Google employee here, xmanger is one of the main ML experiment tracking/orchestration tool we use internally, I'm pretty excited that it is now available for other to use!
In a nutshell, xmanager allows you to:
- define an experiment, which is a collection of one or more work units (think combination of hyperparamters)
- manage the different jobs/executable required to run this experiment (TPU workers, tensorboard job, etc.)
- collect and display measurements from work units (loss, other metrics)
- keep a reproducible artifact which allows you to re-run the same experiment at any point in the future
It's great this is open sourced. This technology was key to enabling ML folks to scale up computation without having to deal with borg and a bunch of other low-level systems.
It's one of the few systems in ML that I've used and thought "huh, this was well-designed and properly architected from the start"
In a nutshell, xmanager allows you to:
- define an experiment, which is a collection of one or more work units (think combination of hyperparamters)
- manage the different jobs/executable required to run this experiment (TPU workers, tensorboard job, etc.)
- collect and display measurements from work units (loss, other metrics)
- keep a reproducible artifact which allows you to re-run the same experiment at any point in the future
See e.g. https://github.com/deepmind/xmanager/blob/main/examples/ for a few concrete examples of a launcher scripts.
I wish they had included screenshots of the tool itself in the repo, I'll make that suggestion :).