Combined machine learning framework for supercomputing

Image: Adobe Stock

Combined machine learning framework for supercomputing. Blog post 2/3

In our previous blog post we discussed the need for combining modern cloud-based machine learning frameworks with the batch processing pipelines used in high performance computing (HPC) environments, including the world's fastest supercomputers. In this blog post, we will envision such a combined machine learning framework and list the primary components required. In the next and final part of this blog post series we will introduce existing frameworks for implementing the combined ML workflow for cloud and HPC environments.

Components of a combined framework

We think a combined ML framework should ideally contain the following components:

1. An easy to use web browser UI with Jupyter Notebook or a similar environment, where you can experiment and develop models. All the files of the project should be easily accessible, perhaps having object storage such as CSC’s Allas mounted as regular folders. If you are not familiar with Jupyter Notebook, it’s a web-based interactive development environment where you can combine for example cells of live Python code with documentation and visualisation cells.

2. Model and dataset repositories. Applications of machine learning are increasingly based on fine-tuning and building on pre-trained models and existing datasets shared by others. The framework interface should therefore offer point-and-click access to model zoos (collections of pre-trained models) and dataset repositories, giving access to commonly used datasets and pre-trained models. In a similar way, the framework should enable researchers to further share their results, models, and datasets. 

3. Model and dataset versioning. In addition to code, the produced models and datasets need to be stored in a versioned manner to enable reproducibility. In particular, it is important to have a record of the exact model specifications and dataset versions used for model training. In general, common version control systems such as git are not optimal for large binary artifacts.

4. A way to run batch jobs using the code developed in the Notebook environment against the given training dataset (possibly from a shared data repository). Very often there is a need to run a large number of different variants of the model. The produced models are saved to the project’s data storage or to a model repository. Validation results and other performance measures should also be logged.

5. Visualisation of the progress and results of batch job runs. In total, there might be hundreds or thousands of jobs running different model variants during development. It would be very useful to have a graphical user interface to compare the results from those job runs to facilitate further analysis, and for example to help in selecting the optimal model. This UI would also be useful for monitoring the training progress during batch jobs. For example, one could identify model variants that perform poorly already at an early stage and interrupt the training jobs for those cases.

6. Support for inference deployment. When the model development is finished and the best model for the task has been chosen, it would be time to utilise the trained model in an ML application. In some cases there is a need to build up an inference system to deploy the application as an online service for others to use. 

Finally, at least some of the above components would be implemented with an API, so that they can also be used programmatically. Ideally all features of the framework would be programmatically usable via a comprehensive API and the user interface parts would be built to use the same API. In addition, some kind of project management needs to be integrated into all components. It would allow to group datasets and models into projects and manage access using for example role based access control mechanisms. The implementations should preferably all be free and open source as we are working in an HPC environment supporting open research.

Joining two worlds

From the technical point-of-view, it makes most sense to have the interactive part running on Kubernetes, while the batch jobs would run in the HPC cluster. Utilising modern container technologies and having the code and datasets in versioned repositories ensure that the two environments are compatible.

However, running an interactively edited Jupyter notebook as a batch job presents some challenges. Notebooks consist of a sequence of cells, which can be executed in an arbitrary order in the interactive mode. Code cells can be modified or deleted while the effects of their earlier execution still remain in the currently running interactive Python session. However, in batch mode the cells are executed serially from first to last, and no history is retained from earlier executions of the code. The user needs to check that the notebook works correctly when run serially before submitting it as a batch job.

Another issue is the parameterisation of notebooks. A common use case for batch jobs is evaluating the performance of different variants of a model, so the notebook needs to be executed with a large number of different parameter values. Support for parameterisation needs to be addressed somehow in the notebooks.

The user needs to be aware of the version of the notebook and the parameter values that were actually used in the execution of a specific batch job. The "live" version of the code in the interactive editor may have changed several times since the batch job was submitted. Here, version control can be particularly useful, but it might not be sufficient to make it convenient enough for the user to keep track of various runs and parameters.

Overall from the user experience point-of-view, interactive notebook editing and batch job submission are quite different beasts: interactive programming takes place on a second-to-second time scale, while it might sometimes take hours or days for batch jobs to start and execute, depending on the requested resources and the length of queues in the HPC cluster. Designing a sensible user interface supporting both scenarios is a big challenge. Another option would be to separate these two tasks into two user interfaces: interactive notebook UI and batch job management UI. Having two completely separate user interfaces might make the differences clearer and more naturally separate the two modes of working.

In the next blog post we will look at some existing higher level frameworks for machine learning and assess how well they fit our requirements. Can we find the ideal framework, or can we combine several existing tools to achieve something close to perfection?

Lisää tästä aiheesta » Siirry sisältöihin ja uutisiin »

Juha Hulkkonen

The author is a data engineering and machine learning specialist in CSC's data analytics group, working with machine learning and big data workflows.

Aleksi Kallio

The author is the manager of CSC's data analytics group, coordinating development of machine learning and data engineering based services.

Markus Koskela

The author is a machine learning specialist in CSC's data analytics group, working with various machine learning applications and computing environments.

Mats Sjöberg

The author is a machine learning specialist in CSC's data analytics group, working with various machine learning applications and computing environments.