MLOps course for Nova-Lisbon. Index | Github Repository

Introduction

MLOps Battina (2019) is the combination of the terms and practices of machine learning and DevOps.

And DevOps is the combination of all phases and roles in a development team into each and every member of the team, who will in turn act as code writers, QA leaders and system programmers.

The objective of machine learning (ML) is to create models from data that can be used for prediction, classification or visualization, solving specific problems for specific customers. If you combine that with DevOps, what we have is the reproducible and seamless creation of ML workflows that can be continuously deployed to production.

Scientific computing and development, which includes data science, has been generally separated from mainstream best practices in companies; this is in part due to the fact that science works through public funding and not with a customer or new product, and also to the traditional division of science in silos.

DevOps was an attempt to break silos (Development, QA, Ops) in software engineering; MLOps tries to break silos in machine learning, and possibly, in time, in data science.

By data science I mean the academic part of it.

This course, or this material, depending on how you arrive at it, is mainly addressed at students that have been living in traditional scientific computing environments, and will try to endow them with an agile mindset that will be an useful skill if they want to jump from academia to the industry; if they are already in the industry, it will try to help them get acquainted with data science concepts and techniques that will help them jump-start a career in the coveted data engineering wage bracket.

Hope this is useful for anyone arriving here from anywhere, before, during or after the in-person summer course.

The keys to data science

As revealed in many articles such as this one, being a successful data scientist of engineer requires a wide range of different skills.

  1. First, domain knowledge is necessary. Since software engineers always solve someone’s problem, the engineer has to become an expert in this domain in order to provide meaningful and useful solutions. This means that a machine learning problem, the same as a workshop, must start with choosing a domain and the intention and drive to solve a problem in that domain. Not in an already created and static dataset.
  2. Math skills are also needed to understand, if nothing else, what validation results mean, and how you need to select a model based on comparison with other models or the baseline. If you are in academia, you will probably need to go a bit further tweaking hyperparameters, and combining algorithms (or your paper will be rejected), but even basic industrial data scientists need to have some knowledge about statistics.
  3. The data scientist needs to know and follow best practices in software development, but also be acquainted with the tools, services and languages currently used by modern developments. This includes DevOps and QA, which will be incorporated in the ML/DS domain as MLOps.
  4. Communication and visualization skills: unlike software engineers, data engineers need to eventually deliver not only a product, but also the solution to the problem that has been reached through that product. In some cases, visualization is all that’s going to be needed; in others, visualization will be the last phase in a data pipeline.

These different parts will be studied, with different intensity, in this course; we will not devote a lot of time to math and visualization (just enough to tell you that you need to use it); obviously, MLOps is the focus of the third pillar, but we will also be spending some time in understanding why domain knowledge is absolutely necessary for the design of successful MLOps systems.

Steps in MLOps

MLOps consists in the automation, within the framework of an agile team, of the following steps:

  1. Data extraction and engineering.
  2. Model development and engineering.
  3. Continuous training.
  4. Storage in model registries.
  5. Continuous deployment.

During the duration of this course, we’ll try to go through these steps, starting with the very important step 0: create valuable products and develop using an agile mindset.

Read also

The site Machine Learning Operations is a collection of concepts and tutorials about MLOps.

References

Battina, Dhaya Sindhu. 2019. “AN INTELLIGENT DEVOPS PLATFORM RESEARCH AND DESIGN BASED ON MACHINE LEARNING.” Training 6 (3).

Kreuzberger, Dominik, Niklas Kühl, and Sebastian Hirschl. 2022. “Machine Learning Operations (MLOps): Overview, Definition, and Architecture.” arXiv Preprint arXiv:2205.02302.

Tamburri, Damian A. 2020. “Sustainable MLOps: Trends and Challenges.” In 2020 22nd International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC), 17–23. IEEE.