MLOps course for Nova-Lisbon. Index | Github Repository

Agile development

TL;DR

Machine learning must focus in solving specific problems, and thus should follow the best practices in software engineering. In this introduction we’ll talk about agile development and how it applies to science, and specifically data science.

Learning outcomes of this unit

This session is mainly related to learning outcome A, “Understand the fundamental principles of agile development”. By the end of this session the student should be able to choose a problem to work in, and understand the principles of agile development as applied to machine learning.

Acceptance criteria

The teams will have created their repository and laid out a series of milestones with minimal viable products to be carried out during the course.

Agile (Data) science

In software engineering circles, agile is a mindset that helps solve problems putting client first, and focusing on having always minimally viable products.

Science, and data science, needs not be different; we need to put the problem first, and not any kind of proxies to that problem in the way of reports or papers or participation in competitions. This is why the manifesto for agile data science was proposed (Merelo-Guervós and García-Valdez 2021), which not only tries to make data science more efficient, but also close the gap between academic and industry practitioners.

In order to do that, you need to use the tools that are standard in the industry, such as source control, team work, and continuous integration/delivery tools, supplemented with specific machine learning tools, such as model registries.

Keys to agile development

First thing you have to take into account that, unlike the waterfall methodology that preceded it, agile is a set of best practices, never a methodology, and of course not a single methodology. It mainly integrates many different trends in software development that happened by the end of the 20th century, and that allowed software teams to deliver value faster and with higher efficiency. By deliver value we mean to actually make something that impacts the bottom line from the organization you’re integrated it, be it a company or a research group, or a mixture of both.

Essentially, a waterfall technology, called that way because when charted it looked like a waterfall, with the different steps, planning, testing, coding, deploying, being sequential and totally independent of every other, was too rigid, could not respond quickly to changing environments, but the main problem is that it did not deliver value until the very end, that is, until you arrived at wherever the waterfall created the pool. This could be OK for a certain kind of projects, and certainly a big advantage when it was created, by Margaret Hamilton, for mission-critical software projects.

In reality, however, it created a series of problems.

In the late 90s, some trends like Extreme Programming and open source development, as well as tools like source control and static checks, came together in the Agile Manifesto (Fowler, Highsmith, et al. 2001). This manifesto has 12 principles that are used as mantras to create a series of best practices in development. It is not a methodology, but it creates a mindset for developing; this why you talk about agile mindset or agile teams; never agile methodology.

In (Merelo-Guervós and García-Valdez 2021) we tried to expand this manifesto to include also any scientific development that could be attached to a data science problem; it will mainly expand the agile manifesto to include academic stakeholders such as tutors or funding agencies, and also clarify who is the customer in the case of a scientific paper. We will weave those into the narrative below when we describe the principles that are more interesting for us.

So let’s put it to work in the following steps.

Problem and customer go first

Data science needs to solve a problem. Either getting some insight on a problem, or provide some easy way to reach decisions, some times both. You need to understand the nature of the problem before settling down to work on it. And a problem begs for a customer; as we have seen in the design thinking chapter, it’s essential to understand that problem to come up with a solution.

What needs to be understood here is that to create reproducible workflows that can go into production, MLOps needs to go beyond using already existing datasets to create and curate those datasets eliminating the need to manually perform any of the steps. And, you obviously need to understand the problem very well to code those steps so that they can be performed automatically.

There are a series of best practices related to including the user wish in the development process; an user journey for instance, will express how the client will use the specific product or how it will be included in their decision making process.

However, user stories are a much simpler mechanism: they describe, in client’s or problem’s space, what are their wishes and what kind of profit they will obtain from having this problem solved.

For instance, this user story reflects what an OSINT researcher would want when downloading data from Ukraine MoD:

As an OSINT research, I need to obtain data reported by the Ukrainian MoD in a processable format: Data is hosted at the MoD site but in a text format. Obtaining it in a processable format (such as CSV or Parquet) will enable further analysis on it.

As we said in the introduction, domain knowledge is one of the keys to data science. We incorporate and assume that domain knowledge using agile methodology, and empathize with these clients through in the first stages of design thinking.

Deliver working software frequently

This is precisely what MLOps is all about, in the same way as DevOps is about delivering regular software frequently. There is no single “waterfall” process that finishes with the paper; it is a rather a continuous improvement and development process that finishes when the customer is no longer interested.

In practice, this means that common software tools should be employed for developing the data science pipeline, including any reports or papers that result from it. That means using integrated source control and team management software such as GitHub or GitLab; in this course we have opted for GitHub.

GitLab is an excellent product that, being free software, you can self-host on premises. However, from the point of view of accessibility and also ease of use in this context, GitHub is probably a better option (also one with which the professor is more familiar with).

Software delivery must be included in a process, and that is the organization of the project in milestones; every milestone is an advance over the previous one, and it includes the delivery of a minimally viable product that solves all the problems (issues) that have been accepted for this specific release.

It also implies the continuous integration of improvements to the main branch in software, via automatically performed tests. This will be the focus of the next two sections, for the time being let us focus on the planning part of it.

While all software that goes to the main branch needs to be tested, you will use milestones to organize development so that there’s a clear advancement in the functionality of what you’re creating, and you can deliver something tangible that is proved to work, a minimally viable product.

Most pipelines will need to scrape and download data; in our Ukr MoD data, this is grouped in this milestone, the first one. The second milestone is focused on performing data tests and making some first approximation to the analysis of data. These milestones include workflows that automate their working; that is the minimally viable product. However, for the time being we need to focus on setting different parts of the MLOps development into different milestones, that will be developed apace.

Working ML pipelines is the primary measure of progress

Satisfying the client is the objective of any agile process, and in order to reach that state it needs to be provided with working software. To prove that they are working, automated tests and other checks need to ensure that that happens every time something is merged into main, and thus put into production.

What this mantra indicates is that finishing a run in our own computer is never the end of the process. Until that experiment is put into a MLOps pipeline that automates the whole process, we cannot affirm any progress has taken place.

Yes, even if the paper has been accepted. That is nice, but not what agile and MLOps is about.

A pipeline indicates that there is some automated procedure working, and a product being delivered in time. A working and complete pipeline is a complete product, which is the objective of the development procedure. But, meanwhile, every milestone must finish with this kind of pipeline.

For instance, the milestone indicated above should finish with a pipeline that should be able to download automatically the web page from the Ukrainian MoD with the proper frequency. This is the result of using that pipeline, which should be green to indicate it, effectively, is working. The rest of the pipelines are shown, and should be green, in the module I’m using to illustrate this course.

To actually certify they are working you need to do at least two things: at one level tests must be automatically run so that the code is clean and according to common practices, and also respects the unitary specifications laid out in advance; at a different level, several eyes must be cast on the code.

This is achieved using pull requests; never, or almost never (typos are one exception that comes to mind), code must be incorporated into the main branch directly. With pull requests, there are several objectives:

Remember, pull requests reviews are not grades in an exam, they are not obstacles for your code to be merged, they are simply a way to ensure quality of the result, and make it a collective endeavor, as opposed to a single person’s work.

Attention to technical excellence and good design

Good design is, in part, guaranteed by the different phases of the design thinking stage we have mentioned before; additionally, best practices in software design should be used.

Software design is a complicated affair and involves multiple layers: architectural design, modularization of the different components, as well as, eventually, API and maybe UI design. It’s essential to pay attention to all of these phases by first, formulating these design problem as issues, and reaching consensus on solutions every step of the way.

There are many principles that can be followed during the design phase: SOLID design is popular mainly in object oriented design, but other methodologies like 12 steps, hexagonal design or CUPID are all methodologies you can follow. Whatever you do, you need to be informed of these and convince yourself that the objective of your work is to create a solid (no acronym here) and expandable product, not a single-shot script, paper or Jupyter notebook.

In general, different modularization have been followed when designing the UKR MoD data module; for instance, the single responsibility principles has been applied through it; this leads to low coupling, and functions do a single function and are called for a single reason.

Simplicity is essential

This should be obvious, but even more so in the case of machine learning, when you have to justify the result of pipelines and often explain how you reach that conclusion. Simplicity is obtained through low coupling, and through simple, readable modules and functions that are well understood and realize a function that is delivering value to a customer.

In the UKR MoD data above, different functions are performed by different modules, written in different languages; this simplifies the code, since some languages are shorter and more expressive when doing a specific function. For instance, Python is very convenient for scraping, but R is much more convenient for data shaping and creating charts. This function in R takes a few lines for reshaping the data frame; using some scripting language, even Python, would have probably needed much more stuff.

Read also

As stated, (Merelo-Guervós and García-Valdez 2021) is one of the main documents that should be followed. Thoughtful Machine Learning (Kirk 2014) will probably help too. Thakur’s book (Thakur 2020) is a very general approach to machine learning, with examples in Python. It does not really tackle the problem-to-code part, but it can be helpful to get started.

Activity

Please refer to the corresponding session for non-coding activities.

Coding activities will consist on the creation of a GitHub team, with the addition of the team that was created, and a corresponding project repository with all the team.

In this first MVP, the repository will contain the description of a problem that will be solved along the course, together with the methodology to obtain the dataset and the insights that need to be obtained from it. This methodology will include

As in every other milestone, you will need to create a pull request *in your own repository”. This PR will include links to the different issues that are targeted, including user stories. It will be reviewed by the tutor, as well as several members of your own team before being accepted.

References

Fowler, Martin, Jim Highsmith, et al. 2001. “The Agile Manifesto.” Software Development 9 (8): 28–35.

Kirk, Matthew. 2014. Thoughtful Machine Learning: A Test-Driven Approach. O’Reilly Media, Inc.

Merelo-Guervós, Juan Julián, and Mario García-Valdez. 2021. “Agile (Data) Science: A (Draft) Manifesto.” arXiv. https://doi.org/10.48550/ARXIV.2104.12545.

Thakur, Abhishek. 2020. Approaching (Almost) Any Machine Learning Problem. Abhishek Thakur.