MLOps course for Nova-Lisbon. Index | Github Repository

Reproducible science using containers

TL;DR

In an open science context, containers are not only good for registering the infrastructure in MLOps, but also because they can be easily published and enable anyone to re-use them. We will see best practices in the creation of active and data containers that can be easily used in ML workflows.

Learning outcomes of this unit

Students will learn how to use containers for deployment of models, as well as to store and possibly publish data.

Acceptance criteria

Several data and active containers have been created and integrated into the publishing schedule as well as different workflows.

Containers

Containers were one of the factors that brought the DevOps (quasi) revolution to the software engineering world. The fact that they allowed to isolate the execution of some programs or services, that they could be packaged and stored, but also that they included a language that allowed very easily to describe how to create them, was a real stunner.

Essentially, those are the three factors that make them useful also for data science: you can isolate an application from the rest of the system, in such a way that it is very easy to deploy it anywhere; that isolation comes at the price of precisely describing the infrastructure that will be needed for running that application. But the prize is that once that is done, you can deploy your application, be it a MLOps stage, or even data, wherever you want.

Let’s check out the concepts that go together with containers, or Docker containers as they are called, since that was the company that created and released the framework for using them in the first place.

First and foremost, what containers do is to isolate applications, not virtualize services. The applications will still use the services of the host operating system, mainly related to that isolation. Which services does not matter too much, what matters is that the operating system must have support for them, and so far only Linux, Windows and some other IBM operating systems do. Also, they do not virtualize the processor instruction set, they run executables on the host processor. That implies that you will need a specific combination of operating system and processor architecture in the host to run a container (and the other way round).

However, containers are a standard (from OCI, open container initiative) and attempt to create a system that any developer can use and that can be then deployed anywhere. How does that work? The software you download on the operating system has two parts:

Let’s see different states a container can be in.

The general best practice is to use Dockerfiles in your project to create images, that are stored in registries (Docker Hub, GitHub Container Registry, Quay.io)

Good thing about containers is that you have a wide array of them to be used directly. All containers are built from a base, and these bases are published in registries, so you can just look for one and start executing it.

At this point, you should have an installation of Docker in your machine. Please follow instructions for your specific platform. In general we will working with Linux/Intel containers, so if your platform offer several options, use the one that is able to work with this kind of containers.

For instance, we can start running a Jupyter Hub just by doing this:

docker run --rm opendatacube/jupyterhub  jupyterhub --ip="*"

There’s probably a GUI option to do more or less the same.

The first part, docker run, is what effectively tells the service we are going to run a container; this will also check if the container is already in our machine, and download it if it is not. You will see that it will show a message about pulling, and several hexadecimal hashes that represents the different layers that compose the image. It does not really matter a lot right now, but best practices advise to minimize the amount of layers you use.

The rest of the command line is as follows

Jupyter releases a number of notebooks you can readily use, that already have preinstalled a series of modules used in data science. For instance, this one:

docker run --rm -p8888:8888 jupyter/scipy-notebook

will directly run a notebook with scipy preinstalled; the lab will be accessible in the 8888 port, which you have mapped to an internal port via the command -p.

In most cases, containers are not created to be used directly; they will be the base for other applications that will add more layers to them, including the files and other dependencies your application will need. We will see how to build these images ourselves right next.

Creating our own containers

We can create our own containers for multitude of purposes. One of the simplest one we can create is to hold data. There are three parts on what we need to do to design and build a container

For instance, this (relatively simple) container has been created to hold the data from the ministry of Defense in the war of Ukraine.

FROM denoland/deno:latest

LABEL version="1.0.0" maintainer="JJMerelo@GMail.com"

WORKDIR /app
COPY tools/serve-data.ts .
RUN mkdir resources
COPY resources/*.csv resources/

EXPOSE 31415
VOLUME resources

CMD ["run", "--allow-net", "--allow-read",  "serve-data.ts"]

Besides the data, it contains a script in Deno that serves the data via a web server; this explains why we use a container for Deno in the FROM statement.

LABEL is mainly for metadata, not really functional; the next 4 sentences decide where the application is going to be run (WORKDIR), copy the Deno script that serves the data, and then creates a subdirectory for resources and copies them from the host (first argument) to the container (resources/, the director will be already created).

Next two sentences are also metadata: it tells you what is the name of the shared directory you should use, and the port the server is going to use (prize if you recognize the number).

Finally, the last part is what you are going to run: using Deno, it runs the server that will serve the data. This image is published in the Github registry, and you can use it directly like this:

docker pull -p31415:31415 ghcr.io/jj/ukr-mod-data:latest

In a Russian doll manner, you can build your containers using as base (that is, using the FROM command) other, existing containers, or, since they are in most cases open source, inspiring yourself by them to create your own. For instance, this uses the code in the course itself to contain data and just print it when invoked:

FROM python:latest

LABEL version=0.0.1 maintainer="jjmerelo@gmail.com"
RUN useradd -ms /bin/bash novamlops
USER novamlops
WORKDIR /home/novamlops
ENV PATH=~"/home/novamlops/.poetry/bin:/home/novamlops/.local/bin:${PATH}"

COPY --chown=novamlops pyproject.toml poetry.lock .
RUN mkdir colares_project
COPY --chown=novamlops colares_project/ colares_project/
RUN pip install poetry \
    && poetry install \
    && poetry run testcsv \
    && rm -rf colares_project/ pyproject.toml poetry.lock

ENTRYPOINT cat Export_test.csv

It’s similar to the one above, except for a couple of details

This is just a plan for building a Dockerfile, of course; you need to actually create the image to use it.

docker build -f first.Dockerfile -t jj/nova-mlops-first .

Then you can run it with

docker run -t jj/nova-mlops-first

Using -t will tell docker that it needs to use the console or terminal, basically that there will be something to print and it should not keep it to itself. This will, effectively, print the content of the CSV file to the screen, and you can then redirect it to a file and use it however you want.

Using “active containers”

These containers are little more than data containers. You will generally want these containers to actually do something. Containers run services, applications, and can also be very useful when running tests, encapsulating everything you need to effectively run the test.

It can also be used for running periodic tasks such as downloading files. In this case we will need to set up a space that is shared between host and container, and where it will or file whatever is needed.

FROM python:latest

LABEL version=0.0.1 maintainer="jjmerelo@gmail.com"
RUN useradd -ms /bin/bash novamlops
USER novamlops
WORKDIR /home/novamlops
ENV PATH=~"/home/novamlops/.poetry/bin:/home/novamlops/.local/bin:${PATH}"

COPY --chown=novamlops pyproject.toml poetry.lock .
RUN mkdir colares_project && mkdir data
COPY --chown=novamlops colares_project/ colares_project/
RUN pip install poetry \
    && poetry install

VOLUME /home/mlops/data

ENTRYPOINT ["poetry", "run", "testcsv"]

This one is very similar to the previous one

As a matter of fact, what we should have done is to create this one, and then the other one based on this one. We might still do it.

Main differences are

By declaring a directory as a VOLUME what we state is the intention to interact with the container using it.

There are also subtle changes to the code, but that’s not really important.

We can run it like this:

docker run --rm -it -v `pwd`:/home/novamlops/data \
  jj/nova-mlops-second data/test.csv

First difference is that we’re using -it. It’s not really needed, but it will help if we want to kill it for some reason. t will print to terminal (as we have seen), i is interactive. But the most important thing is the -v, which is mounting the VOLUME that we have declared in this current directory (pwd: present working directory, backticks run a program and use output, it’s a shell command which probably has some equivalent in PowerShell). That directory needs to be empty in the container, since it will be “mounted” to this external directory, and will have the contents of this directory.

But what we want is the container to produce a file with a specific name; this is added at the end of the script. Since the script is going to be running in the repository root directory, we will need to precede it with the directory name. This will, effectively, create a test.csv file in data

See also

The book Docker for Data Science is a good reference (Cook 2017)

References

Cook, Joshua. 2017. Docker for Data Science: Building Scalable and Extensible Data Infrastructure Around the Jupyter Notebook Server. Apress.