MLOps course for Nova-Lisbon. Index | Github Repository

Reproducible science using containers

TL;DR

In an open science context, containers are not only good for registering the infrastructure in MLOps, but also because they can be easily published and enable anyone to re-use them. We will see best practices in the creation of active and data containers that can be easily used in ML workflows.

Learning outcomes of this unit

Students will learn how to use containers for deployment of models, as well as to store and possibly publish data.

Acceptance criteria

Several data and active containers have been created and integrated into the publishing schedule as well as different workflows.

Containers

Containers were one of the factors that brought the DevOps (quasi) revolution to the software engineering world. The fact that they allowed to isolate the execution of some programs or services, that they could be packaged and stored, but also that they included a language that allowed very easily to describe how to create them, was a real stunner.

Essentially, those are the three factors that make them useful also for data science: you can isolate an application from the rest of the system, in such a way that it is very easy to deploy it anywhere; that isolation comes at the price of precisely describing the infrastructure that will be needed for running that application. But the prize is that once that is done, you can deploy your application, be it a MLOps stage, or even data, wherever you want.

Let’s check out the concepts that go together with containers, or Docker containers as they are called, since that was the company that created and released the framework for using them in the first place.

First and foremost, what containers do is to isolate applications, not virtualize services. The applications will still use the services of the host operating system, mainly related to that isolation. Which services does not matter too much, what matters is that the operating system must have support for them, and so far only Linux, Windows and some other IBM operating systems do. Also, they do not virtualize the processor instruction set, they run executables on the host processor. That implies that you will need a specific combination of operating system and processor architecture in the host to run a container (and the other way round).

However, containers are a standard (from OCI, open container initiative) and attempt to create a system that any developer can use and that can be then deployed anywhere. How does that work? The software you download on the operating system has two parts:

The command line, or graphical front end, which obviously works on your machine as a native program. This is used mainly for management.
The service or daemon, which runs with privileges, and which connects to the client through a socket. This means that the service can, in principle, run anywhere, as long as you connect to it. This service also acts as registry, holding a list of your containers. This can be a virtual machine, and it is how it works on Macs (whose operating system has no support for isolation) and in Windows (where you will very probably create containers for Linux anyway).

Let’s see different states a container can be in.

You create a container usually through a Dockerfile, at least in this context (it can be created interactively, but except for prototyping, it’s not really advised to do so). This Dockerfile has specific commands to copy files, run scripts, and define what will be run when the image is invoked.
From the Dockerfile, you build an image; this process is similar to compiling to a runnable file, and it creates a file that can be downloaded and run using any Docker client you have.
Finally, a container is the equivalent of a running executable. It receives commands, or runs and exits.

The general best practice is to use Dockerfiles in your project to create images, that are stored in registries (Docker Hub, GitHub Container Registry, Quay.io)

Good thing about containers is that you have a wide array of them to be used directly. All containers are built from a base, and these bases are published in registries, so you can just look for one and start executing it.

At this point, you should have an installation of Docker in your machine. Please follow instructions for your specific platform. In general we will working with Linux/Intel containers, so if your platform offer several options, use the one that is able to work with this kind of containers.

For instance, we can start running a Jupyter Hub just by doing this:

docker run --rm opendatacube/jupyterhub  jupyterhub --ip="*"

There’s probably a GUI option to do more or less the same.

The first part, docker run, is what effectively tells the service we are going to run a container; this will also check if the container is already in our machine, and download it if it is not. You will see that it will show a message about pulling, and several hexadecimal hashes that represents the different layers that compose the image. It does not really matter a lot right now, but best practices advise to minimize the amount of layers you use.

The rest of the command line is as follows

--rm will delete the container when you’ve finished running it. This is a very convenient way to not have lots of stopped containers in your local hard disk, and it’s the usual way to do this. It will still keep the image so that you will not have to download it again (in fact, it will store the layers, so it will speed up downloading for images that depend on this one).
opendatacube/jupyterhub this is the name of the image, usually composed by a namespace (or publishers) and a specific name for the image; it can be followed by a tag in the shape :tag.
jupyterhub --ip='*' is a command you issue, or an argument you give the “executable” contained there.

Jupyter releases a number of notebooks you can readily use, that already have preinstalled a series of modules used in data science. For instance, this one:

docker run --rm -p8888:8888 jupyter/scipy-notebook

will directly run a notebook with scipy preinstalled; the lab will be accessible in the 8888 port, which you have mapped to an internal port via the command -p.

In most cases, containers are not created to be used directly; they will be the base for other applications that will add more layers to them, including the files and other dependencies your application will need. We will see how to build these images ourselves right next.

Creating our own containers

We can create our own containers for multitude of purposes. One of the simplest one we can create is to hold data. There are three parts on what we need to do to design and build a container

Decide on base container. Base containers will have the basic tools we need: the language, for instance. In some cases, we will use tools that include the base “operating system”; which is not a complete operating system, because it will use the kernel, or core, of the host operating system. Since you want your container to have the minimum amount of layers, you need the operating system to be as lightweight as possible; this is why some Linux distributions publish their own lightweight versions, like Debian Slim. Some, like Alpine, are specifically designed to be lightweight. In case you choose a higher-level image, like one that is officially created for a language, look for those tagged alpine or slim. Deciding on this base container will imply choices when installing other tools on it and how to do it. At any rate, the choice of base image has nothing to do with your preferences or what you use on your machine. Main criteria is functionality and weight.
Decide which files will go into the image. This is usually easy, although it sometimes has two phases: copying into the yet-to-be-built image some files, and then performing some processing (compilation or other) to generate the files that actually go into the image; you can then proceed to delete or eliminate those files. This will imply also creating a filesystem structure that we will need for several purposes, mainly the application’s own purposes, as well as interacting through a shared directory with the host or other containers.
Find out the command that the container is going to run when invoked with docker run. You have a single chance (there’s only a command than can be invoked) so choose wisely.

For instance, this (relatively simple) container has been created to hold the data from the ministry of Defense in the war of Ukraine.

FROM denoland/deno:latest

LABEL version="1.0.0" maintainer="JJMerelo@GMail.com"

WORKDIR /app
COPY tools/serve-data.ts .
RUN mkdir resources
COPY resources/*.csv resources/

EXPOSE 31415
VOLUME resources

CMD ["run", "--allow-net", "--allow-read",  "serve-data.ts"]

Besides the data, it contains a script in Deno that serves the data via a web server; this explains why we use a container for Deno in the FROM statement.

LABEL is mainly for metadata, not really functional; the next 4 sentences decide where the application is going to be run (WORKDIR), copy the Deno script that serves the data, and then creates a subdirectory for resources and copies them from the host (first argument) to the container (resources/, the director will be already created).

Next two sentences are also metadata: it tells you what is the name of the shared directory you should use, and the port the server is going to use (prize if you recognize the number).

Finally, the last part is what you are going to run: using Deno, it runs the server that will serve the data. This image is published in the Github registry, and you can use it directly like this:

docker pull -p31415:31415 ghcr.io/jj/ukr-mod-data:latest

In a Russian doll manner, you can build your containers using as base (that is, using the FROM command) other, existing containers, or, since they are in most cases open source, inspiring yourself by them to create your own. For instance, this uses the code in the course itself to contain data and just print it when invoked:

FROM python:latest

LABEL version=0.0.1 maintainer="jjmerelo@gmail.com"
RUN useradd -ms /bin/bash novamlops
USER novamlops
WORKDIR /home/novamlops
ENV PATH=~"/home/novamlops/.poetry/bin:/home/novamlops/.local/bin:${PATH}"

COPY --chown=novamlops pyproject.toml poetry.lock .
RUN mkdir colares_project
COPY --chown=novamlops colares_project/ colares_project/
RUN pip install poetry \
    && poetry install \
    && poetry run testcsv \
    && rm -rf colares_project/ pyproject.toml poetry.lock

ENTRYPOINT cat Export_test.csv

It’s similar to the one above, except for a couple of details

For security reasons, it is better if whatever runs in the container runs as a non-privileged user. This is why we create an user, novamlops, and run everything (including COPY as that user).
What we are interested in in this case is to host the data that we are going to print when the image is invoked. This is why we delete all the files, to make it lighter. Images should contain only strictly what’s needed (this contains more things, anyway, since it contains all the Python modules needed to build it and Python itself.
We use ENTRYPOINT instead of CMD. With this, it is more similar to a normal executable, being able to receive arguments and so on.

This is just a plan for building a Dockerfile, of course; you need to actually create the image to use it.

docker build -f first.Dockerfile -t jj/nova-mlops-first .

Then you can run it with

docker run -t jj/nova-mlops-first

Using -t will tell docker that it needs to use the console or terminal, basically that there will be something to print and it should not keep it to itself. This will, effectively, print the content of the CSV file to the screen, and you can then redirect it to a file and use it however you want.

Using “active containers”

These containers are little more than data containers. You will generally want these containers to actually do something. Containers run services, applications, and can also be very useful when running tests, encapsulating everything you need to effectively run the test.

It can also be used for running periodic tasks such as downloading files. In this case we will need to set up a space that is shared between host and container, and where it will or file whatever is needed.

FROM python:latest

LABEL version=0.0.1 maintainer="jjmerelo@gmail.com"
RUN useradd -ms /bin/bash novamlops
USER novamlops
WORKDIR /home/novamlops
ENV PATH=~"/home/novamlops/.poetry/bin:/home/novamlops/.local/bin:${PATH}"

COPY --chown=novamlops pyproject.toml poetry.lock .
RUN mkdir colares_project && mkdir data
COPY --chown=novamlops colares_project/ colares_project/
RUN pip install poetry \
    && poetry install

VOLUME /home/mlops/data

ENTRYPOINT ["poetry", "run", "testcsv"]

This one is very similar to the previous one

As a matter of fact, what we should have done is to create this one, and then the other one based on this one. We might still do it.

Main differences are

We are creating a data directory and declaring it as a VOLUME. This is just an announcement, and does not have any function coupled with it.
We are running a poetry target as the entry point.
We are obviously not deleting anything at all.

By declaring a directory as a VOLUME what we state is the intention to interact with the container using it.

There are also subtle changes to the code, but that’s not really important.

We can run it like this:

docker run --rm -it -v `pwd`:/home/novamlops/data \
  jj/nova-mlops-second data/test.csv

First difference is that we’re using -it. It’s not really needed, but it will help if we want to kill it for some reason. t will print to terminal (as we have seen), i is interactive. But the most important thing is the -v, which is mounting the VOLUME that we have declared in this current directory (pwd: present working directory, backticks run a program and use output, it’s a shell command which probably has some equivalent in PowerShell). That directory needs to be empty in the container, since it will be “mounted” to this external directory, and will have the contents of this directory.

But what we want is the container to produce a file with a specific name; this is added at the end of the script. Since the script is going to be running in the repository root directory, we will need to precede it with the directory name. This will, effectively, create a test.csv file in data

References

Cook, Joshua. 2017. Docker for Data Science: Building Scalable and Extensible Data Infrastructure Around the Jupyter Notebook Server. Apress.