In an open science context, containers are not only good for registering the infrastructure in MLOps, but also because they can be easily published and enable anyone to re-use them. We will see best practices in the creation of active and data containers that can be easily used in ML workflows.
Students will learn how to use containers for deployment of models, as well as to store and possibly publish data.
Several data and active containers have been created and integrated into the publishing schedule as well as different workflows.
Containers were one of the factors that brought the DevOps (quasi) revolution to the software engineering world. The fact that they allowed to isolate the execution of some programs or services, that they could be packaged and stored, but also that they included a language that allowed very easily to describe how to create them, was a real stunner.
Essentially, those are the three factors that make them useful also for data science: you can isolate an application from the rest of the system, in such a way that it is very easy to deploy it anywhere; that isolation comes at the price of precisely describing the infrastructure that will be needed for running that application. But the prize is that once that is done, you can deploy your application, be it a MLOps stage, or even data, wherever you want.
Let’s check out the concepts that go together with containers, or Docker containers as they are called, since that was the company that created and released the framework for using them in the first place.
First and foremost, what containers do is to isolate applications, not virtualize services. The applications will still use the services of the host operating system, mainly related to that isolation. Which services does not matter too much, what matters is that the operating system must have support for them, and so far only Linux, Windows and some other IBM operating systems do. Also, they do not virtualize the processor instruction set, they run executables on the host processor. That implies that you will need a specific combination of operating system and processor architecture in the host to run a container (and the other way round).
However, containers are a standard (from OCI, open container initiative) and attempt to create a system that any developer can use and that can be then deployed anywhere. How does that work? The software you download on the operating system has two parts:
Let’s see different states a container can be in.
The general best practice is to use Dockerfiles in your project to create images, that are stored in registries (Docker Hub, GitHub Container Registry, Quay.io)
Good thing about containers is that you have a wide array of them to be used directly. All containers are built from a base, and these bases are published in registries, so you can just look for one and start executing it.
At this point, you should have an installation of Docker in your machine. Please follow instructions for your specific platform. In general we will working with Linux/Intel containers, so if your platform offer several options, use the one that is able to work with this kind of containers.
For instance, we can start running a Jupyter Hub just by doing this:
docker run --rm opendatacube/jupyterhub jupyterhub --ip="*"
There’s probably a GUI option to do more or less the same.
The first part, docker run
, is what effectively tells the service we are going to run a container; this will also check if the container is already in our machine, and download it if it is not. You will see that it will show a message about pulling, and several hexadecimal hashes that represents the different layers that compose the image. It does not really matter a lot right now, but best practices advise to minimize the amount of layers you use.
The rest of the command line is as follows
--rm
will delete the container when you’ve finished running it. This is a very convenient way to not have lots of stopped containers in your local hard disk, and it’s the usual way to do this. It will still keep the image so that you will not have to download it again (in fact, it will store the layers, so it will speed up downloading for images that depend on this one).opendatacube/jupyterhub
this is the name of the image, usually composed by a namespace (or publishers) and a specific name for the image; it can be followed by a tag in the shape :tag
.jupyterhub --ip='*'
is a command you issue, or an argument you give the “executable” contained there.Jupyter releases a number of notebooks you can readily use, that already have preinstalled a series of modules used in data science. For instance, this one:
docker run --rm -p8888:8888 jupyter/scipy-notebook
will directly run a notebook with scipy
preinstalled; the lab will be accessible in the 8888 port, which you have mapped to an internal port via the command -p
.
In most cases, containers are not created to be used directly; they will be the base for other applications that will add more layers to them, including the files and other dependencies your application will need. We will see how to build these images ourselves right next.
We can create our own containers for multitude of purposes. One of the simplest one we can create is to hold data. There are three parts on what we need to do to design and build a container
alpine
or slim
. Deciding on this base container will imply choices when installing other tools on it and how to do it. At any rate, the choice of base image has nothing to do with your preferences or what you use on your machine. Main criteria is functionality and weight.docker run
. You have a single chance (there’s only a command than can be invoked) so choose wisely.For instance, this (relatively simple) container has been created to hold the data from the ministry of Defense in the war of Ukraine.
FROM denoland/deno:latest
LABEL version="1.0.0" maintainer="JJMerelo@GMail.com"
WORKDIR /app
COPY tools/serve-data.ts .
RUN mkdir resources
COPY resources/*.csv resources/
EXPOSE 31415
VOLUME resources
CMD ["run", "--allow-net", "--allow-read", "serve-data.ts"]
Besides the data, it contains a script in Deno that serves the data via a web server; this explains why we use a container for Deno in the FROM
statement.
LABEL
is mainly for metadata, not really functional; the next 4 sentences decide where the application is going to be run (WORKDIR
), copy the Deno script that serves the data, and then creates a subdirectory for resources and copies them from the host (first argument) to the container (resources/
, the director will be already created).
Next two sentences are also metadata: it tells you what is the name of the shared directory you should use, and the port the server is going to use (prize if you recognize the number).
Finally, the last part is what you are going to run: using Deno, it runs the server that will serve the data. This image is published in the Github registry, and you can use it directly like this:
docker pull -p31415:31415 ghcr.io/jj/ukr-mod-data:latest
In a Russian doll manner, you can build your containers using as base (that is, using the FROM
command) other, existing containers, or, since they are in most cases open source, inspiring yourself by them to create your own. For instance, this uses the code in the course itself to contain data and just print it when invoked:
FROM python:latest
LABEL version=0.0.1 maintainer="jjmerelo@gmail.com"
RUN useradd -ms /bin/bash novamlops
USER novamlops
WORKDIR /home/novamlops
ENV PATH=~"/home/novamlops/.poetry/bin:/home/novamlops/.local/bin:${PATH}"
COPY --chown=novamlops pyproject.toml poetry.lock .
RUN mkdir colares_project
COPY --chown=novamlops colares_project/ colares_project/
RUN pip install poetry \
&& poetry install \
&& poetry run testcsv \
&& rm -rf colares_project/ pyproject.toml poetry.lock
ENTRYPOINT cat Export_test.csv
It’s similar to the one above, except for a couple of details
novamlops
, and run everything (including COPY
as that user).ENTRYPOINT
instead of CMD
. With this, it is more similar to a normal executable, being able to receive arguments and so on.This is just a plan for building a Dockerfile, of course; you need to actually create the image to use it.
docker build -f first.Dockerfile -t jj/nova-mlops-first .
Then you can run it with
docker run -t jj/nova-mlops-first
Using -t
will tell docker that it needs to use the console or terminal, basically that there will be something to print and it should not keep it to itself. This will, effectively, print the content of the CSV file to the screen, and you can then redirect it to a file and use it however you want.
These containers are little more than data containers. You will generally want these containers to actually do something. Containers run services, applications, and can also be very useful when running tests, encapsulating everything you need to effectively run the test.
It can also be used for running periodic tasks such as downloading files. In this case we will need to set up a space that is shared between host and container, and where it will or file whatever is needed.
FROM python:latest
LABEL version=0.0.1 maintainer="jjmerelo@gmail.com"
RUN useradd -ms /bin/bash novamlops
USER novamlops
WORKDIR /home/novamlops
ENV PATH=~"/home/novamlops/.poetry/bin:/home/novamlops/.local/bin:${PATH}"
COPY --chown=novamlops pyproject.toml poetry.lock .
RUN mkdir colares_project && mkdir data
COPY --chown=novamlops colares_project/ colares_project/
RUN pip install poetry \
&& poetry install
VOLUME /home/mlops/data
ENTRYPOINT ["poetry", "run", "testcsv"]
This one is very similar to the previous one
As a matter of fact, what we should have done is to create this one, and then the other one based on this one. We might still do it.
Main differences are
data
directory and declaring it as a VOLUME
. This is just an announcement, and does not have any function coupled with it.poetry
target as the entry point.By declaring a directory as a VOLUME
what we state is the intention to interact with the container using it.
There are also subtle changes to the code, but that’s not really important.
We can run it like this:
docker run --rm -it -v `pwd`:/home/novamlops/data \
jj/nova-mlops-second data/test.csv
First difference is that we’re using -it
. It’s not really needed, but it will help if we want to kill it for some reason. t
will print to terminal (as we have seen), i
is interactive. But the most important thing is the -v
, which is mounting the VOLUME
that we have declared in this current directory (pwd
: present working directory, backticks run a program and use output, it’s a shell command which probably has some equivalent in PowerShell). That directory needs to be empty in the container, since it will be “mounted” to this external directory, and will have the contents of this directory.
But what we want is the container to produce a file with a specific name; this is added at the end of the script. Since the script is going to be running in the repository root directory, we will need to precede it with the directory name. This will, effectively, create a test.csv
file in data
The book Docker for Data Science is a good reference (Cook 2017)
Cook, Joshua. 2017. Docker for Data Science: Building Scalable and Extensible Data Infrastructure Around the Jupyter Notebook Server. Apress.