MLOps course for Nova-Lisbon. Index | Github Repository

Deploy-to-paper

TL;DR

Data science and MLOps is very amenable to the creation of workflows that end with a continuously updated paper, poster or presentation. We will see in this last chapter how to do it.

Learning outcomes of this unit

Integrate ML workflow results in a report or paper that will be continuously updated when data changes. At least an example of such type of languages and tools will be learned.

This chapter corresponds to the fifth learning outcomes for this course,

Applying fundamental machine learning and scientific principles to the above design and reports.

Acceptance criteria

The updateable paper needs to be created, with workflows that trigger their change.

“Active” publishing artifacts

The last stage of a MLOps workflow is continuous deployment; eventually, your workflow will need to work. We need then to tool our way into doing so. Continuous deployment tools are already known, although we will probably need to show examples; however, what to deploy is a different issue.

There are obviously many options to perform deployment. Programs and APIs can be created around the trained data, and we can deploy to an endpoint.

One of the possible ways of deploying a workflow is creating a continuously-updated paper that is uploaded to GitHub or somewhere else. There are several technologies that can be employed for that, but one of the simplest is R Markdown, which is the one that is being used for this material.

In general, weaving text with code is called literary programming; in general also, how it works is by using tools called weave or unweave that pre-process the source, run the code and capture output to integrate it in the paper, and the final compilation of all source, original and generated, into a single artifact that can be a paper or can include active elements like interactive charts or even running programs.

The use case for this kind of artifact is, within the realm of open science, create non-static papers that present, visualize or even run whole workflows under request.

There are many different ways of doing this, as many as possible scripting languages and document description languages. As the latter, you have LaTeX and Markdown, basically; as the latter, R and Python are the two most popular ones (but we can include Julia recently too).

In our research group we use Knitr (similar to RMarkdown, for LaTeX) routinely. For instance, in (Juan Julián Merelo, Valdez, and Galeano 2020) we used it to include all experimental data and create charts on the fly. “Deployment to papers” is excellent to obtain immediate reports when data in a ML workflow is updated, but even if it a small pipeline that goes from experimental data to paper, it saves a lot of work and ties data to the paper that uses it to create charts, allowing re-generation very easily.

You are probably already familiar with Markdown, and we will have used examples with R along this course. At any rate, we will work with R since we can run the whole way from simple scripts to whole ML workflows, through simple data manipulation and visualization scripts which will be all we will be doing here.

We will be using RStudio for this part, at least the initial development; eventually it will be generated from a workflow, but the development part is quite conveniently done here.

Anatomy of RMarkdown files

These files have two different parts: the YAML front matter, and the weaved text and R code. YAML is a data serialization format that is used extensively throughout DevOps workflows. By now we will have already seen examples in the GitHub Actions, for instance. This is the metadata included in this very file:

---
title: "Agile development"
author: "JJ Merelo"
date: "27/5/2022"
output:
  pdf_document: default
  html_document: default
bibliography: ../mlops.bib
---

Besides including metadata like title or author, as well as instructions for RStudio, which is the IDE used for this task, it includes also a pointer to a bibliography file, placed in the upper directory. This is placed among the three dashed lines that mark the beginning and end of a YAML “document”. Right below, we will start the text+code itself.

You do not need to make any kind of change to Markdown in order to insert code. What will be executed will also follow the regular fenced syntax you use for code in Markdown; the only difference is that it will have some arguments that will help interpret it. For instance, this code is at the top of this file, invisible:

{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(arrow)
df <- read_parquet("../data/ukr-mod-data.parquet")
tanks.data <-data.frame( value =as.integer(df[ df$` Item` == " tanks",]$` Total` )

These will be surrounded by code fencing marks, the three backticks. Not included here to avoid linting errors. Check the original at the repository.

The braced statement says that it is written in R, gives it a name, setup, and also says that the result is not going to be included in the document (include=FALSE); this is why it is silent.

The rest is simply the first part of the script. It loads a library, arrow, that interprets the Parquet format, and creates a new data structure we will use later on in the text. For instance, right below:

tanks.data$Total <- as.integer(tanks.data$` Total`)
tanks.data$` Total` <- NULL
tanks.data$Delta <- as.integer(tanks.data$` Delta`)
tanks.data$` Delta` <- NULL
tanks.data$` Item` <- NULL
summary(tanks.data$Delta)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   4.000   8.000   9.556  13.250  34.000

In this case, we tell the chunk to simply be visible. Besides doing a few changes of name and elimination of no-longer-needed columns, it is a summary of the changes in the number of invading tanks destroyed every day, showing how it changes, average and extremes.

We can create charts using ggplot2, for instance, and they will be automatically saved in place, like we will do below.

RMarkdown files also include other goodies; for instance, they can include bibliography that is extracted from .bib files, such as the one indicated in the front matter above and actually used in this paper.

It can also include charts, that are generated on the fly

library(ggplot2)
tanks.data$Date <- as.Date( paste0("2022.",tanks.data$Date), format="%Y.%d.%m")
ggplot(tanks.data, aes(x=Date, y=Total))+geom_point()

This chart uses ggplot2, which is the best known (and possibly best) package for graphical representation. Please bear in mind that this file will be generated automatically, so every time there is any change in this markdown file it will be generated anew.

However, there is going to be no change if data does change; we would have to change the text itself to trigger re-generation. This can change, however.

Processing these files automatically

This would not be MLOps if these files were not processed automatically. They can be triggered, for instance, every time data file changes.

As we have indicated several times, development hosting sites such as GitHub do have the complete set of tools that allow you to create MLOps workflows. You can opt for a monolithic solution, but looking at every step piecewise and looking for good solutions is all good, as long as they allow to automatically work from end to end.

We have already learned how to include Github workflows for processing our stuff. This is more of the same, except that in this case we will have to work with R and the rest of the tools (including pandoc) that are needed to generate these files.

Incidentally, workflows that are very similar to these ones can be used to automatically test and generate a LaTeX file from source, as well as upload it to the repository or somewhere else. Basically everything that can, should be automated.

In general, generating any kind of paper artifact will need a workflow that

… is triggered by changes in data, and
includes all libraries, in this case R libraries, needed to process and generate the file.

It’s convenient to have a script that does the generation. A R script will be the most useful in this case; generation is done in two phases: one renders the scripted part to the intermediate format (Markdown, LaTeX, whatever), the other does the final rendering to the published format (PDF or HTML). Interesting thing is that the report, itself, is more similar to a program (something that is run and produces a result) than to a static document. Anyway, this short R script does the rendering:

#!/usr/bin/env Rscript

library(rmarkdown)
library(knitr)
args = commandArgs(trailingOnly=TRUE)

print(args)

render(args[1],
       output_format = "md_document",
       output_dir = "text",
       output_file=paste0(args[1],".out"))

The render function applies needed logic to generate the file in the required output format: md_document indicates we will use Markdown as this intermediate format. render smartly will apply different procedures depending on the extension of the file: Markdown files will integrate the bibliography only, RMarkdown files will do that and render active chunks.

But this deals with a single file; we need to apply it to every text file in the repository. The bin/md2html script does that task:

We leave out some parts to highlight the gist of it

#!/bin/bash

for fichero in $( ls text/*.*md)
do
    bin/rmd-render.R ${fichero}
    pagetitle=$(cat $fichero | egrep '^#\s+(.*)' | tr '#' ' ')
    pandoc ${fichero}.out -f markdown -s -t html \
       --self-contained \
       -B docs/header.html \
       --metadata pagetitle="MlOps for Nova: $pagetitle" \
       --metadata lang=English  -c docs/mlops.css  -o  "${fichero%.Rmd}.html"
    mv  "${fichero%.Rmd}.html" docs
done

As you can see, the first part of the loop over all files is the one that actually renders to the intermediate Markdown representation; pandoc, a universal text format converter, will take that and render it to HTML including generated images (that will be integrated in the HTML file through the --self-contained flag, together with the CSS files.

You can run that by hand, but we were talking about automation. This will be done in a GitHub action, in the .github/workflows directory, that will need to look something like this:

name: "Check README, text and R code"

on:
  push:
    paths:
      - "**/*.*md"
      - "*.md"
      - "**/words.dic"
      - "words.dic"
      - "**/*.R"
      - "**/*.parquet"

jobs:
  text_actions:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v2
        with:
          fetch-depth: 0
      - uses: r-lib/actions/setup-r@v2
      - uses: r-lib/actions/setup-pandoc@v2
      - name: Generate pages
        run: make html
      - name: Check in result
        run: |
          if [ -n "$(git status -s)" ]; then
              bin/commit-push.sh "Re-generated HTML" && git push
          else
              echo "ℹ No new files"
          fi

The on → push → paths part established what files need to change for this action to be triggered; all Markdown, RMarkdown as well as scripts and Parquet files are included. A bunch of r-lib actions will take care of the rendering, which simply calls make html; the script for generating that will be included in the Makefile, that will have all needed tasks and will be the single source of truth for tasks needed for re-generation. If there’s some actual change in the output, the HTML files are committed through the indicated script (omitted for brevity), and finally integrated in the repository.

With this, you will have achieved a fully automated workflow, that will need no intervention other than editing the reports if you need to.

Publication

“Publish or perish” is the academic motto, and it is at the same time sad and true. From the point of view of open science, publishing a paper has the objective of fulfilling the desires of a customer, which in this case might be interested in the predictions or decisions that the machine learning model can help them make.

In order to achieve that, using Jupyter notebooks that people can copy in Google Colab is as good as any open/free initiative can be. It can be accessed by anyone. However, it is not going to be easy to reference in other works, and specifying a license is not a natural thing for these formats.

Publishing them in GitHub, as you have already done, is also a good idea; people can fork it, and it is easily reproductible.

However, having a published, rendered paper that is easy to reference is still a great bridge to the world of static, rendered-once papers. There are a number of places you can use to publish.

RPubs is associated with RMarkdown, and you can publish directly from RStudio or your R program. For instance, I published (J. J. Merelo 2020) and other reports during the pandemic, with a very straightforward path from creation to publication. It can be updated as many times as you want, but it does not generate a DOI or a BibTeX entry easily. Figshare can be used to upload any kind of artifacts, from papers through simple figures to datasets. Papers can be updated, bundles with papers + data + anything else (posters, presentations) can also be presented, and it generates nice BibTeX files; this is what I used in (Juan J. Merelo and Santos 2020). It’s got an additional nice feature: AltMetrics, which measure the impact of a paper in social networks. Next best to h-index. Papers also get a DOI. * ArXiV is also an option, at least if you have been vetted for publication in a certain section; this comes automatically if you have been coauthor of an already published paper. It is as close as “full service” as you will get. Papers are examined for relevance to the section they are published, they have a host of bibliographic tools, and they can be updated frequently. Lately they include demos, and links to code and source in GitHub. In this case, your paper can be outright rejected if it is an early report, so probably not suitable for this course; it is going to be published 3 days later anyway.

Other sites like Academia.edu or ResearchGate also offer the possibility of uploading papers; some of them include tools similar to these ones. For the purposes of this course, however, these three above are the ones that I consider the most interesting.

Activity

As indicated in the acceptance criteria, the team will have to elaborate a short report, poster or presentation that will be continuously updated when data changes. It can include (or not) training, as well as model registry. The report will have to be published in an open access repository by the end of the evaluation period (can be a few days later, but preferably during the course; most sites allow updating).

As usual, the URL of the report will be submitted as a PR to this site, in the page that will be indicated.

References

Merelo, J. J. 2020. “Finding the best predictor for the number of deceases in the COVID-19 pandemic in Spain.” https://rpubs.com/jjmerelo/covid-19-es-predictors.

Merelo, Juan J., and Victor Manuel Rivas Santos. 2020. “Estimating the infection fatality rate of COVID 19 in South Korea by using time series correlations,” April. https://doi.org/10.6084/m9.figshare.12083322.v3.

Merelo, Juan Julián, Mario Garcı́a Valdez, and Sergio Rojas Galeano. 2020. “Testing the Intermediate Disturbance Hypothesis in Concurrent Evolutionary Algorithms.” In Applied Computer Sciences in Engineering - 7th Workshop on Engineering Applications, WEA 2020, Bogota, Colombia, October 7-9, 2020, Proceedings, edited by Juan Carlos Figueroa-Garcı́a, Fabian Steven Garay Rairan, Germán Jairo Hernández-Pérez, and Yesid Dı́az Gutiérrez, 1274:3–15. Communications in Computer and Information Science. Springer. https://doi.org/10.1007/978-3-030-61834-6\_1.