MLOps course for Nova-Lisbon. Index | Github Repository

Data engineering through test driven development

TL;DR

Test-driven development, or simply development that uses extensive testing, is essential to ensure quality in all phases of a data science pipeline, from data extraction and munging to results. This chapter is an introduction to testing and how to profitably use it to save time and enhance adaptability. Before testing, we’ll need to interiorize best practices in project and model design.

At the same time, we will examine the first phases of machine learning workflows: data extraction, filtering, testing, versioning and storage.

Learning outcomes

The student will have learned the importance of quality control in all stages of machine learning, and learned to set up tests (whose running will then be automated).

Acceptance criteria

The student’s team will have identified criteria for testing downloaded and extracted data, and implemented ETL code that fulfills those tests.

Testing in ETL workflows

Machine learning starts with the ETL phase: Extract, Transform and Load where you need to grab the data that’s going to be used to train and test your model.

If your pipeline does not include training, you will still need to extract data that will be visualized.

In this phase, we are going to need already a test-driven approach to ascertain that all phases (extraction, transformation) occur according to our preconceptions.

In general, this is a part of the process that benchmark-based ML tutorials do not pay much attention to. ETL is reduced to loading your data in a Python data frame. However, in real-world ML ETL is the most important part, and also a part of the process that needs to be heavily automated so that it works as intended and as long as possible.

What we will do is to explain test-driven development along with ETL techniques, so that you start to familiarize yourself with them and put them to work later in the rest of the phases of ML. Testing is essential in agile development; since it puts client first, you need, at every level, to check that their requirements and desires are satisfied; user histories will be distilled to issues, and these grouped in milestones; issues will create features that you will need to test one by one (in unit tests), and these will be grouped in minimally viable products that will be released when a milestone is finished; the viability of these products will need to be checked too. In the case of data science, unit test will check for data issues or the correctness of transformations; viability tests might include a certain quality of the solutions. This will be examined later on in the course.

There are four phases in this part of the pipeline: extracting, updating, data munging and cleaning, and storage and versioning.

Extracting information for machine learning

Data is Out There. Generally, in a way that’s not too easy to obtain. It might be in one of these forms.

Non-processable formats, like PDF. Yes, it’s non-processable. Don’t even try. You might get lucky and get some numbers, but that’s that. As soon as the PDF contains tabular data is going to be hard, tabular data in fancy tables even harder. Not to mention some PDFs are just scanned from printed pages. Just forget it.
Websites in semi-structured form. This is going to be found generally in press releases, or even mandatory information. This can be obtained via scraping, that is, extracting information according to the web structure or the kind of words or symbols that surround it. Scraping, however, is an arcane art and you’d better allot it quite an amount of time, because it involves many tricks, even more so if the site does not like to be scraped (happens a lot in administration/government sites). We’ll devote a bit of time to this one, later on.
Processable formats: CSV and others. Sometimes you get lucky and find the precise data in open data portals. You will still need to transform it, but if you’re lucky it will always have the same URL and it will be easy to extract by simply downloading.
APIs: application interfaces are the best; in some cases open data portals will offer them, in others, like social networks or GitHub, they will offer you loads of data in easily accessible way. You will need to obtain a token, and there will be some limitations when downloading, but other than that, offers a semantic interface. The main issue here will be to respect terms of service regarding storing/republishing, as well as controlling the request or other (like download size) limits. You will need some caching/storage strategy, so that you don’t keep downloading the same stuff over and over.

In all cases, you will need testing; in the case of scraping, you need testing. Let’s devote some time to that one.

Scraping for fun and data science

Webs do not want to be scraped. If they had wanted to be scraped, they would have published data in a processable format to start with. And they do so for entirely legitimate reasons, like avoiding denial of service attacks, or simply making data a bit more obscure by having people work manually to extract it. Of course, avoiding copyright issues is also another reason.

In some cases, they do not want to be scraped, and do not want you to process that information, and even more so to publish it. So you might want to take a look at the terms of service (TOS) for the web you want to process before giving it a try. At any rate, scraping does not do anything that cannot be done manually, so personal or research use is OK in most cases. And you always mention source. If not for any other reason, to actually maintain the source, as it keep it flowing. It is not going to be useful if you can scrape it just once, so be polite about it by storing whatever you have already processed and don’t keep downloading all the stuff over and over again.

Scraping is defined as extracting structured information from web sources. That implies a series of steps, every one of which need some specific techniques, so we have to take a look at all the mechanisms that are implicated in the actual web.

It all starts with an HTTP request, generally a GET request, with a specific URL; this request goes through and is received by a server. In most cases, the page will be hosted by a CDN, a content delivery network; in some cases, this CDN will perform some checks on the requesting entity to rule out automatic bots. Eventually, the request is processed but it might include a redirection (HTTP status 3xx, generally 301), so the URL returned will be a different one (and which one will depend on many factors including the source of the request).

Eventually, you obtain a HTML page that’s part of your request. However, the modern web is essentially a platform for running programs, so the actual information might or might not be there. Pages will be rendered in the browser as a DOM, document object model, and this DOM will visually show the information we’re requesting, with a certain structure. But this information may, or may not, be contained in the same document (because, via so-called iframes, it might be actually in some other documents) or (hopefully) just contained in the DOM. This DOM might be the same every time a page is rendered, it might not.

Every one of these steps contains usually standard procedures to defeat scrapers, which is why scraping is an arcane art needing specific tools to take care of. We’ll devote a bit of time (and examples) to every one of them.

Obtaining an URL to download data

In some very lucky cases URLs follow some specific pattern, so you can just generate them serially and you’re done; in some cases, you need to find out what the URL is, but once you do it, you are good to go.

From now on, we are going to use different examples, either from the Portuguese INE or from Ukraine war data as published by the ministry of Defense of that country.

This fragment of a program in Deno, a framework for interpreting Javascript and typescript, will download a specific page from the INE in Portugal:

import puppeteer from "https://deno.land/x/puppeteer@9.0.0/mod.ts";
import { cheerio } from "https://deno.land/x/cheerio@1.0.4/mod.ts";

const url = 'https://www.ine.pt/xportal/xmain?xpid=INE&xpgid=ine_indicadores&userLoadSave=Load&userTableOrder=11804&tipoSeleccao=1&contexto=pq&selTab=tab1&submitLoad=true';

try {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.goto(url);

The URL is contained in line 4, and contains a main section, and then a series of queries that redirect to the specific one page we are looking for.

Those queries might contain random variables, or, even worse, hashes or session tokens that have been generated by some other client or server program. You’ll have to use a different strategy in that case.

In the case of Ukraine MoD data, the URL can be generated this way:

def download(driver, day, month):
    url = f'https://www.mil.gov.ua/en/news/2022/{month}/{day}/the-total-combat-losses-of-the-enemy-from-24-02-to-{day:02}-{month}/'

It follows some specific pattern that includes the month and the day of the month, in one of the instances padded with zeros (day:02). This may be part of the URL, or part of the queries. In any case, if there is a systematic way of obtaining the URL, that is what should be tried first.

However, that is the case only some times. That data above, for instance, does not contain combat losses for every day. If you want that kind of data, you need to go to the original in Ukrainian. And in this case it is impossible to know, from the shape of the URLs, which ones contain the data we want.

But scraping is 90% technology, and the remaining 90%, ingenuity. It so happens that all URLs with Russian losses contains the word vtrati, which means losses. This is something that can be left for later, however. If some URL is easily available, and contains all information, go for it.

Installing and running tools

DevOps, and thus MLOps, is about including the description of the infrastructure in the code repository itself, developing it at the same time. This implies that whatever is needed to run the application needs to be described in a meaningful, but also usable way.

There are two kind of dependencies for any application: external dependencies, any prerequisite that the program will need, such as the compiler or any external library that is needed, and then the internal dependencies, written in the same language as the application, or at any rate one of them.

Modern languages all include one, or several, dependency managers. These have two functions

Read the file that declares those dependencies, and install them along with any second or any degree dependencies these need.
Create what is usually called a lockfile, which essentially speeds up deployment by recording all versions and modules that have been resolved by the previous installation.

In many cases, these dependence managers are integrated in the language, or even in the compiler; Deno, for instance, just downloads whatever is listed as an import in the source files; in some other cases, there’s a single dependency manager that is installed with the language itself, or is a commonly accepted standard.

Yet in other cases, like Javascript, there are several you can choose from: npm and yarn being the most popular.

Python, as usual, is an outlier in this area. Despite having an accepted way of declaring dependencies through pyproject.toml, pip is still installed by default, and most people do not use, or even know, dependency managers such as poetry that are the single, accepted standard.

This is such a file for examples here:

[tool.poetry]
name = "nova_mlops"
version = "0.1.0"
description = "Example for NOVA MLOps course"
authors = ["JJ Merelo <jjmerelo@gmail.com>"]
license = "GPL 3.0"

[tool.poetry.dependencies]
python = "^3.9"
neptune-client = "^0.16.2"
python-dotenv = "^0.20.0"
mlflow = "^1.27.0"
numpy = "^1.23.0"
pycaret = "^2.3.10"
pandas = "^1.4.3"
pyarrow = "^8.0.0"
fastparquet = "^0.8.1"
jupyter = "^1.0.0"

[build-system]
requires = ["poetry-core>=1.0.0"]
build-backend = "poetry.core.masonry.api"

[tool.poetry.scripts]
train = 'nova_mlops.__init__:go'

As you can see, it includes a number of dependencies that will need to be installed later on in the course in a section that is specifically called that way. Besides, this section pins the Python version so that we use the one that makes numpy work (at the time of writing this, it is not compatible with Python 3.10 and newer). It adds at the top some metadata.

Boils down to: if you use Python and want to work with MLOps, please download and install poetry.

This tool avoids the need to set up virtual environments by taking care of it itself; it will set up one for you, including all the dependencies and Python versions declared in the project file.

Besides, at the bottom of the file, there is a section called scripts. This is also a common feature of dependency managers, which are able to perform some additional tasks in a limited way. poetry is quite limited in that sense: it can invoke only Python functions (although, of course, that function might call shell scripts if needed).

This does not turn it into a task manager: task management needs additional features, like for instance being able to express dependencies between tasks or when to trigger one task. make is the quintessential task manager, but some other languages have a wide variety of them; once again, certain languages like Julia or Go include some limited task running within their base toolchain.

For instance, we can use this tool called Ake to run different Raku tasks

use lib "lib";
use Data::UkraineWar::MoD::Scrape;

my $data = Data::UkraineWar::MoD::Scrape.new( "raw-pages" );
$data.expand;

task 'download', {
    shell "tools/scrape.py";
}

task 'prescrape', {
    if $data.invalid-files() {
        die " ❌ → Invalid files ⇒\n", $data.invalid-files();
    };
}

task 'CSV' => <prescrape>, {
    "resources/ukr-mod-data.csv".IO.spurt($data.CSV);
} ;

Besides being able to express dependencies (the CSV task depends on prescrape), it is written in Raku itself, so that we can actually do some processing in the file (called Akefile) itself. We can achieve similar capabilities using Python tools such as invoke.

Agile development demands that all these tasks should be in a single place; using a task manager is essential to avoid having to read instructions, and also, through dependencies, to establish a clear pipeline for data processing and machine learning.

These task managers, later, simplify the task of testing and deploying the application. By including tasks such as test, start or deploy, it is quite easy, to a certain point standard, and intuitive, to understand what different parts of the pipeline are about.

Initial tests

This chapter is about testing, so we will need to start doing so. Since in most cases the URL will be a computed value, and there is no guarantee that what you download will always show the same information, tests will need to be created already.

Choosing testing tools and strategies will never be straightforward. First, you need to understand, from the get go of your project, that what you are going to create is effectively a project, not a script that will be edited for different runs.

This implies that you will need to use a directory for the module, following the language best practices (for instance, src or lib in ones, the name of the module in others like Python) and a separate directory for tests (t in Raku and Perl, tests in others, module_name/tests in Python).

Then, you need to choose two different tools: a testing library and a test runner. A testing library, or assertion library, will include a series of functions that compare desired and obtained output in different ways (according to type, output, error produced). A test runner will find and run the different test files, collect the output, and report on successes and failures.

Many languages include a default assertion library; Python is one of those, with unittest. Many others include a test runner, some of them as part of the basic tooling: Deno, Go and Julia are one of these. Just write deno test will find and run the existing tests. Other languages, like Python or Raku, will need a separate test runner; Python uses pytest, which includes, besides a command-line tool, façades for creating fixtures (untested variables created at setup) and some assertions (basically, those that are not included in UnitTest).

On the other hand, some languages have test runners and assertions together in what is usually called a testing framework: jest for Javascript is one of these; it includes a test runner CLI, as well as an assertion library (which can be used separately if you want).

Boils down to: there are two separate tools, in many cases the choice will be obvious, in others, like Javascript, you will need to hunt the market for the correct tools.

This example, for instance, from UKR-Mod-Data, will test downloads for sanity:

import pytest
from .. import download, driver

@pytest.fixture
def page():
    return download(14,4)

def test_driver():
    assert driver

def test_download(page):
    assert page

We are checking here two things; first that the driver works; it could or could not, since it is a complicated one; second, we check via assert that we have been able to download something. Remember, we are not checking that the information that is in there is correct; just that we have been able to download something, we have not been banned or the setup continues working. These kind of checks will be left for later.

Check the code and test in the UKR-MoD-Data repository.

Processing data

Let’s assume you’ve managed to download the content that contains the data, and have it available somewhere, as a single one or different pages. First thing you have to take into account is that metadata for downloaded pages will probably be important. At least two different things

File paths contain information such as dates; some times additional information like tags or other. You will need to include it in your scraping strategy.
Headers such as last-modified need to be used to avoid downloading the same thing several times; also respected to avoid slamming a web site with many requests.

That means that you need to use some downloading strategy that gives that information, which might be important for understanding the data in the page; for instance, data might be included in an element that uses date in its ID.

Please do always download and store the raw web pages. First, to avoid downloading them twice; second, the single responsibility principle says that functions and classes should do only one thing: you will need one function for downloading (and storing), and a different one that uses stored raw information to extract that information.

Scraping will then retrieve a file from the hard disk (or wherever), and start to work on it. If the data is inside a structure, let’s say a HTML file or XML file and it’s nicely labeled, just go for it. Modules such as beautifulsoup look at the structure and can extract the information, as long as you have the path that leads to it or can identify it by ID or class, or any other thing really (for instance, being the first sibling after a certain element like h1), tools like XPath are able to perform queries on the structure of a XML document (such as KMZ or SVG) and extract any information from there.

In most cases, however, you will have to resort to using regular expressions, which are also useful even if the structure of the document is involved, since they are able to process any kind of text that has some regularities.

For instance, this will be the Raku code that extracts information on losses from the MoD web pages:

        my $match = $l ~~ /'p>'
            $<concept> = [ .+ ] \s+ ['‒'|'–'|'-'] \s* "about"? \s*
            $<total> = [\d+]
            \s* [\( "+"]?
            $<delta> = [[\d+]]? /;

Looks a bit complicated, and it’s because it’s got to take into account all possible variations, possibly done by hand, to the web page itself. It shows the line will start with a paragraph mark <p> (we are only interested in the last part, that anchors. Then it group of characters, separated by a dash (different types), the word “about” which might appear or not (only in personnel actually), and then the total. The rest is optional: if there’s a delta, an increment from the last day, it will show always in parenthesis.

The $ preceded variables are actually the three quantities we’re extracting here: concept, total, and increment from the last day. Eventually, we obtain a data structure that we can further process to obtain whatever feature vector we will feed our system.

Most programming languages have simple “find string” or “split string” commands. Except if you want to do something extremely simple, like, let’s say, a comma-separated text, I strongly discourage using them. It’s worth the while to spend some time learning regular expressions; it will save you a lot of time later.

This step needs to be tested too, ad different levels; you need to check for internal coherence in the data. For instance, the number of tanks lost in the illegal invasion of Ukraine will always increase, so after scraping and extracting it, we need to do this test (in Raku):

subtest "Data is OK", {
    my @dates = $war-data.dates();
    my %prev = $war-data.data{@dates.shift};
    for @dates -> $date {
        for <tanks personnel> -> $k {
            cmp-ok $war-data.data{$date}{$k}<total>, ">=", %prev{$k}<total>,
                    "$k for $date: {$war-data.data{$date}{$k}<total>} ≥ {%prev{$k}<total>}";
        }
        %prev = $war-data.data{$date};
    }

In this case, what we do is to examine every row in turn and check on the absolute number (total) for two of the features that we will work on later, the personnel and tank losses, is effectively increasing.

A different issue altogether is what to do when you effectively find an error. You can communicate it to the data source, of course; however, some errors cannot be obviously fixed and it can easily invalidate the source of data, which is something you may always find: fake or made up data. In other cases, they are simply corrections, but they will probably imply that the whole series needs to be thrown away and started from the very beginning. At any rate, an error in this case is not something you can fix at the project level.

Storing data

CSV (comma separated values) is the go-to format for storing data; it is easily editable, modifiable and readable from a number of tools, including desktop tools. It is a format you can use to produce reports, or to hand it to your PhD advisor or marketing department. Any language contains tools to work with it so it is the one you will find most often.

JSON is also a popular format; JavaScript Object notation obviously comes from Javascript, where it’s simply the syntax used for objects; unlike CSV, you need to read the whole file before understanding it; updating it is not so easy as in CSV, where you can just add a line; it is also not usable by desktop programs, but since it has got a rich internal structure, being in fact a serialization format, it is a good choice for static data that need not change too often.

These two have the advantage of simply being text, which means they can be easily storable in a repository, which is good for data versioning. However, they’re not efficient either in storage or when reading it, which is why there are alternatives for them: parquetand feather are formats that are also readable from a multitude of libraries, but are binary in nature and thus more compact than the text based JSON and CSV.

We can use any kind of program to do so, as long as it is able to read CSV and write any data structure to the parquet format, , but there is a very efficient converter from CSV written in Go that we can use for this, this way:

go install github.com/fraugster/parquet-go/cmd/csv2parquet@latest
csv2parquet -input ukr-mod-data.csv -output ukr-mod-data.parquet

Data versioning

In a data science pipeline, it is essential to have a traceability that takes from results to the training and testing data that have produced it. This is not only for reproducibility reasons, it might also be essential to compare results with datasets in different dates to find possible bias, or even systematic errors.

This is why all data downloaded must have a version; it does not do to simply say “I used whatever data was in my disk drive when I started training”.

Of course, if your data is open (and if you are doing science it should be), or if you work in a private repository for data (and you should always work with a repository), you have data versioning for free. Data can be versioned by the hash of the last commit that modified it.

however, that’s a longish hexadecimal string that does not convey any kind of information on what’s in there. You can then use tags, which are aliases for specific commits that can be used to quickly retrieve the state of the repository at a certain point in time (not only data, anything that’s in the repository). By checking out a tag, you can be totally sure that the data will be in the state it was when you slapped that tag on it.

In principle, tags can be arbitrary strings too, so it will help to follow a strategy for tagging.

Semantic versioning is very common: this uses the format major.minor.patchlevel, with major indicating changes in functionality or API, minor simple small additions that are still compatible, and patchlevel bugfixes or simple refactorings. Translated this to data, format will change major, anything else minor, if there’s some fixing you can change the patchlevel.
Date versioning: using date of creation, for instance 2022.07.01.

You can add data as a prefix, to differentiate it from code-only releases, which would use non-prefixed strings (or prefixed with v).

git has a plugin called lfs, as in large file system, that allows you to store efficiently large files (as is sometimes the case with training and testing files). It can be used more or less seamlessly from the command line, although if you use GitHub there will be a limit on the size of the files that you are able to store.

Beyond that, there are many data lake solutions that you can use. Talking about specific examples, however, goes beyond the scope of this course.

Activity

The main objective of this activity is to understand how tests are essential to ensure the quality of the resulting workflow. Modules to download and update whatever data is the objective of our project will have been created and tested at all levels:

Availability (Is the URL still there?)
Downloading (Can you download it correctly, and contains the information we’re looking for?)
Extracting (Is the information structured in the same way?)
Cleaning: Is there some error in data? Does it behave in a correct way?

References

Kirk, Matthew. 2014. Thoughtful Machine Learning: A Test-Driven Approach. O’Reilly Media, Inc.