Putting the science back in data science

Advanced computation for Data Science, Southern Connecticut University

`github.com/JJ/science-data-science`

The scientific method meets modern industrial development methodologies: we need to use agile methods for science, specially data science. It started with a (seemingly) simple problem: I wanted to fix a LaTeX error in an ArXiV paper (which allows you to submit new versions). Well, just edit it away and you're done. Or maybe not. The paper was written using Pweave. It allows you to mix Python code with LaTeX, including the data processing workflow into the paper itself. Great, right? Well, yes, but you need to install the Python toolchain to generate it. And this was not specified: `requirements.txt` was missing. Trial and error, you finally obtain all the tools (and I need to put them into the requirements.txt). But then the file was missing, and I needed to download it. And I didn't know where to find it... I needed to look it up... But then, I didn't have our own data file, and when I found one, I wasn't sure it was the last version... Long story short: I managed to do it. But if this has happened in a company, and this was a production application, I'd be fired. Yet this happens in science, all the time. In this talk I'd like to propose some ways to make it a bit better.

Professor at the U. Granada

Programming since 1983 `github.com/JJ`

García-Sánchez P, Velez-Estevez A, Julián Merelo J, Cobo MJ (2021)

The Simpsons did it: Exploring the film trope space and its large scale structure.

PLoS ONE 16(3): e0248881.

In the beginning, we had tropes

From `tvtropes.org`

A trope is a repeated motif or pattern in fiction

To the left, the moon silhouette trope, popularized by ET. Tropes are instantly recognizable, and carry meaning and evocations way beyond its presence or dialogue. For instance, the wizard beard trope is also present in this mural: you instantly recognize someone that way, white beard and a big one: he's a wizard, and a good one. They save a lot of exposition, and are the building blocks of fiction

From Tropes

... through average reviews ...

... to predicted cohesiveness and quality

Train a neural net to predict rating from tropes

Use it to optimize "trope bag"

García‐Ortega, RH, García‐Sánchez, P, Merelo‐Guervós, JJ. StarTroper, a film trope rating optimizer using machine learning and evolutionary algorithms. Expert Systems. 2020; 37:e12525. https://doi.org/10.1111/exsy.12525

Houston, we have a problem, taken from
https://www.freejpg.com.ar/imagenes/premium/1192700007/senal-de-fallo-del-sitio-web-descoldos-houston-tenemos-un-problema-ilustracion-vectorial-obra-de-arte-de-ficcion

Science has a

data

problem

Science, nowadays, has many problems. Funding, of course, is one. Attracting talent in computer science is another. With all computer science jobs in high demand, it's almost impossible ot attract anyone to low wages and uncertain career. That uncertainty carries also mental health problem, which are aggravated by the managerial style of many people. But, over all, we have a data problem. I have been reviewing half a million projects lately, and they have a section devoted to IP and data management. The most imaginative thing I've foun is "We'll buy a server with a big disk drive". C'mon, we're in 2021!

Of course, there's also COVID-19

Data science has solved that problem

as done in the industry

Why?

Because software development had a problem

And it was solved using the agile mindset

Individuals and interactions over processes and tools

Working software over comprehensive documentation

Customer collaboration over contract negotiation

Responding to change over following a plan

Data science in industry works because it's agile

Let's make science Agile

I never remember where's the paper accepted in a conference.

I use GitHub search now

First hypothesis

Open over closed

Interaction with the work via pull requests

Also issues, forks...

Free vulnerability scanning

A call from Hollywood

It should be easy to respond to evolving requirements

Second hypothesis

Stakeholder collaboration over vertical chains-of-command

This is a massive tool for data management. All stakeholders involved, up to and including the public at large, are aware of what's going on and can communicate, through widely accepted, industry-standard tools, with each other. The key word here is collaboration: all parties involved advance towards a common goal, in the same way that all dev teams work together to achieve excellence in a software product. This, of course, raises all kinds of possibilities. In the case of the research above, some local production company might be interested in using it to improve submitted scripts. Might or might not be accomodated, but it's going to always be a case to consider (and you can use an issue in GitHub to raise that possibility).

Where are all my tropes?

García-Ortega, R.H., García-Sánchez, P., & Guervós, J.J. (2020). Tropes in films: an initial analysis. ArXiv, abs/2006.05380.

Sources change

This revealed a deeper problem in the library, tropescraper, that we had been using. The thing is, when there's an outside source for data, validation is always a problem, but it's only one of many possible problems; dependencies might evolve, or fail, and of course there are changing requirements that you might want to accomodate. Which is why we need:

Third hypothesis

Testing at all levels over hypotheses proved once

I needed to change `\ref` → `\cite`

Easy, amirite?

Or maybe not. The paper was written using Pweave. It allows you to mix Python code with LaTeX, including the data processing workflow into the paper itself. Great, right? Well, yes, but you need to install the Python toolchain to generate it. And this was not specified: `requirements.txt` was missing. Trial and error, you finally obtain all the tools (and I need to put them into the requirements.txt). But then the file was missing, and I needed to download it. And I didn't know where to find it... I needed to look it up... But then, I didn't have our own data file, and when I found one, I wasn't sure it was the last version... Long story short: I managed to do it. But if this has happened in a company, and this was a production application, I'd be fired. Yet this happens in science, all the time. In this talk I'd like to propose some ways to make it a bit better."

✓ It was open

✓ It was developed using issues

✓ It was tested

Dude, where's my data?

import pandas as pd
import pygraphviz as pgv
from IPython.display import display, Latex
from scipy import stats

DBTROPES_GENERATED_FILE_PATH = '/Users/phd/Downloads/dbtropes/dbtropes-20160701.nt'
TROPESCRAPER_GENERATED_FILE_PATH = '/Users/phd/workspace/made/tropescraper/bin/tvtropes.json'

Despite doing everything right, these paths, which were invoked from the paper itself, use hardcoded paths. The problem is not only that, but the fact that, in the second case, it uses a generic name that is outside source control... And thus can be anything. Big risk: use another version, and the whole paper will change; this corresponds to the trope For want of a nail, BTW.

4th, last, hypothesis

Reproducibility and replicability over publishability

Reproducibility, or replicability, is the most important thing. You need to produce the same results over and over again. It's the most important thing for oneself: small changes in a paper should be doable without a total change in the tables and charts that will possible change the results. But it's, of course, essential in science. Somebody might want to build on our results. Or replicate the paper, filtering the results in some way. Or simply check that effectively JamesBond is the movie (or franchise) with the most tropes, ever. Anything at all. Replicability will go a long way towards solving the science (data) problem, and the rest of the problems associated with science as it's done today.

Doing science in XXI

Just like it used to be

We need to prove those hypotheses

Science, heal itself

➀ The product is a workflow

Papers and reports are side effects

☑ Reproducibility ☑ Openness ☑ Tests ☑ Collaboration

And opens up many possibilities

Dashboards, interactive charts, APIs...

Code the workflow

# tasks.py
@task
def build_paper_latex_arxiv(context):
    print("Building latex file and figures through pweave ...")
    command = 'cd papers && pweave -f texminted tropescraper_arxiv.texw'
    run(command, hide=False, warn=True)

Using knitr/pweave/rmarkdown

We would need to find out first if this method really works....
%
\begin{figure}
<<generations.fs.table,echo=FALSE, results="asis">>=
library(ggplot2)
library(ggthemes)

generations <- read.csv("data/ng-spambase1-generations.csv")
ggplot(generations,aes(x=Generation,y=Average.F2,group=Generation,color="Average F2"))+geom_boxplot()+geom_boxplot(aes(x=Generation,y=Max.F2,group=Generation, color="Max F2", fill="Max F2"))+theme_tufte()
@
\caption{Boxplot of the best F2 (filled) and average F2 (clear, transparent) over 15 different runs for the spambase1 dataset partition.}
\label{fig:gen:f2}
\end{figure}

➁ You have the idea, you own the product

🔲 Reproducibility ☑ Openness ☑ Tests ☑ Collaboration

Doctoral student owning his thesis

➂ Use common software development tools and practices

☑ Reproducibility ☑ Openness ☑ Tests ☑ Collaboration

As well as common (industrial) data science best practices

CI/CD workflows, MLFlow...

Avoid smells

lint-python:
    runs-on: ubuntu-18.04
    name: Python source lint
    steps:
      - uses: actions/checkout@v2
      - name: Set up Python
        uses: actions/setup-python@v2
        with:
          python-version: 3.7
      - name: Export neccesary variables
        run: |
          echo "::set-env name=PYTHONPATH::${{github.workspace}}"
          echo "::set-env name=KERAS_BACKEND::theano"
      - name: Install necessary tools
        run: pip install nox
      - name: Perform linting
        working-directory: ${{ github.workspace }}
        run: nox -e lint

Putting the science back in data science

Advanced computation for Data Science, Southern Connecticut University

github.com/JJ/science-data-science

Professor at the U. Granada

Programming since 1983 github.com/JJ

García-Sánchez P, Velez-Estevez A, Julián Merelo J, Cobo MJ (2021)

The Simpsons did it: Exploring the film trope space and its large scale structure.

PLoS ONE 16(3): e0248881.

In the beginning, we had tropes

From tvtropes.org

A trope is a repeated motif or pattern in fiction

From Tropes

... through average reviews ...

... to predicted cohesiveness and quality

Train a neural net to predict rating from tropes

Use it to optimize "trope bag"

García‐Ortega, RH, García‐Sánchez, P, Merelo‐Guervós, JJ. StarTroper, a film trope rating optimizer using machine learning and evolutionary algorithms. Expert Systems. 2020; 37:e12525. https://doi.org/10.1111/exsy.12525

Science has a

data

problem

Of course, there's also COVID-19

Data science has solved that problem

as done in the industry

Why?

Because software development had a problem

And it was solved using the agile mindset

Data science in industry works because it's agile

Let's make science Agile

I never remember where's the paper accepted in a conference.

I use GitHub search now

First hypothesis

Open over closed

Interaction with the work via pull requests

Also issues, forks...

Free vulnerability scanning

A call from Hollywood

It should be easy to respond to evolving requirements

Second hypothesis

Stakeholder collaboration over vertical chains-of-command

Where are all my tropes?

García-Ortega, R.H., García-Sánchez, P., & Guervós, J.J. (2020). Tropes in films: an initial analysis. ArXiv, abs/2006.05380.

Sources change

Third hypothesis

Testing at all levels over hypotheses proved once

I needed to change \ref → \cite

Easy, amirite?

✓ It was open

✓ It was developed using issues

✓ It was tested

Dude, where's my data?

4th, last, hypothesis

Reproducibility and replicability over publishability

Doing science in XXI

Just like it used to be

We need to prove those hypotheses

Science, heal itself

➀ The product is a workflow

Papers and reports are side effects

☑ Reproducibility ☑ Openness ☑ Tests ☑ Collaboration

And opens up many possibilities

Dashboards, interactive charts, APIs...

Code the workflow

Using knitr/pweave/rmarkdown

➁ You have the idea, you own the product

🔲 Reproducibility ☑ Openness ☑ Tests ☑ Collaboration

Doctoral student owning his thesis

➂ Use common software development tools and practices

☑ Reproducibility ☑ Openness ☑ Tests ☑ Collaboration

As well as common (industrial) data science best practices

CI/CD workflows, MLFlow...

Avoid smells

Biggest hurdle?

Changing from pay-to-publish to pay-to-deploy-workflow

Join the Agile Science Manifesto

To make science more

✓ Open

✓ Adaptive

✓ Sustainable

Let's leave waterfalls for rainbows

Let's make science agile

`github.com/JJ/science-data-science`

Programming since 1983 `github.com/JJ`

From `tvtropes.org`

I needed to change `\ref` → `\cite`