MLOps course for Nova-Lisbon. Index | Github Repository

Design thinking

TL;DR

Formulating a solution to a problem is never straightforward. Understanding the problem and the customer is essential for designing a successful machine learning workflow or application that meets what the customer wishes.

Learning outcomes of this unit

Understanding the process of kickstarting a machine learning project by focusing on a client needs, and working towards solving that problem through precise stages.

Acceptance criteria

The students will have pitched their ideas, and selected one to carry out during the rest of the course.

What are we talking about

Design thinking is a technique, taken from the design community, that tries to find out the needs of a specific target group in order to create a product that fulfills, at least in part, those needs (Lewrick, Link, and Leifer 2018). It spans all the process from product design to the creation of a startup, or how a whole company will carry out their objectives.

In data science, as in any software development project, one of the hardest parts is to find a way to deliver value to a client; even more so, to actually find if there’s a client for your product and who would that be.

Most data science and engineering tutorials start with already existing datasets; which is nice because you can easily compare the performance of your method, or your team, against others. However, no, no one is interested in a product that tells you who among the Titanic crew or travelers is going to croak or what kind of iris flower we are dealing with looking at the petals and sepals.

You are probably interested in problems or issues that are closer to yourself, your hobbies or your issues. Or you might be actually interested in creating a product in a startup. In any case, you need to go, find out who that person would be (not if it is yourself, of course) and find a way to deliver value. Design thinking will help you with that.

Design thinking in 5 easy steps

The design part of design thinking comes from actual design; it mimics the process uses in designing user interfaces or products. In general, science and software development could learn a lot from this process. While in science (and engineering) we are paper or product oriented, design is customer oriented.

Empathize

DT is user centered, and thus it needs you to understand, first and foremost, the needs of the user; in order to do that, the designer needs to put him or herself in the place of the user, and think what they would wish; how to add value to them.

Many of the examples in this course will be centered in, well, me, since I am the customer I have more handy. That will certainly be the case in most students of this course, with family and friends being closely seconds or thirds.

I have created data pipelines in several occasions: during the COVID pandemic, my main worry was to know when the curve was reaching its peak, and so when the incredibly (some people might say absurdly) strict lockdown we were suffering in Spain would be lifted. So my main objective was to try and analyze, daily, how the tide was turning and when it might definitely turn against COVID.

Currently my worries go more in the direction of the current state of affairs with the unprovoked attack on Ukraine and the subsequent genocidal actuation of Russian armed forces. The fog of war, however, makes it difficult to ascertain how far away from Russia’s defeat Ukraine is.

In general, the design process will start with some piece of news or a post in social networks, some concern shared by a friend or relative, or the obvious need of a paying client. In any of these cases, their stories need to be heard and you need to get in their shoes to be able to create a (data) product that satisfies them.

In a project we were involved, a company that was writing software for editorials asked us to come up with a way to find out how much a book was going to sell. They had hard data for every book sold in Spain for a few years; columns and columns of data. For instance, it included how many copies were sold in big department stores, and how many in other kind of book sellers.

Using our scientist goggles, we cried “Data!” and used every column they could give us. When we presented that result, the software company was not happy. We had simply eliminated sales from the input column, but many other quantities were correlated and caused by sales, such as the aforementioned sales in specialized bookstores or department stores.

We needed to empathize with the editorials and really understand what kind of data was available to them when they had a fresh manuscript in their hands and needed to send an order to the printing house and tell them to print 30, or 2000, or 10K. Printing, shipping, storing is expensive; lost sales when a book is in back order also cost money. So we needed to reduce all those beautiful, plentiful columns to just a few: genre, known author, things like that. Only by understanding and feeling the same pain as the (ultimate) customer, could they be satisfied.

And they were. This piece of software (that ultimately used a multilayer perceptron) was incorporated into a bigger editorial management software, and they lost less money printing the right amount of books; our customer sold more licenses to their software, and we… We managed to publish a few more papers (Castillo et al. 2017).

As indicated in the introduction, one of the keys to data science is domain knowledge. A data science team can never know everything; this stage guarantees that knowledge is incorporated into the process.

Define

From understanding the user and what they want, you need to define the product that you (in most cases, with the help of your team) will create. There are myriad different kind of products that you can create using computers, and every one will solve a different problem. But without a real problem to solve, it’s impossible to actually create a product that solves it.

In the issues stated above, for me one of the problems was not knowing when the lockdown would help.

In the case of the war in Ukraine, finding out if the situation was getting better or worse for the Ukrainians. That’s a defined problem. Thus, I will try (and use in this course) some data sources, mainly so-called OSINT (open source intelligence) sources, that try to shed some light on the subject. With the problem definition, in the case of a data science project, we will probably need to find out whether we have a data source to be used for it, or not; obviously, we will not be able to solve it in the latter case.

Besides, we need to find out what kind of data engineering or machine learning we have in our hands. We might want to predict (When will the war end?) or explore the data (Is the situation right now more similar to the initial one, or has something fundamental changed?) or maybe to just automatically visualize in a chart some specific quantity (How has the number of vehicles destroyed evolved?).

In a project we created for what is now part of a big financial prediction and assessment conglomerate, we had to predict the situation of a company up to three years down the line. We had data for most companies in Spain.

The main problem, however, was not whether the company would tank or not. It was to avoid an angry company CEO of the company calling our customer because they had been told that their company was going to go bust.

That problem was not totally avoided, however; companies go bust anyway.

What we needed then was to come up with a way to explain what was going on, besides actually making a black box prediction. We needed some sort of clustering or visualization. This clustering revealed what kind of features led to the company going belly-up; for instance, one of them included old companies that were part of a conglomerate and that had low churn of assets (or suchlike, I don’t really remember). In that case, apparently, the parent company just shuttered them and employees were distributed among other companies in the conglomerate. By actually defining the problem, which in this case was not only to classify (healthy, broke) but also to explain the reasons why this was happening, we got our customer satisfied (and also a bunch of papers Alfaro-Cid et al. (2009).

In the case of the Ukraine MoD data, it is not entirely clear what we would want to obtain from it. The general idea is to find something about the course of the war, and how much it will last, or simply how bad or good the situation for Ukraine is. Defining the problem will have to start with a general visualization of data, general correlations between the different columns. In general, anyway, we will have to obtain or transform data so that we get daily changes in them. Once we have these deltas, the actual result of the ML pipeline will have to be defined in the next phase.

Ideate

You obviously need to think the kind of product you are going to create to solve the problem. It can be a library, a set of charts, an app, or simply a periodically updated report that visualizes a specific aspect of what you are going to do. This rough idea will need to possibly go through a series of iterations, and will also depend on the time and resources you and your team have to do it. It will have, whatsoever, to consistently deliver value to the customer by solving their problem.

It is in this phase where you can use other methodologies to come up with a rough sketch of a product. Ikea, for instance, has 5 principles of what they call democratic design: Form, Function, Aesthetics, Sustainability and Price (Ikea Democratic Design 2018).

We will probably need to explain a bit further the two first ones: form relates to how some product embeds itself in the living environment; it applies to packaging too. For instance, you need to create couches that integrate well with low tables, or chairs that integrate themselves with high tables. A microwave oven has to be rectangular, and so on. In terms of software engineering, we need to create products that are well integrated within the current systems and processes of the company; in terms of data science, we need to use whatever resources we have available; if our lab (or company) does not have enough money for an Nvidia Tesla GPU, we will have to make do with our laptop or the server our lab has access to.

Function, or course, relates to what the product needs to do. It needs to be able to do so efficiently, and convey whatever information you need to convey, or use the data and make the decisions you need to make. For instance, I remember a project with a company where we needed to predict book sales. One of the inputs we had was correlated to sales; number of editions if I remember correctly. Obviously, using this as an input will make output depend exclusively on that; but also obviously, the number of editions is not going to be available when you are releasing a new book and need to find out how many units it’s going to move. The “Function” mantra is an user-centered one, and you need to follow it closely.

This is one of the reason why I shy away from using already established datasets when teaching data science. Those datasets already have inputs and outputs clearly mapped, hiding the complexity of the design of a real product under the rug and banning the user from actually having to figure out this actual part of the process.

The rest is more or less obvious; all five principles will act as constraints in this design thinking process. Which also has 5 phases:

In the cases mentioned above, several options were possible. For instance, a periodically updated visualization of the evolution of the number of cases.

Number of COVID cases in Spain

Several other publications in RPubs were updated manually, check out the page that lists them all.

In the case of the invasion of Ukraine, however, creating a library that allowed anyone to extract data seemed like a better idea, so that’s what we will be developing here. The concrete machine learning problem that we will try to solve will be related to predicting the course of the war.

In the already mentioned case above, trying to predict book sales, the main problem was that the software company wanted to integrate the algorithm into a desktop application. Our algorithm used a mixture of C++ and Perl to set up training and exploitation. We eventually packaged everything into a Windows executable that could be invoked from their program.

Prototyping + test and validation

Design thinking is a methodology that integrates the whole product design lifecycle. However, these last two steps are better served via agile development; so we will defer it to the rest of the course.

Read also

This one is in Spanish, but still: Qué es y para qué sirve Design Thinking is a short introduction that includes all the main points in the design thinking process. This one is effectively focused on interaction design, but describes the process with precision.

Activity

Design thinking is ideal to identify and start the design process. In this activity, we will perform two of the phases, with the following phase taking place during this same session, but in the framework of the agile thinking process.

References

Alfaro-Cid, E, AM Mora, JJ Merelo, AI Esparcia-Alcázar, and K Sharman. 2009. “Finding Relevant Variables in a Financial Distress Prediction Problem Using Genetic Programming and Self-Organizing Maps.” In Natural Computing in Computational Finance, 31–49. Springer.

Castillo, Pedro A., Antonio M. Mora, Hossam Faris, J. J. Merelo, Pablo García-Sánchez, Antonio J. Fernández-Ares, Paloma De las Cuevas, and María I. García-Arenas. 2017. “Applying Computational Intelligence Methods for Predicting the Sales of Newly Published Books in a Real Editorial Business Management Environment.” Knowledge-Based Systems 115: 133–51. https://doi.org/http://dx.doi.org/10.1016/j.knosys.2016.10.019.

Ikea Democratic Design. 2018. Ikea.

Lewrick, Michael, Patrick Link, and Larry Leifer. 2018. The Design Thinking Playbook: Mindful Digital Transformation of Teams, Products, Services, Businesses and Ecosystems. Wiley.

Mora, Antonio Miguel, Luis Javier Herrera, J Urquiza, Ignacio Rojas, and JJ Merelo. 2010. “Applying Support Vector Machines and Mutual Information to Book Losses Prediction.” In The 2010 International Joint Conference on Neural Networks (IJCNN), 1–7. IEEE.