Categories
project

Human-Centred Data Science

Background

While there are some efforts to automate the process of build machine learning model (commonly known as ‘AutoML’), there are many tasks in the different stages of a data science pipeline (see the diagram below) cannot be fully automated. For example, for ‘Data Wrangling’ a decision has to be made which dataset to get. To some extent, building a machine learning model is also a sensemaking process (see the description here about sensemaking).

Currently there is very little support for these manual tasks: user would start a Jupyter notebook and add different code blocks for data wrangling, cleaning, and so on. User would use try-and-error to test different options such as features/models/hyper-parameters until acceptable performance is achieved (maybe never).

For example, hyperparameters are important to the performance of machine learning models. However, some models have many parameters and each parameter has many possible values, if not infinite. As a result, finding the best set of values for the hyperparameters can be time consuming. Currently people rely on previous experiences or methods such as ‘grid search’, but these do not always work or provide good answers.

However, hyperparameter is just one part of the data science pipeline. When a different model is used, all the hyperparameters have to be test again. When a feature is changed or a few feature is added, all the models and their hyperparameters have to be tested again. Manually tracking this exponential number of combinations is not practically possible.

This project aims to provided the much-needed support so users can easily track and make sense of the results of all the data/feature/model/hyperparameter combinations. As a first step, we aim to capture the values and resulting performance for one factor (such as hyperparameter, the top figure is a mockup):

  • What values have been tested
  • What are the evaluation results for those values

The ultimate goal is to consider all the components of a data science workflow together.

Required knowledge and skills

  • Data Visualisation: An User Interface (UI) is needed to:
    • Present the different data/feature/model/hyperparameter combinations and their results to the users in an easier-to-understand way;
    • Help user make sense of the effect of different options: such as which types of classifiers have better performance and why.
  • Machine Learning can be used together with the visualisation to make the support even more effective:
    • Machine learning can be used to infer which part of the code relates to dataset/feature/model/hyperparameter and automatically record the changes (this can be done in an interactive fashion as described in the Human-AI Teaming project);
    • Then Machine Learning can provide more proactive support such as providing recommendations that may lead to better performance.
  • Programming
    • Python is the recommended language for the Machine Learning and NLP;
    • Python libraries such as Scikit-Learn are recommend for the machine learning.

Related Work

  • MLProvLab is an JupyerLab extension that can capture the changes in a notebook;
  • The JupyterLab Notebook Provenance extension visualises the changes in a notebook (the code changes, not tracking hyperparameter or performance).