Provenance for Sensemaking

What is Provenance?

The word provenance originally was mainly used for art work, refers to ‘the history of ownership of a valued object or work of art or literature’. For a painting, this includes the information of who painted it when and how the painted changed hands over time.

More recently, the concept is expanded to other things, including information and data analysis, so terms like data provenance and analytical provenance started to show up. In this context, the word is used more closely to its another common meaning which is origin or source.

Provenance = History + Context

To make things simpler, I will define ‘provenance’ as ‘history + context’. Here the ‘history’ refers to what happened and when, such as when a set of data is collected and then how it is pre-processed, and the ‘context’ includes information such as ‘where’ it is collected, ‘who’ did the pre-preprocessing, ‘what’ method is used for handling missing data, and ‘why’ that method was used, etc.

This can be applied to analytical reasoning and decision making, too, such as ‘who’ made ‘what’ decision at ‘when’ based on ‘what’ information and ‘why’ such a decision is made, etc. This is when things start to get interesting.

Why is provenance important?

Often decisions are recorded but not their history or complete provenance. Over the time, the assumption that the decision was based may have changed completely, but the decision itself still being blindly followed. For example, the QWERTY keyboard layout was designed in the 1870s to slow down typing speed so typewriters don’t get jammed (happens when a second letter is pressed before the previous can go back). This is not an issue at all for computer keyboard, but the layout stays on.

The same applies to data analysis and decision making: it is important to know if the data used is reliable and If the right analysis method or decision process is used.

Provenance for Sensemaking

Please see here for an explanation what Sensemaking is. As discussed earlier, provenance is very important to know that sensemaking is based on reliable data or the right reasoning process is followed. However, provenance can also be very useful in supporting sensemaking:

  1. The SenesMap tool mentioned in the Visual Analytics for Sensemaking project essentially visualises the user browsing history, a form of provenance, and this can be very useful in helping users organise their searches and information foraging.
  2. For use cases such as sensemaking sharing and collaborative sensemaking (RQ3 and RQ4 in Visual Analytics for Sensemaking project), the sensemaking provenance, e.g., the sequences of steps taken and the data/analysis used at each step, is essential for sharing and for others to understand and then continue sensemaking.
  3. Most of the machine learning tasks in Visual Analytics for Sensemaking project also rely on provenance. For example, the inference of what tasks users are performing needs to based on the user actions and visited information, which is part of the sensemaking provenance.

I believe that provenance is essential for all the research questions mentioned in the Visual Analytics for Sensemaking project. If you want to find out more, the best place to start is the Survey on the Analysis of User Interactions and Visualization Provenance (websitepresentation).

Below I want to describe two other sensemaking use cases that are not covered in the Visual Analytics for Sensemaking project.

Provectories and Large Provenance Model (LPM)

Provectories

Provectories is a method to encode each step in sensemaking as a ‘provenance vector’, which includes all the information necessary to reconstruct the visualisation state. This includes information such as what data are displayed, how they are visualisation, and user interaction. A sensemaking session become a sequence of such vectors, which are called ‘Provectories’ (provenance + vector + stories).

An online demo are available on this page. The source code of the system is available on GitHub. This is the recording of the paper presentation for VIS 2022.

While Provectories provides an universal way to describe and capture interactive visualisation provenance, there are still many open problems of how to use it to analyse and support sensemaking:

Research Questions (Project Ideas)

Q1. Sensemaking Pattern Discovery

Once the sensemaking process is captured, it can be analysed by either visualisation or machine learning to identify any interesting patterns. For single user, such patterns can be:

  • Are there any frequent patterns, i.e., a sequence of actions that appear multiple time? If so, what does this mean, i.e., why was the user doing this?
  • Is it possible to infer the analysis actions, such as comparing two similar states or analysing the clustering in the data?
  • Is it possible to infer what strategy the user is using, such as a depth- or breadth-first search?
  • Is it possible to show what data have been explore and what hasn’t?
  • Is it possible to infer when a user is stuck and can use some help with further analysis?
  • Did the user find the answer, and what is it?

This is a previous student project on this.

There are also many interesting questions about a group of users, such as:

  • What are the differences and similarities among the user sequences?
  • Is it possible to tell who are the experts and who are the novices?

As mentioned, visualisation and/or machine learning can be used to answer these questions.

Q2. Large Provenance Model (LPM)

If you are familiar with Large Language Models or LLMs (more information in the Human-AI teaming project), such as chatGPT, you would know that LLMs essentially is based on a vector representation of text. The provenance vector (or Provectories) is a vector representation of sensemaking. While the data is different (text vs. provenance), the ways of how they are constructed are similar, and maybe they would have similar properties?

LLMs are trained to predict the next word in a sentence. However, they exhibit ‘intelligence’ way beyond this simple task, appearing to have almost human-like understanding of the text. So the question is: Is it possible to do something similar to Provenance Embedding, i.e., can we training a model that can predict the next step in sensemaking, which is very useful itself, but more importantly would such a model exhibit more advanced ‘intelligence’ equivalent to LLM’s ability to understand natural language (such as understanding sensemaking as human would)?

Resources