Human-machine teaming for document coding

Extracting required information from documents is a common task in may research areas such as Human-Computer Interaction (HCI), Psychology, Sociology, Law, and Business. For example, in a study to understand the the decision making process in the England appeal court, the researchers need to analysis count documents like this. The needed information includes the name of the defendant and the education level of the defendant at the time of offence.

Currently such information has to be extracted manually: someone needs to read through a document and find all the relevant information. It can take a few hours to ‘code’ one document (find all the required information), depends on the document length/complexity and number of pieces of information. This makes such research very time consuming (weeks or months just to code the documents) and significantly limit the power of analysis (some statistical analyses require a minimal number of samples).

The goal of this project is improve this process through ‘human-machine teaming’, researcher and algorithm work together to improve the efficiency and accuracy. While in theory it may be possible to automate such analysis with machine learning, there are some obstacles:

  1. Most models requires a large number of training samples, which are difficult to create (time consuming);
  2. The machine learning model trained for one dataset/research question (such as appeal court documents) may not work well with a different dataset/research questions (such as criminal court documents).
  3. Finally, the analysts heavily relies on machine learning expert/programmer to create the model and system for them. They don’t understand how the tool works and how to change it to meet their needs better.

The ‘human-machine teaming’ approach can potentially address all these issues:

  1. Through an interactive user interface (such as the one shown above), analysts can constantly provide new samples to the model, and the model can continuously improve itself with such feedback, and ask answer for the most ambiguous example to reduce the total number of samples needed.
  2. By using the latest ‘transfer learning’ approach, the machine learning model can be quickly trained to adapt to different dataset and research question, requiring much less number of samples than training the model from scratch.
  3. Finally, the tool will be designed in such a way to minimise the need of a machine learning expert/programmer to adapt the tool to new dataset and research question.

Besides these, the project will also follow the ‘explainable AI’ approach, to make the underlying machine learning model more approachable to the researchers, understanding why it makes certain decisions and recommendations, and deciding whether they are suitable for the research.