Background
Extracting required information from documents is a common task in may research areas such as Human-Computer Interaction (HCI), Psychology, Sociology, Law, and Business. For example, in a study to understand the the decision making process in the England appeal court, the researchers need to analysis count documents like this. The needed information includes the name of the defendant and the education level of the defendant at the time of offence.
Currently such information has to be extracted manually: someone needs to read through a document and find all the relevant information. It can take a few hours to ‘code’ one document (find all the required information), depends on the document length/complexity and number of pieces of information. This makes such research very time consuming (weeks or months just to code the documents) and significantly limit the power of analysis (some statistical analyses require a minimal number of samples).
The goal of this project is improve this process through ‘human-AI teaming’, researcher and algorithm work together to improve the efficiency and accuracy. While in theory it may be possible to completely automate such analysis with machine learning, there are some obstacles:
- Most models requires a large number of training samples, which are difficult to create (time consuming);
- The machine learning model trained for one dataset/research question (such as appeal court documents) may not work well with a different dataset/research questions (such as criminal court documents).
- Finally, the analysts heavily relies on machine learning expert/programmer to create the model and system for them. They don’t understand how the tool works and how to change it to meet their needs better.
The ‘human-AI teaming’ approach can potentially address all these issues:
- Through an interactive user interface (such as the one shown above), analysts can constantly provide new samples to the model, and the model can continuously improve itself with such feedback, and ask answer for the most ambiguous example to reduce the total number of samples needed.
- By using the latest ‘transfer learning’ approach, the machine learning model can be quickly trained to adapt to different dataset and research question, requiring much less number of samples than training the model from scratch.
- Finally, the tool will be designed in such a way to minimise the need of a machine learning expert/programmer to adapt the tool to new dataset and research question.
Besides these, the project will also follow the ‘explainable AI’ approach, to make the underlying machine learning model more approachable to the domain experts, understanding why it makes certain decisions and recommendations, and deciding whether they are suitable for the research.
Required knowledge and skills
- Machine Learning and Natural Language Processing (NLP), more specifically:
- Active Learning, which allows the model to continuously improved with user feedback through interaction.
- Deep Learning and Transfer Learning:
- The idea is to build on the latest large language models, such as BERT and GPT3. This is likely to have better performance than creating a new model from scratch;
- Transfer learning will then be used to ‘adapt’ these large language models to the new tasks that the domain experts need.
- Data Visualisation:
- An User Interface (UI) is needed for the domain experts to interact with the model, providing feedback;
- A visualisation is also needed to explain to the domain experts why the model makes certain recommendations and maybe even how the model changes over time based on the user feedback. This can improve the transparency and explainability of the model, potentially increasing experts’ trust towards the model and interact with the model more effectively
- Programming
- Python is the recommended language for the Machine Learning and NLP;
- Python libraries such as TensorFlow and PyTorch are recommend for the deep learning related work;
- JavaScript is the recommended language for the UI and data visualisation
- JavaScript libraries such as Vue and React are recommended to build the UI, and libraries such as d3.js are recommended to build the data visualisation.