"Building trustworthy AI with people in the loop”

My research “Human-In-the-Loop AI” aims to develop human-in-the-loop methods and tools for building trustworthy AI. Tackling the trustworthiness issue of AI requires new methodologies to address both the epistemic and ethical challenges: what the machine needs to know and what purpose it should serve. My proposition is humans in the loop, that is, increasing the involvement of stakeholders in developing new technologies throughout all stages of the AI lifecycle. My research answers the following questions: What data do we need and how to collect such data? How do machine learning models work and when do they fail? and, How to formulate the learning objective and assess the model from the stakeholders' and systemic perspective? As contributions, my research creates a new human-in-the-loop methodology for machine learning, with a set of methods and tools for
  • Human-Enhanced Data Management: data creation, cleaning, and debiasing;
  • Human-Centered Machine Learning: model explanation and debugging;
  • Value-Sensitive Assessment: systemic evaluation and improvement.

  • My research methodology is both empirical and theoretical, with primary activities characterized by the design, implementation, and analysis of human studies, computational algorithms, and human-in-the-loop systems. I particularly value the collaboration with scientists from other disciplines and experts from various application domains, together with whom my research team develop usable tools to make real-world impacts.


    [Model Debugging] ARCH: Know What Your Machine Doesn't Know

    Despite their impressive performance, machine learning systems remain largely unreliable in safety-, trust- and ethically sensitive domains. Recent discussions in several subfields of AI have reached a consensus of knowledge need in machines, but few discussions have touched upon the diagnosis of what knowledge is needed. This project aims to develop human-in-the-loop methods and tools for the diagnosis of machine unknowns. We consider humans to be essential in understanding the knowns and unknowns of intelligent machines, through human interpretation of machine behaviour and creation of knowledge requirements. We also see computational algorithms as vital tools that can assist humans in knowledge reasoning at scale, under uncertainty.
    Knowing machine unknowns is essential in any context both for making AI (debugging the machine) and for using AI (deciding when to trust the machine output). We envision that this project will have a tremendous scientific and practical impact, across all areas where AI and machine learning are applied.

  • What Should You Know? A Human-In-the-Loop Approach to Unknown Unknowns Characterization in Image Recognition (WWW2022)
  • Ready Player One! Eliciting Diverse Knowledge Using A Configurable Game (WWW2022)
  • [Model Explanation] SECA: Know How Your Machine Works

    State-of-the-art machine learning models employ neural models that generally operate as “black-boxes”. The opaqueness of these models has become a major obstacle for deploying, debugging, and tuning them. To understand how those models work, it is essential to explain model behaivor in human-understandable language. This project aims to introduces a scalable human-in-the-loop approach for global interpretability of machine learning models. We employ local interpretability methods to highlight salient input units and leverage human inteligence to annotate such units with semantic concepts. Those semantic concepts are then aggregated into a tabular representation of images to facilitate automatic statistical analysis of model behavior. Our approach supports multi-concept interpetability need for both model validation and exploration.

  • How can Explainability Methods be Used to Support Bug Identification in Computer Vision Models? (CHI2022)
  • What do You Mean? Interpreting Image Classification with Crowdsourced Concept Extraction and Analysis (WWW2021)
  • MARTA: Leveraging Human Rationales for Explainable Text Classification (AAAI2021)
  • [Data Creation] OpenCrowd: Large-scale People Engagement for Data Creation

    Large-scale training data is the cornerstone of machine learning. Creating such data is a long, laborious, and usually expensive process, especially when it requires domain knowledge. Crowdsourcing provides a cost-effective way to find in a short time a large number of contributors, who as a whole possesses a broad knowledge in specific domains. This project aims to develop OpenCrowd, a large-scale expert finding and engagement platform for training data creation. The platform models relevant participant properties such as expertise, location, and cultural background, and employs peer routing techniques to scale out the finding for experts. It employs peer grading techniques for effective and efficient outcome aggregation and assessment. OpenCrowd further allows for steering the data creation process towards the preference over certain data properties using only a small amount of seed data.

  • OpenCrowd: A Human-AI Collaborative Approach for Finding Social Influencers via Open-Ended Answers Aggregation (WWW2020)
  • Training Data Augmentation for Detecting Adverse Drug Reactions in User-Generated Content (EMNLP2019)
  • Peer Grading the Peer Reviews: A Dual-Role Approach for Lightening Scholarly Paper Review Processes (WWW2021)
  • [Data Denoise] Scalpel-CD: Reducing Label Noise in Training Data

    The success of machine learning techniques heavily relies on not only the quantity but also the quality of labeled training data. Incorrect labels in the training data are generally difficult to identify and have become a main obstacle for developing, deploying, and improving machine learning models. This project introduces Scalpel-CD, a first-of-its-kind system that leverages both human and machine intelligence to debug noisy labels from the training data of machine learning systems. Our system identifies potentially wrong labels by exploiting data distributions in the underlying latent feature space, and employs a data sampler which selects data instances that would benefit the most from being inspected by the crowd. The manually verified labels are then propagated to similar data instances in the original training data by exploiting the underlying data structure, thus scaling out the contribution from the crowd.
    Scalpel-CD is designed with a set of algorithmic solutions to automatically search for the optimal configurations for different types of training data, in terms of the underlying data structure, noise ratio, and noise types (random vs. structural).

  • Scalpel-CD: Leveraging Crowdsourcing and Deep Probabilistic Modeling for Debugging Noisy Training Data (WWW2019)
  • [Data Debiasing] Perspective: Identifying and Characterizing Data Atypicality

    High-quality data plays an important role in developing reliable machine learning models. Incompleteness of training data or inherent biases can lead to negative and sometimes damaging effects, particularly in critical domains such as transport, finance, or medicine. It also implies that high-quality test data is vital for reliable and trustworthy evaluation. Despite that, what makes an item hard to classify remains an unstudied topic. This project aims to provide a first-of-its-kind, model-agnostic characterization of data atypicality based on human understanding. We consider the setting of data classification “in the wild”, where a large number of unlabeled items are accessible, and introduce a scalable and effective human computation approach for proactive identification and characterization of atypical data items.
    Our approach enables the creation of a feedback loop in the lifecycle of an machine learning model, thereby enabling a never-ending learning scenario where model performance can continuously improve. It is effective, cost-efficient, and generic to any machine learning tasks.

    [Systemic Evaluation] ValuableML: Co-Design of Value Metrics for ML

    For decades, the primary way to develop and assess machine learning (ML) models has been based on accuracy metrics (e.g. precision, recall, F1, AUC). We have largely forgotten that ML models are applied in an organisation or societal context because they provide value to people. This leads to a significant disconnection between the amazing progress of ML research – with corresponding sky-high expectations of professionals in any field – and the limited adoption of ML. We see the need for new value-based metrics for the development and evaluation of ML models. These will cater to the actual needs and desires of users and relevant stakeholders, and will be tailored to the cost structure in specific use cases.
    This project aims to introduce proper value metrics and processes. We will create them by answering two fundamental questions: what makes a model 'good'? and what is the value of a model? We use a co-design methodology emphasising the importance of involving stakeholders in the creation of metrics, so they represent the collective interest of all involved.

  • On the Value of ML Models (WHMD@Neurips2021)
  • The Science of Rejection: A Research Area for Human Computation (HCOMP2021)
  • [Systemic Improvement] ContiLearn: Continual Learning from People

    To create a sustainable human-AI ecosystem where AI can continuously serve the purpose and acts to the benefit of people, it is of key importance to have an approach that can allow machine learning models to actively and continuously learn from users and stakeholders over time. This project aims to introduce such an approach, referred to as ContiLearn, that allows to create a never-ending learning scenario where machine learning models actively and continuously learn from relevant stakeholders, while maximizing the diversity of viewpoints and the coverage and the representativeness of real-world situations.

  • A Human-AI Loop Approach for Joint Keyword Discovery and Expectation Estimation in Micropost Event Detection (AAAI2020)
  • ActiveLink: Deep Active Learning for Link Prediction in Knowledge Graphs (WWW2019)
  • Leveraging Crowdsourcing Data For Deep Active Learning - An Application: Learning Intents in Alexa (WWW2018)