Data Annotation at Scale: Active and Semi-Supervised Learning

Gokhan Ciflikli, PhD

19 September @ ODSC2020

Workshop Structure

  • Intro
    • Motivation
    • Supervised Learning
  • Part I
    • Active Learning
  • Part II
    • Semi-Supervised Learning
    • Annotation Pipeline
  • Q&A

Reveal.js Basics

  • F to enter full-screen

    • ESC to see the full layout while not in full-screen
  • < > to switch between sections

    • When you switch back to a section, it remembers where you left off
  • ∧ ∨ to go back and forth within sections

Intro: Motivation

My Background

  • Computational social scientist by training (PhD, post-doc @LSE; researcher @UCL)

  • Specialisation in predictive modelling & methodology

  • Worked as a research scientist on an NLP active learning project conducted by Uppsala University (Sweden)

My Work

  • I’m a senior data scientist at Attest—a customer growth platform

  • London start-up aiming to disrupt the market research industry

  • We work in cross-functional teams (a la Spotify)

  • I am a member of the Audience Quality squad (data quality, fraud detection)

Problem Statement

  • Good: We are a survey company; we generate (store) a lot of data every day

  • Bad: The data do not come with labels;

    • Good/bad quality answers (relevance)
    • Open-text validation
    • Speeding/flatlining
    • Impossible/inconsistent demographics
    • etc.

New Frontier: Data Annotation

  • Initially, the bottleneck was obtaining data at scale

  • Now, unlabelled data is widely accessible:

    • Web scraping
    • Large corpora (text repositories)
    • ‘New’ types of data (audio, image, video)
    • Industries that generate streaming data (e.g. frequent transactions)

Annotation Trade-Off

  • In-house: employees have all-important context; but they don’t scale/costly to scale

  • Outsourced: e.g. MTurk; scales at a price; but context is mostly lost

  • Inter-coder reliability: Adds robustness to labels at the expense of increased costs

Intro: Supervised Learning

What is ML?

Per Tom Mitchell:

A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.

  • Very CS/SWE oriented definition

Statistics vs. ML

  • What is the difference between traditional statistical models and machine learning algorithms?

    • e.g. Logistic regression vs. Random Forest classifier?
  • Not much IMO; I subscribe to Leo Breiman’s separation of data vs. algorithmic models

    • i.e. are you trying to explain in-sample variance or to predict out-of-sample?

Model Selection

  • Not the technical (i.e. cross-validation) but the more qualitative aspect w.r.t. the data-generating process (DGP)

    • You have a hypothesis X -> y

    • And your model is your hypothesis about how -> comes about (e.g. linear, non-linear)

Correlation vs. Causation

  • Everyone has heard of the maxim correlation does not imply causation

  • Causal inference—widely defining here as to include path analysis, SEM, SCM—takes correlation one step further (assuming the given causal structure is appropriate)

  • Ladder of causation

  • CI practitioners such as Judea Pearl argue all of ML is mere curve-fitting

Bias-Variance Trade-off


  • True function complexity vs. size of the training data

    • If the DGP is simple -> inflexible algorithm (high bias, low variance) and a small training set

    • If the DGP is complex -> flexible algorithm (low bias, high variance) and a very large training set


  • Curse of dimensionality/sparsity

    • Low variance, high bias algorithms can minimise the effect of high (but irrelevant) dimensions

    • Also dimensionality reduction techniques ( e.g. PCA, regularisation, feature selection)


  • Stochastic and deterministic

    • High bias, low variance algorithms can be used to minimise the effect of noise

    • Also early stopping criteria; outlier/anomaly detection (risky!)

Fundamental Problem of Inference

A simplified example from market research—where things can go wrong w.r.t. inference

  • Data -> Sample
  • Sample -> Population
  • Population -> Superpopulation

Part I: Active Learning

AL Resources


Consider the conventional (passive) ML pipeline

  • Gather (presumably unlabelled) data
  • Manually label a fraction -> training set
  • Cross-validate hyper-parameters/model selection
  • Predict on test set and report performance metrics


  • Given the enormous influence of the training set on the accuracy of the final model,

  • and the fact that labelling is costly,

  • is there a more optimal way of obtaining data labels that can scale?