Data Annotation at Scale: Active and Semi-Supervised Learning

AL Resources

Amazing literature review (Settles, 2009)
AL resources at http://active-learning.net/
MEAP book by Robert Munro Human-in-the-Loop Machine Learning
modAL—highly modular, sk-learn compatible AL framework for Python
And a lot of great YouTube videos: Microsoft Research, ICML 2019, UoM, Prendki

Motivation

Consider the conventional (passive) ML pipeline

Gather (presumably unlabelled) data
Manually label a fraction -> training set
Cross-validate hyper-parameters/model selection
Predict on test set and report performance metrics

Motivation

Given the enormous influence of the training set on the accuracy of the final model,
and the fact that labelling is costly,
is there a more optimal way of obtaining data labels that can scale?

Given limited resources, which data points would you query for their labels?

AL framework posits some data points (instances) are more informative than others
By learning the true labels for the instances that the model is least confident about (in the figure, the cluster in the centre), the model will then generalise to the remaining instances more easily
The idea is that we can achieve high model performance by only labelling a fraction of the available data

AL Framework

AL Cycle

A learner begins with a small number of instances in (labelled training set)

Request labels for one or more instance (query)
Oracle (human annotator) provides labels for the queried instances
Learn from the query results (append labelled instances to training set and refit model)
Repeat steps 2-4 until a stopping criterion is reached (model performance, empty , cost)

AL–Performance Example

Toy data generated from two Gaussians centred at (-2,0) and (2,0) with

200 instances sampled from both classes represented in 2D space (a)
performance after 30 randomly selected instances are labelled (b, 70%)
performance after 30 instances selected by AL using uncertainty sampling (c, 90%)

Sampling Strategies

There are three main sampling scenarios in AL

Membership query synthesis
Stream-based selective sampling
Pool-based sampling

Membership Query Synthesis

Learner queries instances that it generates de novo
- i.e. the learner generates new samples instead of using existing data points
More useful in certain domains than others
- Can generate unintelligible/gibberish samples for humans (but evidently not for the machine)

Stream-based Selective Sampling

Learner assumes obtaining (not labelling) a data instance is free
So it can be sampled from the actual distribution
- Useful if the distribution is non-uniform or unknown
The learner then decides whether to query or discard the sample (sequential)

Pool-based Sampling

Learner assumes there is a small set of labelled data and a large pool of unlabelled data
Queries are drawn from , which is assumed to be closed (static)
- The pool can be dynamic depending on the design
All instances in are ranked based on an informativeness metric, which can then be queried in that order

Query Strategies

We will consider two main approaches to evaluating the informativeness of an unlabelled instance:

Uncertainty sampling
- Least confident
- Margin
- Entropy
Query-by-committee

Least Confident

Simplest query strategy: the learner queries the instances about which it is the least certain
In binary classification, instances whose posterior probability of being positive is closest to 0.5

Least Confident–Multi-Label

where

is the most informative instance (query) under strategy
is the class label with the highest posterior probability under model

Least Confident

In multi-label classification, this can be thought of as the expected 0/1-loss; the degree of belief of the model that it will mislabel .
However, there is information loss:
- Only the information about the most probable label is utilised; the rest is discarded

Margin Sampling

where

and are the first and second most probable class labels under

Margin Sampling

The inclusion of the posterior of the second most likely class addresses the shortcoming
Intuitively:
- Large margins denote the model is confident in differentiating classes
- Small margins indicate that the model is ambiguous (and knowing the true label would increase its performance)

Entropy Sampling

where ranges over all class labels and denotes entropy.

Entropy Sampling

Margin sampling partially addresses the information loss of least confident
Entropy generalises margin sampling to all class labels
For binary-classification, all three approaches are identical

Uncertainty Sampling–Example

Three classes, each ‘occupying’ a corner of the triangle
Red indicates more informative labels; blue denotes that the model is confident
For all three uncertainty strategies, the centre of the triangle is the most informative (as the posterior probability is uniform in that region)
Similarly, the least informative regions are the corners

In LC (a), the information slightly diffuses to the class boundaries from centre
In M (b), the information primarily diffuses from the class boundaries
In H (c), the information diffuses from the centre

Query-by-Committee

Construct a committee of models that are trained on labelled data
Models represent different hypotheses in a version space
Models vote on the labels of query candidates
The most disagreed instance is the most informative query

Version Spaces

The idea of QbC is to minimise the version space; the hypotheses that are consistent with
In the example, linear (a) and axis-parallel box (b) classifiers are shown with their version spaces
We want to identify the ‘best’ model within the version space, so the task of AL is to constrain the size of this space as much as possible

Problem Variants

We will now discuss some generalisations and extensions of AL

Active Feature Acquisition and Classification
Active Class Selection
Active Clustering

Feature Acquisition

In some domains, instances can have incomplete (but retrievable) features
- Medical records, credit card history from another provider, purchase habits etc.
Active feature acquisition operates under the assumption that additional data features can be obtained at a cost

Feature Acquisition

Zheng and Padmanabhan (2002) propose two ‘single-pass’ approaches

Impute missing values, then acquire the ones about which the model is least certain
Alternatively: train models on imputed instances, only acquire feature values for the misclasified instances

Active Class Selection

AL assumes instances are free but labelling is costly
Opposite scenario: learner can query a (known) class label, but obtaining instances is costly
- e.g. you know the class labels and want to teach the model how to differentiate between them

Active Clustering

i.e. AL for unsupervised learning

Counter-intuitive?
Sample the unlabelled instances in a way that they form self-organised clusters
The idea is that this can produce clusters with less overlap (noise) than those identified by random sampling

Practical Considerations

So far, so good. But what are our assumptions?

Is there always a single oracle?
Is the oracle always correct?
Is the cost for labelling constant?
Is there a determinable endpoint to stop learning?

Batch-Mode AL

Traditional AL approaches serially (i.e. one at a time) select instances to be queried
In parallel learning environments, this is not a desirable characteristic
Batch-mode allows a learner to query instances in groups

Batch-Mode AL

Assume you have 100 instances. You label 10 instances and rank the remaining 90 instances based on their informativeness in one go.
Do you think this initial ranking would hold if you continue ranking after labelling another 10, 20 etc. instances?

Batch-Mode AL

In the case of SVM, several batch-mode AL approaches exist in the literature
The idea is to introduce a distance metric that measures diversity among instances within a batch

Noisy Oracles

How reliable are human annotators?
- Fatigue, distractions, biases all lower annotation performance
Traditional AL formulates its cost function purely from a labelling cost standpoint; the obtained label is the ground truth
What if the oracle is noisy?

Noisy Oracles

Trade-off: Should the learner query a potentially noisy label of a new instance OR query for repeated labels to de-noise an existing labelled instance that it is not confident?
What if one oracle is almost always correct and others are noisy?
Open questions!

Variable Labelling Costs

Traditional AL assumes the costs of obtaining a label are uniform
If known, the varying costs of labelling certain instances can be added to the cost function
Then, the most informative instance to be labelled is a function of both the labelling cost and its marginal utility
Costs can be formulaic—e.g. length of the text to be labelled

Stopping Criteria

Another cost-related issue is termination—e.g. when to stop learning
Suggestions include cost/utility functions (learn as long as it is beneficiary) and a pre-determined model performance threshold
In real life, people tend to stop when their allocated budget runs out

Further Caveats

We have challenged some of the assumptions of AL from a cost/benefit perspective (i.e. is it worth it?)
Another question is: but does it work?
Settles (2009) provides ample empirical evidence from multiple domains showing that AL does work
However, there are caveats

Path Dependency

If an AL project is built on a learner and an unlabelled dataset , it is inherently tied to the hypothesis of its learner
- i.e. the labelled instances is not i.i.d. but a biased distribution
What if we change the model later in the process?
- No guarantee that will be useful
- The larger the distance between model families, higher the risk

Inefficiency

In some cases, it is shown that AL requires more labelled instances than passive learning even when using the same model class
If the most appropriate model class and feature set is known a priori—using AL is safe
- Otherwise, random sampling may be more appropriate until the above is established
Heterogeneous model ensembles/feature sets are also advisable in such cases

Data Annotation at Scale: Active and Semi-Supervised Learning

Gokhan Ciflikli, PhD

19 September @ ODSC2020