F
to enter full-screen
ESC
to see the full layout while not in full-screen<
>
to switch between sections
∧ ∨ to go back and forth within sections
Computational social scientist by training (PhD, post-doc @LSE; researcher @UCL)
Specialisation in predictive modelling & methodology
Worked as a research scientist on an NLP active learning project conducted by Uppsala University (Sweden)
I’m a senior data scientist at Attest—a customer growth platform
London start-up aiming to disrupt the market research industry
We work in cross-functional teams (a la Spotify)
I am a member of the Audience Quality squad (data quality, fraud detection)
Good: We are a survey company; we generate (store) a lot of data every day
Bad: The data do not come with labels;
Initially, the bottleneck was obtaining data at scale
Now, unlabelled data is widely accessible:
In-house: employees have all-important context; but they don’t scale/costly to scale
Outsourced: e.g. MTurk; scales at a price; but context is mostly lost
Inter-coder reliability: Adds robustness to labels at the expense of increased costs
Per Tom Mitchell:
A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.
What is the difference between traditional statistical models and machine learning algorithms?
Not much IMO; I subscribe to Leo Breiman’s separation of data vs. algorithmic models
Not the technical (i.e. cross-validation) but the more qualitative aspect w.r.t. the data-generating process (DGP)
You have a hypothesis X -> y
And your model is your hypothesis about how -> comes about (e.g. linear, non-linear)
Everyone has heard of the maxim correlation does not imply causation
Causal inference—widely defining here as to include path analysis, SEM, SCM—takes correlation one step further (assuming the given causal structure is appropriate)
CI practitioners such as Judea Pearl argue all of ML is mere curve-fitting
True function complexity vs. size of the training data
If the DGP is simple -> inflexible algorithm (high bias, low variance) and a small training set
If the DGP is complex -> flexible algorithm (low bias, high variance) and a very large training set
Curse of dimensionality/sparsity
Low variance, high bias algorithms can minimise the effect of high (but irrelevant) dimensions
Also dimensionality reduction techniques ( e.g. PCA, regularisation, feature selection)
Stochastic and deterministic
High bias, low variance algorithms can be used to minimise the effect of noise
Also early stopping criteria; outlier/anomaly detection (risky!)
A simplified example from market research—where things can go wrong w.r.t. inference
Amazing literature review (Settles, 2009)
AL resources at http://active-learning.net/
MEAP book by Robert Munro Human-in-the-Loop Machine Learning
modAL—highly modular, sk-learn compatible AL framework for Python
And a lot of great YouTube videos: Microsoft Research, ICML 2019, UoM, Prendki
Consider the conventional (passive) ML pipeline
Given the enormous influence of the training set on the accuracy of the final model,
and the fact that labelling is costly,
is there a more optimal way of obtaining data labels that can scale?
Given limited resources, which data points would you query for their labels?
AL framework posits some data points (instances) are more informative than others
By learning the true labels for the instances that the model is least confident about (in the figure, the cluster in the centre), the model will then generalise to the remaining instances more easily
The idea is that we can achieve high model performance by only labelling a fraction of the available data
A learner begins with a small number of instances in \(\mathcal{L}\) (labelled training set)
Toy data generated from two Gaussians centred at (-2,0) and (2,0) with \(\sigma=1\)
There are three main sampling scenarios in AL
Learner queries instances that it generates de novo
More useful in certain domains than others
Learner assumes obtaining (not labelling) a data instance is free
So it can be sampled from the actual distribution
The learner then decides whether to query or discard the sample (sequential)
Learner assumes there is a small set of labelled data \(\mathcal{L}\) and a large pool of unlabelled data \(\mathcal{U}\)
Queries are drawn from \(\mathcal{U}\), which is assumed to be closed (static)
All instances in \(\mathcal{U}\) are ranked based on an informativeness metric, which can then be queried in that order
We will consider two main approaches to evaluating the informativeness of an unlabelled instance:
Uncertainty sampling
Least confident
Margin
Entropy
Query-by-committee
Simplest query strategy: the learner queries the instances about which it is the least certain
In binary classification, instances whose posterior probability of being positive is closest to 0.5
\[x^{*}_{LC}=\underset{x}{\operatorname{argmax}}1-P_{\theta}(\hat{y}|x)\]
where
\(x^{*}_{LC}\) is the most informative instance (query) under strategy \(LC\)
\(\hat{y}=\text{argmax}_yP_{\theta}(y|x)\) is the class label with the highest posterior probability under model \(\theta\)
In multi-label classification, this can be thought of as the expected 0/1-loss; the degree of belief of the model that it will mislabel \(x\).
However, there is information loss:
\[x^{*}_{M}=\underset{x}{\operatorname{argmin}}P_{\theta}(\hat{y}_{1}|x)-P_{\theta}(\hat{y}_{2}|x)\]
where
The inclusion of the posterior of the second most likely class addresses the \(LC\) shortcoming
Intuitively:
Large margins denote the model is confident in differentiating classes
Small margins indicate that the model is ambiguous (and knowing the true label would increase its performance)
\[x^{*}_{H}=\underset{x}{\operatorname{argmax}}-\sum_{i} P_{\theta}(y_{i}|x) \log P_{\theta}(y_{i}|x)\]
where \(y_{i}\) ranges over all class labels and \(H\) denotes entropy.
Margin sampling partially addresses the information loss of least confident
Entropy generalises margin sampling to all class labels
For binary-classification, all three approaches are identical
Three classes, each ‘occupying’ a corner of the triangle
Red indicates more informative labels; blue denotes that the model is confident
For all three uncertainty strategies, the centre of the triangle is the most informative (as the posterior probability is uniform in that region)
Similarly, the least informative regions are the corners
In LC (a), the information slightly diffuses to the class boundaries from centre
In M (b), the information primarily diffuses from the class boundaries
In H (c), the information diffuses from the centre
Construct a committee \(\mathcal{C}=\{\theta^{(1)},\dots,\theta^{(C)}\}\) of models that are trained on labelled data \(\mathcal{L}\)
Models represent different hypotheses in a version space
Models vote on the labels of query candidates
The most disagreed instance is the most informative query
The idea of QbC is to minimise the version space; the hypotheses that are consistent with \(\mathcal{L}\)
In the example, linear (a) and axis-parallel box (b) classifiers are shown with their version spaces
We want to identify the ‘best’ model within the version space, so the task of AL is to constrain the size of this space as much as possible
We will now discuss some generalisations and extensions of AL
Active Feature Acquisition and Classification
Active Class Selection
Active Clustering
In some domains, instances can have incomplete (but retrievable) features
Active feature acquisition operates under the assumption that additional data features can be obtained at a cost
Zheng and Padmanabhan (2002) propose two ‘single-pass’ approaches
Impute missing values, then acquire the ones about which the model is least certain
Alternatively: train models on imputed instances, only acquire feature values for the misclasified instances
AL assumes instances are free but labelling is costly
Opposite scenario: learner can query a (known) class label, but obtaining instances is costly
i.e. AL for unsupervised learning
Counter-intuitive?
Sample the unlabelled instances in a way that they form self-organised clusters
The idea is that this can produce clusters with less overlap (noise) than those identified by random sampling
So far, so good. But what are our assumptions?
Is there always a single oracle?
Is the oracle always correct?
Is the cost for labelling constant?
Is there a determinable endpoint to stop learning?
Traditional AL approaches serially (i.e. one at a time) select instances to be queried
In parallel learning environments, this is not a desirable characteristic
Batch-mode allows a learner to query instances in groups
Assume you have 100 instances. You label 10 instances and rank the remaining 90 instances based on their informativeness in one go.
Do you think this initial ranking would hold if you continue ranking after labelling another 10, 20 etc. instances?
In the case of SVM, several batch-mode AL approaches exist in the literature
The idea is to introduce a distance metric that measures diversity among instances within a batch
How reliable are human annotators?
Traditional AL formulates its cost function purely from a labelling cost standpoint; the obtained label is the ground truth
What if the oracle is noisy?
Trade-off: Should the learner query a potentially noisy label of a new instance OR query for repeated labels to de-noise an existing labelled instance that it is not confident?
What if one oracle is almost always correct and others are noisy?
Open questions!
Traditional AL assumes the costs of obtaining a label are uniform
If known, the varying costs of labelling certain instances can be added to the cost function
Then, the most informative instance to be labelled is a function of both the labelling cost and its marginal utility
Costs can be formulaic—e.g. length of the text to be labelled
Another cost-related issue is termination—e.g. when to stop learning
Suggestions include cost/utility functions (learn as long as it is beneficiary) and a pre-determined model performance threshold
In real life, people tend to stop when their allocated budget runs out
We have challenged some of the assumptions of AL from a cost/benefit perspective (i.e. is it worth it?)
Another question is: but does it work?
Settles (2009) provides ample empirical evidence from multiple domains showing that AL does work
However, there are caveats
If an AL project is built on a learner and an unlabelled dataset \(\mathcal{U}\), it is inherently tied to the hypothesis of its learner
What if we change the model later in the process?
No guarantee that \(\mathcal{L}\) will be useful
The larger the distance between model families, higher the risk
In some cases, it is shown that AL requires more labelled instances than passive learning even when using the same model class
If the most appropriate model class and feature set is known a priori—using AL is safe
Heterogeneous model ensembles/feature sets are also advisable in such cases
Semi-Supervised Learning tackles the annotation problem from the opposite direction
AL focuses on exploring the unknown from the perspective of a leaner
While SSL takes advantage of the instances that the model is highly confident about
Literature review (Zhu, 2008)
Recall the predicted probability figure with red and blue contours in Part I
SSL posits that, instead of querying the most informative regions (the centre), we can expand the size of \(\mathcal{L}\) by appending the instances that the model is fairly certain about
Set an arbitrary threshold; say 95%. Then, every instance in \(\mathcal{U}\) that the model predicts with a probability of 95% or higher will be assigned to that class label (as if annotated by a human expert)
In the figure, this would correspond to the innermost instances at the top left and bottom right being assigned blue and red labels, respectively
However, there is a qualitative difference between the labels obtained by AL and SSL
Labels annotated by an oracle in AL are thought of as ground truth; they are the true labels (oracle quality issues notwithstanding)
Labels predicted by a model in SSL are thought of as synthetic labels; they are forecasts
As AL and SSL are naturally compatible, it is common to see them employed together in large data annotation projects
The idea is to attack the problem from two opposite directions using a complimentary framework
See Tomanek & Hahn (2009) for a Semi-Supervised Active Learning pipeline (SeSAL)
A general-purpose pipeline combining both AL and SSL assuming cold start—i.e. \(\mathcal{L}\) is an empty set; otherwise skip to the next step
Manually label a small number of instances—can be up to 20-30 data points
Adjust according to domain/feasibility
Query strategies available on modAL
will merely give you the first \(n\) instances if you start cold
Although it may not be viable in the first iteration, when you are repeating the loop:
Predict class labels (with probabilities) for instances in \(\mathcal{U}\)
Consider a threshold that you are comfortable with making a judgement call (>=n% predicted probability)
Append those instances to \(\mathcal{L}\) and delete them from \(\mathcal{U}\)
Fit a learner on \(\mathcal{L}\)
Pay attention to your qualitative hypothesis and the hypothesis implied by your model/algorithm family selection
Remember path dependency; are you likely to make drastic changes to the model family?
If yes, consider an ensemble of learners (query-by-committee)
How do you obtain data—streaming, fetched in chunks, or do you already have all the data already?
How heavy is your model? How fast can it predict a single data point?
Does your model have expectations w.r.t. the data structure?
Query an instance from \(\mathcal{U}\)
For binary classification, query strategy is relatively trivial
For multi-class classification, consider learning goals and expected information loss
No free lunch!
Oracle provides labels to queries -> appended to \(\mathcal{L}\)
Pay attention to labelling costs—are they uniform? If not, account for them in your query strategy
Single oracle vs. multiple oracles?
How crucial is context for your annotation task?
Are the annotators equal w.r.t. labelling?
If the answer is no to both, consider multiple annotators for robustness (funds permitting)
Continue iterating until a stopping criterion is met
It is normal that the first iterations are the most volatile
Performance might regress, but should pick up soon after
Otherwise something is not right!
The SSL stage can be moved around depending on what you want to get out of it
AL can be used to verify some of the synthetic labels predicted by SSL
You can experiment with SSL predictions using different thresholds:
There are other ensemble approaches other than Query-by-Committee
You can use the same model, but have three learners using least confident, margin, and entropy sampling
You can have one pure AL stream and an AL + SSL stream
It is possible to be led astray using this approach (i.e. performance degradation over time)
Create a separate database for keeping track of the labels supplied by AL and SSL
If model performance deteriorates in the future, revert back to an earlier stage and start again
This also makes sure you know which labels are provided by a human and which ones are synthetically labelled
It literally pays to plan ahead:
What is your budget?
How much does an individual label cost?
Are all classes known a priori?
What is an acceptable model performance to stop learning?
In addition to crowdsourced annotation solutions like MTurk, the whole AL + SSL enterprise can be outsourced
If your company is already using AWS:
SageMaker GroundTruth is a frontend for data annotation
Supports private (employees), public (MTurk), and third-party annotators
Has a built-in SSL component
Augmented AI implements AL in a fully managed workflow