F
to enter full-screen
ESC
to see the full layout while not in full-screen<
>
to switch between sections
∧ ∨ to go back and forth within sections
Computational social scientist by training (PhD, post-doc @LSE; researcher @UCL)
Specialisation in predictive modelling & methodology
Worked as a research scientist on an NLP active learning project conducted by Uppsala University (Sweden)
I’m a senior data scientist at Attest—a customer growth platform
London start-up aiming to disrupt the market research industry
We work in cross-functional teams (a la Spotify)
I am a member of the Audience Quality squad (data quality, fraud detection)
Good: We are a survey company; we generate (store) a lot of data every day
Bad: The data do not come with labels;
Initially, the bottleneck was obtaining data at scale
Now, unlabelled data is widely accessible:
In-house: employees have all-important context; but they don’t scale/costly to scale
Outsourced: e.g. MTurk; scales at a price; but context is mostly lost
Inter-coder reliability: Adds robustness to labels at the expense of increased costs
Per Tom Mitchell:
A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.
What is the difference between traditional statistical models and machine learning algorithms?
Not much IMO; I subscribe to Leo Breiman’s separation of data vs. algorithmic models
Not the technical (i.e. cross-validation) but the more qualitative aspect w.r.t. the data-generating process (DGP)
You have a hypothesis X -> y
And your model is your hypothesis about how -> comes about (e.g. linear, non-linear)
Everyone has heard of the maxim correlation does not imply causation
Causal inference—widely defining here as to include path analysis, SEM, SCM—takes correlation one step further (assuming the given causal structure is appropriate)
CI practitioners such as Judea Pearl argue all of ML is mere curve-fitting
True function complexity vs. size of the training data
If the DGP is simple -> inflexible algorithm (high bias, low variance) and a small training set
If the DGP is complex -> flexible algorithm (low bias, high variance) and a very large training set
Curse of dimensionality/sparsity
Low variance, high bias algorithms can minimise the effect of high (but irrelevant) dimensions
Also dimensionality reduction techniques ( e.g. PCA, regularisation, feature selection)
Stochastic and deterministic
High bias, low variance algorithms can be used to minimise the effect of noise
Also early stopping criteria; outlier/anomaly detection (risky!)
A simplified example from market research—where things can go wrong w.r.t. inference
Amazing literature review (Settles, 2009)
AL resources at http://active-learning.net/
MEAP book by Robert Munro Human-in-the-Loop Machine Learning
modAL—highly modular, sk-learn compatible AL framework for Python
And a lot of great YouTube videos: Microsoft Research, ICML 2019, UoM, Prendki
Consider the conventional (passive) ML pipeline
Given the enormous influence of the training set on the accuracy of the final model,
and the fact that labelling is costly,
is there a more optimal way of obtaining data labels that can scale?