Yet Another Caret Workshop

Apr 19, 2018 · 3441 words · 17 minutes read caret caretEnsemble skimr xray

Intro

Yesterday I gave a workshop on applied predictive modelling¹ with caret at the 1st LSE Computational Social Science hackathon. Organiser privileges. I put together some introductory code and started a simple GitHub repo for the participants, so I thought I’d share it here as well. This is not supposed to cover all aspects of caret (plus there is already this), but more of a starter-pack for those who might be migrating from Python or another machine learning library like mlr. I have also saved the environment as caret.rdata, so that the participants can load it up during the workshop (insert harrowing experience about live coding) and follow through—that’s included in the repo too if you rather have a test run first.

The Data

Let’s start by creating some synthetic data using caret. The twoClassSim generates a dataset suitable for binary-outcomes:

dat <- twoClassSim(n = 1000, #number of rows
                   linearVars = 2, #linearly important variables
                   noiseVars = 5, #uncorrelated irrelevant variables
                   corrVars = 2, #correlated irrelevant variables
                   mislabel = .01) #percentage possibly mislabeled
colnames(dat)

##  [1] "TwoFactor1" "TwoFactor2" "Linear1"    "Linear2"    "Nonlinear1"
##  [6] "Nonlinear2" "Nonlinear3" "Noise1"     "Noise2"     "Noise3"    
## [11] "Noise4"     "Noise5"     "Corr1"      "Corr2"      "Class"

The above chunk simulates a dataframe with 1000 rows containing 15 variables:

Class: Binary outcome (Class)
TwoFactor: Correlated multivariate normal predictors (TwoFactor1, TwoFactor2)
Nonlinear: Uncorrelated random uniform predictors (NonLinear1, …, Nonlinear3)
Linear: (Optional) uncorrelated standard normal predictors (Linear1, Linear2)
Noise: (Optional) uncorrelated standard normal predictors (Noise1, … , Noise5)
Correlated: (Optional) correlated multivariate normal predictors (Corr1, Corr2)

We can take a closer look at the variables using two packages: skimr and xray. Both have functions that provide a snapshot of your covariates in an easy-to-understand output:

skim(dat) #unintended

## Skim summary statistics
##  n obs: 1000 
##  n variables: 15 
## 
## Variable type: factor 
##  variable missing complete    n n_unique                top_counts ordered
##     Class       0     1000 1000        2 Cla: 567, Cla: 433, NA: 0   FALSE
## 
## Variable type: numeric 
##    variable missing complete    n    mean   sd          p0   p25  median
##       Corr1       0     1000 1000 -0.078  1       -3.19    -0.76 -0.11  
##       Corr2       0     1000 1000 -0.016  1       -2.91    -0.71 -0.014 
##     Linear1       0     1000 1000  0.016  1       -3.29    -0.68  0.054 
##     Linear2       0     1000 1000  0.023  1.02    -3.31    -0.66  0.053 
##      Noise1       0     1000 1000 -0.022  1       -3.46    -0.71 -0.023 
##      Noise2       0     1000 1000 -0.048  0.99    -3.52    -0.74 -0.042 
##      Noise3       0     1000 1000  0.016  0.97    -3.11    -0.64  0.014 
##      Noise4       0     1000 1000 -0.015  1.02    -3.48    -0.71 -0.028 
##      Noise5       0     1000 1000  0.036  0.94    -3.03    -0.59  0.072 
##  Nonlinear1       0     1000 1000  0.0086 0.58    -1       -0.49  0.038 
##  Nonlinear2       0     1000 1000  0.49   0.29     7.8e-05  0.25  0.5   
##  Nonlinear3       0     1000 1000  0.51   0.29 5e-04        0.26  0.52  
##  TwoFactor1       0     1000 1000 -0.086  1.36    -4.28    -0.96 -0.085 
##  TwoFactor2       0     1000 1000 -0.042  1.39    -4.63    -1.03  0.0095
##   p75 p100     hist
##  0.58 3.58 ▁▂▆▇▇▂▁▁
##  0.67 3.57 ▁▂▅▇▆▂▁▁
##  0.71 3.02 ▁▁▃▇▇▅▂▁
##  0.69 3.88 ▁▂▅▇▇▂▁▁
##  0.64 3.22 ▁▁▃▇▇▅▁▁
##  0.61 3.39 ▁▁▅▇▇▅▁▁
##  0.64 3.77 ▁▁▅▇▅▂▁▁
##  0.67 3.17 ▁▁▃▇▇▅▁▁
##  0.71 3.1  ▁▁▅▇▇▅▁▁
##  0.5  1    ▇▇▇▇▇▇▇▇
##  0.73 1    ▇▇▇▇▇▇▇▆
##  0.76 1    ▇▇▇▆▇▇▇▇
##  0.81 3.83 ▁▂▃▇▇▅▂▁
##  0.89 4.79 ▁▁▅▇▇▃▁▁

You should also try out xray::anomalies(dat) and see which output you prefer. Because our data is synthetic, we have these nice bell curves and normal distributions that are harder to locate in the wild.

Let’s split the data into train/test using an index of row numbers:

index <- createDataPartition(y = dat$Class, p = .7, list = FALSE)
training <- dat[index, ]
test <- dat[-index, ]

First, we supply the outcome variable y so that caret can take it into account when creating the split (in terms of class-balance). We use 70% of the data for training and hold out the remaining 30% for testing later. We want a vector instead of a list so we convey this to R by overriding the default behaviour. The actual splitting happens when we subset using the index we just created; the selected row numbers generate the training data whereas the rest goes to the test (using negative indexing).

trainControl

The magic of caret happens in the control arguments. Default arguments tend to cater to regression problems; given our focus on classification, I only briefly mention the former here:

reg.ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 5, allowParallel = TRUE)

We now have a trainControl object that will signal a ten-k fold (repeated 5 times; so 50 resamples in total) to the train function. Classification controls require several more arguments:

cls.ctrl <- trainControl(method = "repeatedcv", #boot, cv, LOOCV, timeslice OR adaptive etc.
                         number = 10, repeats = 5,
                         classProbs = TRUE, summaryFunction = twoClassSummary,
                         savePredictions = "final", allowParallel = TRUE)

There is a good variety of cross-validation methods you can choose in caret, which I will not cover here. classProbs computes class probabilities for each resample. We need to set the summary function to twoClassSummary for binary classification. Finally, we set save predictions to TRUE—note that this is not a classification-specific argument; we didn’t have it in the regression controls because we won’t be covering them here in detail.

For future reference, there are several other useful arguments that you can call within trainControl. For example, you can evoke subsampling using sampling if you have class-imbalance. You can set seeds for each resample for perfect reproducibility. You can also define your own indices (index) for resampling purposes.

Model Fitting

We’ll start with a place-holder regression example for completeness. You should always set the seed before calling train. caret accepts the formula interface if you supply the data later. Below, we arbitrarily select one of the linear variables as the outcome, and fit the rest of the variables as predictors using the dot indicator:

set.seed(1895)
lm.fit <- train(Linear1 ~ ., data = training, trControl = reg.ctrl, method = "lm")
lm.fit

## Linear Regression 
## 
## 701 samples
##  14 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 630, 632, 632, 630, 631, 630, ... 
## Resampling results:
## 
##   RMSE       Rsquared    MAE      
##   0.9995116  0.01280926  0.8039185
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Probably not the most amazing \(R^2\) value you have ever seen, but that’s alright. Note that calling the model fit displays the most crucial information in a succinct way.

Let’s move on to a classification algorithm. It’s good practice to start with a logistic regression and take it from there. In R, logistic regression is in the glm framework and can be specified by calling family = "binomial". We set the performance metric to ROC as the default metric is accuracy. Accuracy tends to be unreliable when you have class-imbalance. A classic example is having one positive outcome and 99 negative outcomes; any lazy algorithm predicting all zeros would be 99% accurate (but it would be uninformative as a result). The Receiver Operating Characteristic—a proud member of the good ol’ WWII school of naming things—provides a better performance metric by taking into account the rate of true positives and true negatives. Finally, we apply some pre-processing to our data by passing a set of strings: we drop near-zero variance variables, as well as centring and scaling all covariates:

set.seed(1895)
glm.fit <- train(Class ~ ., data = training, trControl = cls.ctrl,
                 method = "glm", family = "binomial", metric = "ROC",
                 preProcess = c("nzv", "center", "scale"))

For reference, you can also vectorise your x and y if you find it easier to read:

y <- training$Class
predictors <- training[,which(colnames(training) != "Class")]

#Same logit fit
set.seed(1895)
glm.fit <- train(x = predictors, y = y, trControl = cls.ctrl,
                 method = "glm", family = "binomial", metric = "ROC",
                 preProcess = c("nzv", "center", "scale"))
glm.fit

## Generalized Linear Model 
## 
## 701 samples
##  14 predictor
##   2 classes: 'Class1', 'Class2' 
## 
## Pre-processing: centered (14), scaled (14) 
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 630, 630, 630, 631, 631, 631, ... 
## Resampling results:
## 
##   ROC        Sens       Spec     
##   0.8639053  0.8565385  0.7206237

You can quickly find out which variables contribute to predictive accuracy:

varImp(glm.fit)

## glm variable importance
## 
##            Overall
## TwoFactor1 100.000
## TwoFactor2  97.267
## Linear2     73.712
## Nonlinear1  25.964
## Linear1     12.900
## Nonlinear2  12.261
## Corr1        8.758
## Noise2       7.424
## Corr2        6.116
## Nonlinear3   5.798
## Noise1       5.371
## Noise3       4.306
## Noise5       2.668
## Noise4       0.000

plot(varImp(glm.fit))

Let’s fit a couple of other models before moving on. One common choice would be the elastic net. Elastic net relies on L1 and L2 regularisations and it’s basically a mix of both: the former shrinks some variable coefficients to zero (so that they are dropped out; i.e. feature selection/dimensionality reduction), whereas the latter penalises coefficient size. In R, it has two hyper-parameters that can be tuned: alpha and lambda. Alpha controls the type of regression; 0 representing Ridge and 1 denoting LASSO (Least Absolute Shrinkage and Selector Operator)². Lambda, on the other hand, determines the penalty amount. Note that the expand.grid function actually just creates a dataset with two columns called alpha and lambda, which are then used for the model fit based on the value-pairs in each row.

set.seed(1895)
glmnet.fit <- train(x = predictors, y = y, trControl = cls.ctrl,
                    method = "glmnet", metric = "ROC",
                    preProcess = c("nzv", "center", "scale"),
                    tuneGrid = expand.grid(alpha = 0:1,
                                           lambda = seq(0.0001, 1, length = 20)))

Because it has tune-able parameters, we can visualise their performance by calling plot on the model fit:

plot(glmnet.fit)

where the two colours denote the alpha level and the dots are the specified lambda values.

Finally, let’s fit a Random Forest using the ranger package, which is a fast C++ implementation of the original algorithm in R:

set.seed(1895)
rf.fit <- train(Class ~ ., data = training, trControl = cls.ctrl,
                method = "ranger", metric = "ROC",
                preProcess = c("nzv", "center", "scale"))
confusionMatrix(rf.fit)

## Cross-Validated (10 fold, repeated 5 times) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction Class1 Class2
##     Class1   51.2    7.2
##     Class2    5.4   36.1
##                             
##  Accuracy (average) : 0.8739

This is all good, as we are fitting different algorithms using a simple interface without needing to memorise the idiosyncrasies of each package. However, we are still writing lots of redundant code. The caretEnsemble package provides this functionality via caretList:

set.seed(1895)
models <- caretList(Class ~ ., data = training, trControl = cls.ctrl, metric = "ROC",
                    tuneList = list(logit = caretModelSpec(method = "glm", family = "binomial"),
                                    elasticnet = caretModelSpec(method = "glmnet",
                                                                tuneGrid = expand.grid(alpha = 0:1,
                                                                                       lambda = seq(0.0001, 1, length = 20))),
                                    rf = caretModelSpec(method = "ranger")),
                    preProcess = c("nzv", "center", "scale"))

We basically just merged the first three model fits into a single call using tuneList, which requires a list of model specifications. If we want to predict using unseen data, we can now get predictions from all three models:

models.preds <- lapply(models, predict, newdata = test) #add type = "prob" for class probabilities
models.preds <- data.frame(models.preds)
head(models.preds, 10)

##     logit elasticnet     rf
## 1  Class1     Class1 Class1
## 2  Class2     Class2 Class2
## 3  Class1     Class1 Class2
## 4  Class2     Class2 Class1
## 5  Class1     Class1 Class2
## 6  Class2     Class2 Class2
## 7  Class2     Class2 Class2
## 8  Class1     Class1 Class2
## 9  Class1     Class1 Class1
## 10 Class1     Class1 Class1

The resamples function collects all the resampling data from all models and allows you to easily assess in-sample performance metrics:

bwplot(resamples(models)) #try dotplot as well

Averaged over all resamples, the Random Forest algorithm has the highest ROC value, however the whiskers overlap in all three categories—perhaps a larger number of resamples are needed for significant separation. It also outperforms other two algorithms when it comes to detecting true positives and true negatives. Note that often the results will not be this clear; it’s common for an algorithm to do really well in one area and perform terribly in the other.

We could also create a simple linear ensemble using the three model fits. You can check whether the model predictions are linearly correlated:

modelCor(resamples(models))

##                logit elasticnet        rf
## logit      1.0000000  0.9996123 0.1135033
## elasticnet 0.9996123  1.0000000 0.1141322
## rf         0.1135033  0.1141322 1.0000000

Seems like logit and elastic net predictions are more or less identical, meaning one of them is redundant:

xyplot(resamples(models))

And now for the ensemble:

set.seed(1895)
greedy_ensemble <- caretEnsemble(models, metric = "ROC", trControl = cls.ctrl)
summary(greedy_ensemble)

## The following models were ensembled: logit, elasticnet, rf 
## They were weighted: 
## -4.9237 44.2633 -42.2176 8.0329
## The resulting ROC is: 0.9517
## The fit for each individual model on the ROC is: 
##      method       ROC      ROCSD
##       logit 0.8618236 0.03919733
##  elasticnet 0.8614454 0.03923875
##          rf 0.9469568 0.02237265

The ROC of the ensemble (0.9517) is higher than any individual model, however the Random Forest algorithm by itself provides similar levels of accuracy (0.9469).

Feature Selection

As an extra, I’ll briefly cover several feature selection wrapper functions that are available in caret.

Recursive Feature Elimination

RFE works by passing a vector of subsets consisting of different number of variables to be used in model fitting. For example, because we only have 14 variables in our dataset (excluding the outcome), we can try all numbers from one to 14. With all three feature selection algorithms, we will need to change the summary function to twoClassSummary for classification purposes.

subsets <- c(1:length(training))

lrFuncs$summary <- twoClassSummary

We need to pass an additional control function specifically for the RFE. We select the linear regression wrapper (lrFuncs),³ and choose bootstrapped cross-validation (25). After that, we call the rfe function:

rfe.ctrl = rfeControl(functions = lrFuncs,
                      method = "boot",
                      number = 25,
                      allowParallel = TRUE, verbose = TRUE)

set.seed(1895)
rfe <- rfe(x = predictors, y = y, sizes = subsets,
           metric = "ROC", rfeControl = rfe.ctrl)
rfe

## 
## Recursive feature selection
## 
## Outer resampling method: Bootstrapped (25 reps) 
## 
## Resampling performance over subset size:
## 
##  Variables    ROC   Sens   Spec   ROCSD  SensSD  SpecSD Selected
##          1 0.6383 0.8529 0.4462 0.03546 0.03912 0.05763         
##          2 0.8123 0.8351 0.6355 0.03603 0.04063 0.04609         
##          3 0.8606 0.8723 0.6905 0.01693 0.03575 0.04339         
##          4 0.8610 0.8634 0.6958 0.01955 0.03571 0.04740        *
##          5 0.8596 0.8595 0.7011 0.01893 0.03560 0.04360         
##          6 0.8582 0.8574 0.7010 0.01925 0.03567 0.04386         
##          7 0.8565 0.8565 0.6983 0.01996 0.03595 0.04805         
##          8 0.8565 0.8557 0.7017 0.01907 0.03565 0.04726         
##          9 0.8561 0.8590 0.6993 0.01988 0.03484 0.04806         
##         10 0.8560 0.8569 0.7052 0.02019 0.03384 0.04739         
##         11 0.8560 0.8592 0.7048 0.02003 0.03353 0.04939         
##         12 0.8561 0.8558 0.7047 0.02014 0.03430 0.04612         
##         13 0.8561 0.8561 0.7051 0.02015 0.03413 0.04721         
##         14 0.8562 0.8555 0.7047 0.01999 0.03407 0.04688         
## 
## The top 4 variables (out of 4):
##    TwoFactor1, TwoFactor2, Linear2, Nonlinear1

We see that the model with four variables (indicated with an asterisk) resulted in the highest ROC value and thus selected as the best model. The top variables are also displayed at the end. One approach would be fitting a model on the test data using these four variables only.

Simulated Annealing

SA works by starting with a certain number of variables and introducing small changes to them along the way. If the change results in an `upgrade’ (i.e. higher predictive accuracy), the initial candidate is abandoned in favour of the new solution. Unlike RFE, which is greedy—meaning, it only assesses the subset sizes once and moves on forever—SA can be programmed to go back and try again if it doesn’t find an improvement within a certain number of iterations (below we set this limit to 5). We can also pass which performance metrics to be used for both the internal and external processes, as well as defining the amount that will be held out (20%):

caretSA$fitness_extern <- twoClassSummary

safs.ctrl = safsControl(functions = caretSA, method = "boot", number = 10,
                        metric = c(internal = "ROC", external = "ROC"),
                        maximize = c(internal = TRUE, external = TRUE),
                        holdout = .2, improve = 5,
                        allowParallel = TRUE, verbose = TRUE)

We can then fit the algorithm by calling safs:

sa <- safs(x = predictors, y = y,
           iters = 10, method = "glm", family = "binomial", metric = "ROC",
           trControl = cls.ctrl,
           safsControl = safs.ctrl)

## + final SA
##   1 0.4597138 (3)
##   2 0.4597138->0.4663124 (3+1, 75.0%) *
##   3 0.4663124->0.4626936 (4-1, 75.0%)  0.9769874  A
##   4 0.4663124->0.6563141 (4+0, 60.0%) *
##   5 0.6563141->0.657809 (4-1, 75.0%) *
##   6 0.657809->0.7425797 (3+1, 75.0%) *
##   7 0.7425797->0.7398033 (4+1, 80.0%)  0.9741674  A
##   8 0.7425797->0.7417647 (4+0, 60.0%)  0.991258  A
##   9 0.7425797->0.7407108 (4+1, 50.0%)  0.9776039  A
##  10 0.7425797->0.7427323 (4+0, 60.0%) *
## + final model

Calling the object returns an informative summary of the whole process:

sa

## 
## Simulated Annealing Feature Selection
## 
## 701 samples
## 14 predictors
## 2 classes: 'Class1', 'Class2' 
## 
## Maximum search iterations: 10 
## Restart after 5 iterations without improvement (0.3 restarts on average)
## 
## Internal performance values: ROC, Sens, Spec
## Subset selection driven to maximize internal ROC 
## 
## External performance values: ROC, Sens, Spec
## Best iteration chose by maximizing external ROC 
## External resampling method: Bootstrapped (10 reps) 
## Subsampling for internal fitness calculation: 20%
## 
## During resampling:
##   * the top 5 selected variables (out of a possible 14):
##     Linear2 (70%), TwoFactor1 (70%), TwoFactor2 (70%), Nonlinear1 (60%), Corr1 (40%)
##   * on average, 5.6 variables were selected (min = 3, max = 8)
## 
## In the final search using the entire training set:
##    * 4 features selected at iteration 10 including:
##      TwoFactor1, Linear2, Noise3, Noise4  
##    * external performance at this iteration is
## 
##         ROC        Sens        Spec 
##      0.7539      0.8022      0.5891

Genetic Algorithm

Last but not least, we will cover genetic algorithms. Here, variables are put through pressures similar to that of natural selection. We keep the iteration and population sizes really, really low as the code chunks are only supposed to give you a working example of the process. These algorithms fit a lot of models, so always start with a small value and gradually increase the number of iterations/generations.

caretGA$fitness_extern <- twoClassSummary

gafs.ctrl = gafsControl(functions = caretGA, method = "boot", number = 10,
                        metric = c(internal = "ROC", external = "ROC"),
                        maximize = c(internal = TRUE, external = TRUE),
                        holdout = .2,
                        allowParallel = TRUE, genParallel = TRUE, verbose = TRUE)

set.seed(1895)
ga <- gafs(x = predictors, y = y, iters = 5, popSize = 2, elite = 0,
           differences = TRUE, method = "glm", family = "binomial", metric = "ROC",
           trControl = cls.ctrl,
           gafsControl = gafs.ctrl)

## + final GA
##  1 0.8197197 (12)
##  2 0.8197197->0.820586 (12->12, 100.0%) *
##  3 0.820586->0.8211813 (12->12, 100.0%) *
##  4 0.8211813->0.8218709 (12->12, 100.0%) *
##  5 0.8218709->0.8215376 (12->12, 100.0%)
## + final model

Similar to the previous algorithms, calling the final object provides a summary:

ga

## 
## Genetic Algorithm Feature Selection
## 
## 701 samples
## 14 predictors
## 2 classes: 'Class1', 'Class2' 
## 
## Maximum generations: 5 
## Population per generation: 2 
## Crossover probability: 0.8 
## Mutation probability: 0.1 
## Elitism: 0 
## 
## Internal performance values: ROC, Sens, Spec
## Subset selection driven to maximize internal ROC 
## 
## External performance values: ROC, Sens, Spec
## Best iteration chose by maximizing external ROC 
## External resampling method: Bootstrapped (10 reps) 
## Subsampling for internal fitness calculation: 20%
## 
## During resampling:
##   * the top 5 selected variables (out of a possible 14):
##     Linear2 (100%), Noise3 (100%), Nonlinear3 (100%), TwoFactor2 (100%), Corr2 (90%)
##   * on average, 12.4 variables were selected (min = 10, max = 14)
## 
## In the final search using the entire training set:
##    * 12 features selected at iteration 1 including:
##      TwoFactor1, TwoFactor2, Linear1, Nonlinear1, Nonlinear2 ... 
##    * external performance at this iteration is
## 
##         ROC        Sens        Spec 
##      0.8475      0.8481      0.7055

Given the small number of iterations used for SA and GA, we can’t really judge the quality of their results. However, running the algorithms for hundreds or thousands of iterations is not necessarily the best option either. As these algorithms focus on maximising in-sample ROC, given enough iterations, they will perfectly learn the specific noise of your dataset and will not generalise to unseen data (i.e. over-fitting). As always, aim to leverage your domain knowledge and gradually increase the number of iterations until you see a divergence between training and test validation results.

Big fan of coming up with original names.↩
Seriously, who names these things? Backronyms everywhere.↩
Read more here.↩