Overview

Today’s goal: to apply supervised learning algorithms to the cattle time series data.

Resources
Packages

tidymodels, consisting of: rsample, recipes, parsnip and yardstick, but also: kernlab and randomForest

Main functions

group_vfold_cv, analysis, assessment, recipe, prep, bake, svm_rbf, rand_forest, set_engine, fit, predict, metrics, pr_curve

1 Introduction

Yesterday we applied some unsupervised machine learning algorithms, today we will be applying two types of supervised machine learning algorithms: Support Vector Machines (SVM) and Random Forests (RF). We will continue with the cattle movement data that we worked with the last 2 days. Because we had the column label in the dataset, which contained the annotated behaviour based on direct visual observation, we have a labelled dataset and thus possibility to use supervised learning.

Recall from the lecture on the data science workflow that in predictive modelling it is necessary to split the data into train and validation set. This can be done randomly, but this should only be done with great care. As we are dealing with time series data on individual animals, which generally have high degrees of temporal autocorrelation, randomly splitting the data into train and validation sets is not making sure that both sets are independent! In tutorial 9 we called non-independent train and validation sets a form of data leakage, in which by mistake information is shared between the training and validation data sets. In general, by dividing a set of data into training and validation sets, the goal is to ensure that no data is shared between the two.

Thus, it may be better to keep all the data of a specific individual together, and allocate all data from a single individual (or time-series) to either the train dataset or the validation dataset. Such partitioning of the data into train and validation sets can be done once, or using a k-fold cross validation approach. We will use both here.

Exercise 13.1a
Download the file dat_ts2.rds from Brighspace > Time series analyses > datasets > cattle, and store in an appropriate folder (e.g. data / processed / cattle /). Load the file into an object called dat. Note that it is an .rds file, which can be read into R with the function readRDS (see tutorial 1). Also load (and install if needed) the packages tidyverse, tidymodels and kernlab.

library(tidyverse)
library(tidymodels)
library(kernlab)
dat <- readRDS("data/processed/cattle/dat_ts2.rds")
dat
## # A tibble: 2,159 × 12
##    IDburst isGrazing xacc_mean xacc_sd yacc_mean yacc_sd zacc_mean zacc_sd VeDBA_mean VeDBA_sd   step  angle
##    <fct>   <fct>         <dbl>   <dbl>     <dbl>   <dbl>     <dbl>   <dbl>      <dbl>    <dbl>  <dbl>  <dbl>
##  1 1_1     TRUE          -6.06   13.5      -34.7    21.9    -13.0     19.8       46.2     17.8 0.199   0.983
##  2 1_1     TRUE          -1.91   18.8      -30.4    27.7     -7.80    22.5       46.9     20.2 0.109  -0.250
##  3 1_1     TRUE          -7.59   14.7      -23.5    25.3     -3.15    21.5       40.5     16.9 0.0892  0.619
##  4 1_1     TRUE          -8.88   13.3      -18.8    12.6     -4.31    11.8       28.5     10.5 0.204  -0.894
##  5 1_1     TRUE          -1.75   22.4      -19.6    23.6     -5.09    21.1       40.3     16.7 0.0486 -0.444
##  6 1_1     TRUE          -4.60   16.8      -23.2    22.8     -1.83    15.0       35.8     17.5 0.167   0.236
##  7 1_1     TRUE          -6.55   15.1      -15.2    21.4     -2.14    16.2       30.8     16.6 0.453   0.958
##  8 1_1     TRUE          -6.04   16.6      -27.0    28.1     -9.91    24.3       45.6     20.6 0.379   0.977
##  9 1_1     TRUE          -7.10   13.4      -23.5    28.3     -6.14    21.4       41.1     19.3 0.243   0.996
## 10 1_1     TRUE         -11.4     9.62     -19.8    17.4     -5.07    13.8       31.3     12.1 0.220   0.976
## # ℹ 2,149 more rows

The data dat does not have the column label like we had in the previous 2 tutorials; here it is reclassified to the new column isGrazing, which is a factor with 2 levels indicating whether or not the cows were grazing.

For your information, the dataset loaded above was created from the dataset “dat_ts1.rds” loaded yesterday via the following chain of dplyr functions:

readRDS("data/processed/cattle/dat_ts1.rds") %>%
  select(-c(dttm,t,x,y,dx,dy,head)) %>%
  unite(IDburst, c(ID, burst)) %>%
  drop_na(angle) %>%
  mutate(angle = cos(angle),
         IDburst = as.factor(IDburst),
         isGrazing = as.factor(label == "grazing")) %>%
  select(-label) %>%
  select(IDburst, isGrazing, everything()) # just for reordering columns

2 Splitting the data

The rsample package, part of tidymodels, offers functions to split the data into train and validation set. In tutorial 9 we already saw the simple initial_split function that partitions the data randomly into train and validation data. However, because this function randomly allocates rows (thus records) to either of the datasets, and not a grouped set of records (e.g. belonging to the same individual), we cannot use this function here. But we can use the group_vfold_cv function, which is a function that creates splits of the data based on some grouping variable (which may have more than a single row associated with it), in a k-fold (here called V-fold) cross-validation approach.

Exercise 13.2a
Check the helpfile of the group_vfold_cv function and create a train-validation split with 5 folds based on grouping variable IDburst. Assign the result to an object called dat_split and check the object by printing it to the screen.

dat_split <- group_vfold_cv(dat, group = IDburst, v = 5)
dat_split
## # Group 5-fold cross-validation 
## # A tibble: 5 × 2
##   splits             id       
##   <list>             <chr>    
## 1 <split [1707/452]> Resample1
## 2 <split [1647/512]> Resample2
## 3 <split [1812/347]> Resample3
## 4 <split [1650/509]> Resample4
## 5 <split [1820/339]> Resample5

You’ll see that the output of group_vfold_cv is a tibble, here with 5 rows (as we set it to 5 folds), in which the first column, splits, is a named list with split data. it shows the number of records in the train dataset, and in the validation dataset, for each fold. Using the function analysis, we can extract the data to be used for training, whereas with the assessment function we can extract the data that should be used for validation. To retrieve the first record/fold of the data splits, we can use dat_split$splits[[1]].

Exercise 13.2b
Extract the train and validation data for the first fold, and assign these to the objects dat_train and dat_val, respectively. Print the objects to the console so you can inspect them.

dat_train <- analysis(dat_split$splits[[1]])
dat_val   <- assessment(dat_split$splits[[1]])
dat_train
## # A tibble: 1,707 × 12
##    IDburst isGrazing xacc_mean xacc_sd yacc_mean yacc_sd zacc_mean zacc_sd VeDBA_mean VeDBA_sd   step  angle
##    <fct>   <fct>         <dbl>   <dbl>     <dbl>   <dbl>     <dbl>   <dbl>      <dbl>    <dbl>  <dbl>  <dbl>
##  1 1_2     TRUE         -0.183    16.5    -18.1     18.6    -2.56     17.5       32.3     14.4 0.134   0.467
##  2 1_2     TRUE         -5.39     21.9    -15.7     17.6     1.06     14.7       32.6     14.4 0.371  -0.938
##  3 1_2     TRUE         -2.47     11.8    -17.6     14.1    -2.05     13.3       26.3     11.8 0.506   0.983
##  4 1_2     TRUE         -4.04     18.6    -17.6     24.4     4.37     22.9       40.1     13.7 0.160   0.999
##  5 1_2     TRUE         -2.35     17.8    -12.3     17.8     0.600    14.8       29.4     11.7 0.0665  0.976
##  6 1_2     TRUE         -2.18     18.5    -10.9     17.7     3.04     15.8       29.6     12.5 0.165   0.951
##  7 1_2     TRUE         -0.766    17.7     -9.78    21.4     2.02     17.7       31.9     12.5 0.243  -0.340
##  8 1_2     TRUE         -4.59     21.6    -18.5     32.0    -1.12     25.3       46.0     18.9 0.317   0.280
##  9 1_2     TRUE         -7.08     21.4    -20.2     25.1    -5.84     17.9       39.7     17.7 0.131  -0.118
## 10 1_2     TRUE         -2.18     11.9    -17.4     15.6    -4.12     18.5       30.3     11.4 0.190  -0.833
## # ℹ 1,697 more rows
dat_val
## # A tibble: 452 × 12
##    IDburst isGrazing xacc_mean xacc_sd yacc_mean yacc_sd zacc_mean zacc_sd VeDBA_mean VeDBA_sd   step  angle
##    <fct>   <fct>         <dbl>   <dbl>     <dbl>   <dbl>     <dbl>   <dbl>      <dbl>    <dbl>  <dbl>  <dbl>
##  1 1_1     TRUE          -6.06   13.5      -34.7    21.9    -13.0     19.8       46.2     17.8 0.199   0.983
##  2 1_1     TRUE          -1.91   18.8      -30.4    27.7     -7.80    22.5       46.9     20.2 0.109  -0.250
##  3 1_1     TRUE          -7.59   14.7      -23.5    25.3     -3.15    21.5       40.5     16.9 0.0892  0.619
##  4 1_1     TRUE          -8.88   13.3      -18.8    12.6     -4.31    11.8       28.5     10.5 0.204  -0.894
##  5 1_1     TRUE          -1.75   22.4      -19.6    23.6     -5.09    21.1       40.3     16.7 0.0486 -0.444
##  6 1_1     TRUE          -4.60   16.8      -23.2    22.8     -1.83    15.0       35.8     17.5 0.167   0.236
##  7 1_1     TRUE          -6.55   15.1      -15.2    21.4     -2.14    16.2       30.8     16.6 0.453   0.958
##  8 1_1     TRUE          -6.04   16.6      -27.0    28.1     -9.91    24.3       45.6     20.6 0.379   0.977
##  9 1_1     TRUE          -7.10   13.4      -23.5    28.3     -6.14    21.4       41.1     19.3 0.243   0.996
## 10 1_1     TRUE         -11.4     9.62     -19.8    17.4     -5.07    13.8       31.3     12.1 0.220   0.976
## # ℹ 442 more rows

3 Create data recipe

Using the rsample package we have splitted the data into k folds used for training and validation, now we will proceed and use the recipes package to set up a data recipe. Recall from tutorial 9 that a data recipe is a description of what steps should be applied to a data set in order to get it ready for data analysis. Most steps in the recipe are defined using functions that start with the prefix step_.

Exercise 13.3a

Create a data recipe called dataRecipe that:

  • takes the training data as input;
  • specifies that the formula for the analyses is isGrazing ~ .;
  • removes the column IDburst from the data;
  • centers all predictors;
  • scales all predictors;
  • prepares the data recipe using the training data.
Create the recipe with these steps in the order shown above.

To specify a formula, use the function recipe; to remove columns use step_rm; to center use step_center; to scale use step_scale. To prepare the data recipe use the prep function (this will then automatically use the data passed on to the recipe function for training the data recipe (thus, here the training data!). Put all steps behind each other in one sequences using pipes.


dataRecipe <- dat_train %>%
  recipe(isGrazing ~ .) %>%
  step_rm(IDburst) %>%
  step_center(all_predictors()) %>%
  step_scale(all_predictors()) %>% 
  prep()
dataRecipe

4 Bake data

Now that we have splitted the data into train and validation set, and now that we’ve set up and prepared a data recipe, let’s bake some data: recipes metaphor for apply the preprocessing computations to new data.

Exercise 13.4a
Bake the dataset dat_train and do the same for the dat_val dataset. Assign the results to objects called baked_train and baked_val, respectively.

baked_train  <- dataRecipe %>%
  bake(dat_train)
baked_train
## # A tibble: 1,707 × 11
##    xacc_mean xacc_sd yacc_mean yacc_sd zacc_mean zacc_sd VeDBA_mean VeDBA_sd    step  angle isGrazing
##        <dbl>   <dbl>     <dbl>   <dbl>     <dbl>   <dbl>      <dbl>    <dbl>   <dbl>  <dbl> <fct>    
##  1    0.0386   0.567    -0.851  0.324     -0.305  0.198      -0.152   0.481  -0.458   0.153 TRUE     
##  2   -0.852    1.54     -0.748  0.185      0.202 -0.175      -0.129   0.485  -0.120  -1.95  TRUE     
##  3   -0.353   -0.297    -0.830 -0.298     -0.234 -0.352      -0.667  -0.0532  0.0736  0.927 TRUE     
##  4   -0.621    0.955    -0.829  1.13       0.668  0.901       0.515   0.343  -0.421   0.951 TRUE     
##  5   -0.332    0.801    -0.603  0.209      0.138 -0.153      -0.401  -0.0649 -0.554   0.916 TRUE     
##  6   -0.303    0.928    -0.545  0.194      0.480 -0.0205     -0.387   0.0972 -0.413   0.879 TRUE     
##  7   -0.0612   0.776    -0.496  0.705      0.337  0.217      -0.191   0.0902 -0.302  -1.06  TRUE     
##  8   -0.715    1.49     -0.870  2.17      -0.103  1.21        1.01    1.42   -0.197  -0.126 TRUE     
##  9   -1.14     1.45     -0.943  1.22      -0.766  0.254       0.477   1.17   -0.463  -0.723 TRUE     
## 10   -0.302   -0.266    -0.820 -0.0900    -0.525  0.331      -0.327  -0.132  -0.379  -1.80  TRUE     
## # ℹ 1,697 more rows
baked_val  <- dataRecipe %>%
  bake(dat_val)
baked_val
## # A tibble: 452 × 11
##    xacc_mean  xacc_sd yacc_mean yacc_sd zacc_mean zacc_sd VeDBA_mean VeDBA_sd     step  angle isGrazing
##        <dbl>    <dbl>     <dbl>   <dbl>     <dbl>   <dbl>      <dbl>    <dbl>    <dbl>  <dbl> <fct>    
##  1    -0.966  0.0186     -1.56    0.779    -1.77   0.499       1.03    1.18   -0.365    0.927 TRUE     
##  2    -0.257  0.975      -1.38    1.58     -1.04   0.850       1.09    1.68   -0.493   -0.922 TRUE     
##  3    -1.23   0.239      -1.08    1.25     -0.389  0.717       0.546   1.00   -0.522    0.381 TRUE     
##  4    -1.45  -0.0122     -0.882  -0.499    -0.551 -0.543      -0.473  -0.326  -0.358   -1.89  TRUE     
##  5    -0.229  1.64       -0.916   1.02     -0.661  0.668       0.532   0.962  -0.580   -1.21  TRUE     
##  6    -0.717  0.620      -1.07    0.911    -0.203 -0.130       0.142   1.12   -0.411   -0.193 TRUE     
##  7    -1.05   0.315      -0.728   0.714    -0.247  0.0313     -0.281   0.928  -0.00321  0.890 TRUE     
##  8    -0.963  0.582      -1.23    1.64     -1.34   1.08        0.983   1.76   -0.108    0.917 TRUE     
##  9    -1.14   0.00460    -1.08    1.66     -0.808  0.702       0.600   1.50   -0.303    0.947 TRUE     
## 10    -1.88  -0.685      -0.924   0.154    -0.658 -0.281      -0.236   0.0165 -0.336    0.917 TRUE     
## # ℹ 442 more rows

5 Define and train the model

Now that all data is properly pre-processed we can proceed to train a model/algorithm. Recall from tutorial 9 that a model is trained using the parsnip package, by first defining the model, then specifying the engine that will execute the model estimation for us, and finally fitting the model. Which models currently are supported and via which function they should be defined can be seen here. We will focus on a Support Vector Machine, SVM, and specifically a SVM with a Gaussian radial basis function.

A SVM with a Gaussian radial basis function can be specified in the parsnip package via the svm_rbf function. For a classification problem (i.e. our machine learning task of classifying the annotated behaviour), it takes 3 arguments: mode which should be set to "classification", as well as the hyperparameters cost and rbf_sigma. Check the helpfile for information on these parameters.

Exercise 13.5a

Fit a SVM to the train data baked_train. Do this by first specifying the model: specify it to be a classification mode, with cost parameters with value 1.0, and rbf_sigma parameter with value 0.001 Then, set the engine to "kernlab". Finally, fit the model using the formula isGrazing ~ . (i.e. isGrazing is the response, and all other columns are predictors) with data baked_train. Assign the fitted model to an object called dat_SVMfit.

NOTE: If you have problems installing the tidymodels package, you can do this exercise a bit different: in the box below this exercise there is information on how to fit the SVM to the data directly using the kernlab package, thus without the use of the interface that parsnip supplies.

Specify the model using svm_rbf, then set the engine using set_engine, and fit the model using fit.


dat_SVMfit <- svm_rbf(mode = "classification",
                      cost = 1.0,
                      rbf_sigma = 0.001) %>%
  set_engine("kernlab") %>%
  fit(isGrazing ~ ., data = baked_train)
dat_SVMfit
## parsnip model object
## 
## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 1 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.001 
## 
## Number of Support Vectors : 478 
## 
## Objective Function Value : -404.1226 
## Training error : 0.086116 
## Probability model included.

If you have problems installing tidymodels, then you can instead use the following code to fit a SVM model with a Gaussian radial basis function

library(kernlab)
dat_SVMfit <- kernlab::ksvm(isGrazing ~ .,
                            data = baked_train,
                            kernel = "rbfdot",
                            kpar = list(sigma = 0.001),
                            C = 1.0,
                            prob.model = TRUE)

6 Making predictions

With the fitted model stored in object dat_SVMfit, we can now predict the model on the validation dataset baked_val using the predict function. The function by default returns the predicted class, but we can also retrieve the predicted classes and probabilities by adding the function argument type="prob" (then fitted using parsnip; the column names will then be .pred_FALSE and .pred_TRUE), or type = "probabilities" (when fitted directly using kernlab; the column names will then be FALSE and TRUE).

Exercise 13.6a
Predict the fitted SVM on the baked_val validation dataset: predict the class probabilities, and save to an object dat_preds. Also add the columns of dat_val using the function bind_cols. Because we predict probabilities, and because our response variable isGrazing is a factor with levels "TRUE" and "FALSE", we get as predictions the columns ".pred_FALSE" and ".pred_TRUE".

dat_preds <- dat_SVMfit %>% 
  predict(baked_val, type="prob") %>%
  bind_cols(dat_val)
dat_preds
## # A tibble: 452 × 14
##    .pred_FALSE .pred_TRUE IDburst isGrazing xacc_mean xacc_sd yacc_mean yacc_sd zacc_mean zacc_sd VeDBA_mean VeDBA_sd   step  angle
##          <dbl>      <dbl> <fct>   <fct>         <dbl>   <dbl>     <dbl>   <dbl>     <dbl>   <dbl>      <dbl>    <dbl>  <dbl>  <dbl>
##  1     0.0103       0.990 1_1     TRUE          -6.06   13.5      -34.7    21.9    -13.0     19.8       46.2     17.8 0.199   0.983
##  2     0.00517      0.995 1_1     TRUE          -1.91   18.8      -30.4    27.7     -7.80    22.5       46.9     20.2 0.109  -0.250
##  3     0.0139       0.986 1_1     TRUE          -7.59   14.7      -23.5    25.3     -3.15    21.5       40.5     16.9 0.0892  0.619
##  4     0.0470       0.953 1_1     TRUE          -8.88   13.3      -18.8    12.6     -4.31    11.8       28.5     10.5 0.204  -0.894
##  5     0.00619      0.994 1_1     TRUE          -1.75   22.4      -19.6    23.6     -5.09    21.1       40.3     16.7 0.0486 -0.444
##  6     0.0184       0.982 1_1     TRUE          -4.60   16.8      -23.2    22.8     -1.83    15.0       35.8     17.5 0.167   0.236
##  7     0.0265       0.973 1_1     TRUE          -6.55   15.1      -15.2    21.4     -2.14    16.2       30.8     16.6 0.453   0.958
##  8     0.00640      0.994 1_1     TRUE          -6.04   16.6      -27.0    28.1     -9.91    24.3       45.6     20.6 0.379   0.977
##  9     0.0110       0.989 1_1     TRUE          -7.10   13.4      -23.5    28.3     -6.14    21.4       41.1     19.3 0.243   0.996
## 10     0.0483       0.952 1_1     TRUE         -11.4     9.62     -19.8    17.4     -5.07    13.8       31.3     12.1 0.220   0.976
## # ℹ 442 more rows
Exercise 13.6b

Let’s also add not just the predicted probabilities of both classes, but the predicted class itself. You can do so by adding the prediction column using

dat_preds <- dat_preds %>% 
  bind_cols(predict(dat_SVMfit, baked_val, type="class"))
Because here we predict the response variable isGrazing, which is a factor with levels "TRUE" and "FALSE", the newly created column “.pred_class” also is a factor with these levels!

dat_preds <- dat_preds %>% 
  bind_cols(predict(dat_SVMfit, baked_val, type="class"))
dat_preds
## # A tibble: 452 × 15
##    .pred_FALSE .pred_TRUE IDburst isGrazing xacc_mean xacc_sd yacc_mean yacc_sd zacc_mean zacc_sd VeDBA_mean VeDBA_sd   step  angle
##          <dbl>      <dbl> <fct>   <fct>         <dbl>   <dbl>     <dbl>   <dbl>     <dbl>   <dbl>      <dbl>    <dbl>  <dbl>  <dbl>
##  1     0.0103       0.990 1_1     TRUE          -6.06   13.5      -34.7    21.9    -13.0     19.8       46.2     17.8 0.199   0.983
##  2     0.00517      0.995 1_1     TRUE          -1.91   18.8      -30.4    27.7     -7.80    22.5       46.9     20.2 0.109  -0.250
##  3     0.0139       0.986 1_1     TRUE          -7.59   14.7      -23.5    25.3     -3.15    21.5       40.5     16.9 0.0892  0.619
##  4     0.0470       0.953 1_1     TRUE          -8.88   13.3      -18.8    12.6     -4.31    11.8       28.5     10.5 0.204  -0.894
##  5     0.00619      0.994 1_1     TRUE          -1.75   22.4      -19.6    23.6     -5.09    21.1       40.3     16.7 0.0486 -0.444
##  6     0.0184       0.982 1_1     TRUE          -4.60   16.8      -23.2    22.8     -1.83    15.0       35.8     17.5 0.167   0.236
##  7     0.0265       0.973 1_1     TRUE          -6.55   15.1      -15.2    21.4     -2.14    16.2       30.8     16.6 0.453   0.958
##  8     0.00640      0.994 1_1     TRUE          -6.04   16.6      -27.0    28.1     -9.91    24.3       45.6     20.6 0.379   0.977
##  9     0.0110       0.989 1_1     TRUE          -7.10   13.4      -23.5    28.3     -6.14    21.4       41.1     19.3 0.243   0.996
## 10     0.0483       0.952 1_1     TRUE         -11.4     9.62     -19.8    17.4     -5.07    13.8       31.3     12.1 0.220   0.976
## # ℹ 442 more rows
## # ℹ 1 more variable: .pred_class <fct>

7 Evaluating model predictions

After model fitting and prediction using parsnip, we can use the functions from the package yardstick to compute model evaluation metrics. The metrics function can be used for this, taking input arguments truth and estimate (which should be the predicted class, and not the predicted class probability!). For example, the following code will compute model evaluation metrics based on a classification problem:

Exercise 13.7a
Compute the metrics for the fitted model, using column “isGrazing” as the true label, and the column with the predicted class as the estimate.

dat_preds %>%
  metrics(truth = isGrazing,
          estimate = .pred_class)
## # A tibble: 2 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.819
## 2 kap      binary         0.465

The results are estimates for the accuracy and kappa statistic for the model prediction. The accuracy is quite high: around 82% of the labels have been correctly predicted! To see how the predictions are for both classes, we can compute the confusion matrix the conf_mat function:

dat_preds %>%
  conf_mat(truth = isGrazing,
           estimate = .pred_class)
##           Truth
## Prediction FALSE TRUE
##      FALSE    52    6
##      TRUE     76  318

We see that most of the records where the cattle did or did not graze were correctly classified as such. In this case, 318 of the 324 records were the cattle were in fact grazing were correctly classified as such (recall: ca 98%), whereas 318 of the 394 records were cattle were predicted to be grazing they were in indeed also doing that (precision: ca 86%). As mentiond in the lectures, instead of simply being interested in accuracy only, we are often interested in these precision and recall metrics of model predictions. See here for a good overview of what these (and other) measures mean (reading material for the lecture on the data science workflow). the yardstick package offers the function pr_curve to compute the precision and recall along a gradient in threshold values.

Exercise 13.7b
Compute the precision and recall using the function pr_curve. Do this based on the data dat_preds (in 13.6 you have added the class predictions), using isGrazing as the true label, and .pred_TRUE (or TRUE if you did not use parsnip for model fitting but rather kernlab directly) as the predicted value, and set the “event_level” argument to value “second” (namely, from FALSE and TRUE, the second value is the focus). Assign the result to an object called PR, inspect it by printing it to the screen, and plot PR using the autoplot function.

PR <- dat_preds %>%
  pr_curve(truth = isGrazing, .pred_TRUE, event_level = "second")
PR
## # A tibble: 453 × 3
##    .threshold  recall precision
##         <dbl>   <dbl>     <dbl>
##  1    Inf     0               1
##  2      0.997 0.00309         1
##  3      0.996 0.00617         1
##  4      0.995 0.00926         1
##  5      0.995 0.0123          1
##  6      0.995 0.0154          1
##  7      0.995 0.0185          1
##  8      0.995 0.0216          1
##  9      0.995 0.0247          1
## 10      0.995 0.0278          1
## # ℹ 443 more rows
autoplot(PR)

A very helpful performance measure is the F1 score: it is the weighted average of precision and recall. Therefore, this score takes both false positives and false negatives into account. Intuitively it is not as easy to understand as accuracy, but F1 is usually more useful than accuracy, especially if you have an uneven class distribution. F1 can be computed using:

\[F1 = 2 * \frac{precision * recall}{precision + recall}\]

Exercise 13.7c
Add the F1 score as a new column F1 to PR. Show the row of data that corresponds to the maximum F1. Classify the predictions based on the class prediction .pred_TRUE and the threshold value corresponding tot the max F1 score.

PR <- PR %>%
  mutate(F1 = 2 * (precision * recall) / (precision + recall))
PR
## # A tibble: 453 × 4
##    .threshold  recall precision      F1
##         <dbl>   <dbl>     <dbl>   <dbl>
##  1    Inf     0               1 0      
##  2      0.997 0.00309         1 0.00615
##  3      0.996 0.00617         1 0.0123 
##  4      0.995 0.00926         1 0.0183 
##  5      0.995 0.0123          1 0.0244 
##  6      0.995 0.0154          1 0.0304 
##  7      0.995 0.0185          1 0.0364 
##  8      0.995 0.0216          1 0.0423 
##  9      0.995 0.0247          1 0.0482 
## 10      0.995 0.0278          1 0.0541 
## # ℹ 443 more rows
F1max <- PR %>%
  filter(F1 == max(F1, na.rm=TRUE))
F1max
## # A tibble: 1 × 4
##   .threshold recall precision    F1
##        <dbl>  <dbl>     <dbl> <dbl>
## 1      0.760  0.935     0.950 0.942
dat_preds %>%
  mutate(prediction = as.factor(.pred_TRUE >= pull(F1max, .threshold))) %>%
  metrics(truth = isGrazing, estimate = prediction)
## # A tibble: 2 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.918
## 2 kap      binary         0.801
Exercise 13.7d

Another helpful metric is the AUC (area under the ROC curve), which indicates how well the probabilities from the positive classes are separated from the negative classes (see link to the metrics above; ROC is the receiver operating characteristic, see e.g. this wikipedia page).

Compute the AUC, using the pr_auc function with the data from dat_preds, where isGrazing is the true class, and .pred_TRUE as the predicted value, and set the “event_level” argument to value “second” (namely, from FALSE and TRUE, the second value is the focus).

dat_preds %>%
  pr_auc(truth = isGrazing, .pred_TRUE, event_level = "second")
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 pr_auc  binary         0.983

8 Hyperparameter tuning

Above, we have splitted the dataset, specified (and prepared) a data recipe, set up a model specification, trained the model, predicted it on the left-out validation dataset, and evaluated the predictive performance using several performance metrics.

Where the set of tidymodels tools will really start shining, is when we start combining things into a workflow, and then use the set of tidymodels tools (from the tune and dials packages) that allows us to efficiently do hyperparameter-tuning!

Before we can do that, let’s first repeat a few steps we’ve been doing above, yet without actually training the data recipe (using the prep function) or the model (using the fit function).

Exercise 13.8a
Create a specification of a data recipe using all steps as done above (see section 3 above), yet without the prep function. Assign it to an object called dat_rec.

# Define specification of data recipe (without preparing it)
dat_rec <- dat_train %>%
  recipe(isGrazing ~ .) %>%
  step_rm(IDburst) %>%
  step_center(all_predictors()) %>%
  step_scale(all_predictors())
Exercise 13.8b
Create a model specification using all steps as done above (see section 5 above), yet without the fit function. Assign it to an object called svm_spec. However, instead of giving the “cost” and “rbf_sigma” a value, this time only specify that these will be hyperparameters that need to be tuned! Do so by giving them the value tune().

# Defining model specification (without fitting it to data)
svm_spec <-  svm_rbf(mode = "classification",
                     cost = tune(),
                     rbf_sigma = tune()) %>%
  set_engine("kernlab")
Exercise 13.8c
Now that you’ve created a data recipe specification (dat_rec) as well as a model specification (svm_spec), let’s combine them into a workflow object: do so using the workflow function to create an empty workflow, then adding the model specification svm_spec using the add_model function, and then add the data recipe specification dat_rec using the add_recipe function. Combine these into a single object called svm_wf using the pipe operator %>%. Inspect the workflow object svm_wf by printing it to the console.

# Bundle specification of data recipe and model into workflow
svm_wf <- workflow() %>%
  add_model(svm_spec) %>%
  add_recipe(dat_rec)
svm_wf
## ══ Workflow ════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: svm_rbf()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## 3 Recipe Steps
## 
## • step_rm()
## • step_center()
## • step_scale()
## 
## ── Model ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Radial Basis Function Support Vector Machine Model Specification (classification)
## 
## Main Arguments:
##   cost = tune()
##   rbf_sigma = tune()
## 
## Computational engine: kernlab
Exercise 13.8d

You’re almost ready to start tuning the hyperparameters! But first you’ll need to specify a strategy for the values of the hyperparameters that you’ll be assessing.

Using the functions rbf_sigma and cost, you can set the range of values for which the parameters will be evaluated during tuning. Set parameter “rbf_sigma” to be sampled in a range from -5 to 0 (this is on a log10 scale by default!), which you can do using rbf_sigma(range = c(-5,0)). Similarly, specify that the “cost” parameter should be sampled in a range from -4 to 8 (on a log2 scale!), using cost(range = c(-4,8)).

Use these two hyperparameter at their given scales to set up a regular grid for hyperparameter tuning: do this using the grid_regular function where both hyperparameter tuning specifications (as shown above) are included, and where you will also need to insert a value for the argument levels, which is the number of levels each value of each hyperparameter will get. Setting levels to a high value will lead to many combinations of hyperparameters that need to be tuned, thus for now set it to a low value, e.g. 5 (later, if everything runs fine and in a manageable amount of time, you can increase it to e.g. 10 or higher.

Assign the resultant tuning grid to an object called svm_grid.

After you’ve created svm_grid, inspect it by printing it to the console, and count the number of distinct value of rbf_sigma as well as cost (using the count function).

Note that in the code below, we use levels = 20, which will lead to a fine grid for hyperparameter tuning, but also long computation time!

# Specify grid of hyperparameter values for which to assess fit
svm_grid <- grid_regular(rbf_sigma(range = c(-5,0)),
                         cost(range = c(-4,8)),
                         levels = 20)
svm_grid
## # A tibble: 400 × 2
##    rbf_sigma   cost
##        <dbl>  <dbl>
##  1 0.00001   0.0625
##  2 0.0000183 0.0625
##  3 0.0000336 0.0625
##  4 0.0000616 0.0625
##  5 0.000113  0.0625
##  6 0.000207  0.0625
##  7 0.000379  0.0625
##  8 0.000695  0.0625
##  9 0.00127   0.0625
## 10 0.00234   0.0625
## # ℹ 390 more rows
# Explore
svm_grid %>% count(rbf_sigma)
## # A tibble: 20 × 2
##    rbf_sigma     n
##        <dbl> <int>
##  1 0.00001      20
##  2 0.0000183    20
##  3 0.0000336    20
##  4 0.0000616    20
##  5 0.000113     20
##  6 0.000207     20
##  7 0.000379     20
##  8 0.000695     20
##  9 0.00127      20
## 10 0.00234      20
## 11 0.00428      20
## 12 0.00785      20
## 13 0.0144       20
## 14 0.0264       20
## 15 0.0483       20
## 16 0.0886       20
## 17 0.162        20
## 18 0.298        20
## 19 0.546        20
## 20 1            20
svm_grid %>% count(cost)
## # A tibble: 20 × 2
##        cost     n
##       <dbl> <int>
##  1   0.0625    20
##  2   0.0968    20
##  3   0.150     20
##  4   0.232     20
##  5   0.360     20
##  6   0.558     20
##  7   0.864     20
##  8   1.34      20
##  9   2.07      20
## 10   3.21      20
## 11   4.98      20
## 12   7.71      20
## 13  12.0       20
## 14  18.5       20
## 15  28.7       20
## 16  44.4       20
## 17  68.8       20
## 18 107.        20
## 19 165.        20
## 20 256         20
Exercise 13.8e

Now you’re all set to tune the hyperparameters! For this use the function tune_grid, which takes a workflow specification (the object svm_wf as created above) as input, as well as inputs to the arguments resamples (the data-split object dat_split created above) and grid (a tuning grid specification such as svm_grid created above). Save the result to an object called svm_res.

Note that running the tune_grid can take a long amount of time (minutes), so this may be an ideal time to stretch your legs, get some water, tea or coffee, etc.! And, as mentioned above, start with a low value of the levels argument in the specification of the tuning grid (in function grid_regular), and only increase this value if the command runs without error and results in promising outcomes



# Tune the hyperparameters with the elements prepared above
svm_res <- svm_wf %>% 
  tune_grid(resamples = dat_split,
            grid = svm_grid)
Exercise 13.8f

After having evaluated the model performance for all combinations of hyperparameter values as listed in the svm_grid object, we can now extract the model performance metrics using the collect_metrics function. By default, this function will compute some performance metrics per combination of hyperparameter values; which ones depends on the machine learning problem at hand (classification or regression). Here we are faced with a classification problem, and the collect_metrics function thus evaluated model performance using the accuracy and roc_auc. By default, the collect_metrics function will summarise multiple values of a performance metric for each combination of hyperparameter values into a mean value (column mean) and its standard error (column std_err), with information shown as to how many values were summarised (column n). In object svm_res, we will indeed get these summary statistics, as it is based on 5-fold cross validation (as specified in the data split object dat_split), since we therefore have 5 performance metric values per combination of hyperparameter values.

Extract the performance metrics for the svm_res object. Also, get the combination of hyperparameter values that results in the highest roc_auc. Then, plot the mean value of the roc_auc metric as function of cost and rbf_sigma (for this you could opt for geom_tile or geom_contour_filled).

# Collect performance metrics
svm_res %>% 
  collect_metrics(summarize  = TRUE)
## # A tibble: 800 × 8
##      cost rbf_sigma .metric  .estimator  mean     n std_err .config               
##     <dbl>     <dbl> <chr>    <chr>      <dbl> <int>   <dbl> <chr>                 
##  1 0.0625 0.00001   accuracy binary     0.781     5 0.0735  Preprocessor1_Model001
##  2 0.0625 0.00001   roc_auc  binary     0.936     5 0.00690 Preprocessor1_Model001
##  3 0.0625 0.0000183 accuracy binary     0.781     5 0.0735  Preprocessor1_Model002
##  4 0.0625 0.0000183 roc_auc  binary     0.936     5 0.00688 Preprocessor1_Model002
##  5 0.0625 0.0000336 accuracy binary     0.781     5 0.0735  Preprocessor1_Model003
##  6 0.0625 0.0000336 roc_auc  binary     0.936     5 0.00691 Preprocessor1_Model003
##  7 0.0625 0.0000616 accuracy binary     0.781     5 0.0735  Preprocessor1_Model004
##  8 0.0625 0.0000616 roc_auc  binary     0.936     5 0.00678 Preprocessor1_Model004
##  9 0.0625 0.000113  accuracy binary     0.781     5 0.0735  Preprocessor1_Model005
## 10 0.0625 0.000113  roc_auc  binary     0.936     5 0.00679 Preprocessor1_Model005
## # ℹ 790 more rows
# Get best set of hyperparameters (in terms of roc_auc)
best_hyperparms <- svm_res %>% 
  collect_metrics() %>% 
  filter(.metric == "roc_auc") %>%
  filter(mean == max(mean))
best_hyperparms
## # A tibble: 1 × 8
##    cost rbf_sigma .metric .estimator  mean     n std_err .config               
##   <dbl>     <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                 
## 1  18.5   0.00234 roc_auc binary     0.958     5 0.00694 Preprocessor1_Model270
# Plot the mean roc_auc as function of both hyperparameter values
svm_res %>% 
  collect_metrics() %>% 
  filter(.metric == "roc_auc") %>%
  ggplot(mapping = aes(x = rbf_sigma, y = cost, z = mean)) +
  geom_contour_filled(bins = 20) +
  geom_point(data = best_hyperparms,
             mapping = aes(x = rbf_sigma, y = cost)) +
  scale_y_log10() + 
  scale_x_log10() + 
  labs(x = "Sigma", y = "Cost")

The plot above shows the model’s performance (mean AUC value over the 5 folds) as function of the hyperparameters cost and sigma. The plotted point shows the value with the highest performance.

Note that if you tuned the hyperparameters on a coarser grid, then the plot will be different (coarser).

Exercise 13.8g
Using the best combination of hyperparameters, update the model and its predictions as done in the exercises of sections 2 - 7. Evaluate its predictive performance, e.g. using the metrics and conf_mat functions. Did the model predictions improve due to the hyperparameter-tuning?

# Retrieve the best hyperparameters
best_hyperparms
## # A tibble: 1 × 8
##    cost rbf_sigma .metric .estimator  mean     n std_err .config               
##   <dbl>     <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                 
## 1  18.5   0.00234 roc_auc binary     0.958     5 0.00694 Preprocessor1_Model270
# Predict with the best hyperparameters
svm_fit <- svm_rbf(mode = "classification",
                   cost = 18.5,
                   rbf_sigma = 0.00234) %>%
  set_engine("kernlab") %>% 
  fit(isGrazing ~ ., data = baked_train)

# Preds
dat_preds2 <- svm_fit %>% 
  predict(baked_val, type="class") %>%
  bind_cols(dat_val)
dat_preds2
## # A tibble: 452 × 13
##    .pred_class IDburst isGrazing xacc_mean xacc_sd yacc_mean yacc_sd zacc_mean zacc_sd VeDBA_mean VeDBA_sd   step  angle
##    <fct>       <fct>   <fct>         <dbl>   <dbl>     <dbl>   <dbl>     <dbl>   <dbl>      <dbl>    <dbl>  <dbl>  <dbl>
##  1 TRUE        1_1     TRUE          -6.06   13.5      -34.7    21.9    -13.0     19.8       46.2     17.8 0.199   0.983
##  2 TRUE        1_1     TRUE          -1.91   18.8      -30.4    27.7     -7.80    22.5       46.9     20.2 0.109  -0.250
##  3 TRUE        1_1     TRUE          -7.59   14.7      -23.5    25.3     -3.15    21.5       40.5     16.9 0.0892  0.619
##  4 TRUE        1_1     TRUE          -8.88   13.3      -18.8    12.6     -4.31    11.8       28.5     10.5 0.204  -0.894
##  5 TRUE        1_1     TRUE          -1.75   22.4      -19.6    23.6     -5.09    21.1       40.3     16.7 0.0486 -0.444
##  6 TRUE        1_1     TRUE          -4.60   16.8      -23.2    22.8     -1.83    15.0       35.8     17.5 0.167   0.236
##  7 TRUE        1_1     TRUE          -6.55   15.1      -15.2    21.4     -2.14    16.2       30.8     16.6 0.453   0.958
##  8 TRUE        1_1     TRUE          -6.04   16.6      -27.0    28.1     -9.91    24.3       45.6     20.6 0.379   0.977
##  9 TRUE        1_1     TRUE          -7.10   13.4      -23.5    28.3     -6.14    21.4       41.1     19.3 0.243   0.996
## 10 TRUE        1_1     TRUE         -11.4     9.62     -19.8    17.4     -5.07    13.8       31.3     12.1 0.220   0.976
## # ℹ 442 more rows
# Metrics
dat_preds2 %>% 
  metrics(truth = isGrazing,
          estimate = .pred_class)
## # A tibble: 2 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.896
## 2 kap      binary         0.728
# Confusion matrix
dat_preds2 %>% 
  conf_mat(truth = isGrazing,
           estimate = .pred_class)
##           Truth
## Prediction FALSE TRUE
##      FALSE    92   11
##      TRUE     36  313

Indeed, the predictions now are better than they were before tuning the hyperparameters! Especially the predictions of the records where the cattle were not grazing (i.e. the minority class) improved!

Note that in the exercises above we have conducted k-fold cross validation with 5 folds, however, we actually did not set aside a true test dataset! It would thus be best to perform a true and final test of the model’s predictive performance with a test-dataset that ‘the model has not seen yet’. For example, we could do so by first splitting the data into a test and a train/validation dataset, the latter of which we could then further split e.g. using k-fold cross validation, and use for model training and hyper parameter tuning!

9 Challenge

Challenge

Above we only have focussed on a kernel SVM using the “kernlab” implementation as the engine. As a challenge, try using a different algorithm, e.g. a neural network (e.g. in the “nnet” or “keras” implementation, i.e. engine, or a random forest (e.g. in the “randomForest” or “ranger” implementation, i.e. engine). See here for examples. You can focus on mlp for neural networks (a multilayer perceptron model a.k.a. a single layer, feed-forward neural network) or rand_forest for random forests, respectively.

When tuning their hyperparameters, start simple, e.g. with tuning max 2 hyperparameters using a regular grid that is not having many levels.

A random forest could be fitted and tuned as follows;

# Load library for engine
library(ranger)

# Specify model
rf_spec <- rand_forest(trees = 1000,
                       mtry = tune(),
                       min_n = tune(),
                       mode = "classification") %>%
  set_engine("ranger", seed = 1234567)

# Bundle specification of data recipe and model into workflow
rf_wf <- workflow() %>%
  add_model(rf_spec) %>%
  add_recipe(dat_rec)
rf_wf

# Specify grid of hyperparameter values for which to assess fit
rf_grid <- grid_regular(mtry(range = c(1,5)),
                        min_n(range = c(1,10)),
                        levels = 10)
rf_grid

# Explore
rf_grid %>% count(mtry)
rf_grid %>% count(min_n)

# Tune the hyperparameters with the elements prepared above
rf_res <- rf_wf %>% 
  tune_grid(resamples = dat_split,
            grid = rf_grid,
            control = control_grid(save_pred = TRUE))
rf_res

# Collect performance metrics
rf_res %>% 
  collect_metrics(summarize  = TRUE)

# Get best set of hyperparameters (in terms of roc_auc)
best_hyperparms <- rf_res %>% 
  collect_metrics() %>% 
  filter(.metric == "roc_auc") %>%
  filter(mean == max(mean))
best_hyperparms

# Plot the mean roc_auc as function of both hyperparameter values
rf_res %>% 
  collect_metrics() %>% 
  filter(.metric == "roc_auc") %>%
  ggplot(mapping = aes(x = mtry, y = min_n, z = mean)) +
  geom_contour_filled(bins = 50, show.legend = FALSE) +
  geom_point(data = best_hyperparms,
             mapping = aes(x = mtry, y = min_n)) +
  labs(x = "mtry", y = "min_n")

# Collect predictions
rf_preds <- rf_res %>% 
  collect_predictions() %>% 
  filter(mtry == best_hyperparms$mtry[[1]],
         min_n == best_hyperparms$min_n[[1]])

# Get metrics
rf_preds %>%
  metrics(truth = isGrazing,
          estimate = .pred_class)

# confusion matrix
rf_preds %>% 
  conf_mat(truth = isGrazing,
           estimate = .pred_class)

Tuning and fitting a neural network using Keras could be done using:

# Load library for engine
library(keras)
# see installation instructions in ?install_keras()

# Specify model
nn_spec <- mlp(hidden_units = tune(), 
               penalty = tune(), 
               dropout = 0.0, 
               epochs = 20, 
               activation = "softmax") %>%
  set_engine("keras", seed = 1234567) %>% 
  set_mode("classification")

# Bundle specification of data recipe and model into workflow
nn_wf <- workflow() %>%
  add_model(nn_spec) %>%
  add_recipe(dat_rec)
nn_wf

# Specify grid of hyperparameter values for which to assess fit
nn_grid <- grid_regular(hidden_units(range = c(1,10)),
                        penalty(range = c(-10,0)),
                        levels = 10)
nn_grid

# Explore
nn_grid %>% count(hidden_units)
nn_grid %>% count(penalty)

# Tune the hyperparameters with the elements prepared above
nn_res <- nn_wf %>% 
  tune_grid(resamples = dat_split,
            grid = nn_grid,
            control = control_grid(save_pred = TRUE))
nn_res

# Get best set of hyperparameters (in terms of roc_auc)
best_hyperparms <- nn_res %>% 
  collect_metrics() %>% 
  filter(.metric == "roc_auc") %>%
  filter(mean == max(mean))
best_hyperparms

### Plot the mean roc_auc as function of both hyperparameter values
nn_res %>% 
  collect_metrics() %>% 
  filter(.metric == "roc_auc") %>%
  ggplot(mapping = aes(x = hidden_units, y = penalty, z = mean)) +
  geom_contour_filled(bins = 50, show.legend = FALSE) +
  geom_point(data = best_hyperparms,
             mapping = aes(x = hidden_units, y = penalty)) +
  labs(x = "hidden_units", y = "penalty") +
  scale_y_log10()

### Collect predictions
nn_preds <- nn_res %>% 
  collect_predictions() %>% 
  filter(hidden_units == best_hyperparms$hidden_units[[1]],
         penalty == best_hyperparms$penalty[[1]])

### Get metrics
nn_preds %>%
  metrics(truth = isGrazing,
          estimate = .pred_class)

# confusion matrix
nn_preds %>% 
  conf_mat(truth = isGrazing,
           estimate = .pred_class)

10 Submit your last plot

Submit your script file as well as a plot: either your last created plot, or a plot that best captures your solution to the challenge. Submit the files on Brightspace via Assessment > Assignments > Skills day 13.

Note that your submission will not be graded or evaluated. It is used only to facilitate interaction and to get insight into progress.

11 Recap

Today, we’ve further explored the tidymodels packages, including the workflows, dials and tune packages that we used to perform hyperparameter tuning.

An R script of today’s exercises can be downloaded here

12 Further reading