Yesterday we applied some unsupervised machine learning algorithms,
today we will be applying two types of supervised machine learning
algorithms: Support Vector Machines (SVM) and Random Forests (RF). We
will continue with the cattle movement data that we worked with the last
2 days. Because we had the column label in the dataset,
which contained the annotated behaviour based on direct visual
observation, we have a labelled dataset and thus possibility to use
supervised learning.
Recall from the lecture on the data science workflow that in
predictive modelling it is necessary to split the data into train and
validation set. This can be done randomly, but this should only be done
with great care. As we are dealing with time series data on individual
animals, which generally have high degrees of temporal autocorrelation,
randomly splitting the data into train and validation sets is not making
sure that both sets are independent! In tutorial 9 we called
non-independent train and validation sets a form of data
leakage, in which by mistake information is shared between the
training and validation data sets. In general, by dividing a set of data
into training and validation sets, the goal is to ensure that no
data is shared between the two.
Thus, it may be better to keep all the data of a specific individual
together, and allocate all data from a single individual (or
time-series) to either the train dataset or the validation dataset. Such
partitioning of the data into train and validation sets can be done
once, or using a k-fold cross validation approach. We will use
both here.
Exercise 13.1a
Download the file dat_ts2.rds from Brighspace >
Time series analyses > datasets > cattle, and store in an
appropriate folder (e.g. data / processed / cattle /). Load the
file into an object called dat. Note that it is an .rds
file, which can be read into R with the function readRDS
(see tutorial 1). Also load (and install if needed) the packages
tidyverse, tidymodels and kernlab.
library(tidyverse)
library(tidymodels)
library(kernlab)
dat <- readRDS("data/processed/cattle/dat_ts2.rds")
dat
The data dat does not have the column label
like we had in the previous 2 tutorials; here it is reclassified to the
new column isGrazing, which is a factor with 2
levels indicating whether or not the cows were grazing.
For your information, the dataset loaded above was created from the
dataset “dat_ts1.rds” loaded yesterday via the following chain of dplyr
functions:
The rsample package, part of tidymodels,
offers functions to split the data into train and validation set. In
tutorial 9 we already saw the simple initial_split function
that partitions the data randomly into train and validation data.
However, because this function randomly allocates rows (thus records) to
either of the datasets, and not a grouped set of records (e.g. belonging
to the same individual), we cannot use this function here. But we can
use the group_vfold_cv function, which is a function that
creates splits of the data based on some grouping variable (which may
have more than a single row associated with it), in a k-fold
(here called V-fold) cross-validation approach.
Exercise 13.2a
Check the helpfile of the group_vfold_cv function and
create a train-validation split with 5 folds based on grouping variable
IDburst. Assign the result to an object called
dat_split and check the object by printing it to the
screen.
dat_split <- group_vfold_cv(dat, group = IDburst, v = 5)
dat_split
You’ll see that the output of group_vfold_cv is a
tibble, here with 5 rows (as we set it to 5 folds), in which the first
column, splits, is a named list with split data. it shows
the number of records in the train dataset, and in the validation
dataset, for each fold. Using the function analysis, we can
extract the data to be used for training, whereas with the
assessment function we can extract the data that should be
used for validation. To retrieve the first record/fold of the data
splits, we can use dat_split$splits[[1]].
Exercise 13.2b
Extract the train and validation data for the first fold, and assign
these to the objects dat_train and dat_val,
respectively. Print the objects to the console so you can inspect them.
Using the rsample package we have splitted the data into
k folds used for training and validation, now we will proceed
and use the recipes package to set up a data recipe.
Recall from tutorial 9 that a data recipe is a description of what steps
should be applied to a data set in order to get it ready for data
analysis. Most steps in the recipe are defined using functions that
start with the prefix step_.
Exercise 13.3a
Create a data recipe called dataRecipe that:
takes the training data as input;
specifies that the formula for the analyses is
isGrazing ~ .;
removes the column IDburst from the data;
centers all predictors;
scales all predictors;
prepares the data recipe using the training data.
Create the recipe with these steps in the order shown above.
To specify a formula, use the function recipe; to remove
columns use step_rm; to center use
step_center; to scale use step_scale. To
prepare the data recipe use the prep function (this will
then automatically use the data passed on to the recipe
function for training the data recipe (thus, here the
training data!). Put all steps behind each other in one
sequences using pipes.
Now that we have splitted the data into train and validation set, and
now that we’ve set up and prepared a data recipe, let’s
bake some data: recipes metaphor for apply the
preprocessing computations to new data.
Exercise 13.4a
Bake the dataset dat_train and do the same for the
dat_val dataset. Assign the results to objects called
baked_train and baked_val, respectively.
Now that all data is properly pre-processed we can proceed to train a
model/algorithm. Recall from tutorial 9 that a model is trained using
the parsnip package, by first defining the
model, then specifying the engine that will execute the model
estimation for us, and finally fitting the model. Which models
currently are supported and via which function they should be defined
can be seen
here.
We will focus on a Support Vector Machine, SVM, and specifically a SVM
with a Gaussian radial basis function.
A SVM with a Gaussian radial basis function can be specified in the
parsnip package via the svm_rbf function. For a
classification problem (i.e. our machine learning task of classifying
the annotated behaviour), it takes 3 arguments: mode which
should be set to "classification", as well as the
hyperparameters cost and rbf_sigma. Check the
helpfile for information on these parameters.
Exercise 13.5a
Fit a SVM to the train data baked_train. Do this by
first specifying the model: specify it to be a classification mode, with
cost parameters with value 1.0, and rbf_sigma
parameter with value 0.001 Then, set the engine to
"kernlab". Finally, fit the model using the formula
isGrazing ~ . (i.e. isGrazing is the response,
and all other columns are predictors) with data
baked_train. Assign the fitted model to an object called
dat_SVMfit.
NOTE: If you have problems installing the tidymodels
package, you can do this exercise a bit different: in the box below this
exercise there is information on how to fit the SVM to the data directly
using the kernlab package, thus without the use of the
interface that parsnip supplies.
Specify the model using svm_rbf, then set the engine
using set_engine, and fit the model using
fit.
## parsnip model object
##
## Support Vector Machine object of class "ksvm"
##
## SV type: C-svc (classification)
## parameter : cost C = 1
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.001
##
## Number of Support Vectors : 478
##
## Objective Function Value : -404.1226
## Training error : 0.086116
## Probability model included.
If you have problems installing tidymodels, then you can instead use
the following code to fit a SVM model with a Gaussian radial basis
function
With the fitted model stored in object dat_SVMfit, we
can now predict the model on the validation dataset
baked_val using the predict function. The
function by default returns the predicted class, but we can also
retrieve the predicted classes and probabilities by adding the function
argument type="prob" (then fitted using parsnip; the column
names will then be .pred_FALSE and
.pred_TRUE), or type = "probabilities" (when
fitted directly using kernlab; the column names will then be
FALSE and TRUE).
Exercise 13.6a
Predict the fitted SVM on the baked_val validation dataset:
predict the class probabilities, and save to an object
dat_preds. Also add the columns of dat_val
using the function bind_cols. Because we predict
probabilities, and because our response variable isGrazing
is a factor with levels "TRUE" and
"FALSE", we get as predictions the columns
".pred_FALSE" and ".pred_TRUE".
Let’s also add not just the predicted probabilities of both classes,
but the predicted class itself. You can do so by adding the
prediction column using
Because here we predict the response variable isGrazing,
which is a factor with levels "TRUE" and
"FALSE", the newly created column “.pred_class” also is a
factor with these levels!
After model fitting and prediction using parsnip, we can use
the functions from the package yardstick to compute
model evaluation metrics. The metrics function can be used
for this, taking input arguments truth and
estimate (which should be the predicted class, and not the
predicted class probability!). For example, the following code will
compute model evaluation metrics based on a classification problem:
Exercise 13.7a
Compute the metrics for the fitted model, using column “isGrazing” as
the true label, and the column with the predicted class as the estimate.
The results are estimates for the accuracy and
kappa statistic for the model prediction. The accuracy is quite
high: around 82% of the labels have been correctly predicted! To see how
the predictions are for both classes, we can compute the confusion
matrix the conf_mat function:
We see that most of the records where the cattle did or
did not graze were correctly classified as such. In this case,
318 of the 324 records were the cattle were in fact grazing were
correctly classified as such (recall: ca 98%), whereas 318 of
the 394 records were cattle were predicted to be grazing they were in
indeed also doing that (precision: ca 86%). As mentiond in the
lectures, instead of simply being interested in accuracy only,
we are often interested in these precision and recall
metrics of model predictions. See
here
for a good overview of what these (and other) measures mean (reading
material for the lecture on the data science workflow). the yardstick
package offers the function pr_curve to compute the
precision and recall along a gradient in threshold values.
Exercise 13.7b
Compute the precision and recall using the function
pr_curve. Do this based on the data dat_preds
(in 13.6 you have added the class predictions), using
isGrazing as the true label, and .pred_TRUE
(or TRUE if you did not use parsnip for model fitting but
rather kernlab directly) as the predicted value, and set the
“event_level” argument to value “second” (namely, from FALSE and TRUE,
the second value is the focus). Assign the result to an object called
PR, inspect it by printing it to the screen, and plot
PR using the autoplot function.
A very helpful performance measure is the F1 score: it is the
weighted average of precision and recall. Therefore, this score takes
both false positives and false negatives into account. Intuitively it is
not as easy to understand as accuracy, but F1 is usually more useful
than accuracy, especially if you have an uneven class distribution. F1
can be computed using:
Add the F1 score as a new column F1 to PR.
Show the row of data that corresponds to the maximum F1. Classify the
predictions based on the class prediction .pred_TRUE and
the threshold value corresponding tot the max F1 score.
Another helpful metric is the AUC (area under the ROC
curve), which indicates how well the probabilities from the
positive classes are separated from the negative classes (see link to
the metrics above; ROC is the receiver operating
characteristic, see e.g. this
wikipedia page).
Compute the AUC, using the pr_auc function with the data
from dat_preds, where isGrazing is the true
class, and .pred_TRUE as the predicted value, and set the
“event_level” argument to value “second” (namely, from FALSE and TRUE,
the second value is the focus).
Above, we have splitted the dataset, specified (and prepared) a data
recipe, set up a model specification, trained the model, predicted it on
the left-out validation dataset, and evaluated the predictive
performance using several performance metrics.
Where the set of tidymodels tools will really start shining, is when
we start combining things into a workflow, and then use the set
of tidymodels tools (from the tune and dials packages)
that allows us to efficiently do hyperparameter-tuning!
Before we can do that, let’s first repeat a few steps we’ve been
doing above, yet without actually training the data recipe (using the
prep function) or the model (using the fit
function).
Exercise 13.8a
Create a specification of a data recipe using all steps as done
above (see section 3 above), yet without the prep function.
Assign it to an object called dat_rec.
Create a model specification using all steps as done above (see
section 5 above), yet without the fit function. Assign it
to an object called svm_spec. However, instead of giving
the “cost” and “rbf_sigma” a value, this time only specify that these
will be hyperparameters that need to be tuned! Do so by giving them the
value tune().
# Defining model specification (without fitting it to data)
svm_spec <- svm_rbf(mode = "classification",
cost = tune(),
rbf_sigma = tune()) %>%
set_engine("kernlab")
Exercise 13.8c
Now that you’ve created a data recipe specification
(dat_rec) as well as a model specification
(svm_spec), let’s combine them into a workflow
object: do so using the workflow function to create an
empty workflow, then adding the model specification
svm_spec using the add_model function, and
then add the data recipe specification dat_rec using the
add_recipe function. Combine these into a single object
called svm_wf using the pipe operator %>%.
Inspect the workflow object svm_wf by printing it to the
console.
# Bundle specification of data recipe and model into workflow
svm_wf <- workflow() %>%
add_model(svm_spec) %>%
add_recipe(dat_rec)
svm_wf
You’re almost ready to start tuning the hyperparameters! But first
you’ll need to specify a strategy for the values of the hyperparameters
that you’ll be assessing.
Using the functions rbf_sigma and cost, you
can set the range of values for which the parameters will be evaluated
during tuning. Set parameter “rbf_sigma” to be sampled in a range from
-5 to 0 (this is on a log10 scale by default!), which you can do using
rbf_sigma(range = c(-5,0)). Similarly, specify that the
“cost” parameter should be sampled in a range from -4 to 8 (on a log2
scale!), using cost(range = c(-4,8)).
Use these two hyperparameter at their given scales to set up a
regular grid for hyperparameter tuning: do this using the
grid_regular function where both hyperparameter tuning
specifications (as shown above) are included, and where you will also
need to insert a value for the argument levels, which is
the number of levels each value of each hyperparameter will get. Setting
levels to a high value will lead to many combinations of
hyperparameters that need to be tuned, thus for now set it to a low
value, e.g. 5 (later, if everything runs fine and in a manageable amount
of time, you can increase it to e.g. 10 or higher.
Assign the resultant tuning grid to an object called
svm_grid.
After you’ve created svm_grid, inspect it by printing it to
the console, and count the number of distinct value of
rbf_sigma as well as cost (using the
count function).
Note that in the code below, we use levels = 20, which
will lead to a fine grid for hyperparameter tuning, but also long
computation time!
# Specify grid of hyperparameter values for which to assess fit
svm_grid <- grid_regular(rbf_sigma(range = c(-5,0)),
cost(range = c(-4,8)),
levels = 20)
svm_grid
Now you’re all set to tune the hyperparameters! For this use the
function tune_grid, which takes a workflow
specification (the object svm_wf as created above) as
input, as well as inputs to the arguments resamples (the
data-split object dat_split created above) and
grid (a tuning grid specification such as
svm_grid created above). Save the result to an object
called svm_res.
Note that running the tune_grid can take a long amount
of time (minutes), so this may be an ideal time to stretch your legs,
get some water, tea or coffee, etc.! And, as mentioned above, start with
a low value of the levels argument in the specification of
the tuning grid (in function grid_regular), and only
increase this value if the command runs without error and results in
promising outcomes
# Tune the hyperparameters with the elements prepared above
svm_res <- svm_wf %>%
tune_grid(resamples = dat_split,
grid = svm_grid)
Exercise 13.8f
After having evaluated the model performance for all combinations of
hyperparameter values as listed in the svm_grid object, we
can now extract the model performance metrics using the
collect_metrics function. By default, this function will
compute some performance metrics per combination of hyperparameter
values; which ones depends on the machine learning problem at hand
(classification or regression). Here we are faced with a classification
problem, and the collect_metrics function thus evaluated
model performance using the accuracy and roc_auc. By default, the
collect_metrics function will summarise multiple values of
a performance metric for each combination of hyperparameter values into
a mean value (column mean) and its standard error (column
std_err), with information shown as to how many values were
summarised (column n). In object svm_res, we
will indeed get these summary statistics, as it is based on 5-fold cross
validation (as specified in the data split object
dat_split), since we therefore have 5 performance metric
values per combination of hyperparameter values.
Extract the performance metrics for the svm_res object.
Also, get the combination of hyperparameter values that results in the
highest roc_auc. Then, plot the mean value of the
roc_auc metric as function of cost and
rbf_sigma (for this you could opt for
geom_tile or geom_contour_filled).
# Get best set of hyperparameters (in terms of roc_auc)
best_hyperparms <- svm_res %>%
collect_metrics() %>%
filter(.metric == "roc_auc") %>%
filter(mean == max(mean))
best_hyperparms
# Plot the mean roc_auc as function of both hyperparameter values
svm_res %>%
collect_metrics() %>%
filter(.metric == "roc_auc") %>%
ggplot(mapping = aes(x = rbf_sigma, y = cost, z = mean)) +
geom_contour_filled(bins = 20) +
geom_point(data = best_hyperparms,
mapping = aes(x = rbf_sigma, y = cost)) +
scale_y_log10() +
scale_x_log10() +
labs(x = "Sigma", y = "Cost")
The plot above shows the model’s performance (mean AUC value over the
5 folds) as function of the hyperparameters cost and sigma. The plotted
point shows the value with the highest performance.
Note that if you tuned the hyperparameters on a coarser grid, then
the plot will be different (coarser).
Exercise 13.8g
Using the best combination of hyperparameters, update the model and its
predictions as done in the exercises of sections 2 - 7. Evaluate its
predictive performance, e.g. using the metrics and
conf_mat functions. Did the model predictions improve due
to the hyperparameter-tuning?
# Retrieve the best hyperparameters
best_hyperparms
Indeed, the predictions now are better than they were before tuning
the hyperparameters! Especially the predictions of the records where the
cattle were not grazing (i.e. the minority class) improved!
Note that in the exercises above we have conducted k-fold
cross validation with 5 folds, however, we actually did not set aside a
true test dataset! It would thus be best to perform a true and
final test of the model’s predictive performance with a test-dataset
that ‘the model has not seen yet’. For example, we could do so by first
splitting the data into a test and a train/validation
dataset, the latter of which we could then further split e.g. using
k-fold cross validation, and use for model training and hyper
parameter tuning!
9 Challenge
Challenge
Above we only have focussed on a kernel SVM using the “kernlab”
implementation as the engine. As a challenge, try using a different
algorithm, e.g. a neural network (e.g. in the “nnet” or “keras”
implementation, i.e. engine, or a random forest (e.g. in the
“randomForest” or “ranger” implementation, i.e. engine). See
here
for examples. You can focus on mlp for neural networks (a
multilayer perceptron model a.k.a. a single layer, feed-forward neural
network) or rand_forest for random forests,
respectively.
When tuning their hyperparameters, start simple, e.g. with tuning max 2
hyperparameters using a regular grid that is not having many levels.
A random forest could be fitted and tuned as follows;
# Load library for engine
library(ranger)
# Specify model
rf_spec <- rand_forest(trees = 1000,
mtry = tune(),
min_n = tune(),
mode = "classification") %>%
set_engine("ranger", seed = 1234567)
# Bundle specification of data recipe and model into workflow
rf_wf <- workflow() %>%
add_model(rf_spec) %>%
add_recipe(dat_rec)
rf_wf
# Specify grid of hyperparameter values for which to assess fit
rf_grid <- grid_regular(mtry(range = c(1,5)),
min_n(range = c(1,10)),
levels = 10)
rf_grid
# Explore
rf_grid %>% count(mtry)
rf_grid %>% count(min_n)
# Tune the hyperparameters with the elements prepared above
rf_res <- rf_wf %>%
tune_grid(resamples = dat_split,
grid = rf_grid,
control = control_grid(save_pred = TRUE))
rf_res
# Collect performance metrics
rf_res %>%
collect_metrics(summarize = TRUE)
# Get best set of hyperparameters (in terms of roc_auc)
best_hyperparms <- rf_res %>%
collect_metrics() %>%
filter(.metric == "roc_auc") %>%
filter(mean == max(mean))
best_hyperparms
# Plot the mean roc_auc as function of both hyperparameter values
rf_res %>%
collect_metrics() %>%
filter(.metric == "roc_auc") %>%
ggplot(mapping = aes(x = mtry, y = min_n, z = mean)) +
geom_contour_filled(bins = 50, show.legend = FALSE) +
geom_point(data = best_hyperparms,
mapping = aes(x = mtry, y = min_n)) +
labs(x = "mtry", y = "min_n")
# Collect predictions
rf_preds <- rf_res %>%
collect_predictions() %>%
filter(mtry == best_hyperparms$mtry[[1]],
min_n == best_hyperparms$min_n[[1]])
# Get metrics
rf_preds %>%
metrics(truth = isGrazing,
estimate = .pred_class)
# confusion matrix
rf_preds %>%
conf_mat(truth = isGrazing,
estimate = .pred_class)
Tuning and fitting a neural network using Keras could be done
using:
# Load library for engine
library(keras)
# see installation instructions in ?install_keras()
# Specify model
nn_spec <- mlp(hidden_units = tune(),
penalty = tune(),
dropout = 0.0,
epochs = 20,
activation = "softmax") %>%
set_engine("keras", seed = 1234567) %>%
set_mode("classification")
# Bundle specification of data recipe and model into workflow
nn_wf <- workflow() %>%
add_model(nn_spec) %>%
add_recipe(dat_rec)
nn_wf
# Specify grid of hyperparameter values for which to assess fit
nn_grid <- grid_regular(hidden_units(range = c(1,10)),
penalty(range = c(-10,0)),
levels = 10)
nn_grid
# Explore
nn_grid %>% count(hidden_units)
nn_grid %>% count(penalty)
# Tune the hyperparameters with the elements prepared above
nn_res <- nn_wf %>%
tune_grid(resamples = dat_split,
grid = nn_grid,
control = control_grid(save_pred = TRUE))
nn_res
# Get best set of hyperparameters (in terms of roc_auc)
best_hyperparms <- nn_res %>%
collect_metrics() %>%
filter(.metric == "roc_auc") %>%
filter(mean == max(mean))
best_hyperparms
### Plot the mean roc_auc as function of both hyperparameter values
nn_res %>%
collect_metrics() %>%
filter(.metric == "roc_auc") %>%
ggplot(mapping = aes(x = hidden_units, y = penalty, z = mean)) +
geom_contour_filled(bins = 50, show.legend = FALSE) +
geom_point(data = best_hyperparms,
mapping = aes(x = hidden_units, y = penalty)) +
labs(x = "hidden_units", y = "penalty") +
scale_y_log10()
### Collect predictions
nn_preds <- nn_res %>%
collect_predictions() %>%
filter(hidden_units == best_hyperparms$hidden_units[[1]],
penalty == best_hyperparms$penalty[[1]])
### Get metrics
nn_preds %>%
metrics(truth = isGrazing,
estimate = .pred_class)
# confusion matrix
nn_preds %>%
conf_mat(truth = isGrazing,
estimate = .pred_class)
10 Submit your last plot
Submit your script file as well as a plot: either your last created
plot, or a plot that best captures your solution to the challenge.
Submit the files on Brightspace via Assessment > Assignments >
Skills day 13.
Note that your submission will not be graded or
evaluated. It is used only to facilitate interaction and to get insight
into progress.
11 Recap
Today, we’ve further explored the tidymodels packages,
including the workflows, dials and tune
packages that we used to perform hyperparameter tuning.
An R script of today’s exercises can be downloaded
here