Overview

Today’s goal: to get acquainted with using the tidymodels packages for model training and model evaluation.

Resources
Packages

tidymodels, consisting of: rsample, recipes, parsnip, yardstick, tune and other packages.

Main functions

initial_split, training, testing, linear_reg, set_engine, fit, predict, metrics, recipe, step_log, step_center, step_scale, step_normalize, prep, bake

Cheat sheets

No cheatsheets available yet

1 Introduction

We hope that by now the tidyverse package has given you some fluency in data transformation and visualization, as well as some tools to model data (as we saw yesterday in the tutorial). But do you recall that in the introduction lecture we resembled R to the wild west, as anything goes? Also in data modelling/machine learning, R contains an expansive universe of packages and functions, each with their own interfaces (function names, arguments names) and types of objects returned. For example, to run a simple linear regression model in R, you can choose to use the function lm, but also glm, glmnet or other functions and packages (and for other machine learning algorithms it is no different: for example a Random Forest can be run using either randomForest, ranger or xgboost). Altogether, this complicates the modelling step in a data science project.

Developing models can thus benefit from a more unified modelling syntax and a tidy philosophy. This is where the tidymodels package comes in: it provides a tidyverse-friendly unified interface to the modelling part of a data science project: from preprocessing, model training, model prediction to model validation. Tidymodels provides the tools needed to iterate and explore modelling tasks with a tidy philosophy, and shares a common philosophy (and a few libraries) with the tidyverse. The packages in tidymodels do not implement the machine learning algorithms themselves; rather they provide the unified interface to it. In doing so, they make the tasks around model fitting much easier.

Like tidyverse, the package tidymodels is actually a metapackage: when tidymodels is installed or loaded, a set of packages is installed or loaded, including rsample, recipes, parsnip and yardstick. These packages all have their own specific role in the modelling component of data science:

  • rsample can be used for different types of data sampling and subsetting (e.g. in train and test sets);
  • recipes is a package for data preprocessing: it allows you to record data recipes: the steps that need to be taken in the preprocessing of data;
  • parnip is a package that offers a common interface for model training: many algorithms implemented in R have different interfaces and arguments names and parsnip standardizes the interface for fitting models as well as the return values;
  • yardstick offers various model performance measures;
  • dials and tune can be used for (hyper-)parameter tuning

The gentle introduction to tidymodels places the tidymodels packages as follows in the workflow as shown in the R4DS book and in the lectures:

Tidymodels

In a way, the model step itself has sub-steps, for which tidymodels thus provides one or several packages:

Tidymodels workflow

Tidymodels is still young and in development (most packages still in version 0.x), so at present the amount of documentation is not very abundant, and supported models are not very many yet (but still many good ones are supported!). However, the tidy-philosophy, and also the tidymodels framework has large momentum, so it is evolving to be the central set of packages when it comes to modelling in R.

Some may now be wondering: what about the caret package? Many of the packages in tidymodels contain improved versions of the older, large and famous caret functions, made by the same authors of the tidymodels packages. The caret package does not adhere to a tidy-philosophy, and is a large package that tries to accommodate everything in 1 place. The tidymodels packages, however, have split up the various tasks in smaller functions in separate packages, all based on the same tidy principles.

2 Forest structure data

We will explore the tidymodels functions using the dataset that we used in tutorial 2: the dataset of aboveground biomass in forests across Europe and Asia (Schepaschenko et al. 2017).

Exercise 9.2a
You’ve already used the dataset stored in “Forest biomass data.csv” before, so load the tidyverse and tidymodels packages (install first if needed), and load the forest data in a tibble called dat.

library(tidyverse)
library(tidymodels)
dat <- read_csv("data/raw/forest/Forest biomass data.csv")

In tutorial 2, we’ve subsetted the data to aspen trees only, and plotted the aboveground biomass (variable AGB) as function of stand age (variable Stand_age), adding a straight line (geom_smooth with method = "lm") through the data:

dat_aspen <- subset(dat, Species == "aspen")
ggplot(data = dat_aspen) +
  geom_point(mapping = aes(x = Precip_annual, y = AGB, size = Stand_age)) +
  geom_smooth(mapping = aes(x = Precip_annual, y = AGB), method = "lm", se = FALSE)

Below we will be using this subset of data for aspen to explore some of the functions of tidymodels.

Since the focus today is on the modelling workflow, we are keeping the steps of data exploration and preparation to a bare minimum, and only use a very simple model: a linear regression as in the figure created above. Some steps of the modelling workflow using the tidymodels toolset we will not explore yet today, but rather next week.

3 Data splits

As discussed in the lectures, for predictive modelling we have to train the model on a separate part of the data and evaluate the model’s predictive performance on the withheld data (which was thus not used during model training). The rsample package provides several functions to automate such data splits. For example, the function initial_split can be used to apply the most basic train/test set splitting: randomising the part allocated to each of these sets. By default, 75% of the data is allocated to the training dataset and the remainder to the test dataset, but this can be manually using the prop function argument (which needs a proportion, thus value between 0 and 1, and not a percentage). The object returned by the function initial_split is not a tibble: it is an object of class rsplit, which holds all data but also the information on how the records are allocated to the train and test data splits.

Exercise 9.3a
Create a train/test split using the dat_aspen data as created above, and call it dat_split. Allocate 60% of the records to the training dataset. Have a look at the output of the created object dat_split: what do you see? Check the structure of the object dat_split using the functions names and glimpse.

dat_split <- initial_split(dat_aspen, prop = 0.6)
dat_split
## <Training/Testing/Total>
## <55/38/93>
names(dat_split)
## [1] "data"   "in_id"  "out_id" "id"
glimpse(dat_split)
## List of 4
##  $ data  : tibble [93 × 15] (S3: tbl_df/tbl/data.frame)
##   ..$ Plot_ID      : num [1:93] 24 439 440 963 1810 ...
##   ..$ Tree_species : chr [1:93] "Populus tremula" "Populus tremula" "Populus tremula" "Populus tremula" ...
##   ..$ Species_code : chr [1:93] "potr" "potr" "potr" "potr" ...
##   ..$ Species      : chr [1:93] "aspen" "aspen" "aspen" "aspen" ...
##   ..$ Origin       : chr [1:93] "Natural" "Natural" "Natural" "Natural" ...
##   ..$ Site_index   : num [1:93] 8 5 5 6 7 7 6 5 5 5 ...
##   ..$ Stand_age    : num [1:93] 30 55 51 26 9 19 32 38 49 57 ...
##   ..$ Height       : num [1:93] 9 23 22.2 14 4.5 10.2 16.5 21.2 27.2 27.1 ...
##   ..$ DBH          : num [1:93] 10 20 19 12.8 1.9 6.8 10.5 17.2 22.5 29.5 ...
##   ..$ Tree_density : num [1:93] 2331 1458 1004 4999 15560 ...
##   ..$ Growing_stock: num [1:93] 96 430 292 248 14 ...
##   ..$ AGB          : num [1:93] 52.4 299.1 208 118 11.4 ...
##   ..$ Temp_annual  : num [1:93] 0.7875 -0.5458 -0.5458 0.0667 5.3625 ...
##   ..$ Precip_annual: num [1:93] 700 510 510 425 630 608 630 630 630 608 ...
##   ..$ CEC_Soil     : num [1:93] 41 64 64 38 25 32 25 25 25 32 ...
##  $ in_id : int [1:55] 82 62 47 85 68 61 16 77 52 58 ...
##  $ out_id: logi NA
##  $ id    : tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
##   ..$ id: chr "Resample1"
##  - attr(*, "class")= chr [1:3] "initial_split" "mc_split" "rsplit"

To retrieve the data allocated to the training set, use the training function. Idem, use testing for retrieving the data allocated to the testing set.

Exercise 9.3b
Have a glimpse at the training data.

dat_split %>%
  training() %>%
  glimpse()
## Rows: 55
## Columns: 15
## $ Plot_ID       <dbl> 9135, 5784, 3347, 9146, 9112, 5783, 1823, 9122, 4950, 5778, 1816, 1818, 9126, 3325, 5775, 3361, 3333, 9118, 3331, 91…
## $ Tree_species  <chr> "Populus tremula", "Populus tremula", "Populus tremula", "Populus tremula", "Populus tremula", "Populus tremula", "P…
## $ Species_code  <chr> "potr", "potr", "potr", "potr", "potr", "potr", "potr", "potr", "potr", "potr", "potr", "potr", "potr", "potr", "pot…
## $ Species       <chr> "aspen", "aspen", "aspen", "aspen", "aspen", "aspen", "aspen", "aspen", "aspen", "aspen", "aspen", "aspen", "aspen",…
## $ Origin        <chr> "Natural", "Natural", "Natural", "Natural", "Natural", "Natural", "Natural", "Natural", "Natural", "Natural", "Natur…
## $ Site_index    <dbl> 5, 6, 7, 5, 6, 6, 6, 7, 5, 5, 6, 7, 7, 7, 6, 6, 5, 7, 5, 5, 5, 6, 7, 8, 8, 6, 6, 8, 7, 5, 8, 6, 9, 5, 8, 5, 6, 6, 5,…
## $ Stand_age     <dbl> 26, 41, 26, 26, 55, 35, 33, 11, 40, 36, 2, 16, 10, 13, 26, 25, 28, 55, 8, 55, 67, 54, 90, 11, 26, 75, 9, 35, 15, 42,…
## $ Height        <dbl> 12.0, 19.7, 12.2, 14.0, 24.0, 17.0, 19.0, 5.0, 21.0, 19.0, 1.7, 8.3, 4.9, 7.1, 13.7, 10.2, 16.6, 20.0, 4.9, 26.3, 28…
## $ DBH           <dbl> 12.6, 23.2, 9.8, 10.0, 29.6, 18.4, 11.4, 2.5, 20.0, 18.7, 1.1, 4.9, 2.2, 4.6, 10.9, 8.3, 12.6, 17.0, 2.6, 19.2, 30.5…
## $ Tree_density  <dbl> 2645, 526, 4730, 2011, 500, 800, 2600, 32200, 790, 1883, 92000, 10080, 11630, 8530, 2210, 5550, 1412, 1654, 22788, 9…
## $ Growing_stock <dbl> 184.0, 192.0, 193.0, 112.0, 375.0, 178.0, 234.0, 33.0, 208.0, 255.0, 9.7, 92.9, 12.1, 56.2, 134.0, 186.0, 128.0, 258…
## $ AGB           <dbl> 112.5500, 100.6000, 92.2000, 70.8000, 193.5000, 92.2000, 111.4280, 21.2400, 97.1800, 130.0200, 9.3941, 52.3977, 5.70…
## $ Temp_annual   <dbl> 5.5166667, 2.1625000, 0.7875001, 2.4375001, 4.1500000, 2.1625000, 3.6583333, 3.8291668, 5.4333334, 1.3125001, 3.6583…
## $ Precip_annual <dbl> 530, 319, 700, 604, 700, 319, 635, 593, 674, 357, 635, 635, 547, 515, 357, 477, 387, 635, 387, 695, 520, 555, 548, 3…
## $ CEC_Soil      <dbl> 57.0, 29.0, 41.0, 30.0, 29.0, 29.0, 41.0, 30.5, 32.5, 31.5, 41.0, 41.0, 29.5, 52.0, 31.5, 29.5, 33.0, 25.5, 33.0, 36…
Exercise 9.3c
Retrieve the training and testing data, and assign both sets of data to new objects, called dat_train and dat_test, reespectively.

dat_train <- dat_split %>% training()
dat_test  <- dat_split %>% testing()

Avoid data leakage! Data leakage, according to quora, refers to an error made by the creator of a machine learning model in which accidentally information is shared between the training and test data sets. In general, by dividing a set of data into training and test sets, the goal is to ensure that no data is shared between the two.

The rsample package contains more complex and advanced methods for splitting data in training and testing sets (for example, in the function initial_split you can use the argument strata to conduct stratified sampling; the function initial_time_split takes the first prop samples for training, instead of a random selection; the function vfold_cv randomly splits the data into V groups of roughly equal size (called folds), thus useful for V-fold cross validation, whereas loo_cv can be used for leave-one-out cross validation. We will see some of these other forms of data splits later, for now we stick to the random allocation of all data in train and test sets.

4 Model training

As introduced above, the tidymodels package, and here in particular the parsnip package, aims at providing a unified interface to the various machine leaning algorithms offered in R (currently a selection, but this is increasing). So instead of different functions (possibly from different packages) that implements a similar algorithm (e.g. Random Forest, linear regression or Support Vector Machine), parsnip provides a single interface (i.e. function with standardized argument names) for each of the classes of models. See this list of models currently implemented in parsnip.

For example, to execute a linear regression model using tidymodels, we can thus use the function linear_reg.

Exercise 9.4a
Check the help file of the linear_reg function from the parsnip package.

?parsnip::linear_reg

We see that the linear_reg function has input arguments: mode, engine, penalty, and mixture. The mode can be set to “regression” (default) or “classification”, thus specifying the task that the algorithm should perform. The engine specifies the computational engine that should be used for fitting: these are the different implementations of an algorithm (here linear model) that can be used to fit the model to data. The arguments penalty and mixture can be used to specify the type and amount of regularization in the model (e.g. the regularization parameter for lasso and ridge regression).

Although you can thus specify the engine when defining the model (here e.g. via the linear_reg function), below we will specify the engine separately, which allows for more options, e.g. setting engine arguments separately.

What the function linear_reg allows us to do is to specify the model before we are going to fit it to data.

Exercise 9.4b
Use the linear_reg function to specify a linear regression model.

linear_reg(mode = "regression")
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm

Before we can fit the model, we have to specify which engine is going to do the fitting for us. The function set_engine is used to specify which package or system will be used to fit the model, along with any arguments specific to that software. See in the list of models that for a linear regression (linear_reg) we can for example choose the engine lm, which will use the lm function in base R to estimate the specified linear regression for us. Like linear_reg specifies the type of model we want to fit to the data, set_engine specifies the engine that is doing the actual estimation. However, it is not yet going to conduct the estimation, i.e. it is not yet fitting the model to the data. For that, we need to call the fit function.

Exercise 9.4c
Check the help files of the set_engine and fit functions

?set_engine
?fit

Do you notice that both functions start with the argument object? This is convenient (and deliberately chosen), because we can now combine these functions in a pipe (like in tidyverse where the first argument in most functions is .data so that we can combine them in a pipe). In the fit function, we can specify the linear regression model just as we would do using the lm model (recall that a linear regression is carried out using the lm function using the syntax lm(y ~ x, data = dat, where y is the response variable, x the predictor variable, and dat the data).

Exercise 9.4d
Try to combine the 3 separate steps of model specification and fitting into a single pipe, in which you fit a linear regression with the data dat_train, where the column AGB is the response variable, the column Stand_age the predictor variable). Set the engine used for fitting to lm. Assign the fitted model to the object with name dat_lm. Check the output by printing it to the console

dat_lm <- linear_reg(mode = "regression") %>%
  set_engine("lm") %>%
  fit(AGB ~ 1 + Stand_age,
      data = dat_train)
dat_lm
## parsnip model object
## 
## 
## Call:
## stats::lm(formula = AGB ~ 1 + Stand_age, data = data)
## 
## Coefficients:
## (Intercept)    Stand_age  
##      -2.794        3.056

Do you notice that you now have fitted a “parnip model object”? This is very convenient: not only the interface to the engine that is fitting the specified model is unified, also the output returned is always in a unified format. Parsnip, and thus tidymodels, thus provides a unified way of fitting models to data in R!

5 Model predictions

Using the predict function, we can predict a fitted model on (new) data. In tidymodels, also the data that is returned by the predict function is standardized: it returns a tibble with the column .pred containing the model predictions. It can be merged with the data simply using the bind_cols function.

Exercise 9.5a
Predict the fitted model on the test dataset (dat_test)

dat_lm %>% 
  predict(dat_test)
## # A tibble: 38 × 1
##    .pred
##    <dbl>
##  1 165. 
##  2  76.7
##  3  24.7
##  4  55.3
##  5 171. 
##  6  61.4
##  7  70.6
##  8 126. 
##  9 138. 
## 10 181. 
## # ℹ 28 more rows
Exercise 9.5b
Repeat the previous excercise, but now bind the output predictions to the dat_test data and assign the result to a new object called dat_testfit

dat_testfit <- dat_lm %>% 
  predict(dat_test) %>%
  bind_cols(dat_test)
dat_testfit
## # A tibble: 38 × 16
##    .pred Plot_ID Tree_species    Species_code Species Origin  Site_index Stand_age Height   DBH Tree_density Growing_stock   AGB Temp_annual
##    <dbl>   <dbl> <chr>           <chr>        <chr>   <chr>        <dbl>     <dbl>  <dbl> <dbl>        <dbl>         <dbl> <dbl>       <dbl>
##  1 165.      439 Populus tremula potr         aspen   Natural          5        55   23    20           1458          430. 299.      -0.546 
##  2  76.7     963 Populus tremula potr         aspen   Natural          6        26   14    12.8         4999          248  118.       0.0667
##  3  24.7    1810 Populus tremula potr         aspen   Natural          7         9    4.5   1.9        15560           14   11.4      5.36  
##  4  55.3    1811 Populus tremula potr         aspen   Natural          7        19   10.2   6.8         3020           60   25.4      5.25  
##  5 171.     1815 Populus tremula potr         aspen   Natural          5        57   27.1  29.5          690          364  121.       5.25  
##  6  61.4    1820 Populus tremula potr         aspen   Natural          7        21   10.7   6.5         6700          127   58.4      3.66  
##  7  70.6    1822 Populus tremula potr         aspen   Natural          6        24   13.7   8.5         3800          144   71.1      3.66  
##  8 126.     1825 Populus tremula potr         aspen   Natural          6        42   19.9  14.8         1600          251  116.       3.66  
##  9 138.     1826 Populus tremula potr         aspen   Natural          5        46   22.9  20.6          880          316  157.       3.66  
## 10 181.     2434 Populus tremula potr         aspen   Natural          8        60   18    20           1633          454  263.       2.51  
## # ℹ 28 more rows
## # ℹ 2 more variables: Precip_annual <dbl>, CEC_Soil <dbl>
Exercise 9.5c
Visualise the predictions of the fitted model as function of stand age.

dat_testfit %>%
  ggplot(mapping = aes(x = Stand_age, y = .pred)) + 
  geom_line(col = "red") + 
  geom_point(mapping = aes(y = AGB))

6 Model validation

The yardstick package offers the function metrics which returns several measures of model performance.

Exercise 9.6a
Check the help function of the metrics function, and pay attention to the first function arguments.

?metrics

Notice the first 3 function arguments: data, truth, and estimate, where data should be a data.frame/tibble, and truth and estimate the names of the columns containing the actual and predicted data in data. Which metrics are returned is dependent on the type of task performed (e.g. classification vs. regression), and the type of algorithm used.

Exercise 9.6b
Compute the metrics for the model predictions on the test set

dat_testfit %>%
  metrics(truth = AGB, estimate = .pred)
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      79.0  
## 2 rsq     standard       0.307
## 3 mae     standard      46.4

7 Preprocessing recipes

The recipes package uses a cooking metaphor to handle all the data preprocessing, like missing values imputation, removing predictors, centring and scaling, one-hot-encoding, and more1. Above, we have not used it yet, as we did no preprocessing of the data. Using the recipes package, we can store all the preprocessing steps in a single object, so that it can later be applied to new (raw) input data to result in a preprocessed version of the raw data. In that way, using a data recipe we can preprocess raw data that after preprocessing is ready for use in a machine learning algorithm.

The recipes package contains many possible transformations that you can use to preprocess the data, all functions start with step_. Using these steps, you can define a recipe for the transformations that you want to apply to the data. Then, you can prep the recipe, like you would be mixing the ingredients.

Recall from above the problem of data leakage (e.g: transferring information from the train set into the test set): data should be only be prepped with the training data only!

Finally, to continue with the cooking metaphor, you can bake the recipe to apply all preprocessing to the data sets: indeed this can be either the training dataset or the testing dataset!

To practice, let’s prepare a data recipe that formulates a model structure, and performs some transformations (centering and scaling) on the predictor variable.

Exercise 9.7a
Create a data recipe using the training data dat_train, that specifies that the response is AGB and the predictor variable is Stand_age, where the predictor variable is (1) transformed via a log-transformation, (2) centered (i.e. the mean of the variable is subtracted from all values), and (3) scaled (i.e. all values are divided by the standard deviation, so that the scale of the data is in units variance). Assign the data recipe to an object called lmrecipe, and look at the object by printing it to the console.

lmrecipe <- dat_train %>%
  recipe(AGB ~ 1 + Stand_age) %>%
  step_log(Stand_age) %>%
  step_center(Stand_age) %>%
  step_scale(Stand_age)
lmrecipe
Exercise 9.7b
Now, prepare the recipe. Save the prepared recipe under the same name: lmrecipe and have a look at the output.

lmrecipe <- lmrecipe %>%
  prep()
lmrecipe

You now see that some more information is added to the data recipe: it not only states that some columns are transformed (log, centering, scaling), but also where the information for the transformations is taken from: here from [trained].

Exercise 9.7c
Now, as a final step of data preprocessing, bake your data recipe. ‘Bake’ both datasets (dat_train and dat_test), and store the resultant preprocessed dataset as preppedTrain and preppedTest, respectively. Have a look at both datasets.

preppedTrain <- lmrecipe %>%
  bake(dat_train)
preppedTest  <- lmrecipe %>%
  bake(dat_test)
preppedTrain
## # A tibble: 55 × 2
##    Stand_age   AGB
##        <dbl> <dbl>
##  1    -0.186 113. 
##  2     0.462 101. 
##  3    -0.186  92.2
##  4    -0.186  70.8
##  5     0.880 194. 
##  6     0.237  92.2
##  7     0.153 111. 
##  8    -1.41   21.2
##  9     0.427  97.2
## 10     0.277 130. 
## # ℹ 45 more rows
preppedTest
## # A tibble: 38 × 2
##    Stand_age   AGB
##        <dbl> <dbl>
##  1     0.880 299. 
##  2    -0.186 118. 
##  3    -1.69   11.4
##  4    -0.632  25.4
##  5     0.931 121. 
##  6    -0.489  58.4
##  7    -0.299  71.1
##  8     0.496 116. 
##  9     0.626 157. 
## 10     1.00  263. 
## # ℹ 28 more rows

Now, in each dataset you see only 2 columns: where Stand_age is transformed using the above data recipe! Isn’t this cool?! Once you’ve set up a data recipe, you can bake your raw data in 1 statement to a format that is suitable input for machine learning algorithms. Here, we’ve only done some very small data transformations (log, center and scale), but the list of step_ functions is very long: for example we can convert factor columns into dummy columns (columns with value 0 or 1, so that each level in the factor gets a unique combination of zeros and ones).

8 Challenge

Challenge
With the preprocessed data preppedTrain and preppedTest, and with the parsnip model specification and fitting as shown above, now train a linear regression model with AGB as response and the transformed Stand_age as predictor. Predict the fitted model to the test dataset, and plot the results in graph (plot both the raw datapoints, as well as the fitted regression line). Plot the original values for Stand_age, and not the transformed values!

# Fit
new_lm <- linear_reg(mode = "regression") %>%
  set_engine("lm") %>%
  fit(AGB ~ 1 + Stand_age,
      data = preppedTrain)

# Predict
new_testfit <- new_lm %>% 
  predict(preppedTest) %>%
  bind_cols(dat_test)

# Plot
new_testfit %>%
  ggplot(aes(x=Stand_age, y=AGB)) +
  geom_point() +
  geom_line(aes(y=.pred), col="red")

9 Submit your last plot

Submit your script file as well as a plot: either your last created plot, or a plot that best captures your solution to the challenge. Submit the files on Brightspace via Assessment > Assignments > Skills day 9.

Note that your submission will not be graded or evaluated. It is used only to facilitate interaction and to get insight into progress.

10 Recap

Today, we’ve entered the tidymodels ecosystem (itself thus a collection of packages akin to tidyverse). Tomorrow, we’ll practice making reproducible reports using RMarkdown with a challenge to build an interactive Shiny app!

An R script of today’s exercises can be downloaded here

11 Further reading