Today’s goal: to get acquainted with using the tidymodels packages for model training and model evaluation.
tidymodels, consisting of: rsample, recipes, parsnip, yardstick, tune and other packages.
initial_split, training, testing, linear_reg, set_engine, fit, predict, metrics, recipe, step_log, step_center, step_scale, step_normalize, prep, bake
No cheatsheets available yet
We hope that by now the tidyverse package has given you some fluency
in data transformation and visualization, as well as some tools to model
data (as we saw yesterday in the tutorial). But do you recall that in
the introduction lecture we resembled R to the wild west, as
anything goes? Also in data modelling/machine learning, R contains an
expansive universe of packages and functions, each with their own
interfaces (function names, arguments names) and types of objects
returned. For example, to run a simple linear regression model in R, you
can choose to use the function lm
, but also
glm
, glmnet
or other functions and packages
(and for other machine learning algorithms it is no different: for
example a Random Forest can be run using either
randomForest
, ranger
or xgboost
).
Altogether, this complicates the modelling step in a data science
project.
Developing models can thus benefit from a more unified modelling syntax and a tidy philosophy. This is where the tidymodels package comes in: it provides a tidyverse-friendly unified interface to the modelling part of a data science project: from preprocessing, model training, model prediction to model validation. Tidymodels provides the tools needed to iterate and explore modelling tasks with a tidy philosophy, and shares a common philosophy (and a few libraries) with the tidyverse. The packages in tidymodels do not implement the machine learning algorithms themselves; rather they provide the unified interface to it. In doing so, they make the tasks around model fitting much easier.
Like tidyverse, the package tidymodels is actually a metapackage: when tidymodels is installed or loaded, a set of packages is installed or loaded, including rsample, recipes, parsnip and yardstick. These packages all have their own specific role in the modelling component of data science:
The gentle introduction to tidymodels places the tidymodels packages as follows in the workflow as shown in the R4DS book and in the lectures:
In a way, the model step itself has sub-steps, for which tidymodels thus provides one or several packages:
Tidymodels is still young and in development (most packages still in version 0.x), so at present the amount of documentation is not very abundant, and supported models are not very many yet (but still many good ones are supported!). However, the tidy-philosophy, and also the tidymodels framework has large momentum, so it is evolving to be the central set of packages when it comes to modelling in R.
Some may now be wondering: what about the caret package? Many of the packages in tidymodels contain improved versions of the older, large and famous caret functions, made by the same authors of the tidymodels packages. The caret package does not adhere to a tidy-philosophy, and is a large package that tries to accommodate everything in 1 place. The tidymodels packages, however, have split up the various tasks in smaller functions in separate packages, all based on the same tidy principles.
We will explore the tidymodels functions using the dataset that we used in tutorial 2: the dataset of aboveground biomass in forests across Europe and Asia (Schepaschenko et al. 2017).
dat
.
library(tidyverse)
library(tidymodels)
dat <- read_csv("data/raw/forest/Forest biomass data.csv")
In tutorial 2, we’ve subsetted the data to aspen trees only, and
plotted the aboveground biomass (variable AGB) as function of
stand age (variable Stand_age), adding a straight line
(geom_smooth
with method = "lm"
) through the
data:
dat_aspen <- subset(dat, Species == "aspen")
ggplot(data = dat_aspen) +
geom_point(mapping = aes(x = Precip_annual, y = AGB, size = Stand_age)) +
geom_smooth(mapping = aes(x = Precip_annual, y = AGB), method = "lm", se = FALSE)
Below we will be using this subset of data for aspen to explore some of the functions of tidymodels.
Since the focus today is on the modelling workflow, we are keeping the steps of data exploration and preparation to a bare minimum, and only use a very simple model: a linear regression as in the figure created above. Some steps of the modelling workflow using the tidymodels toolset we will not explore yet today, but rather next week.
As discussed in the lectures, for predictive modelling we have to
train the model on a separate part of the data and evaluate the model’s
predictive performance on the withheld data (which was thus not used
during model training). The rsample package provides several
functions to automate such data splits. For example, the function
initial_split
can be used to apply the most basic
train/test set splitting: randomising the part allocated to each of
these sets. By default, 75% of the data is allocated to the training
dataset and the remainder to the test dataset, but this can be manually
using the prop
function argument (which needs a proportion,
thus value between 0 and 1, and not a percentage). The object returned
by the function initial_split
is not a tibble: it is an
object of class rsplit
, which holds all data but also the
information on how the records are allocated to the train and test data
splits.
dat_aspen
data as
created above, and call it dat_split
. Allocate 60% of the
records to the training dataset. Have a look at the output of the
created object dat_split
: what do you see? Check the
structure of the object dat_split
using the functions
names
and glimpse
. dat_split <- initial_split(dat_aspen, prop = 0.6)
dat_split
## <Training/Testing/Total>
## <55/38/93>
names(dat_split)
## [1] "data" "in_id" "out_id" "id"
glimpse(dat_split)
## List of 4
## $ data : tibble [93 × 15] (S3: tbl_df/tbl/data.frame)
## ..$ Plot_ID : num [1:93] 24 439 440 963 1810 ...
## ..$ Tree_species : chr [1:93] "Populus tremula" "Populus tremula" "Populus tremula" "Populus tremula" ...
## ..$ Species_code : chr [1:93] "potr" "potr" "potr" "potr" ...
## ..$ Species : chr [1:93] "aspen" "aspen" "aspen" "aspen" ...
## ..$ Origin : chr [1:93] "Natural" "Natural" "Natural" "Natural" ...
## ..$ Site_index : num [1:93] 8 5 5 6 7 7 6 5 5 5 ...
## ..$ Stand_age : num [1:93] 30 55 51 26 9 19 32 38 49 57 ...
## ..$ Height : num [1:93] 9 23 22.2 14 4.5 10.2 16.5 21.2 27.2 27.1 ...
## ..$ DBH : num [1:93] 10 20 19 12.8 1.9 6.8 10.5 17.2 22.5 29.5 ...
## ..$ Tree_density : num [1:93] 2331 1458 1004 4999 15560 ...
## ..$ Growing_stock: num [1:93] 96 430 292 248 14 ...
## ..$ AGB : num [1:93] 52.4 299.1 208 118 11.4 ...
## ..$ Temp_annual : num [1:93] 0.7875 -0.5458 -0.5458 0.0667 5.3625 ...
## ..$ Precip_annual: num [1:93] 700 510 510 425 630 608 630 630 630 608 ...
## ..$ CEC_Soil : num [1:93] 41 64 64 38 25 32 25 25 25 32 ...
## $ in_id : int [1:55] 82 62 47 85 68 61 16 77 52 58 ...
## $ out_id: logi NA
## $ id : tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
## ..$ id: chr "Resample1"
## - attr(*, "class")= chr [1:3] "initial_split" "mc_split" "rsplit"
To retrieve the data allocated to the training set, use the
training
function. Idem, use testing
for
retrieving the data allocated to the testing set.
dat_split %>%
training() %>%
glimpse()
## Rows: 55
## Columns: 15
## $ Plot_ID <dbl> 9135, 5784, 3347, 9146, 9112, 5783, 1823, 9122, 4950, 5778, 1816, 1818, 9126, 3325, 5775, 3361, 3333, 9118, 3331, 91…
## $ Tree_species <chr> "Populus tremula", "Populus tremula", "Populus tremula", "Populus tremula", "Populus tremula", "Populus tremula", "P…
## $ Species_code <chr> "potr", "potr", "potr", "potr", "potr", "potr", "potr", "potr", "potr", "potr", "potr", "potr", "potr", "potr", "pot…
## $ Species <chr> "aspen", "aspen", "aspen", "aspen", "aspen", "aspen", "aspen", "aspen", "aspen", "aspen", "aspen", "aspen", "aspen",…
## $ Origin <chr> "Natural", "Natural", "Natural", "Natural", "Natural", "Natural", "Natural", "Natural", "Natural", "Natural", "Natur…
## $ Site_index <dbl> 5, 6, 7, 5, 6, 6, 6, 7, 5, 5, 6, 7, 7, 7, 6, 6, 5, 7, 5, 5, 5, 6, 7, 8, 8, 6, 6, 8, 7, 5, 8, 6, 9, 5, 8, 5, 6, 6, 5,…
## $ Stand_age <dbl> 26, 41, 26, 26, 55, 35, 33, 11, 40, 36, 2, 16, 10, 13, 26, 25, 28, 55, 8, 55, 67, 54, 90, 11, 26, 75, 9, 35, 15, 42,…
## $ Height <dbl> 12.0, 19.7, 12.2, 14.0, 24.0, 17.0, 19.0, 5.0, 21.0, 19.0, 1.7, 8.3, 4.9, 7.1, 13.7, 10.2, 16.6, 20.0, 4.9, 26.3, 28…
## $ DBH <dbl> 12.6, 23.2, 9.8, 10.0, 29.6, 18.4, 11.4, 2.5, 20.0, 18.7, 1.1, 4.9, 2.2, 4.6, 10.9, 8.3, 12.6, 17.0, 2.6, 19.2, 30.5…
## $ Tree_density <dbl> 2645, 526, 4730, 2011, 500, 800, 2600, 32200, 790, 1883, 92000, 10080, 11630, 8530, 2210, 5550, 1412, 1654, 22788, 9…
## $ Growing_stock <dbl> 184.0, 192.0, 193.0, 112.0, 375.0, 178.0, 234.0, 33.0, 208.0, 255.0, 9.7, 92.9, 12.1, 56.2, 134.0, 186.0, 128.0, 258…
## $ AGB <dbl> 112.5500, 100.6000, 92.2000, 70.8000, 193.5000, 92.2000, 111.4280, 21.2400, 97.1800, 130.0200, 9.3941, 52.3977, 5.70…
## $ Temp_annual <dbl> 5.5166667, 2.1625000, 0.7875001, 2.4375001, 4.1500000, 2.1625000, 3.6583333, 3.8291668, 5.4333334, 1.3125001, 3.6583…
## $ Precip_annual <dbl> 530, 319, 700, 604, 700, 319, 635, 593, 674, 357, 635, 635, 547, 515, 357, 477, 387, 635, 387, 695, 520, 555, 548, 3…
## $ CEC_Soil <dbl> 57.0, 29.0, 41.0, 30.0, 29.0, 29.0, 41.0, 30.5, 32.5, 31.5, 41.0, 41.0, 29.5, 52.0, 31.5, 29.5, 33.0, 25.5, 33.0, 36…
dat_train
and dat_test
,
reespectively. dat_train <- dat_split %>% training()
dat_test <- dat_split %>% testing()
Avoid data leakage! Data leakage, according to quora, refers to an error made by the creator of a machine learning model in which accidentally information is shared between the training and test data sets. In general, by dividing a set of data into training and test sets, the goal is to ensure that no data is shared between the two.
The rsample package contains more complex and advanced
methods for splitting data in training and testing sets (for example, in
the function initial_split
you can use the argument
strata
to conduct stratified sampling; the function
initial_time_split
takes the first
prop
samples for training, instead of a random selection;
the function vfold_cv
randomly splits the data into
V groups of roughly equal size (called folds), thus
useful for V-fold cross validation, whereas loo_cv
can be
used for leave-one-out cross validation. We will see some of these other
forms of data splits later, for now we stick to the random allocation of
all data in train and test sets.
As introduced above, the tidymodels package, and here in particular the parsnip package, aims at providing a unified interface to the various machine leaning algorithms offered in R (currently a selection, but this is increasing). So instead of different functions (possibly from different packages) that implements a similar algorithm (e.g. Random Forest, linear regression or Support Vector Machine), parsnip provides a single interface (i.e. function with standardized argument names) for each of the classes of models. See this list of models currently implemented in parsnip.
For example, to execute a linear regression model using tidymodels,
we can thus use the function linear_reg
.
linear_reg
function from the
parsnip
package. ?parsnip::linear_reg
We see that the linear_reg
function has input arguments:
mode
, engine
, penalty
, and
mixture
. The mode
can be set to “regression”
(default) or “classification”, thus specifying the task that the
algorithm should perform. The engine
specifies the
computational engine that should be used for fitting: these are the
different implementations of an algorithm (here linear model) that can
be used to fit the model to data. The arguments penalty
and
mixture
can be used to specify the type and amount of
regularization in the model (e.g. the regularization parameter for lasso
and ridge regression).
Although you can thus specify the engine when defining the model
(here e.g. via the linear_reg
function), below we will
specify the engine separately, which allows for more options,
e.g. setting engine arguments separately.
What the function linear_reg
allows us to do is to
specify the model before we are going to fit it to
data.
linear_reg
function to specify a linear regression
model. linear_reg(mode = "regression")
## Linear Regression Model Specification (regression)
##
## Computational engine: lm
Before we can fit the model, we have to specify which
engine is going to do the fitting for us. The function
set_engine
is used to specify which package or system will
be used to fit the model, along with any arguments specific to that
software. See in the list of models that
for a linear regression (linear_reg
) we can for example
choose the engine lm
, which will use the lm
function in base R to estimate the specified linear regression for us.
Like linear_reg
specifies the type of model we want to fit
to the data, set_engine
specifies the engine that is doing
the actual estimation. However, it is not yet going to conduct the
estimation, i.e. it is not yet fitting the model to the data.
For that, we need to call the fit
function.
set_engine
and fit
functions ?set_engine
?fit
Do you notice that both functions start with the argument
object
? This is convenient (and deliberately chosen),
because we can now combine these functions in a pipe (like in
tidyverse where the first argument in most functions is
.data
so that we can combine them in a pipe). In the
fit
function, we can specify the linear regression model
just as we would do using the lm
model (recall that a
linear regression is carried out using the lm
function
using the syntax lm(y ~ x, data = dat
, where y
is the response variable, x
the predictor variable, and
dat
the data).
dat_train
, where the column AGB is the response
variable, the column Stand_age the predictor variable). Set the
engine used for fitting to lm
. Assign the fitted
model to the object with name dat_lm
. Check the output by
printing it to the console dat_lm <- linear_reg(mode = "regression") %>%
set_engine("lm") %>%
fit(AGB ~ 1 + Stand_age,
data = dat_train)
dat_lm
## parsnip model object
##
##
## Call:
## stats::lm(formula = AGB ~ 1 + Stand_age, data = data)
##
## Coefficients:
## (Intercept) Stand_age
## -2.794 3.056
Do you notice that you now have fitted a “parnip model object”? This is very convenient: not only the interface to the engine that is fitting the specified model is unified, also the output returned is always in a unified format. Parsnip, and thus tidymodels, thus provides a unified way of fitting models to data in R!
Using the predict
function, we can predict a fitted
model on (new) data. In tidymodels, also the data that is returned by
the predict
function is standardized: it returns a
tibble with the column .pred
containing the model
predictions. It can be merged with the data simply using the
bind_cols
function.
dat_test
)
dat_lm %>%
predict(dat_test)
## # A tibble: 38 × 1
## .pred
## <dbl>
## 1 165.
## 2 76.7
## 3 24.7
## 4 55.3
## 5 171.
## 6 61.4
## 7 70.6
## 8 126.
## 9 138.
## 10 181.
## # ℹ 28 more rows
dat_test
data and assign the result to a new object
called dat_testfit
dat_testfit <- dat_lm %>%
predict(dat_test) %>%
bind_cols(dat_test)
dat_testfit
## # A tibble: 38 × 16
## .pred Plot_ID Tree_species Species_code Species Origin Site_index Stand_age Height DBH Tree_density Growing_stock AGB Temp_annual
## <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 165. 439 Populus tremula potr aspen Natural 5 55 23 20 1458 430. 299. -0.546
## 2 76.7 963 Populus tremula potr aspen Natural 6 26 14 12.8 4999 248 118. 0.0667
## 3 24.7 1810 Populus tremula potr aspen Natural 7 9 4.5 1.9 15560 14 11.4 5.36
## 4 55.3 1811 Populus tremula potr aspen Natural 7 19 10.2 6.8 3020 60 25.4 5.25
## 5 171. 1815 Populus tremula potr aspen Natural 5 57 27.1 29.5 690 364 121. 5.25
## 6 61.4 1820 Populus tremula potr aspen Natural 7 21 10.7 6.5 6700 127 58.4 3.66
## 7 70.6 1822 Populus tremula potr aspen Natural 6 24 13.7 8.5 3800 144 71.1 3.66
## 8 126. 1825 Populus tremula potr aspen Natural 6 42 19.9 14.8 1600 251 116. 3.66
## 9 138. 1826 Populus tremula potr aspen Natural 5 46 22.9 20.6 880 316 157. 3.66
## 10 181. 2434 Populus tremula potr aspen Natural 8 60 18 20 1633 454 263. 2.51
## # ℹ 28 more rows
## # ℹ 2 more variables: Precip_annual <dbl>, CEC_Soil <dbl>
dat_testfit %>%
ggplot(mapping = aes(x = Stand_age, y = .pred)) +
geom_line(col = "red") +
geom_point(mapping = aes(y = AGB))
The yardstick package offers the function
metrics
which returns several measures of model
performance.
metrics
function, and pay
attention to the first function arguments. ?metrics
Notice the first 3 function arguments: data
,
truth
, and estimate
, where data
should be a data.frame/tibble, and truth
and
estimate
the names of the columns containing the actual and
predicted data in data
. Which metrics are returned is
dependent on the type of task performed (e.g. classification
vs. regression), and the type of algorithm used.
dat_testfit %>%
metrics(truth = AGB, estimate = .pred)
## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 79.0
## 2 rsq standard 0.307
## 3 mae standard 46.4
The recipes package uses a cooking metaphor to handle all the data preprocessing, like missing values imputation, removing predictors, centring and scaling, one-hot-encoding, and more1. Above, we have not used it yet, as we did no preprocessing of the data. Using the recipes package, we can store all the preprocessing steps in a single object, so that it can later be applied to new (raw) input data to result in a preprocessed version of the raw data. In that way, using a data recipe we can preprocess raw data that after preprocessing is ready for use in a machine learning algorithm.
The recipes package contains many possible transformations that you
can use to preprocess the data, all functions start with
step_
. Using these steps, you can define a
recipe for the transformations that you want to apply to the
data. Then, you can prep the recipe, like you would be mixing
the ingredients.
Recall from above the problem of data leakage (e.g: transferring information from the train set into the test set): data should be only be prepped with the training data only!
Finally, to continue with the cooking metaphor, you can bake the recipe to apply all preprocessing to the data sets: indeed this can be either the training dataset or the testing dataset!
To practice, let’s prepare a data recipe that formulates a model structure, and performs some transformations (centering and scaling) on the predictor variable.
dat_train
,
that specifies that the response is AGB and the predictor
variable is Stand_age, where the predictor variable is (1)
transformed via a log-transformation, (2) centered (i.e. the mean of the
variable is subtracted from all values), and (3) scaled (i.e. all values
are divided by the standard deviation, so that the scale of the data is
in units variance). Assign the data recipe to an object called
lmrecipe
, and look at the object by printing it to the
console. lmrecipe <- dat_train %>%
recipe(AGB ~ 1 + Stand_age) %>%
step_log(Stand_age) %>%
step_center(Stand_age) %>%
step_scale(Stand_age)
lmrecipe
lmrecipe
and have a look at the output. lmrecipe <- lmrecipe %>%
prep()
lmrecipe
You now see that some more information is added to the data recipe:
it not only states that some columns are transformed (log, centering,
scaling), but also where the information for the transformations is
taken from: here from [trained]
.
dat_train
and
dat_test
), and store the resultant preprocessed dataset as
preppedTrain
and preppedTest
, respectively.
Have a look at both datasets. preppedTrain <- lmrecipe %>%
bake(dat_train)
preppedTest <- lmrecipe %>%
bake(dat_test)
preppedTrain
## # A tibble: 55 × 2
## Stand_age AGB
## <dbl> <dbl>
## 1 -0.186 113.
## 2 0.462 101.
## 3 -0.186 92.2
## 4 -0.186 70.8
## 5 0.880 194.
## 6 0.237 92.2
## 7 0.153 111.
## 8 -1.41 21.2
## 9 0.427 97.2
## 10 0.277 130.
## # ℹ 45 more rows
preppedTest
## # A tibble: 38 × 2
## Stand_age AGB
## <dbl> <dbl>
## 1 0.880 299.
## 2 -0.186 118.
## 3 -1.69 11.4
## 4 -0.632 25.4
## 5 0.931 121.
## 6 -0.489 58.4
## 7 -0.299 71.1
## 8 0.496 116.
## 9 0.626 157.
## 10 1.00 263.
## # ℹ 28 more rows
Now, in each dataset you see only 2 columns: where Stand_age
is transformed using the above data recipe! Isn’t this cool?! Once
you’ve set up a data recipe, you can bake your raw data in 1
statement to a format that is suitable input for machine learning
algorithms. Here, we’ve only done some very small data transformations
(log, center and scale), but the list of step_
functions is
very long: for example we can convert factor columns into
dummy columns (columns with value 0 or 1, so that each level in
the factor gets a unique combination of zeros and ones).
preppedTrain
and
preppedTest
, and with the parsnip model
specification and fitting as shown above, now train a linear regression
model with AGB as response and the transformed
Stand_age as predictor. Predict the fitted model to the test
dataset, and plot the results in graph (plot both the raw datapoints, as
well as the fitted regression line). Plot the original values
for Stand_age, and not the transformed values! # Fit
new_lm <- linear_reg(mode = "regression") %>%
set_engine("lm") %>%
fit(AGB ~ 1 + Stand_age,
data = preppedTrain)
# Predict
new_testfit <- new_lm %>%
predict(preppedTest) %>%
bind_cols(dat_test)
# Plot
new_testfit %>%
ggplot(aes(x=Stand_age, y=AGB)) +
geom_point() +
geom_line(aes(y=.pred), col="red")
Submit your script file as well as a plot: either your last created plot, or a plot that best captures your solution to the challenge. Submit the files on Brightspace via Assessment > Assignments > Skills day 9.
Note that your submission will not be graded or evaluated. It is used only to facilitate interaction and to get insight into progress.
Today, we’ve entered the tidymodels ecosystem (itself thus a collection of packages akin to tidyverse). Tomorrow, we’ll practice making reproducible reports using RMarkdown with a challenge to build an interactive Shiny app!
An R script of today’s exercises can be downloaded here