Overview

Today’s goal: to start fitting models to data, were we focus on fitting linear regression models.

Resources

R4DS: the modelling part: introduction, basics, model building, many models

Packages

modelr, purrr, broom

Main functions

data_grid add_predictions add_residuals map nest glance

Cheat sheets

purrr

1 Introduction

After formatting and tidying data yesterday and the day before, we will finally start modelling! We will use the tools that you learned yesterday (including the map function from the purrr package) to fit a series of linear models to groups of a large dataset.

We will further explore the data from the Chronicles of Nature dataset, as recently made publicly available in Ovaskainen et al. 2020 On Monday, we already used this dataset, although for only 1 nature reserve. Now we are going to use all data: the full dataset consists of more than half a million records from 471 localities in the Russian Federation, Ukraine, Uzbekistan, Belarus and Kyrgyzstan, in which nature reserves recorded the timing of specified climatic and phenological events. The data goes back to as early as 1890, but most data is post 1960. We will look at a phenological trend in the the onset of blooming of the plant species Coltsfoot, Tussilago farfara.

First, we will fit just one linear model to data from one site only (Kivach; where we focussed on in tutorial 6) using modelr. Second, we will fit models for various study areas at once, by creating a list-column in a nested data.frame/tibble, a very nice option in tidyr and modelr.

Exercise 8.1a
Download the dataset for all nature reserves (“chroniclesofnature_v2.csv”) from Brightspace > Skills > Datasets > Phenology. Remember to save the data file in your raw data folder (for example: “data/raw/phenology/”). Load (and install if need be) the tidyverse, modelr, and broom packages. Import the data into a tibble called dat. When loading the data from the csv file, make sure that the column “studysite” is loaded as a character column, while all other columns are loaded as numeric values!

When loading the csv file using the read_csv function, the column types are guessed by the function. However, you can specify the type of a column using the col_types argument, where you can specify the column types using the function cols. In this function, you can set the type of a particular column, as well as set the default column type. For example, the following function call loads in the file “input.csv” while specifying that the default column type is “character”, while the column “weight” and “length” are “numeric”, and the column “age” is “integer”:

dat <- read_csv("input.csv",
                col_types = cols(weight = col_double(),
                                 length = col_double(),
                                 age = col_integer(),
                                 .default = col_character()))

# Preliminaries
library(tidyverse)
library(broom)
library(modelr)

# Load data
dat <- read_csv("data/raw/phenology/chroniclesofnature_v2.csv",
                col_types = cols(studysite = col_character(),
                                 .default = col_double()))
dat
## # A tibble: 4,532 × 10
##     year studysite Bombus_1st_occurrence Bombus_sylvarum_1st_occurrence Bombus_terrestris_1s…¹ Salix_caprea_onset_o…² Scolopax_rusticola_s…³
##    <dbl> <chr>                     <dbl>                          <dbl>                  <dbl>                  <dbl>                  <dbl>
##  1  1923 FC_0062                      NA                             NA                     NA                     NA                     NA
##  2  1923 FC_0064                      NA                             NA                     NA                     NA                     NA
##  3  1924 FC_0053                      NA                             NA                     NA                     NA                     NA
##  4  1924 FC_0062                      NA                             NA                     NA                     NA                     NA
##  5  1924 FC_0064                      NA                             NA                     NA                     NA                     NA
##  6  1924 FC_0124                      NA                             NA                     NA                     NA                     NA
##  7  1925 FC_0053                      NA                             NA                     NA                     NA                     NA
##  8  1925 FC_0062                      NA                             NA                     NA                     NA                     NA
##  9  1925 FC_0064                      NA                             NA                     NA                     NA                     NA
## 10  1926 FC_0003                      NA                             NA                     NA                     NA                     NA
## # ℹ 4,522 more rows
## # ℹ abbreviated names: ¹​Bombus_terrestris_1st_occurrence, ²​Salix_caprea_onset_of_blooming, ³​Scolopax_rusticola_start_of_display_flights
## # ℹ 3 more variables: Temperature_transition_above_0 <dbl>, Tussilago_farfara_onset_of_blooming <dbl>,
## #   Zootoca_vivipara_1st_occurrence <dbl>

2 Fitting a linear model

Exercise 8.2a
Select a subset of the dataset that contains the Kivach site only (based on the “studysite” column), which was the part of the dataset that you explored in tutorial 6. Fit a linear model of the onset of blooming of Tussilago farfara (column “Tussilago_farfara_onset_of_blooming”) as function of year. Make a scatterplot to show the results, and include a line to show the predicted values using the predictions of a linear model.

Define the linear model with lm, and use data_grid and add_predictions. Like many other functions in R, the lm function uses a formula notation to specify the response and predictor variable(s). For example, the formula notation y ~ x specifies that “y” is a dependent function of “x” (and by default an intercept), thus “y” here is the response variable, and “x” the predictor variable.


dat2 <- dat %>%
  filter(studysite == "Kivach")

mod1 <- lm(Tussilago_farfara_onset_of_blooming ~ year,
           data = dat2)

grid <- dat2 %>% 
  data_grid(year) %>%
  add_predictions(mod1) 

ggplot() +
  geom_point(data = dat2,
             mapping = aes(x = year,
                           y = Tussilago_farfara_onset_of_blooming)) +
  geom_line(data = grid,
            mapping = aes(x = year,
                          y = pred),
            colour = "red",
            size = 1)

3 Fitting multiple linear models

Multiple linear models can be fitted, by first creating a nested data.frame/tibble, and then fitting a model to each element in the list-column containing the data.

Exercise 8.3a
Exclude all rows from the dat tibble where the column “Tussilago_farfara_onset_of_blooming” contains values that are NA. Then, nest the data by studysite. Assign the result to a new object called by_site. At how many sites does Tussilago farfara occur?

To filter out all records for which a column contains values that are NA, use the filter function in combination with is.na to exclude records that are NA is a specific column. Alternatively, the function drop_na can be used, that automatically drops records based on NA values in specified column(s). Create a grouped data.frame/tibble with group_by and nest it with nest.


by_site <- dat %>%
  filter(!is.na(Tussilago_farfara_onset_of_blooming)) %>%
  group_by(studysite) %>% 
  nest()
nrow(by_site)
## [1] 180

Thus, at 180 sites. Alternatively, using the drop_na function:

# Equivalently with drop_na() function
by_site <- dat %>%
  drop_na(Tussilago_farfara_onset_of_blooming) %>%
  group_by(studysite) %>% 
  nest()
Exercise 8.3b
You will be fitting a linear model with the built-in lm function to each site, thus to each element in the list-column “data” in the object created above. Write your own function to fit a linear model, for example with the name site_model, which takes a data.frame/tibble is input, and returns a fitted linear model, with “year” as predictor and “Tussilago_farfara_onset_of_blooming” as response.

You can use the function function. Check tutorial 7 as well.


site_model <- function(df) {
  y = lm(Tussilago_farfara_onset_of_blooming ~ year,
         data = df)
  return(y)
}
Exercise 8.3c
Now use the map function from the purrr package to make a new column in your nested tibble with the name model with the model fits per site. Thus, use the function created in the previous exercise to fit the linear model to each site. The new column should thus contain the fitted model object for each site!

You can use mutate to create a new column, in combination with map and the function created above.


by_site <- by_site %>% 
  mutate(model = map(data, site_model))

Like many other operations in R, the above steps can also be done in a different way, e.g. by using group_by and then within the summarise function create list-columns explicitly via the list function, where the entire data.frame of the current group can be accessed via the cur_data function. Thus, the current set of data can be assigned to each element in the list-column ‘data’, and likewise a fitted linear model can be assigned to each element in a list-column ‘model’.

dat %>%
  drop_na(Tussilago_farfara_onset_of_blooming) %>%
  group_by(studysite) %>%
  summarize(data = list(cur_data()),
            model = list(lm(Tussilago_farfara_onset_of_blooming ~ year,
                            data = cur_data())))
## # A tibble: 180 × 3
##    studysite                                          data              model 
##    <chr>                                              <list>            <list>
##  1 Altajskij_Bele                                     <tibble [26 × 9]> <lm>  
##  2 Altajskij_Yailu                                    <tibble [28 × 9]> <lm>  
##  3 Altajskij_Yailu_No2                                <tibble [23 × 9]> <lm>  
##  4 Bashkiriya                                         <tibble [5 × 9]>  <lm>  
##  5 Brjanskij les                                      <tibble [7 × 9]>  <lm>  
##  6 Bryanskiy Les_Plant_Loc-17                         <tibble [7 × 9]>  <lm>  
##  7 Bryanskiy Les_Plant_Loc-4                          <tibble [3 × 9]>  <lm>  
##  8 CarpathianBiosphereReserve_Kuzi-Tribushansky_Masyv <tibble [29 × 9]> <lm>  
##  9 Central'no-Lesnoj                                  <tibble [20 × 9]> <lm>  
## 10 Darvinskij                                         <tibble [45 × 9]> <lm>  
## # ℹ 170 more rows

Note, however, that the cur_data function has now been deprecated in favour of the pick function, which can now be used equivalently by calling pick(everything()):

# Equivalently with the pick function
dat %>%
  drop_na(Tussilago_farfara_onset_of_blooming) %>%
  group_by(studysite) %>%
  summarize(data = list(pick(everything())),
            model = list(lm(Tussilago_farfara_onset_of_blooming ~ year,
                            data = pick(everything()))))
## # A tibble: 180 × 3
##    studysite                                          data              model 
##    <chr>                                              <list>            <list>
##  1 Altajskij_Bele                                     <tibble [26 × 9]> <lm>  
##  2 Altajskij_Yailu                                    <tibble [28 × 9]> <lm>  
##  3 Altajskij_Yailu_No2                                <tibble [23 × 9]> <lm>  
##  4 Bashkiriya                                         <tibble [5 × 9]>  <lm>  
##  5 Brjanskij les                                      <tibble [7 × 9]>  <lm>  
##  6 Bryanskiy Les_Plant_Loc-17                         <tibble [7 × 9]>  <lm>  
##  7 Bryanskiy Les_Plant_Loc-4                          <tibble [3 × 9]>  <lm>  
##  8 CarpathianBiosphereReserve_Kuzi-Tribushansky_Masyv <tibble [29 × 9]> <lm>  
##  9 Central'no-Lesnoj                                  <tibble [20 × 9]> <lm>  
## 10 Darvinskij                                         <tibble [45 × 9]> <lm>  
## # ℹ 170 more rows
Exercise 8.3d
Add a new column preds with model predictions to your nested tibble as well, using the map2 function from the purrr package.

You can use mutate to create a new column, and then use map2 within mutate.


by_site <- by_site %>% 
  mutate(
    preds = map2(data, model, add_predictions)
  )
Exercise 8.3e
To be able to plot predictions into a graph we need to unnest the nested tibble. Use the unnest function to obtain the predictions, and include lines for all models in one graph.

preds <- unnest(by_site, preds)
preds
## # A tibble: 3,589 × 13
## # Groups:   studysite [180]
##    studysite data              model   year Bombus_1st_occurrence Bombus_sylvarum_1st_occurr…¹ Bombus_terrestris_1s…² Salix_caprea_onset_o…³
##    <chr>     <list>            <list> <dbl>                 <dbl>                        <dbl>                  <dbl>                  <dbl>
##  1 FC_0062   <tibble [60 × 9]> <lm>    1923                    NA                           NA                     NA                     NA
##  2 FC_0062   <tibble [60 × 9]> <lm>    1924                    NA                           NA                     NA                     NA
##  3 FC_0062   <tibble [60 × 9]> <lm>    1925                    NA                           NA                     NA                     NA
##  4 FC_0062   <tibble [60 × 9]> <lm>    1926                    NA                           NA                     NA                     NA
##  5 FC_0062   <tibble [60 × 9]> <lm>    1927                    NA                           NA                     NA                     NA
##  6 FC_0062   <tibble [60 × 9]> <lm>    1928                    NA                           NA                     NA                     NA
##  7 FC_0062   <tibble [60 × 9]> <lm>    1929                    NA                           NA                     NA                     NA
##  8 FC_0062   <tibble [60 × 9]> <lm>    1930                    NA                           NA                     NA                     NA
##  9 FC_0062   <tibble [60 × 9]> <lm>    1931                    NA                           NA                     NA                     NA
## 10 FC_0062   <tibble [60 × 9]> <lm>    1932                    NA                           NA                     NA                     NA
## # ℹ 3,579 more rows
## # ℹ abbreviated names: ¹​Bombus_sylvarum_1st_occurrence, ²​Bombus_terrestris_1st_occurrence, ³​Salix_caprea_onset_of_blooming
## # ℹ 5 more variables: Scolopax_rusticola_start_of_display_flights <dbl>, Temperature_transition_above_0 <dbl>,
## #   Tussilago_farfara_onset_of_blooming <dbl>, Zootoca_vivipara_1st_occurrence <dbl>, pred <dbl>
preds %>% 
  ggplot(aes(year, pred)) +
  geom_line(aes(group = studysite))

Exercise 8.3f
To assess how much variation in the onset of blooming of Tussilago is explained by year, look at the R2 of the models. The glance function from the broom package can be used to calculate model performance metrics for all models at once, and to add them to the nested tibble What is the range in R2 across the models for all study sites?

Use mutate(), and glance() from the broom package


R2_values <- by_site %>% 
               mutate(glance = map(model, broom::glance)) %>% 
               unnest(glance)

range(R2_values$r.squared)
## [1] 0 1
mean(R2_values$r.squared)
## [1] 0.1851501
R2_values %>%
  ggplot(aes(y = r.squared)) + 
  geom_boxplot()

4 Challenge

Challenge
Combine the results from the previous 2 exercises and colour the predictions per study site (the lines in the graphs) by the R2 of the model.

You need to combine preds and R2_values to link R2 values to the predictions.

preds %>%
  left_join(R2_values, by = "studysite") %>%
  ggplot(aes(year, pred)) +
  geom_line(aes(group = studysite, color = r.squared))

5 Submit your last plot

Submit your script file as well as a plot: either your last created plot, or a plot that best captures your solution to the challenge. Submit the files on Brightspace via Assessment > Assignments > Skills day 8.

Note that your submission will not be graded or evaluated. It is used only to facilitate interaction and to get insight into progress.

6 Recap

Today, we’ve fitted many models to data using the modelr and broom packages, while making clever use of list-columns. Tomorrow, we will enter a new realm, the metapackage tidymodels (itself thus a collection of packages akin to tidyverse).

An R script of today’s exercises can be downloaded here

7 Further reading