WUR | WEC | CSA | PPS | GRS

Overview

Today’s goal: to start fitting models to data, were we focus on fitting linear regression models.

Resources

R4DS: the modelling part: introduction, basics, model building, many models

Packages

modelr, purrr, broom

Main functions

data_grid add_predictions add_residuals map nest glance

Cheat sheets

purrr

1 Introduction

After formatting and tidying data yesterday and the day before, we will finally start modelling! We will use the tools that you learned yesterday (including the map function from the purrr package) to fit a series of linear models to groups of a large dataset.

We will further explore the data from the Chronicles of Nature dataset, as recently made publicly available in Ovaskainen et al. 2020 On Monday, we already used this dataset, although for only 1 nature reserve. Now we are going to use all data: the full dataset consists of more than half a million records from 471 localities in the Russian Federation, Ukraine, Uzbekistan, Belarus and Kyrgyzstan, in which nature reserves recorded the timing of specified climatic and phenological events. The data goes back to as early as 1890, but most data is post 1960. We will look at a phenological trend in the the onset of blooming of the plant species Coltsfoot, Tussilago farfara.

First, we will fit just one linear model to data from one site only (Kivach; where we focussed on in tutorial 6) using modelr. Second, we will fit models for various study areas at once, by creating a list-column in a nested data.frame/tibble, a very nice option in tidyr and modelr.

Exercise 8.1a

Download the dataset for all nature reserves (“chroniclesofnature_v2.csv”) from Brightspace > Skills > Datasets > Phenology. Remember to save the data file in your raw data folder (for example: “data/raw/phenology/”). Load (and install if need be) the tidyverse, modelr, and broom packages. Import the data into a tibble called dat. When loading the data from the csv file, make sure that the column “studysite” is loaded as a character column, while all other columns are loaded as numeric values!

When loading the csv file using the read_csv function, the column types are guessed by the function. However, you can specify the type of a column using the col_types argument, where you can specify the column types using the function cols. In this function, you can set the type of a particular column, as well as set the default column type. For example, the following function call loads in the file “input.csv” while specifying that the default column type is “character”, while the column “weight” and “length” are “numeric”, and the column “age” is “integer”:

dat <- read_csv("input.csv",
                col_types = cols(weight = col_double(),
                                 length = col_double(),
                                 age = col_integer(),
                                 .default = col_character()))

# Preliminaries
library(tidyverse)
library(broom)
library(modelr)

# Load data
dat <- read_csv("data/raw/phenology/chroniclesofnature_v2.csv",
                col_types = cols(studysite = col_character(),
                                 .default = col_double()))
dat

## # A tibble: 4,532 × 10
##     year studysite Bombus_1st_occurrence Bombus_sylvarum_1st_occurr…¹ Bombus_terrestris_1s…² Salix_caprea_onset_o…³ Scolopax_rusticola_s…⁴ Temperature_transiti…⁵ Tussilago_farfara_on…⁶ Zootoca_vivipara_1st…⁷
##    <dbl> <chr>                     <dbl>                        <dbl>                  <dbl>                  <dbl>                  <dbl>                  <dbl>                  <dbl>                  <dbl>
##  1  1923 FC_0062                      NA                           NA                     NA                     NA                     NA                     NA                     91                     NA
##  2  1923 FC_0064                      NA                           NA                     NA                     NA                     NA                     NA                    113                     NA
##  3  1924 FC_0053                      NA                           NA                     NA                     NA                     NA                     NA                    120                     NA
##  4  1924 FC_0062                      NA                           NA                     NA                     NA                     NA                     NA                    102                     NA
##  5  1924 FC_0064                      NA                           NA                     NA                     NA                     NA                     NA                    105                     NA
##  6  1924 FC_0124                      NA                           NA                     NA                     NA                     NA                     NA                    107                     NA
##  7  1925 FC_0053                      NA                           NA                     NA                     NA                     NA                     NA                    100                     NA
##  8  1925 FC_0062                      NA                           NA                     NA                     NA                     NA                     NA                    102                     NA
##  9  1925 FC_0064                      NA                           NA                     NA                     NA                     NA                     NA                     95                     NA
## 10  1926 FC_0003                      NA                           NA                     NA                     NA                     NA                     NA                    146                     NA
## # ℹ 4,522 more rows
## # ℹ abbreviated names: ¹Bombus_sylvarum_1st_occurrence, ²Bombus_terrestris_1st_occurrence, ³Salix_caprea_onset_of_blooming, ⁴Scolopax_rusticola_start_of_display_flights, ⁵Temperature_transition_above_0,
## #   ⁶Tussilago_farfara_onset_of_blooming, ⁷Zootoca_vivipara_1st_occurrence

2 Fitting a linear model

Exercise 8.2a

Select a subset of the dataset that contains the Kivach site only (based on the “studysite” column), which was the part of the dataset that you explored in tutorial 6. Fit a linear model of the onset of blooming of Tussilago farfara (column “Tussilago_farfara_onset_of_blooming”) as function of year. Make a scatterplot to show the results, and include a line to show the predicted values using the predictions of a linear model.

Define the linear model with lm, and use data_grid and add_predictions. Like many other functions in R, the lm function uses a formula notation to specify the response and predictor variable(s). For example, the formula notation y ~ x specifies that “y” is a dependent function of “x” (and by default an intercept), thus “y” here is the response variable, and “x” the predictor variable.

dat2 <- dat %>%
  filter(studysite == "Kivach")

mod1 <- lm(Tussilago_farfara_onset_of_blooming ~ year,
           data = dat2)

grid <- dat2 %>% 
  data_grid(year) %>%
  add_predictions(mod1) 

ggplot() +
  geom_point(data = dat2,
             mapping = aes(x = year,
                           y = Tussilago_farfara_onset_of_blooming)) +
  geom_line(data = grid,
            mapping = aes(x = year,
                          y = pred),
            colour = "red",
            size = 1)

3 Fitting multiple linear models

Multiple linear models can be fitted, by first creating a nested data.frame/tibble, and then fitting a model to each element in the list-column containing the data.

Exercise 8.3a

Exclude all rows from the dat tibble where the column “Tussilago_farfara_onset_of_blooming” contains values that are NA.

Then, nest the data by “studysite”, where the nested data is in a column called “data”. Assign the result to a new object called by_site. At how many sites does Tussilago farfara occur?

To filter out all records for which a column contains values that are NA, use the filter function in combination with is.na to exclude records that are NA in a specific column (to exclude these use the ! operator). Alternatively, the function drop_na can be used, that automatically drops records based on NA values in specified column(s).

Create a nested tibble with function nest.

by_site <- dat %>%
  drop_na(Tussilago_farfara_onset_of_blooming) %>%
  nest(.by = studysite)
nrow(by_site)

## [1] 180

Thus, Tussilago farfara (coltsfoot) occurs at 180 sites.

Alternatively, using the drop_na function:

# Equivalently with the filter() and is.na() functions
by_site <- dat %>%
  filter(!is.na(Tussilago_farfara_onset_of_blooming)) %>%
  nest(.by = studysite)

Exercise 8.3b

You will be fitting a linear model with the built-in lm function to each site, thus to each element in the list-column “data” in the object created above. Write your own function to fit a linear model, for example with the name site_model, which takes a data.frame/tibble is input, and returns a fitted linear model, with “year” as predictor and “Tussilago_farfara_onset_of_blooming” as response.

You can use the function function. Check tutorial 7 as well.

site_model <- function(df) {
  y <- lm(Tussilago_farfara_onset_of_blooming ~ year,
          data = df)
  return(y)
}

Exercise 8.3c

Now use the map function from the purrr package to make a new column in your nested tibble with the name model with the model fits per site. Thus, use the function created in the previous exercise to fit the linear model to each site. The new column should thus contain the fitted model object for each site!

You can use mutate to create a new column, in combination with map and the function created above.

by_site <- by_site %>% 
  mutate(model = map(data, .f = site_model))

Like many other operations in R, the above steps can also be done in a different way, e.g. by using group_by and then within the summarise function create list-columns explicitly via the list function, where the entire data.frame of the current group can be accessed via the pick(everything()) function. Thus, the current set of data can be assigned to each element in the list-column ‘data’, and likewise a fitted linear model can be assigned to each element in a list-column ‘model’.

# Equivalently with the pick function
dat %>%
  drop_na(Tussilago_farfara_onset_of_blooming) %>%
  group_by(studysite) %>%
  summarize(data = list(pick(everything())),
            model = list(lm(Tussilago_farfara_onset_of_blooming ~ year,
                            data = pick(everything()))))

## # A tibble: 180 × 3
##    studysite                                          data              model 
##    <chr>                                              <list>            <list>
##  1 Altajskij_Bele                                     <tibble [26 × 9]> <lm>  
##  2 Altajskij_Yailu                                    <tibble [28 × 9]> <lm>  
##  3 Altajskij_Yailu_No2                                <tibble [23 × 9]> <lm>  
##  4 Bashkiriya                                         <tibble [5 × 9]>  <lm>  
##  5 Brjanskij les                                      <tibble [7 × 9]>  <lm>  
##  6 Bryanskiy Les_Plant_Loc-17                         <tibble [7 × 9]>  <lm>  
##  7 Bryanskiy Les_Plant_Loc-4                          <tibble [3 × 9]>  <lm>  
##  8 CarpathianBiosphereReserve_Kuzi-Tribushansky_Masyv <tibble [29 × 9]> <lm>  
##  9 Central'no-Lesnoj                                  <tibble [20 × 9]> <lm>  
## 10 Darvinskij                                         <tibble [45 × 9]> <lm>  
## # ℹ 170 more rows

Exercise 8.3d

Add a new column preds with model predictions to your nested tibble as well, using the map2 function from the purrr package.

You can use mutate to create a new column, and then use map2 within mutate.

by_site <- by_site %>% 
  mutate(
    preds = map2(data, model, .f = add_predictions)
  )

Exercise 8.3e

To be able to plot predictions into a graph we need to unnest the column preds in the nested tibble. Note that the list-column preds contains the original data as stored in the list-column data plus an extra column holding the model predictions (from the fitted objects in the list-column model). Thus, in a single pipeline first remove the data and model columns after which you unnest the preds column (using the unnest function) to obtain the predictions. Save the output to a separate object (e.g. call it preds), and plot the model predictions (y-axis) as function of year (x-axis), for all of the models (study areas) in one graph.

# Unnest the predictions after removing the data and model columns
preds <- by_site %>% 
  select(-data, -model) %>% 
  unnest(preds)
preds

## # A tibble: 3,589 × 11
##    studysite  year Bombus_1st_occurrence Bombus_sylvarum_1st_…¹ Bombus_terrestris_1s…² Salix_caprea_onset_o…³ Scolopax_rusticola_s…⁴ Temperature_transiti…⁵ Tussilago_farfara_on…⁶ Zootoca_vivipara_1st…⁷  pred
##    <chr>     <dbl>                 <dbl>                  <dbl>                  <dbl>                  <dbl>                  <dbl>                  <dbl>                  <dbl>                  <dbl> <dbl>
##  1 FC_0062    1923                    NA                     NA                     NA                     NA                     NA                     NA                     91                     NA  105.
##  2 FC_0062    1924                    NA                     NA                     NA                     NA                     NA                     NA                    102                     NA  105.
##  3 FC_0062    1925                    NA                     NA                     NA                     NA                     NA                     NA                    102                     NA  104.
##  4 FC_0062    1926                    NA                     NA                     NA                     NA                     NA                     NA                    107                     NA  104.
##  5 FC_0062    1927                    NA                     NA                     NA                     NA                     NA                     NA                     91                     NA  104.
##  6 FC_0062    1928                    NA                     NA                     NA                     NA                     NA                     NA                     98                     NA  104.
##  7 FC_0062    1929                    NA                     NA                     NA                     NA                     NA                     NA                    110                     NA  104.
##  8 FC_0062    1930                    NA                     NA                     NA                     NA                     NA                     NA                     96                     NA  103.
##  9 FC_0062    1931                    NA                     NA                     NA                     NA                     NA                     NA                    106                     NA  103.
## 10 FC_0062    1932                    NA                     NA                     NA                     NA                     NA                     NA                    112                     NA  103.
## # ℹ 3,579 more rows
## # ℹ abbreviated names: ¹Bombus_sylvarum_1st_occurrence, ²Bombus_terrestris_1st_occurrence, ³Salix_caprea_onset_of_blooming, ⁴Scolopax_rusticola_start_of_display_flights, ⁵Temperature_transition_above_0,
## #   ⁶Tussilago_farfara_onset_of_blooming, ⁷Zootoca_vivipara_1st_occurrence

# plot the predictions
preds %>% 
  ggplot(aes(year, pred)) +
  geom_line(aes(group = studysite))

Exercise 8.3f

To assess how much variation in the onset of blooming of Tussilago is explained by year, look at the R² of the models. The glance function from the broom package can be used to calculate model performance metrics for all models at once, and to add them to the nested tibble What is the range in R² across the models for all study sites?

Use mutate(), and glance() from the broom package

R2_values <- by_site %>% 
  mutate(glance = map(model, broom::glance)) %>% 
  unnest(glance)

range(R2_values$r.squared)

## [1] 0 1

mean(R2_values$r.squared)

## [1] 0.1851501

R2_values %>%
  ggplot(aes(y = r.squared)) + 
  geom_boxplot()

4 Challenge

Challenge

In Monday’s exercises, we studied the synchrony between phenological events of pollinators (e.g. the first appearance of bumblebees, Bombus spp.) and plants (e.g. the onset of blooming in goat willows, Salix caprea, and coltsfoot, Tussilago farfara), using long-term data collected across many locations in the Russian Federation, Ukraine, Uzbekistan, Belarus and Kyrgyzstan (see Ovaskainen et al. 2013). In Exercise 6.2e we saw that the first occurrence of bumblebees has a higher level of synchrony with the onset of flowering of goat willows than that of coltsfoot. This may suggest that in these areas, the blooming of the goat willow might represent the basic source of food for bumblebees in the first days after their emergence in early spring. Thus, the results from Monday’s analyses might suggest that the appearance of bumblebees has shifted in parallel with the blossoming of goat willow, whereas the blooming of coltsfoot has shifted faster.

In this challenge, let’s thus model the synchrony between the first appearance of bumblebees and the blooming onset of goat willows further:

from the original dataset dat, keep the relevant columns year, studysite, Bombus_1st_occurrence and Salix_caprea_onset_of_blooming;
remove rows that contain NA;
nest the data per studysite;
keep only those study sites with 10 records or more;
for each study site, fit a linear model where Bombus_1st_occurrence is modelled as function of Salix_caprea_onset_of_blooming;
for each fitted model, retrieve the slope of the relationship, as well as the r² value of the regression: for a fitted linear regression model x this can be done with the functions coefficients(x) and summary(x)$r.squared.

After fitting all models and extracting the slopes and r² values, proceed by visualizing the results:

plot the onset of blooming of the goat willows on the x-axis, the first appearance of the bumblebees on the y-axis;
add a linear regression line for each study area;
show 3 facet panels based on the r² values of the regression models for each study site: r² <= 0.2; 0.2 < r² <= 0.5; and r² > 0.5.

# Prepare data and fit models per study site
datmodels <- dat %>%
  # keep only relevant columns
  select(year,
         studysite,
         Bombus_1st_occurrence,
         Salix_caprea_onset_of_blooming) %>% 
  # drop rows on with NAs
  drop_na() %>% 
  # nest data per study site
  nest(.by = studysite) %>% 
  # get nr records per study site
  mutate(n = map_int(data, nrow)) %>% 
  # Keep only study areas with 10 records or more
  filter(n >= 10) %>% 
  mutate(
    # Fit models
    lmfit = map(data,
                .f = function(x){
                  lm(Bombus_1st_occurrence ~ Salix_caprea_onset_of_blooming, data = x)
                }),
    # get beta coefficient for the slope
    slope = map_dbl(lmfit,
                    .f = function(x){
                      coefficients(x)[2]
                    }),
    # get r2 values for the model fits
    r2 = map_dbl(lmfit,
                 .f = function(x){
                   summary(x)$r.squared
                 })
  )

# Explore data
datmodels

## # A tibble: 18 × 6
##    studysite                         data                  n lmfit    slope       r2
##    <chr>                             <list>            <int> <list>   <dbl>    <dbl>
##  1 Nikel (Murmansk)                  <tibble [11 × 3]>    11 <lm>    0.915  0.586   
##  2 Laplandskij                       <tibble [60 × 3]>    60 <lm>    0.593  0.370   
##  3 Pechoro-Ilychskij                 <tibble [62 × 3]>    62 <lm>    0.742  0.441   
##  4 Prioksko-Terrasnyj                <tibble [32 × 3]>    32 <lm>    0.0215 0.000586
##  5 Pinega (Arkhangelsk)              <tibble [19 × 3]>    19 <lm>    0.705  0.439   
##  6 Kirovsk (Murmansk)                <tibble [12 × 3]>    12 <lm>   -0.0632 0.0125  
##  7 Georgiyevskoye (Arkhangelsk)      <tibble [18 × 3]>    18 <lm>    0.874  0.577   
##  8 Kurtsevo (Arkhangelsk)            <tibble [21 × 3]>    21 <lm>    1.37   0.542   
##  9 Kivach                            <tibble [36 × 3]>    36 <lm>    0.829  0.422   
## 10 Manushkino-Nyandoma (Arkhangelsk) <tibble [31 × 3]>    31 <lm>    0.815  0.427   
## 11 Mezen (Arkhangelsk)               <tibble [10 × 3]>    10 <lm>    0.981  0.700   
## 12 Altajskij_Yailu                   <tibble [18 × 3]>    18 <lm>    0.731  0.432   
## 13 Konevo (Arkhangelsk)              <tibble [15 × 3]>    15 <lm>    0.843  0.420   
## 14 Kargopol (Arkhangelsk)            <tibble [11 × 3]>    11 <lm>    0.745  0.321   
## 15 Pinezhskij                        <tibble [33 × 3]>    33 <lm>    0.603  0.627   
## 16 Kostomukshskij                    <tibble [23 × 3]>    23 <lm>    0.277  0.0960  
## 17 Kenozerskij                       <tibble [10 × 3]>    10 <lm>    0.950  0.369   
## 18 Altajskij_Bele                    <tibble [12 × 3]>    12 <lm>    0.150  0.0143

# Plot
datmodels %>% 
  select(studysite, data, slope, r2) %>% 
  unnest(data) %>% 
  mutate(r2_class = 
           case_when(r2 <= 0.2 ~ "a", # or, use function cut()
                     r2 > 0.2 & r2 <= 0.5 ~ "b",
                     r2 > 0.5 ~ "c"),
         r2_class = factor(r2_class,
                           levels = c("a","b","c"),
                           labels = c("r2 <= 0.2", 
                                      "0.2 < r2 <= 0.5", 
                                      "r2 > 0.5"))) %>% 
  ggplot(mapping = aes(x = Salix_caprea_onset_of_blooming, 
                       y = Bombus_1st_occurrence, 
                       col = studysite)) + 
  geom_point() + 
  geom_smooth(method = "lm") + 
  geom_abline(intercept = 0, slope = 1, # dashed line with 1:1 relationship
              col = "black", linetype = "dashed") + 
  labs(x = "Salix caprea onset of blooming",
       y = "Bombus 1st occurrence",
       col = "Study area") + 
  facet_wrap(~r2_class)

We see that for those study sites that have a strong relationship between the first occurrence of bumblebees and the onset of blooming in willows (i.e. those models where r² > 0.5), these two phenological events are tightly linked and highly synchronous! Indeed, it seems that in many study sites, goat willows may provide the food for the first appearing bumblebees in early spring!

5 Submit your last plot

Submit your script file as well as a plot: either your last created plot, or a plot that best captures your solution to the challenge. Submit the files on Brightspace via Assessment > Assignments > Skills day 8.

Note that your submission will not be graded or evaluated. It is used only to facilitate interaction and to get insight into progress.

6 Recap

Today, we’ve fitted many models to data using the modelr and broom packages, while making clever use of list-columns. Tomorrow, we will enter a new realm, the metapackage tidymodels (itself thus a collection of packages akin to tidyverse).

An R script of today’s exercises can be downloaded here

Modelling data