Overview

Today’s goal: combining and exploring datasets that are part of a relational database.

Resources
Packages

dplyr, magrittr

Main functions

inner_join, left_join, right_join, full_join, semi_join, anti_join

Cheat sheets

data transformation with dplyr

1 Introduction

Often, data is stored in relational databases, where each table contains a few variables, and to be able to analyse the data, the tables have to be combined. The package dplyr includes tools for combining tables with data.

Tools for working with relational databases can be divided into three families:

Mutating joins: these joins add columns from y to x, matching rows based on the keys:

  • inner_join(x,y): includes all rows in x and y;
  • left_join(x,y): includes all rows in x;
  • right_join(x,y): includes all rows in y;
  • full_join(x,y): includes all rows in x or y.

Filtering joins: filter rows from x based on the presence or absence of matches in y:

  • semi_join(x,y): returns all rows from x with a match in y;
  • anti_join(x,y): returns all rows from x without a match in y.

Moreover, there are nest joins which return all rows and columns in x with a new nested-tibble column that contains all matches from y. We will not use them today, as nesting of data.frames/tibbles will only be covered next week.

We will explore a global dataset of grain legume yield (Cernay et al. 2016), consisting of grain legume yield data published in scientific articles. Data on grain yield, crop biomass, environmental conditions, and management practices from various studies across the globe have been collected. Because the design of all individual studies was different, there is a lot of missing data in the dataset, and many variables are included. Today, we will use a subset of the dataset that includes yield data for chickpea, fababean, garden pea, lentil, peanut, and soybean. The data is stored in multiple csv files: “Crop.csv”, “Site.csv”, “Fertilization.csv” and “Irrigation.csv”. To be able to explore the data, datasets need to be combined.

Exercise 5.1a

Download the data from Brightspace > Skills > Datasets > Legumes. Remember to save the data files in an appropriate sub-folder of your raw data folder (for example: “data/raw/legumes/”).

Then, load the “tidyverse” package, set the working directory (to the root folder of your project), and import the 4 csv data files into tibbles (thus use the read_csv function and not read.csv), respectively called crop, site, fertilization and irrigation.

Then, filter the crop tibble to only keep the six crop species “Chickpea”, “Fababean”, “Garden pea”, “Lentil”, “Peanut”, and “Soybean” (see the column “Crop_Species_Common_Name”). Print the tibbles to the console to check them.

# Load libraries
library(tidyverse)

# Set working directory (e.g. via Session menu button, or via setwd() function)

# Load data
crop <- read_csv("data/raw/legumes/Crop.csv")
site <- read_csv("data/raw/legumes/Site.csv")
fertilization <- read_csv("data/raw/legumes/Fertilization.csv")
irrigation <- read_csv("data/raw/legumes/Irrigation.csv")

# Filter crop to only the 6 species:
crop <- crop %>%
  filter(Crop_Species_Common_Name %in% c("Chickpea","Fababean",
                                         "Garden pea","Lentil",
                                         "Peanut","Soybean"))
crop
## # A tibble: 3,415 × 14
##    IDCrop IDRotation Crop_Species_Scientific_Name Crop_Species_Common_Name Crop_Species_Legume Crop_Date_Seeding Crop_Date_Harvest
##     <dbl>      <dbl> <chr>                        <chr>                                  <dbl> <chr>             <chr>            
##  1    853        263 Cicer arietinum              Chickpea                                   1 NA Dec. 2000      NA NA 2001       
##  2    854        264 Vicia faba                   Fababean                                   1 NA Dec. 2000      NA NA 2001       
##  3    857        265 Lens culinaris               Lentil                                     1 NA Dec. 2000      NA NA 2001       
##  4    858        266 Pisum sativum                Garden pea                                 1 NA Dec. 2000      NA NA 2001       
##  5    859       4749 Cicer arietinum              Chickpea                                   1 NA Dec. 2001      NA NA 2002       
##  6    860       4750 Vicia faba                   Fababean                                   1 NA Dec. 2001      NA NA 2002       
##  7    861       4751 Lens culinaris               Lentil                                     1 NA Dec. 2001      NA NA 2002       
##  8    862       4752 Pisum sativum                Garden pea                                 1 NA Dec. 2001      NA NA 2002       
##  9    864       4757 Cicer arietinum              Chickpea                                   1 NA Dec. 2004      NA NA 2005       
## 10    865       4758 Vicia faba                   Fababean                                   1 NA Dec. 2004      NA NA 2005       
## # ℹ 3,405 more rows
## # ℹ 7 more variables: Crop_Date_From_Seeding_To_Harvest_Day_Number <dbl>, Crop_Yield_Grain <dbl>, Crop_Yield_Grain_Unit <chr>,
## #   Crop_Biomass_Aerial <dbl>, Crop_Biomass_Aerial_Unit <chr>, Crop_Harvest_Index <lgl>, IDSite <dbl>

2 Mutating and filtering joins

Exercise 5.2a
Combine the crop and the site tibbles in such a way that you keep all the rows in the crop tibble. Do first look at both tibbles to see what variable you can use as key to combine both tibbles.

You can use left_join()


left_join(crop, site, by = "IDSite")
## # A tibble: 3,415 × 20
##    IDCrop IDRotation Crop_Species_Scientific_Name Crop_Species_Common_Name Crop_Species_Legume Crop_Date_Seeding Crop_Date_Harvest
##     <dbl>      <dbl> <chr>                        <chr>                                  <dbl> <chr>             <chr>            
##  1    853        263 Cicer arietinum              Chickpea                                   1 NA Dec. 2000      NA NA 2001       
##  2    854        264 Vicia faba                   Fababean                                   1 NA Dec. 2000      NA NA 2001       
##  3    857        265 Lens culinaris               Lentil                                     1 NA Dec. 2000      NA NA 2001       
##  4    858        266 Pisum sativum                Garden pea                                 1 NA Dec. 2000      NA NA 2001       
##  5    859       4749 Cicer arietinum              Chickpea                                   1 NA Dec. 2001      NA NA 2002       
##  6    860       4750 Vicia faba                   Fababean                                   1 NA Dec. 2001      NA NA 2002       
##  7    861       4751 Lens culinaris               Lentil                                     1 NA Dec. 2001      NA NA 2002       
##  8    862       4752 Pisum sativum                Garden pea                                 1 NA Dec. 2001      NA NA 2002       
##  9    864       4757 Cicer arietinum              Chickpea                                   1 NA Dec. 2004      NA NA 2005       
## 10    865       4758 Vicia faba                   Fababean                                   1 NA Dec. 2004      NA NA 2005       
## # ℹ 3,405 more rows
## # ℹ 13 more variables: Crop_Date_From_Seeding_To_Harvest_Day_Number <dbl>, Crop_Yield_Grain <dbl>, Crop_Yield_Grain_Unit <chr>,
## #   Crop_Biomass_Aerial <dbl>, Crop_Biomass_Aerial_Unit <chr>, Crop_Harvest_Index <lgl>, IDSite <dbl>, Site_Name <chr>, Site_Country <chr>,
## #   Site_Latitude <dbl>, Site_Longitude <dbl>, Site_Precipitation_mm <dbl>, Site_Precipitation_Period <chr>

Or, using the pipe:

crop %>%
  left_join(site, by = "IDSite")
Exercise 5.2b
Make another combination of the crop and the site tibble, but now keep all rows from both datasets (thus, keep rows that occur in both datasets, or rows that only occur in one of them). Are there sites in the dataset at which none of the six selected crops is cultivated?

You can use the function full_join(). If there are sites in the dataset at which none of the six selected crops is cultivates, it means that the number of rows in the dataset resulting from a full join is larger than the number of rows in crop.


full_join(crop, site, by = "IDSite")
## # A tibble: 3,438 × 20
##    IDCrop IDRotation Crop_Species_Scientific_Name Crop_Species_Common_Name Crop_Species_Legume Crop_Date_Seeding Crop_Date_Harvest
##     <dbl>      <dbl> <chr>                        <chr>                                  <dbl> <chr>             <chr>            
##  1    853        263 Cicer arietinum              Chickpea                                   1 NA Dec. 2000      NA NA 2001       
##  2    854        264 Vicia faba                   Fababean                                   1 NA Dec. 2000      NA NA 2001       
##  3    857        265 Lens culinaris               Lentil                                     1 NA Dec. 2000      NA NA 2001       
##  4    858        266 Pisum sativum                Garden pea                                 1 NA Dec. 2000      NA NA 2001       
##  5    859       4749 Cicer arietinum              Chickpea                                   1 NA Dec. 2001      NA NA 2002       
##  6    860       4750 Vicia faba                   Fababean                                   1 NA Dec. 2001      NA NA 2002       
##  7    861       4751 Lens culinaris               Lentil                                     1 NA Dec. 2001      NA NA 2002       
##  8    862       4752 Pisum sativum                Garden pea                                 1 NA Dec. 2001      NA NA 2002       
##  9    864       4757 Cicer arietinum              Chickpea                                   1 NA Dec. 2004      NA NA 2005       
## 10    865       4758 Vicia faba                   Fababean                                   1 NA Dec. 2004      NA NA 2005       
## # ℹ 3,428 more rows
## # ℹ 13 more variables: Crop_Date_From_Seeding_To_Harvest_Day_Number <dbl>, Crop_Yield_Grain <dbl>, Crop_Yield_Grain_Unit <chr>,
## #   Crop_Biomass_Aerial <dbl>, Crop_Biomass_Aerial_Unit <chr>, Crop_Harvest_Index <lgl>, IDSite <dbl>, Site_Name <chr>, Site_Country <chr>,
## #   Site_Latitude <dbl>, Site_Longitude <dbl>, Site_Precipitation_mm <dbl>, Site_Precipitation_Period <chr>

The dataset resulting from a full join is indeed larger than the one where only rows in the crop tibble are retained.

Exercise 5.2c
Check how many sites (the column “IDsite”) in the site tibble are not included for the crops that you selected (thus in the crop tibble). Use a function that performs a filtering join.

You can use anti_join()


anti_join(site, crop, by = "IDSite")
## # A tibble: 23 × 7
##    IDSite Site_Name Site_Country Site_Latitude Site_Longitude Site_Precipitation_mm Site_Precipitation_Period
##     <dbl> <chr>     <chr>                <dbl>          <dbl>                 <dbl> <chr>                    
##  1    138 Tezpur    India                26.1            92.5                   315 Growing season           
##  2    150 Jodhpur   India                26.2            73.0                   196 Growing season           
##  3    179 Morogoro  Tanzania             -6.83           37.7                   545 Growing season           
##  4    296 Kiboko    Kenya                -2.1            37.4                   475 Growing season           
##  5    383 El Bab    Syria                36.1            37.2                   379 Growing season           
##  6    384 El Bab    Syria                36.1            37.2                   303 Growing season           
##  7    385 Khanneser Syria                35.9            37.5                   215 Growing season           
##  8    386 Khanneser Syria                35.9            37.5                   137 Growing season           
##  9    387 Tel Hadya Syria                35.6            36.6                   329 Growing season           
## 10    388 Tel Hadya Syria                35.6            36.6                   365 Growing season           
## # ℹ 13 more rows

There are 23 sites that are not included in the crop tibble.

Exercise 5.2d
Use the combined dataset from exercise 5.2a, and add the irrigation tibble in (based on the “IDCrop” key). Keep all rows in the combined dataset from exercise 5.2a. Remember that you can use the pipe (%>%) operator if you need to do multiple adjustments on a dataset! Assign the result to an object called crop2.

crop2 <- crop %>%
  left_join(site, by = "IDSite") %>%
  left_join(irrigation, by = "IDCrop")
Exercise 5.2e

Check whether the number of rows in crop2 is the same as the number of rows of crop. Namely, this gives an indication whether the join was performed correctly: if their number of rows is identical, the join was a correct one-to-many left-join. However, if their number of rows is not identical, it could be because the data that is being joined (here thus site and/or irrigation) does not have unique keys (i.e. some keys occur more than once). Therefore, it may be important to check whether the key(s) of each table used for joining contains unique values.

Therefore, check whether column “IDCrop” in the irrigation tibble has unique values, and if not exclude the non-unique values from the tibble before joining the datasets. Assign the result to an object called crop2.

Use count() and select only the counts that are greater than one.


nrow(crop)
## [1] 3415
nrow(crop2)
## [1] 3416

Indeed: the number of rows are not identical. Thus in one (or both) of the joined tables there probably are non-unique keys. Thus, let’s check in which table these non-unique keys are:

nrow(crop %>% left_join(site, by = "IDSite")) # 3415 thus this is okay
## [1] 3415
nrow(crop %>% left_join(irrigation, by = "IDCrop")) # 3416 thus irrigation has a duplicate key!
## [1] 3416

Therefore: let’s filter out the non-unique key from irrigation, by first counting the number of records for each key, filtering the duplicates, and using the anti_join to remove them from the irrigation dataset:

count_ir <- irrigation %>%
  count(IDCrop) %>%
  filter(n > 1)

irrigation2 <- anti_join(irrigation, count_ir, by = "IDCrop")
crop2 <- crop %>%
           left_join(site, by = "IDSite") %>%
           left_join(irrigation2, by = "IDCrop")
nrow(crop)
## [1] 3415
nrow(crop2)
## [1] 3415

Now, crop and crop2 have the same number of rows!

Note that the same can be accomplished in different ways! For example, the above can be written in a single pipe, e.g.:

crop2 <- crop %>%
  left_join(site, by = "IDSite") %>%
  left_join(anti_join(irrigation, 
                      irrigation %>%
                        count(IDCrop) %>%
                        filter(n > 1),
                      by = "IDCrop"),
            by = "IDCrop")

3 Grain yield and crop biomass

Now, the dataset is nearly ready to start exploring patterns. However, most of the sites are not irrigated, or information is not available (column “Irrigation_Presence_Irrigation”), thus make sure to include only those in the dataset, and exclude sites that are irrigated. For example, by adding a filter statement like this:

crop2 %>%
  filter(is.na(Irrigation_Presence_Irrigation) |
           Irrigation_Presence_Irrigation == 0)

Here, crop2 is the output of the previous exercise, and the function is.na() returns TRUE when values are NA, thus not available. The | operator is the “OR” operator: it checks whether the left hand side evaluates to TRUE, OR whether the right hand side evaluates to TRUE.

Exercise 5.3a
Make a scatterplot of grain yield (column “Crop_Yield_Grain”) against rainfall over the growing season (column “Site_Precipitation_mm”), add a straight line (linear model) in the graph, and divide the crop species over panels. Is crop grain yield related to rainfall?

crop2 %>%
  filter(is.na(Irrigation_Presence_Irrigation) |
           Irrigation_Presence_Irrigation == 0) %>% 
ggplot(mapping = aes(x = Site_Precipitation_mm,
                     y = Crop_Yield_Grain)) +
  geom_point() +
  geom_smooth(method="lm") +
  facet_wrap(~ Crop_Species_Common_Name)

Exercise 5.3b
Make a similar graph for crop biomass (column “Crop_Biomass_Aerial”). Which of the yield variables seems to be more strongly related to rainfall?

crop2 %>%
  filter(is.na(Irrigation_Presence_Irrigation) |
           Irrigation_Presence_Irrigation == 0) %>% 
  ggplot(mapping = aes(x = Site_Precipitation_mm,
                       y = Crop_Biomass_Aerial)) +
  geom_point() +
  geom_smooth(method="lm") +
  facet_wrap(~ Crop_Species_Common_Name)

4 Challenge

Challenge
If time allows, explore the grain yield dataset further. Add in the fertilization tibble as well, and assess whether you find an effect of fertilizer addition. Look at the “Fertilization_NPK” and “Fertilization_NPK_Dose” columns. Exclude rows for which a value for the dose of N (nitrogen), P (phosphorus), or K (potassium) is missing. Then, choose the crop species and fertilizer treatment (the “Fertilization_NPK” column) combination for which most data points are available. Make a graph to assess how grain yield responds to the fertilizer dose. Does the result align with what you would expect?

crop3 <- crop2 %>%
  left_join(distinct(fertilization), by="IDCrop") %>% # distinct to get rid of duplicates
  filter(!is.na(Fertilization_NPK_Dose))

crop3 %>%
  group_by(Crop_Species_Common_Name, Fertilization_NPK) %>%
  summarise(Count = n()) %>%
  ungroup() %>%
  slice_max(Count, n=6)
## # A tibble: 6 × 3
##   Crop_Species_Common_Name Fertilization_NPK Count
##   <chr>                    <chr>             <int>
## 1 Garden pea               P                   844
## 2 Garden pea               N                   382
## 3 Lentil                   P                   318
## 4 Chickpea                 P                   308
## 5 Soybean                  P                   305
## 6 Soybean                  N                   302
crop3 %>%
  filter(Crop_Species_Common_Name == "Garden pea",
         Fertilization_NPK == "P") %>%
  ggplot(mapping = aes(x = Fertilization_NPK_Dose,
                       y = Crop_Yield_Grain)) +
  geom_point() +
  geom_smooth(method="lm") +
  theme_classic() + 
  labs(title = "Relationship between yield and fertilizer dose",
       x = "P fertilizer dose",
       y = "Garden pea yield")

5 Submit your last plot

Submit your script file as well as a plot: either your last created plot, or a plot that best captures your solution to the challenge. Submit the files on Brightspace via Assessment > Assignments > Skills day 5.

Note that your submission will not be graded or evaluated. It is used only to facilitate interaction and to get insight into progress.

6 Recap

Today, we’ve joined tables from a relational database. This week, we have been using data that was already tidy. Next week, we will start with exploring methods to get untidy data into a tidy form, using the tidyr package.

An R script of today’s exercises can be downloaded here

7 Further reading