Use the function unite
in the tidyr package to
paste columns together. Use the str_replace
or
str_replace_all
function (form the stringr
package, part of core tidyverse) to replace specific patterns from a
string with a replacement pattern.
Today’s goal: to be able to transform datasets into a tidy format using the tidyr package
pivot_longer, pivot_wider, separate, unite, complete.
In this tutorial, we will practice putting data into a tidy format using the package tidyr, and also use some handy data cleaning and preparation tools from the stringr package.
Remember, properties of a tidy dataset are:
To practice data tidying, we will use the data from the Chronicles of Nature dataset, as made publicly available in Ovaskainen et al. 2020. We will be using this extensive, large-scale, long-term and multi-taxon database on phenological and climatic variation to reproduce some figures shown in Ovaskainen et al. 2013 about the community-level phenological response of several species to climate change.
The full dataset consist of more than half a million records from 471 localities in the Russian Federation, Ukraine, Uzbekistan, Belarus and Kyrgyzstan, in which nature reserves recorded the timing of specified climatic and phenological events. The data go back to as early as 1890, but most data is post 1960. We will look at some trends in phenological events for a single site, Kivach, during 40 years.
dat
. # Prelims
library(tidyverse)
# setwd()
# Load data
dat <- read_csv("data/raw/phenology/chroniclesofnature.csv")
The data are not yet in tidy format. Here, we are going to tidy the data by transforming it so that each row/record corresponds to a year in which data is gathered, each column corresponds to a feature (i.e., a climatic or phenological event), and each value in the feature columns correspond to the timing of the event in terms of day-of-year. The data consists of the following columns:
_
. Also
replace the spaces in the newly created column by a dot, .
.
Use the function unite
in the tidyr package to
paste columns together. Use the str_replace
or
str_replace_all
function (form the stringr
package, part of core tidyverse) to replace specific patterns from a
string with a replacement pattern.
dat <- dat %>%
unite(col = "event", taxon:event, sep = "_") %>%
mutate(event = str_replace_all(event,
pattern=" ",
replacement="."))
dat
## # A tibble: 246 × 3
## event year dayofyear
## <chr> <dbl> <dbl>
## 1 Temperature_transition.above.0 1966 112
## 2 Temperature_transition.above.0 1967 86
## 3 Temperature_transition.above.0 1968 82
## 4 Temperature_transition.above.0 1969 99
## 5 Temperature_transition.above.0 1970 92
## 6 Temperature_transition.above.0 1971 121
## 7 Temperature_transition.above.0 1972 98
## 8 Temperature_transition.above.0 1973 83
## 9 Temperature_transition.above.0 1974 117
## 10 Temperature_transition.above.0 1975 64
## # ℹ 236 more rows
dat
so that these years are also
included, with value NA
. First, to check for a sorted series of unique values, you can make
use of the sort
and unique
functions.
Functions like summary
, range
,
min
and max
can be used to retrieve basic
summaries of numeric data (here e.g. start and end year of sampling). To
get values that are not present in a vector you could make use of the
!
and %in%
operators, especially in
combination with square bracket indexing, e.g. the following code checks
which values in x
do not occur in y
:
x <- 1:6
y <- c(1, 2, 4, 5)
x[! x %in% y]
## [1] 3 6
Then, you can use the complete
function to complete a
tibble with missing combinations of data.
sort(unique(dat$year))
## [1] 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1994 1995
## [28] 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
range(dat$year) # 1966 - 2013
## [1] 1966 2013
allyears = seq.int(from = 1966, to = 2013)
allyears[! allyears %in% dat$year]
## [1] 1991 1992 1993
Thus, the years 1991:1993 are missing: this was the time of the Soviet breakup. Now, we can compete the dataset using all years:
dat <- dat %>%
complete(year = allyears, event)
dat %>%
filter(year %in% c(1991:1993))
## # A tibble: 18 × 3
## year event dayofyear
## <dbl> <chr> <dbl>
## 1 1991 Bombus_1st.occurrence NA
## 2 1991 Salix.caprea_onset.of.blooming NA
## 3 1991 Scolopax.rusticola_start.of.display.flights NA
## 4 1991 Temperature_transition.above.0 NA
## 5 1991 Tussilago.farfara_onset.of.blooming NA
## 6 1991 Zootoca.vivipara_1st.occurrence NA
## 7 1992 Bombus_1st.occurrence NA
## 8 1992 Salix.caprea_onset.of.blooming NA
## 9 1992 Scolopax.rusticola_start.of.display.flights NA
## 10 1992 Temperature_transition.above.0 NA
## 11 1992 Tussilago.farfara_onset.of.blooming NA
## 12 1992 Zootoca.vivipara_1st.occurrence NA
## 13 1993 Bombus_1st.occurrence NA
## 14 1993 Salix.caprea_onset.of.blooming NA
## 15 1993 Scolopax.rusticola_start.of.display.flights NA
## 16 1993 Temperature_transition.above.0 NA
## 17 1993 Tussilago.farfara_onset.of.blooming NA
## 18 1993 Zootoca.vivipara_1st.occurrence NA
dat
contains.
Transform the tibble into wider format, by transforming each distinct
value in the event column to a new column, filled with values
found in dayofyear. Assign the result to a new object called
dat2
. Check the number of rows and columns in this tibble.
Use the pivot_wider
function.
nrow(dat); ncol(dat)
## [1] 288
## [1] 3
dat2 <- dat %>%
pivot_wider(names_from = event,
values_from = dayofyear)
nrow(dat2); ncol(dat2)
## [1] 48
## [1] 7
The tibble dat
has 288 rows and 3 columns (thus: a
‘long’ format), while the tibble dat2
has 48 rows and 7
columns (thus a ‘wide’ format - hence the function name
pivot_wider
).
dat2 %>%
ggplot() +
geom_point(aes(x=year,
y=Temperature_transition.above.0),
col="black") +
geom_smooth(aes(x=year,
y=Temperature_transition.above.0),
method="lm", se=FALSE, col="black") +
geom_point(aes(x=year,
y=Scolopax.rusticola_start.of.display.flights),
col="blue") +
geom_smooth(aes(x=year,
y=Scolopax.rusticola_start.of.display.flights),
method="lm", se=FALSE, col="blue") +
geom_point(aes(x=year,
y=Zootoca.vivipara_1st.occurrence),
col="cyan") +
geom_smooth(aes(x=year,
y=Zootoca.vivipara_1st.occurrence),
method="lm", se=FALSE, col="cyan") +
xlab("Year") +
ylab("Day of year") +
ggtitle("Phenological shift")
dat2 %>%
ggplot() +
geom_point(aes(x=year,
y=Bombus_1st.occurrence),
col="black") +
geom_smooth(aes(x=year,
y=Bombus_1st.occurrence),
method="lm", se=FALSE, col="black") +
geom_point(aes(x=year,
y=Salix.caprea_onset.of.blooming),
col="cyan") +
geom_smooth(aes(x=year,
y=Salix.caprea_onset.of.blooming),
method="lm", se=FALSE, col="cyan") +
geom_point(aes(x=year,
y=Tussilago.farfara_onset.of.blooming), col="blue") +
geom_smooth(aes(x=year,
y=Tussilago.farfara_onset.of.blooming),
method="lm", se=FALSE, col="blue") +
xlab("Year") +
ylab("Day of year") +
ggtitle("Phenological shift")
dat2 %>%
ggplot(aes(x=Temperature_transition.above.0)) +
geom_point(aes(y=Salix.caprea_onset.of.blooming),
col="red") +
geom_smooth(aes(y=Salix.caprea_onset.of.blooming),
method="lm", se=FALSE, col="red") +
geom_point(aes(y=Tussilago.farfara_onset.of.blooming),
col="blue") +
geom_smooth(aes(y=Tussilago.farfara_onset.of.blooming),
method="lm", se=FALSE, col="blue") +
xlab("Temperature transition above 0") +
ylab("Day of blooming") +
ggtitle("Phenological shift")
dat3
. Assign the names to the new column
feature
, and the values to new column with name
dayofyear
. How many rows do you have and how does this
compare to the original data? Use the pivot_longer
function.
dat3 <- dat2 %>%
pivot_longer(2:7,
names_to = "feature",
values_to = "dayofyear",
values_drop_na = FALSE)
nrow(dat3)
## [1] 288
nrow(dat)
## [1] 288
Now we have 288 rows, and this is the same as the number of rows in
the original tibble dat
.
Use the separate
function from the tidyr package. Note
that if there are strings with multiple underscore symbols, you can
instruct the separate
function how to deal with this
situation; see the function documentation, especially the function
argument extra
.
Use the str_replace
or str_replace_all
function to replace specific patterns from a string with a replacement
pattern. Note that the pattern
to be replaced can be a
regular expression, see the function help file; using
pattern = "."
might not work, in which case you could use
`pattern = “[.]” or one of the other methods.
dat3 <- dat3 %>%
separate(col = "feature",
into = c("taxon","event"),
sep = '_',
extra = "merge") %>%
mutate(taxon = str_replace_all(taxon,
pattern = "[.]",
replacement = " "),
event = str_replace_all(event,
pattern = "[.]",
replacement = " "))
dat3
## # A tibble: 288 × 4
## year taxon event dayofyear
## <dbl> <chr> <chr> <dbl>
## 1 1966 Bombus 1st occurrence 135
## 2 1966 Salix caprea onset of blooming NA
## 3 1966 Scolopax rusticola start of display flights 124
## 4 1966 Temperature transition above 0 112
## 5 1966 Tussilago farfara onset of blooming 134
## 6 1966 Zootoca vivipara 1st occurrence NA
## 7 1967 Bombus 1st occurrence 122
## 8 1967 Salix caprea onset of blooming NA
## 9 1967 Scolopax rusticola start of display flights NA
## 10 1967 Temperature transition above 0 86
## # ℹ 278 more rows
In this challenge, you are going to load and tidy an online dataset on COVID cases across countries. We can access the John Hopkins university global dataset of numbers of confirmed COVID cases directly from github: https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv. Load this dataset into R using the appropriate tidyverse functions.
Then, remove the columns “Province/State”, “Lat” and “Long”, and rename the column “Country/Region” to “country”. Check the number of rows and columns that the dataset contains. In this dataset, each date has its own column, thus this dataset is not tidy yet. Therefore, make it a tidy dataset, where each record is a row, each feature a column. After tidying, how many rows and columns does the dataset have?The function matches
can find column names that match a
specific expression, where “\d” is the expression for a any numeric
character. Note that because this pattern contains a back-slash, it
needs to be ‘escaped’. In R, you can escape a special character by
putting a back-slack in front of it, here thus “\\d”.
# load data
confirmedraw <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv")
# Inspect
confirmedraw
## # A tibble: 289 × 1,147
## `Province/State` `Country/Region` Lat Long `1/22/20` `1/23/20` `1/24/20` `1/25/20` `1/26/20` `1/27/20` `1/28/20` `1/29/20` `1/30/20`
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 <NA> Afghanistan 33.9 67.7 0 0 0 0 0 0 0 0 0
## 2 <NA> Albania 41.2 20.2 0 0 0 0 0 0 0 0 0
## 3 <NA> Algeria 28.0 1.66 0 0 0 0 0 0 0 0 0
## 4 <NA> Andorra 42.5 1.52 0 0 0 0 0 0 0 0 0
## 5 <NA> Angola -11.2 17.9 0 0 0 0 0 0 0 0 0
## 6 <NA> Antarctica -71.9 23.3 0 0 0 0 0 0 0 0 0
## 7 <NA> Antigua and Bar… 17.1 -61.8 0 0 0 0 0 0 0 0 0
## 8 <NA> Argentina -38.4 -63.6 0 0 0 0 0 0 0 0 0
## 9 <NA> Armenia 40.1 45.0 0 0 0 0 0 0 0 0 0
## 10 Australian Capit… Australia -35.5 149. 0 0 0 0 0 0 0 0 0
## # ℹ 279 more rows
## # ℹ 1,134 more variables: `1/31/20` <dbl>, `2/1/20` <dbl>, `2/2/20` <dbl>, `2/3/20` <dbl>, `2/4/20` <dbl>, `2/5/20` <dbl>, `2/6/20` <dbl>,
## # `2/7/20` <dbl>, `2/8/20` <dbl>, `2/9/20` <dbl>, `2/10/20` <dbl>, `2/11/20` <dbl>, `2/12/20` <dbl>, `2/13/20` <dbl>, `2/14/20` <dbl>,
## # `2/15/20` <dbl>, `2/16/20` <dbl>, `2/17/20` <dbl>, `2/18/20` <dbl>, `2/19/20` <dbl>, `2/20/20` <dbl>, `2/21/20` <dbl>, `2/22/20` <dbl>,
## # `2/23/20` <dbl>, `2/24/20` <dbl>, `2/25/20` <dbl>, `2/26/20` <dbl>, `2/27/20` <dbl>, `2/28/20` <dbl>, `2/29/20` <dbl>, `3/1/20` <dbl>,
## # `3/2/20` <dbl>, `3/3/20` <dbl>, `3/4/20` <dbl>, `3/5/20` <dbl>, `3/6/20` <dbl>, `3/7/20` <dbl>, `3/8/20` <dbl>, `3/9/20` <dbl>,
## # `3/10/20` <dbl>, `3/11/20` <dbl>, `3/12/20` <dbl>, `3/13/20` <dbl>, `3/14/20` <dbl>, `3/15/20` <dbl>, `3/16/20` <dbl>, …
# tidy
confirmed <- confirmedraw %>%
select(-"Province/State", -Lat, -Long) %>%
rename(country = "Country/Region") %>%
pivot_longer(cols = matches("\\d/."),
names_to = "date",
values_to = "confirmed")
# inspect
confirmed
## # A tibble: 330,327 × 3
## country date confirmed
## <chr> <chr> <dbl>
## 1 Afghanistan 1/22/20 0
## 2 Afghanistan 1/23/20 0
## 3 Afghanistan 1/24/20 0
## 4 Afghanistan 1/25/20 0
## 5 Afghanistan 1/26/20 0
## 6 Afghanistan 1/27/20 0
## 7 Afghanistan 1/28/20 0
## 8 Afghanistan 1/29/20 0
## 9 Afghanistan 1/30/20 0
## 10 Afghanistan 1/31/20 0
## # ℹ 330,317 more rows
dim(confirmedraw)
## [1] 289 1147
dim(confirmed)
## [1] 330327 3
Optionally, we can now plot it:
library(lubridate) # tidyverse package to work with date/time data
confirmed_dense <- confirmed %>%
mutate(confirmed = as.integer(confirmed),
date = mdy(date)) %>%
filter(confirmed > 0) %>%
group_by(country,date) %>%
summarize(total = sum(confirmed)) %>%
arrange(country,date, total)
confirmed_dense %>%
ggplot(aes(x=date, y=total, colour=country)) +
geom_line() +
theme(legend.position = "none") +
labs(title = "Coronavirus trajectories for 192 countries ",
subtitle = "Cumulative number of cases according to the John Hopkins database",
x = "Date",
y = NULL) +
scale_y_log10(name = "Confirmed cases",
limits = c(1, 100000000),
breaks = c(1, 10, 100, 1000, 10000, 100000,1000000,10000000),
labels = c("1", "10", "100", "1,000", "10,000", "100,000","1,000,000", "10,000,000"),
sec.axis = dup_axis()) +
annotation_logticks(sides = "lr") +
theme_bw() +
theme(plot.background = element_rect(fill = rgb(255,241,230,max=255)),
panel.background = element_rect(fill = rgb(255,241,230,max=255)),
legend.position="none",
panel.grid.major = element_line(colour = "grey"))
Submit your script file as well as a plot: either your last created plot, or a plot that best captures your solution to the challenge. Submit the files on Brightspace via Assessment > Assignments > Skills day 6.
Note that your submission will not be graded or evaluated. It is used only to facilitate interaction and to get insight into progress.
Today, we’ve explored methods to get untidy data into a tidy form, using the tidyr package. Tomorrow, we will explore a set of tools to efficiently work with character strings, timestamps, write our own functions, and explore methods for iteration.
An R script of today’s exercises can be downloaded here