Overview

Today’s goal: to get used to working with time series data in a tidy manner via the tsibble package; preparing and exploring time series data.

Resources
Packages

tidyverts, which is a meta-package (like tidyverse is), focussed on time series data, with packages tsibble, feasts, fable and tsibbletalk.

Main functions

as_tsibble

1 Introduction

The last two weeks, we worked towards, and with, tidy data: a convenient for the processing, visualisation and modelling of data. In all previous exercises, we assumed that all records in the tibble were independent: we could thus effectively shuffle the rows in the tables without impacting the subsequent visualisations and analyses done.

This week, we will work with data along a temporal dimension: time series data. The temporal dimension of such data is often in different formats, which complicates the analyses. For example1:

  • time can have various resolutions (e..g years, months, days, hours, minutes, seconds, …);
  • time can be associated with different time zones with possible adjustments such as daylight saving time;
  • time intervals can be regular or irregular;
  • time series data often includes repeated measures on multiple individuals.

In today’s tutorial, we will be working with time series data including some of the above-mentioned challenges For sake of simplicity, we will make use of regular time series, and with timestamps recorded in the same timezone and without daylight saving time (all timestamps will be in GMT). We will be using data of several cows collected via neck collars with sensors (GPS and tri-axial accelerometer) during an experiment on cow movement and behaviour in relation to pasture height in Wageningen.

Exercise 11.1a
Download the data “acc.csv” and “gps.csv” from Brightspace > Time series analyses > Datasets > Cattle. Save the data file in your raw data folder (for example: data/raw/cattle/). Import the data into a tibbles called acc and gps, respectively. Load the “tidyverse” and “lubridate” packages. Have a look at the loaded data.

gps <- read_csv("data/raw/cattle/gps.csv")
acc <- read_csv("data/raw/cattle/acc.csv")
gps
## # A tibble: 2,191 × 6
##       ID burst     t label         x        y
##    <dbl> <dbl> <dbl> <chr>     <dbl>    <dbl>
##  1     1     1   300 idle    682170. 5762774.
##  2     1     1   303 grazing 682169. 5762773.
##  3     1     1   306 grazing 682169. 5762773.
##  4     1     1   309 grazing 682169. 5762773.
##  5     1     1   312 grazing 682169. 5762773.
##  6     1     1   315 grazing 682169. 5762773.
##  7     1     1   318 grazing 682169. 5762773.
##  8     1     1   321 grazing 682169. 5762773.
##  9     1     1   324 grazing 682169. 5762773.
## 10     1     1   327 grazing 682168. 5762773.
## # ℹ 2,181 more rows
acc
## # A tibble: 209,054 × 6
##       ID burst     t     x     y     z
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1     1     1  299    490   489   628
##  2     1     1  299.   490   489   627
##  3     1     1  299.   491   488   626
##  4     1     1  299.   489   491   627
##  5     1     1  299.   488   493   632
##  6     1     1  299.   495   496   630
##  7     1     1  299.   500   493   630
##  8     1     1  299.   505   494   629
##  9     1     1  299.   516   491   627
## 10     1     1  299.   511   488   632
## # ℹ 209,044 more rows

Both tibbles contain the columns ID, burst and t:

  • ID is the cow identifier;
  • burst is an identifier nested within ID for burst: a consecutive uninterrupted stretch of data;
  • t is the timestamp, measured in seconds from 09:00 21/4/2018 GMT.

Tibble gps contains 3 more columns:

  • label a character vector with the annotated behaviour of the cow at any moment in time (annotated through visual observation in the field), with the behaviours grazing, walking and idle. The original data includes more behavioural classes, but these labels were reclassified to these three broad classes.
  • x and y are the spatial coordinates, in units meters in the projected coordinate system UTM31N.

Tibble acc also contains 3 more columns:

  • x, y, z are the acceleration forces (not gravity corrected) in 3 dimensions.

The accelerometer data were sampled at 32 Hz.

2 Prepare data

Exercise 11.2a
For gps: use column t and the information given above to compute a new feature called dttm that is an R datetime object, and convert the column label to a factor (with the levels “grazing”,“walking” and “idle”).

You can use the functions parse_date_time (see tutorial 7) and factor.


# save the reference time to object t0, which can be used here as well as below
t0 <- parse_date_time("2018-4-21 9:0:00",
                orders="%Y-%m-%d %H:%M:%S",
                tz = "GMT")

# mutate data in gps object
gps <- gps %>%
  mutate(dttm = t0 + t,
         label = factor(label, levels = c("grazing","walking","idle")))
Exercise 11.2b
For acc: use column t and the information given above to compute a new feature called dttm that is an R datetime object, similar to how you did it in the previous exercise.

acc <- acc %>%
  mutate(dttm = t0 + t)
Exercise 11.2c
Explore the data, e.g. by plotting the x, y and z values in 1 plot against time for ID == 1 and t < 600.

acc %>%
  filter(ID == 1, t < 600) %>%
  ggplot(aes(x=dttm)) +
  geom_line(aes(y=x), col="red") + 
  geom_line(aes(y=y), col="green") + 
  geom_line(aes(y=z), col="blue") + 
  xlab("time") + 
  ylab("accelerometer value")

As can be seen in the plot of the previous exercise, data (e.g. accelerometer data) can often contain outliers. It may therefore often be wise to check for outliers, and to replace extreme values. For example, we could use an outlier removal algorithm, or do it ourselves by replacing values that are higher then (lower than) some threshold value, e.g. the 0.1% quantile (or 99.9% quantile). The code below computes (per accelerometer axis) first the 0.1% and 99.9% quantile values, than replaces values respectively lower or higher than these quantiles with these quantiles, and finally removes the computed columns with these quantiles:

acc <- acc %>%
  mutate(minx = quantile(x, probs=0.001),
         maxx = quantile(x, probs=0.999),
         x = pmax(x, minx),
         x = pmin(x, maxx),
         
         miny = quantile(y, probs=0.001),
         maxy = quantile(y, probs=0.999),
         y = pmax(y, miny),
         y = pmin(y, maxy),
         
         minz = quantile(z, probs=0.001),
         maxz = quantile(z, probs=0.999),
         z = pmax(z, minz),
         z = pmin(z, maxz)) %>%
  
  select(-starts_with("min")) %>%
  select(-starts_with("max"))

In the code above, we use the pmax function to return the parallel maxima and minima of the input values in order to truncate the signals to their 0.1 - 99.9% range, using the quantile function that produces sample quantiles corresponding to the given probabilities.

Besides using ggplot2, time series data is often conveniently plotted by an dynamic plotting library such as dygraphs, plotly or the package tsibbletalk that allows dynamical plotting of tsibbles.

For example, we could use dygraphs to plot the x, y and z axes of the accelerometer as function of time (here via a xts time series object, but we could also plot it from a data.frame or tibble), like done above statically using ggplot:

library(dygraphs)
library(xts)
plotData = acc %>%
  filter(ID == 1, t < 600) %>%
  select(dttm, x:z)
series <- xts(plotData %>% select(-dttm), order.by = plotData$dttm, tzone = "GMT", check = FALSE)
dygraph(series, main = "Accelerometer values") %>%
  dyRangeSelector() %>% 
  dyOptions(useDataTimezone = TRUE) # so that the plot does not use the local timezone
                                    # but rather the timezone of the used data
                                    # defaults to FALSE, so that the local (pc) timezone is shown


Here, we add a range selector to the bottom of the chart that allows us to pan and zoom to various time ranges (we can also select a range in the main plot, and zoom out to the full temporal range by double clicking).

As the acc data (x, y and z) are not gravity-corrected, it may be good to first center the data (i.e. subtract the mean) and then collapse the 3 dimensions into a single dimension by calculating the vectorial sum of the dynamic body acceleration (\(VeDBA\)):

\[VeDBA = \sqrt{x^2+y^2+z^2}\]

Exercise 11.2d
For the acc data: first center the x, y and z columns, and then calculate \(VeDBA\) as a new column VeDBA. The, rename the columns x, y and z to xacc, yacc and zacc to avoid confusion later on with x and y from gps

To center: remove the mean from the vector of data. To compute the square root, use the function sqrt.


acc <- acc %>%
  mutate(x = x - mean(x, na.rm=TRUE),
         y = y - mean(y, na.rm=TRUE),
         z = z - mean(z, na.rm=TRUE),
         VeDBA = sqrt(x^2+y^2+z^2)) %>%
  rename(xacc = x,
         yacc = y,
         zacc = z)

3 Summarise data

Because the gps and acc data are recorded at different resolutions (respectively at \(\frac{1}{3}\)Hz and 32Hz), we first have to align both time series. We are going to do that by computing summary statistics of the high-resolution time series (acc) at the resolution of the low-resolution time series (gps). Thus, we are going to compute several single-value summary statistics of the 32Hz signal for every step of 3 seconds.

Exercise 11.3a
Truncate the column dttm in acc to values every 3 seconds and assign the result to a new column called dttm3sec

To truncate date-time values, you can use the floor_date function in the lubridate package.


acc <- acc %>%
  mutate(dttm3sec = lubridate::floor_date(dttm, "3 secs"))

After calculating the new column dttm3sec, we can continue by calculating summary statistics per unique combination of ID and dttm3sec.

Exercise 11.3b
Summarise the acc data per unique combination of ID and dttm3sec (use the group_by and summarise functions from dplyr). For the columns xacc, yacc, zacc, and VeDBA, compute both the mean and the standard deviation. Give short but meaningful names to the newly created features, for example by adding a suffix _sd or _mean (e.g. the standard deviation of xacc could become xacc_sd). Afterwards, rename dttm3sec to dttm.

acc <- acc %>%
  group_by(ID,dttm3sec) %>%
  summarise(xacc_mean = mean(xacc),
            xacc_sd = sd(xacc),
            yacc_mean = mean(yacc),
            yacc_sd = sd(yacc),
            zacc_mean = mean(zacc),
            zacc_sd = sd(zacc),
            VeDBA_mean = mean(VeDBA),
            VeDBA_sd = sd(VeDBA)) %>%
  rename(dttm = dttm3sec) %>%
  ungroup()

Now that we’ve made sure that both time series have the same resolution, we can move on to aligned them join. Let’s use the tools that we learned about in tutorial 10 to join the data.

Exercise 11.3c
Join the data of acc to the data of gps, and assign the result to an object called dat

Consider the use of left_join to join the data in acc to that in gps, by columns ID and dttm.


dat <- gps %>%
  left_join(acc, by=c("ID","dttm"))
Exercise 11.3d
Use the data from the previous exercise to calculate the step length, step, and turning angle, angle (from the columns x and y from the GPS sensor).

Step distances can be computed using Pythagoras’ theorem: \(distance = \sqrt{dx^2+dy^2}\), where \(dx\) and \(dy\) are the displacements in the \(x\) and \(y\) directions, respectively.

To compute these displacements there are a few options. The function difference from the tsibble package can be used to compute lagged differences between subsequent records (here thus dx from x and dy from y). Alternatively, we can also use the base R function diff, but since this function returns \(n-1\) differences for a vector with \(n\) records, we will have to pad the differences with a NA value either at the start or at the end of the resultant vector. Alternatively, we can use the lag or lead functions to find, respectively, the “previous” or “next” values in a vector (thus, in that case dx = lead(x) - x returns the distance of the current point to the next point along the x-axis direction).

The sqrt function can be used to compute the square root needed in Pythagoras’ theorem, and to raise some value to some power we can use the notation ^, e.g. 3^2 means the value 3 raised to the power 2.

To compute turning angles, first calculate movement headings, head, which can be calculated from dx and dy using the atan2(dx,dy) function. Then, from the headings calculate the turning angles (angle) between subsequent headings using the function tangle that is shown in Appendix D (thus load it in memory before using it, or source it in your script using source("https://wec.wur.nl/dse/-scripts/tangle.r")). Calculate the step and angle for each time series, thus group by ID and burst!


# source turning angle function (tangle)
source("https://wec.wur.nl/dse/-scripts/tangle.r")

# compute movement metrics
dat <- dat %>%
  group_by(ID, burst) %>%
  mutate(dx = lead(x) - x,
         dy = lead(y) - y,
         step = sqrt(dx^2 + dy^2),
         head = atan2(dx, dy),
         angle = tangle(head)) %>%
  ungroup()

4 Tsibble: tidy time series data

The tsibble package extends the tidyverse to temporal data, and builds on top of the tibble. The tsibble package is part of the tidyverts metapackage, a sort of tidyverse for time series data and analyses. For a short introduction to the tsibble package, see here.

Exercise 11.4a
Convert the object dat into a tsibble called datTS, using columns ID and burst as key, and using column dttm as index. Check the output of datTS by printing it to the console. What do you see?

Use the function as_tsibble.


datTS <- dat %>%
  select(ID, dttm, everything()) %>% # to reorder columns
  as_tsibble(key = c(ID, burst),
             index = dttm)
datTS
## # A tsibble: 2,191 x 20 [3s] <GMT>
## # Key:       ID, burst [16]
##       ID dttm                burst     t label         x        y xacc_mean xacc_sd yacc_mean yacc_sd zacc_mean zacc_sd VeDBA_mean VeDBA_sd
##    <dbl> <dttm>              <dbl> <dbl> <fct>     <dbl>    <dbl>     <dbl>   <dbl>     <dbl>   <dbl>     <dbl>   <dbl>      <dbl>    <dbl>
##  1     1 2018-04-21 09:05:00     1   300 idle    682170. 5762774.     -2.47    9.98     -11.8    26.3     -4.26    18.6       30.8     18.6
##  2     1 2018-04-21 09:05:03     1   303 grazing 682169. 5762773.     -6.06   13.5      -34.7    21.9    -13.0     19.8       46.2     17.8
##  3     1 2018-04-21 09:05:06     1   306 grazing 682169. 5762773.     -1.91   18.8      -30.4    27.7     -7.80    22.5       46.9     20.2
##  4     1 2018-04-21 09:05:09     1   309 grazing 682169. 5762773.     -7.59   14.7      -23.5    25.3     -3.15    21.5       40.5     16.9
##  5     1 2018-04-21 09:05:12     1   312 grazing 682169. 5762773.     -8.88   13.3      -18.8    12.6     -4.31    11.8       28.5     10.5
##  6     1 2018-04-21 09:05:15     1   315 grazing 682169. 5762773.     -1.75   22.4      -19.6    23.6     -5.09    21.1       40.3     16.7
##  7     1 2018-04-21 09:05:18     1   318 grazing 682169. 5762773.     -4.60   16.8      -23.2    22.8     -1.83    15.0       35.8     17.5
##  8     1 2018-04-21 09:05:21     1   321 grazing 682169. 5762773.     -6.55   15.1      -15.2    21.4     -2.14    16.2       30.8     16.6
##  9     1 2018-04-21 09:05:24     1   324 grazing 682169. 5762773.     -6.04   16.6      -27.0    28.1     -9.91    24.3       45.6     20.6
## 10     1 2018-04-21 09:05:27     1   327 grazing 682168. 5762773.     -7.10   13.4      -23.5    28.3     -6.14    21.4       41.1     19.3
## # ℹ 2,181 more rows
## # ℹ 5 more variables: dx <dbl>, dy <dbl>, step <dbl>, head <dbl>, angle <dbl>

The printed output of datTS is not that different from the printed output of a tibble. It shows a bit more information that is relevant to the time series data object: it shows that it is a time series with timestamps in timezone and with intervals of [3s]. Moreover, it shows that the key of the data is the combination of the columns ID and burst, as well as that there are [16] unique combinations of the keys. You can check the key and the index of the object using the functions key and index:

tsibble::key(datTS)
## [[1]]
## ID
## 
## [[2]]
## burst
tsibble::index(datTS)
## dttm

5 Challenge

Challenge
Explore the data stored in datTS, for example by showing the patterns in the different features as function of the behavioural class as indicated in label. Try to make some nice-looking informative graphs. For example, you can try to plot some autocorrelation measure for the different behaviours (the autocorrelation function acf can be handy here, or perhaps better, the partial autocorrelation function pacf).

dat %>%
  ggplot() +
  geom_boxplot(aes(x=label, y=step)) + 
  scale_y_sqrt()

dat %>%
  ggplot() +
  geom_boxplot(aes(x=label, y=angle)) + 
  scale_y_sqrt()

dat %>%
  ggplot() +
  geom_boxplot(aes(x=label, y=VeDBA_sd)) + 
  scale_y_sqrt()

dat %>%
  ggplot() +
  geom_boxplot(aes(x=label, y=yacc_sd)) + 
  scale_y_sqrt()

dat %>%
  as_tibble() %>%
  group_by(ID,burst,label) %>%
  summarise(acf = acf(na.omit(step), lag.max=1, plot=FALSE)$acf[2]) %>%
  ungroup() %>%
  ggplot() +
  geom_boxplot(aes(x=label, y=acf)) + 
  scale_y_sqrt()

6 Submit your last plot

Submit your script file as well as a plot: either your last created plot, or a plot that best captures your solution to the challenge. Submit the files on Brightspace via Assessment > Assignments > Skills day 11.

Note that your submission will not be graded or evaluated. It is used only to facilitate interaction and to get insight into progress.

7 Recap

Today, we’ve explored the tsibble packages to work with time series data. Tomorrow, we will use this data to with unsupervised learning algorithms.

An R script of today’s exercises can be downloaded here

8 Further reading