You can use the functions parse_date_time
(see tutorial
7) and factor
.
Today’s goal: to get used to working with time series data in a tidy manner via the tsibble package; preparing and exploring time series data.
tidyverts, which is a meta-package (like tidyverse is), focussed on time series data, with packages tsibble, feasts, fable and tsibbletalk.
The last two weeks, we worked towards, and with, tidy data: a convenient for the processing, visualisation and modelling of data. In all previous exercises, we assumed that all records in the tibble were independent: we could thus effectively shuffle the rows in the tables without impacting the subsequent visualisations and analyses done.
This week, we will work with data along a temporal dimension: time series data. The temporal dimension of such data is often in different formats, which complicates the analyses. For example1:
In today’s tutorial, we will be working with time series data including some of the above-mentioned challenges For sake of simplicity, we will make use of regular time series, and with timestamps recorded in the same timezone and without daylight saving time (all timestamps will be in GMT). We will be using data of several cows collected via neck collars with sensors (GPS and tri-axial accelerometer) during an experiment on cow movement and behaviour in relation to pasture height in Wageningen.
acc
and gps
, respectively. Load the
“tidyverse” and “lubridate” packages. Have a look at the loaded data.
gps <- read_csv("data/raw/cattle/gps.csv")
acc <- read_csv("data/raw/cattle/acc.csv")
gps
## # A tibble: 2,191 × 6
## ID burst t label x y
## <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 1 1 300 idle 682170. 5762774.
## 2 1 1 303 grazing 682169. 5762773.
## 3 1 1 306 grazing 682169. 5762773.
## 4 1 1 309 grazing 682169. 5762773.
## 5 1 1 312 grazing 682169. 5762773.
## 6 1 1 315 grazing 682169. 5762773.
## 7 1 1 318 grazing 682169. 5762773.
## 8 1 1 321 grazing 682169. 5762773.
## 9 1 1 324 grazing 682169. 5762773.
## 10 1 1 327 grazing 682168. 5762773.
## # ℹ 2,181 more rows
acc
## # A tibble: 209,054 × 6
## ID burst t x y z
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 299 490 489 628
## 2 1 1 299. 490 489 627
## 3 1 1 299. 491 488 626
## 4 1 1 299. 489 491 627
## 5 1 1 299. 488 493 632
## 6 1 1 299. 495 496 630
## 7 1 1 299. 500 493 630
## 8 1 1 299. 505 494 629
## 9 1 1 299. 516 491 627
## 10 1 1 299. 511 488 632
## # ℹ 209,044 more rows
Both tibbles contain the columns ID
, burst
and t
:
ID
is the cow identifier;burst
is an identifier nested within ID
for burst: a consecutive uninterrupted stretch of data;t
is the timestamp, measured in seconds from 09:00
21/4/2018 GMT.Tibble gps
contains 3 more columns:
label
a character vector with the annotated behaviour
of the cow at any moment in time (annotated through visual observation
in the field), with the behaviours grazing, walking
and idle. The original data includes more behavioural classes,
but these labels were reclassified to these three broad classes.x
and y
are the spatial coordinates, in
units meters in the projected coordinate system UTM31N.Tibble acc
also contains 3 more columns:
x
, y
, z
are the acceleration
forces (not gravity corrected) in 3 dimensions.The accelerometer data were sampled at 32 Hz.
gps
: use column t
and the information
given above to compute a new feature called dttm
that is an
R datetime object, and convert the column label
to a factor
(with the levels “grazing”,“walking” and “idle”). You can use the functions parse_date_time
(see tutorial
7) and factor
.
# save the reference time to object t0, which can be used here as well as below
t0 <- parse_date_time("2018-4-21 9:0:00",
orders="%Y-%m-%d %H:%M:%S",
tz = "GMT")
# mutate data in gps object
gps <- gps %>%
mutate(dttm = t0 + t,
label = factor(label, levels = c("grazing","walking","idle")))
acc
: use column t
and the information
given above to compute a new feature called dttm
that is an
R datetime object, similar to how you did it in the previous exercise.
acc <- acc %>%
mutate(dttm = t0 + t)
x
, y
and z
values in 1 plot against time for
ID == 1
and t < 600
. acc %>%
filter(ID == 1, t < 600) %>%
ggplot(aes(x=dttm)) +
geom_line(aes(y=x), col="red") +
geom_line(aes(y=y), col="green") +
geom_line(aes(y=z), col="blue") +
xlab("time") +
ylab("accelerometer value")
As can be seen in the plot of the previous exercise, data (e.g. accelerometer data) can often contain outliers. It may therefore often be wise to check for outliers, and to replace extreme values. For example, we could use an outlier removal algorithm, or do it ourselves by replacing values that are higher then (lower than) some threshold value, e.g. the 0.1% quantile (or 99.9% quantile). The code below computes (per accelerometer axis) first the 0.1% and 99.9% quantile values, than replaces values respectively lower or higher than these quantiles with these quantiles, and finally removes the computed columns with these quantiles:
acc <- acc %>%
mutate(minx = quantile(x, probs=0.001),
maxx = quantile(x, probs=0.999),
x = pmax(x, minx),
x = pmin(x, maxx),
miny = quantile(y, probs=0.001),
maxy = quantile(y, probs=0.999),
y = pmax(y, miny),
y = pmin(y, maxy),
minz = quantile(z, probs=0.001),
maxz = quantile(z, probs=0.999),
z = pmax(z, minz),
z = pmin(z, maxz)) %>%
select(-starts_with("min")) %>%
select(-starts_with("max"))
In the code above, we use the pmax
function to return
the parallel maxima and minima of the input values in order to
truncate the signals to their 0.1 - 99.9% range, using the
quantile
function that produces sample quantiles
corresponding to the given probabilities.
Besides using ggplot2, time series data is often conveniently plotted by an dynamic plotting library such as dygraphs, plotly or the package tsibbletalk that allows dynamical plotting of tsibbles.
For example, we could use dygraphs to plot the x, y and z axes of the accelerometer as function of time (here via a xts time series object, but we could also plot it from a data.frame or tibble), like done above statically using ggplot:
library(dygraphs)
library(xts)
plotData = acc %>%
filter(ID == 1, t < 600) %>%
select(dttm, x:z)
series <- xts(plotData %>% select(-dttm), order.by = plotData$dttm, tzone = "GMT", check = FALSE)
dygraph(series, main = "Accelerometer values") %>%
dyRangeSelector() %>%
dyOptions(useDataTimezone = TRUE) # so that the plot does not use the local timezone
# but rather the timezone of the used data
# defaults to FALSE, so that the local (pc) timezone is shown
Here, we add a range selector to the bottom of the chart that
allows us to pan and zoom to various time ranges (we can also select a
range in the main plot, and zoom out to the full temporal range by
double clicking).
As the acc
data (x
, y
and
z
) are not gravity-corrected, it may be good to first
center the data (i.e. subtract the mean) and then collapse the 3
dimensions into a single dimension by calculating the vectorial sum of
the dynamic body acceleration (\(VeDBA\)):
\[VeDBA = \sqrt{x^2+y^2+z^2}\]
acc
data: first center the x
,
y
and z
columns, and then calculate \(VeDBA\) as a new column VeDBA
.
The, rename the columns x
, y
and
z
to xacc
, yacc
and
zacc
to avoid confusion later on with x
and
y
from gps
To center: remove the mean from the vector of data. To compute the
square root, use the function sqrt
.
acc <- acc %>%
mutate(x = x - mean(x, na.rm=TRUE),
y = y - mean(y, na.rm=TRUE),
z = z - mean(z, na.rm=TRUE),
VeDBA = sqrt(x^2+y^2+z^2)) %>%
rename(xacc = x,
yacc = y,
zacc = z)
Because the gps
and acc
data are recorded
at different resolutions (respectively at \(\frac{1}{3}\)Hz and 32Hz), we first have to
align both time series. We are going to do that by computing summary
statistics of the high-resolution time series (acc
) at the
resolution of the low-resolution time series (gps
). Thus,
we are going to compute several single-value summary statistics of the
32Hz signal for every step of 3 seconds.
dttm
in acc
to values
every 3 seconds and assign the result to a new column called
dttm3sec
To truncate date-time values, you can use the floor_date
function in the lubridate package.
acc <- acc %>%
mutate(dttm3sec = lubridate::floor_date(dttm, "3 secs"))
After calculating the new column dttm3sec
, we can
continue by calculating summary statistics per unique combination of
ID
and dttm3sec
.
acc
data per unique combination of
ID
and dttm3sec
(use the group_by
and summarise
functions from dplyr). For the columns
xacc
, yacc
, zacc
, and
VeDBA
, compute both the mean and the standard deviation.
Give short but meaningful names to the newly created features, for
example by adding a suffix _sd or _mean (e.g. the
standard deviation of xacc
could become
xacc_sd
). Afterwards, rename dttm3sec
to
dttm
. acc <- acc %>%
group_by(ID,dttm3sec) %>%
summarise(xacc_mean = mean(xacc),
xacc_sd = sd(xacc),
yacc_mean = mean(yacc),
yacc_sd = sd(yacc),
zacc_mean = mean(zacc),
zacc_sd = sd(zacc),
VeDBA_mean = mean(VeDBA),
VeDBA_sd = sd(VeDBA)) %>%
rename(dttm = dttm3sec) %>%
ungroup()
Now that we’ve made sure that both time series have the same resolution, we can move on to aligned them join. Let’s use the tools that we learned about in tutorial 10 to join the data.
acc
to the data of gps
, and
assign the result to an object called dat
Consider the use of left_join
to join the data in
acc
to that in gps
, by columns ID
and dttm
.
dat <- gps %>%
left_join(acc, by=c("ID","dttm"))
step
, and turning angle, angle
(from the
columns x
and y
from the GPS sensor). Step distances can be computed using Pythagoras’ theorem: \(distance = \sqrt{dx^2+dy^2}\), where \(dx\) and \(dy\) are the displacements in the \(x\) and \(y\) directions, respectively.
To compute these displacements there are a few options. The function
difference
from the tsibble package can be used to compute
lagged differences between subsequent records (here thus dx
from x
and dy
from y
).
Alternatively, we can also use the base R function diff
,
but since this function returns \(n-1\)
differences for a vector with \(n\)
records, we will have to pad the differences with a NA
value either at the start or at the end of the resultant vector.
Alternatively, we can use the lag
or lead
functions to find, respectively, the “previous” or “next” values in a
vector (thus, in that case dx = lead(x) - x
returns the
distance of the current point to the next point along the x-axis
direction).
The sqrt
function can be used to compute the square root
needed in Pythagoras’ theorem, and to raise some value to some power we
can use the notation ^
, e.g. 3^2
means the
value 3 raised to the power 2.
To compute turning angles, first calculate movement headings,
head
, which can be calculated from dx
and
dy
using the atan2(dx,dy)
function. Then, from
the headings calculate the turning angles (angle
) between
subsequent headings using the function tangle
that is shown
in Appendix D (thus load it in memory before using it, or source it in
your script using
source("https://wec.wur.nl/dse/-scripts/tangle.r")
).
Calculate the step
and angle
for each time
series, thus group by ID
and burst
!
# source turning angle function (tangle)
source("https://wec.wur.nl/dse/-scripts/tangle.r")
# compute movement metrics
dat <- dat %>%
group_by(ID, burst) %>%
mutate(dx = lead(x) - x,
dy = lead(y) - y,
step = sqrt(dx^2 + dy^2),
head = atan2(dx, dy),
angle = tangle(head)) %>%
ungroup()
The tsibble package extends the tidyverse to temporal data, and builds on top of the tibble. The tsibble package is part of the tidyverts metapackage, a sort of tidyverse for time series data and analyses. For a short introduction to the tsibble package, see here.
dat
into a tsibble called
datTS
, using columns ID
and burst
as key, and using column dttm
as index.
Check the output of datTS
by printing it to the console.
What do you see? Use the function as_tsibble
.
datTS <- dat %>%
select(ID, dttm, everything()) %>% # to reorder columns
as_tsibble(key = c(ID, burst),
index = dttm)
datTS
## # A tsibble: 2,191 x 20 [3s] <GMT>
## # Key: ID, burst [16]
## ID dttm burst t label x y xacc_mean xacc_sd yacc_mean yacc_sd zacc_mean zacc_sd VeDBA_mean VeDBA_sd
## <dbl> <dttm> <dbl> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 2018-04-21 09:05:00 1 300 idle 682170. 5762774. -2.47 9.98 -11.8 26.3 -4.26 18.6 30.8 18.6
## 2 1 2018-04-21 09:05:03 1 303 grazing 682169. 5762773. -6.06 13.5 -34.7 21.9 -13.0 19.8 46.2 17.8
## 3 1 2018-04-21 09:05:06 1 306 grazing 682169. 5762773. -1.91 18.8 -30.4 27.7 -7.80 22.5 46.9 20.2
## 4 1 2018-04-21 09:05:09 1 309 grazing 682169. 5762773. -7.59 14.7 -23.5 25.3 -3.15 21.5 40.5 16.9
## 5 1 2018-04-21 09:05:12 1 312 grazing 682169. 5762773. -8.88 13.3 -18.8 12.6 -4.31 11.8 28.5 10.5
## 6 1 2018-04-21 09:05:15 1 315 grazing 682169. 5762773. -1.75 22.4 -19.6 23.6 -5.09 21.1 40.3 16.7
## 7 1 2018-04-21 09:05:18 1 318 grazing 682169. 5762773. -4.60 16.8 -23.2 22.8 -1.83 15.0 35.8 17.5
## 8 1 2018-04-21 09:05:21 1 321 grazing 682169. 5762773. -6.55 15.1 -15.2 21.4 -2.14 16.2 30.8 16.6
## 9 1 2018-04-21 09:05:24 1 324 grazing 682169. 5762773. -6.04 16.6 -27.0 28.1 -9.91 24.3 45.6 20.6
## 10 1 2018-04-21 09:05:27 1 327 grazing 682168. 5762773. -7.10 13.4 -23.5 28.3 -6.14 21.4 41.1 19.3
## # ℹ 2,181 more rows
## # ℹ 5 more variables: dx <dbl>, dy <dbl>, step <dbl>, head <dbl>, angle <dbl>
The printed output of datTS
is not that different from
the printed output of a tibble. It shows a bit more information that is
relevant to the time series data object: it shows that it is a time
series with timestamps in ID
and burst
, as well as that there
are [16] unique combinations of the keys. You can check the key and the
index of the object using the functions key
and
index
:
tsibble::key(datTS)
## [[1]]
## ID
##
## [[2]]
## burst
tsibble::index(datTS)
## dttm
datTS
, for example by showing
the patterns in the different features as function of the behavioural
class as indicated in label
. Try to make some nice-looking
informative graphs. For example, you can try to plot some
autocorrelation measure for the different behaviours (the
autocorrelation function acf
can be handy here, or perhaps
better, the partial autocorrelation function pacf
).
dat %>%
ggplot() +
geom_boxplot(aes(x=label, y=step)) +
scale_y_sqrt()
dat %>%
ggplot() +
geom_boxplot(aes(x=label, y=angle)) +
scale_y_sqrt()
dat %>%
ggplot() +
geom_boxplot(aes(x=label, y=VeDBA_sd)) +
scale_y_sqrt()
dat %>%
ggplot() +
geom_boxplot(aes(x=label, y=yacc_sd)) +
scale_y_sqrt()
dat %>%
as_tibble() %>%
group_by(ID,burst,label) %>%
summarise(acf = acf(na.omit(step), lag.max=1, plot=FALSE)$acf[2]) %>%
ungroup() %>%
ggplot() +
geom_boxplot(aes(x=label, y=acf)) +
scale_y_sqrt()
Submit your script file as well as a plot: either your last created plot, or a plot that best captures your solution to the challenge. Submit the files on Brightspace via Assessment > Assignments > Skills day 11.
Note that your submission will not be graded or evaluated. It is used only to facilitate interaction and to get insight into progress.
Today, we’ve explored the tsibble packages to work with time series data. Tomorrow, we will use this data to with unsupervised learning algorithms.
An R script of today’s exercises can be downloaded here