Overview

Today’s goal: to do some wrangling with text, to create datetime objects, to write your own function, and to iterate this function over a column of data.

Resources

R4DS: chapters 14 Strings, 16 Dates and times, 19 Functions, 21 Iteration

Packages

stringr, lubridate, purrr. Stringr and purrr are loaded when you load tidyverse, but lubridate should be loaded separately (after loading tidyverse). Although this is a tidyverse package, it is not loaded in the core set of packages.

Main functions

parse_date_time, str_sub, str_c, function(), map, map_dbl, leaflet

Cheat sheets

stringr, lubridate, purrr, leaflet

1 Introduction

In this tutorial, we are going to practice some more skills related to working with text strings in R, datetime object, writing our own functions, and iteration. We are going to work with a dataset of a GPS-tracked animal, but only at the end of the tutorial we will figure out from which species this animal is.

Exercise 7.1a

First, start a new script, load the tidyverse and lubridate libraries (install first if needed).

Then, download the file somespecies.csv from Brightspace > Skills > Datasets > Somespecies, store it on your computer, and load it into a tibble called dat.

library(tidyverse)
library(lubridate)
dat <- read_csv("data/raw/somespecies/somespecies.csv")

The loaded data contains data of a single individual, with 585 records and just 2 columns:

  • timestamp: the UNIX timestamp of the observation
  • payload: the payload of the sensor data.

The first rows of the data are shown here:

## # A tibble: 585 × 2
##     timestamp payload           
##         <dbl> <chr>             
##  1 1151064476 017202f979afdb4cfb
##  2 1151066276 016e02f9b55fdb508b
##  3 1151068076 016302f9bd1fdb512e
##  4 1151069876 015f02f9cc5fdb50c6
##  5 1151071675 015002f9c98fdb508e
##  6 1151073475 014102f9cc9fdb5065
##  7 1151075275 012b02f9ccbfdb509b
##  8 1151077075 011902f9b40fdb51e7
##  9 1151078875 010602f9a6afdb5360
## 10 1151080675 00f702f9a2ffdb53be
## # ℹ 575 more rows

The timestamp contains the number of seconds that have elapsed since the start of 1970 (in GMT), and the column payload contains all the data that the sensor collected at each timepoint. This website defines payload as:

Payload of a specific packet or other protocol data unit (PDU) is the actual transmitted data sent by communicating endpoints

The payload here is stored in hexadecimal representation. A hexidecimal representation of data is written using a symbol 0-9 and A-F (either lower-case or upper-case): in total thus 16 characters (compared to 10 for decimal, and 2 for binary; hexidecimal is thus base-16). For example, the first payload in the data loaded in dat is “017202f979afdb4cfb”.

In this practical, we will convert the raw data from the sensor to information, by converting the timestamp to a datetime object were we can easily work with, and by converting the payload to 3 values: the ambient temperature, the animals longitude and latitude. When you have converted the data into these sets of information and plot it on a map, you will be able to tell from which species this data is! Namely, the individual from which we use the data was near a river that is named after the species. Thus, your task is to decode the payload, plot, and solve the mystery!

2 Datetimes with lubridate

Datetime objects are often stored in a UNIX timestamp format: a number which represents the number of seconds that have passed since midnight on the 1st January 1970, GMT time. Because of this, it is actually very straightforward to calculate with datetime objects. The package lubridate, part of the tidyverse but not automatically loaded when you load the tidyverse package, is a helpful package to work with date-time data in R. Before using its functions, you thus needs to load the library (and install it once).

Lubridate has several user-friendly date-time parsing methods to convert date-time data into a UNIX-style format (dttm). For example, the function parse_date_time can be used to parse a character string into a object:

parse_date_time("1970-1-1 0:0:00",
                orders = "%Y-%m-%d %H:%M:%S",
                tz = "GMT")

Here, the first argument specifies a date and time in a character string, the argument orders specifies the format in which the character string denotes date and time elements (e.g., %Y means the 4-digit year; %m means the month in values 1-12, etc., see the specification listed in the documentation of the strptime of the base package, thus, ?base::strptime()), the tz argument specifies the timezone, here GMT. Internally, in R these datetime objects are stored in an object of class POSIXct:

parse_date_time("1970-1-1 0:0:00",
                orders = "%Y-%m-%d %H:%M:%S",
                tz = "GMT") %>%
  class()
## [1] "POSIXct" "POSIXt"

Since R treats time as Unix time, we can thus calculate with time: for example, we can simply compute the time difference between two instances in time, here t1 and t2, using t2 - t1, or, we can add 10 seconds to an instance in time, here t1, using t1 + 10!

Exercise 7.2a
Convert the column timestamp into a new column called dttm, with values that are datetime object.

The timestamp column contains numeric timestamps in UNIX format ( dttm, POSIXct ). Since you can calculate with POSIXct objects, you can add the value in the timestamp column to the reference timestamp 1/1/1970 0:0:00 in the proper format.


After conversion, your data should look like this:

## # A tibble: 585 × 3
##     timestamp payload            dttm               
##         <dbl> <chr>              <dttm>             
##  1 1151064476 017202f979afdb4cfb 2006-06-23 12:07:56
##  2 1151066276 016e02f9b55fdb508b 2006-06-23 12:37:56
##  3 1151068076 016302f9bd1fdb512e 2006-06-23 13:07:56
##  4 1151069876 015f02f9cc5fdb50c6 2006-06-23 13:37:56
##  5 1151071675 015002f9c98fdb508e 2006-06-23 14:07:55
##  6 1151073475 014102f9cc9fdb5065 2006-06-23 14:37:55
##  7 1151075275 012b02f9ccbfdb509b 2006-06-23 15:07:55
##  8 1151077075 011902f9b40fdb51e7 2006-06-23 15:37:55
##  9 1151078875 010602f9a6afdb5360 2006-06-23 16:07:55
## 10 1151080675 00f702f9a2ffdb53be 2006-06-23 16:37:55
## # ℹ 575 more rows

dat <- dat %>%
  mutate(dttm = parse_date_time("1970-1-1 0:0:00", 
                                orders = "%Y-%m-%d %H:%M:%S",
                                tz = "GMT") + timestamp)
dat
## # A tibble: 585 × 3
##     timestamp payload            dttm               
##         <dbl> <chr>              <dttm>             
##  1 1151064476 017202f979afdb4cfb 2006-06-23 12:07:56
##  2 1151066276 016e02f9b55fdb508b 2006-06-23 12:37:56
##  3 1151068076 016302f9bd1fdb512e 2006-06-23 13:07:56
##  4 1151069876 015f02f9cc5fdb50c6 2006-06-23 13:37:56
##  5 1151071675 015002f9c98fdb508e 2006-06-23 14:07:55
##  6 1151073475 014102f9cc9fdb5065 2006-06-23 14:37:55
##  7 1151075275 012b02f9ccbfdb509b 2006-06-23 15:07:55
##  8 1151077075 011902f9b40fdb51e7 2006-06-23 15:37:55
##  9 1151078875 010602f9a6afdb5360 2006-06-23 16:07:55
## 10 1151080675 00f702f9a2ffdb53be 2006-06-23 16:37:55
## # ℹ 575 more rows

3 Wrangling with text strings

Now that we’ve converted the first column, timestamp, we continue with the payload. Notice that the payload in in character format: it is thus a text string. The stringr package, part of the core tidyverse, is the main package that deals with string manipulations. Most functions in the stringr package start with the prefix str_, e.g. str_length("hello world") returns the value 11 as the text string is 11 characters long (including the space!). We are thus going to use some stringr functions to convert our payload data into meaningful information. In order to do that, we have to know how to decode the payload, i.e., which parts of the payload codes for what piece of information, and how to decode/translate it.

For our payload, the following information is given:

  • the first message is 4 nibbles long, and codes for the ambient temperature (unsigned 16bit integer) * 10
  • the following message is 7 nibbles long, codes for the GPS longitude (signed 28bit integer, two’s complement * 1e5
  • the last message is similar to the second message, but codes for the GPS latitude.

A nibble is one hexadecimal digit, written using a symbol 0-9 or A-F (either lower-case or upper-case). Two nibbles make 1 byte (thus 8 bits).

The notation 1e5 is scientific notation, meaning 105, thus 100000.

Exercise 7.3a
Convert the column payload into 3 new columns: respectively temp_hex, lon_hex and lat_hex using the information of payload build-up explained above. Display the result by printing the object dat to the console. When your code works, you can consider removing the columns timestamp and payload, so that your dataset is smaller and you have more overview (without loosing information!).

Check the str_sub function to retrieve a subset from a text string, index by the starting and ending position of the sub-string.


After conversion, your data should look like this:

## # A tibble: 585 × 4
##    dttm                temp_hex lon_hex lat_hex
##    <dttm>              <chr>    <chr>   <chr>  
##  1 2006-06-23 12:07:56 0172     02f979a fdb4cfb
##  2 2006-06-23 12:37:56 016e     02f9b55 fdb508b
##  3 2006-06-23 13:07:56 0163     02f9bd1 fdb512e
##  4 2006-06-23 13:37:56 015f     02f9cc5 fdb50c6
##  5 2006-06-23 14:07:55 0150     02f9c98 fdb508e
##  6 2006-06-23 14:37:55 0141     02f9cc9 fdb5065
##  7 2006-06-23 15:07:55 012b     02f9ccb fdb509b
##  8 2006-06-23 15:37:55 0119     02f9b40 fdb51e7
##  9 2006-06-23 16:07:55 0106     02f9a6a fdb5360
## 10 2006-06-23 16:37:55 00f7     02f9a2f fdb53be
## # ℹ 575 more rows

dat <- dat %>%
  mutate(temp_hex = str_sub(payload, start = 1,  end = 4),
         lon_hex  = str_sub(payload, start = 5,  end = 11),
         lat_hex  = str_sub(payload, start = 12, end = 18)) %>%
  select(-c(timestamp,payload))
dat
## # A tibble: 585 × 4
##    dttm                temp_hex lon_hex lat_hex
##    <dttm>              <chr>    <chr>   <chr>  
##  1 2006-06-23 12:07:56 0172     02f979a fdb4cfb
##  2 2006-06-23 12:37:56 016e     02f9b55 fdb508b
##  3 2006-06-23 13:07:56 0163     02f9bd1 fdb512e
##  4 2006-06-23 13:37:56 015f     02f9cc5 fdb50c6
##  5 2006-06-23 14:07:55 0150     02f9c98 fdb508e
##  6 2006-06-23 14:37:55 0141     02f9cc9 fdb5065
##  7 2006-06-23 15:07:55 012b     02f9ccb fdb509b
##  8 2006-06-23 15:37:55 0119     02f9b40 fdb51e7
##  9 2006-06-23 16:07:55 0106     02f9a6a fdb5360
## 10 2006-06-23 16:37:55 00f7     02f9a2f fdb53be
## # ℹ 575 more rows

Please pay attention to the values of the column temp_hex: although they resemble numerical values, they are in fact not: they are values in hexadecimal representation, even when the values a-f are not present. In many programming languages, data in hexadecimal representation often gets the prefix 0x, which makes it visually very clear that it is not numeric data.

Exercise 7.3b
Add the prefix “0x” to the 3 new columns created above.

Check the str_c function on how to join multiple strings into a single string.


After conversion, your data should look like this:

## # A tibble: 585 × 4
##    dttm                temp_hex lon_hex   lat_hex  
##    <dttm>              <chr>    <chr>     <chr>    
##  1 2006-06-23 12:07:56 0x0172   0x02f979a 0xfdb4cfb
##  2 2006-06-23 12:37:56 0x016e   0x02f9b55 0xfdb508b
##  3 2006-06-23 13:07:56 0x0163   0x02f9bd1 0xfdb512e
##  4 2006-06-23 13:37:56 0x015f   0x02f9cc5 0xfdb50c6
##  5 2006-06-23 14:07:55 0x0150   0x02f9c98 0xfdb508e
##  6 2006-06-23 14:37:55 0x0141   0x02f9cc9 0xfdb5065
##  7 2006-06-23 15:07:55 0x012b   0x02f9ccb 0xfdb509b
##  8 2006-06-23 15:37:55 0x0119   0x02f9b40 0xfdb51e7
##  9 2006-06-23 16:07:55 0x0106   0x02f9a6a 0xfdb5360
## 10 2006-06-23 16:37:55 0x00f7   0x02f9a2f 0xfdb53be
## # ℹ 575 more rows

dat <- dat %>%
  mutate(temp_hex = str_c("0x", temp_hex, sep = ""),
         lon_hex  = str_c("0x", lon_hex,  sep = ""),
         lat_hex  = str_c("0x", lat_hex,  sep = ""))
dat
## # A tibble: 585 × 4
##    dttm                temp_hex lon_hex   lat_hex  
##    <dttm>              <chr>    <chr>     <chr>    
##  1 2006-06-23 12:07:56 0x0172   0x02f979a 0xfdb4cfb
##  2 2006-06-23 12:37:56 0x016e   0x02f9b55 0xfdb508b
##  3 2006-06-23 13:07:56 0x0163   0x02f9bd1 0xfdb512e
##  4 2006-06-23 13:37:56 0x015f   0x02f9cc5 0xfdb50c6
##  5 2006-06-23 14:07:55 0x0150   0x02f9c98 0xfdb508e
##  6 2006-06-23 14:37:55 0x0141   0x02f9cc9 0xfdb5065
##  7 2006-06-23 15:07:55 0x012b   0x02f9ccb 0xfdb509b
##  8 2006-06-23 15:37:55 0x0119   0x02f9b40 0xfdb51e7
##  9 2006-06-23 16:07:55 0x0106   0x02f9a6a 0xfdb5360
## 10 2006-06-23 16:37:55 0x00f7   0x02f9a2f 0xfdb53be
## # ℹ 575 more rows

The newly created column temp_hex represents the ambient temperature measured by the sensor. It holds a 16-bit unsigned integer value (16 bit is 4 nibbles), which can be easily convert to a numeric value. The integer value is not the actual temperature in degrees C, it is actually 10 times the temperature in degrees C, thus we have to first decode the hexadecimal data in temp_hex to integer, and then divide by 10.

Exercise 7.3c
Decode the ambient temperature as stored in temp_hex

Simple conversion from hex to integer can be done using the function strtoi. Note that the argument base should be 16L (the L here explicitly denotes that it is an integer) since hexadecimal is base-16. For example, this code converts “015f” into an integer value:

strtoi("015f", base = 16L)
## [1] 351

After decoding the hexadecimal character string to integer values, you now only have to still divide by 10 in order to get the temperature in degrees C!


dat <- dat %>%
  mutate(temp = strtoi(temp_hex, base = 16L) / 10)
dat
## # A tibble: 585 × 5
##    dttm                temp_hex lon_hex   lat_hex    temp
##    <dttm>              <chr>    <chr>     <chr>     <dbl>
##  1 2006-06-23 12:07:56 0x0172   0x02f979a 0xfdb4cfb  37  
##  2 2006-06-23 12:37:56 0x016e   0x02f9b55 0xfdb508b  36.6
##  3 2006-06-23 13:07:56 0x0163   0x02f9bd1 0xfdb512e  35.5
##  4 2006-06-23 13:37:56 0x015f   0x02f9cc5 0xfdb50c6  35.1
##  5 2006-06-23 14:07:55 0x0150   0x02f9c98 0xfdb508e  33.6
##  6 2006-06-23 14:37:55 0x0141   0x02f9cc9 0xfdb5065  32.1
##  7 2006-06-23 15:07:55 0x012b   0x02f9ccb 0xfdb509b  29.9
##  8 2006-06-23 15:37:55 0x0119   0x02f9b40 0xfdb51e7  28.1
##  9 2006-06-23 16:07:55 0x0106   0x02f9a6a 0xfdb5360  26.2
## 10 2006-06-23 16:37:55 0x00f7   0x02f9a2f 0xfdb53be  24.7
## # ℹ 575 more rows

Congratulations! you’ve just done some code code crunching by converting a value stored in hex format to a value stored in decimal format! Do you recognize the difference between data and information?

4 Functions

To convert the other two pieces of data ( hex_lon and hex_lat ), it will help us to write a function ourselves. The R4DS book states:

One of the best ways to improve your reach as a data scientist is to write functions. Functions allow you to automate common tasks in a more powerful and general way than copy-and-pasting.

Indeed, although we did not cover this yet, being able to write your own functions is going to make your life as a (data)scientist much more convenient. Being able to write functions and use them cleverly is going to make your code more readable and much shorter. It will prevent you from being a lazy (and bad) copy-and-paste programmer, like wikipedia writes:

Copy-and-paste programming, sometimes referred to as just pasting, is the production of highly repetitive computer programming code, as produced by copy and paste operations. It is primarily a pejorative term; those who use the term are often implying a lack of programming competence.

The chapter in R4DS on functions covers the writing and use of functions nicely, so have a look there. For now, we will start with creating a very simple function.

To define a function in R, we should use the following syntax:

functionName <- function(arguments) {
  computations on the arguments
  some other code
  return(output)
}

Here, we specify that the function gets the name functionName (and thus it will be used via the function call functionName(...)), where arguments specifies the inputs that the function takes. If there are more function inputs (arguments), they are separated by a comma. You can assign them a default value using the = operator: e.g. if the function arguments area x1, x2=4.5, then you specify 2 function inputs, called x1 and x2, where x1 needs to be supplied to the function when calling it, as it does not have a default value, however, input x2 is technically now optional, as it has a default value. When no value is supplied to x2 in the function call, R will assume that the value 4.5 is meant. Everything in between the curly brackets {} is called the function body. In this function body, all the commands to process the inputs to the output will be placed. A function is best ended by an explicit return function: this specifies what output the function returns.

We’ve created already a more complex function that converts the values of hex_lon and hex_lat to signed integers. This function uses the R.utils package, so you will need to install and load the library prior to the use of the function. You can load the function into your workspace by running:

library(R.utils)
source("https://wec.wur.nl/dse/-scripts/hex2integer.r")

The function has the name hex2integer, and can be called with 1 input argument (x). Give it a try: inserting “0x02f9b40” into the function should return the value 3119936:

hex2integer("0x02f9b40")
## [1] 3119936
Exercise 7.4a
Create your own function, called myFun, that takes 1 input argument (x; which will be the hex code of lon or lat), calculates the equivalent longitude or latitude as an integer, and transforms it to an appropriate decimal-point value. Test your function for the value “0x02f9b40”.

Recall from above that the lon and lat values were multiplied by 1e5! When you test your function on the specified hex string, the output should be 31.19936!


myFun <- function(x) {
  y <- hex2integer(x) / 1e5
  return(y)
}
myFun("0x02f9b40")
## [1] 31.19936

5 Iteration

While writing functions is a way to reduce duplication in your code and reduce copying-and-pasting, iteration is another. The package purrr, part of core tidyverse, is package that allows you to efficiently iterate a function over a vector of data. To apply a function over each element in a vector data (e.g. a column in a tibble) we can use the map function. See the R4DS section on the map function, especially the difference between map(), map_dbl() etc, specifying the type of output that is returned.

Exercise 7.5a
Use the appropriate function to iterate the function you created above over the vectors lon_hex and lat_hex, storing the data as a decimal value in the new columns lon and lat, respectively.

Since the function you created above returns a decimal value (double; ), use map_dbl to iterate your function over the vectors hex_lon and hex_lat.


After conversion, your data should look like this:

## # A tibble: 585 × 7
##    dttm                temp_hex lon_hex   lat_hex    temp   lon   lat
##    <dttm>              <chr>    <chr>     <chr>     <dbl> <dbl> <dbl>
##  1 2006-06-23 12:07:56 0x0172   0x02f979a 0xfdb4cfb  37    31.2 -24.1
##  2 2006-06-23 12:37:56 0x016e   0x02f9b55 0xfdb508b  36.6  31.2 -24.0
##  3 2006-06-23 13:07:56 0x0163   0x02f9bd1 0xfdb512e  35.5  31.2 -24.0
##  4 2006-06-23 13:37:56 0x015f   0x02f9cc5 0xfdb50c6  35.1  31.2 -24.0
##  5 2006-06-23 14:07:55 0x0150   0x02f9c98 0xfdb508e  33.6  31.2 -24.0
##  6 2006-06-23 14:37:55 0x0141   0x02f9cc9 0xfdb5065  32.1  31.2 -24.0
##  7 2006-06-23 15:07:55 0x012b   0x02f9ccb 0xfdb509b  29.9  31.2 -24.0
##  8 2006-06-23 15:37:55 0x0119   0x02f9b40 0xfdb51e7  28.1  31.2 -24.0
##  9 2006-06-23 16:07:55 0x0106   0x02f9a6a 0xfdb5360  26.2  31.2 -24.0
## 10 2006-06-23 16:37:55 0x00f7   0x02f9a2f 0xfdb53be  24.7  31.2 -24.0
## # ℹ 575 more rows

dat <- dat %>%
  mutate(lon = map_dbl(lon_hex, myFun),
         lat = map_dbl(lat_hex, myFun))
dat
## # A tibble: 585 × 7
##    dttm                temp_hex lon_hex   lat_hex    temp   lon   lat
##    <dttm>              <chr>    <chr>     <chr>     <dbl> <dbl> <dbl>
##  1 2006-06-23 12:07:56 0x0172   0x02f979a 0xfdb4cfb  37    31.2 -24.1
##  2 2006-06-23 12:37:56 0x016e   0x02f9b55 0xfdb508b  36.6  31.2 -24.0
##  3 2006-06-23 13:07:56 0x0163   0x02f9bd1 0xfdb512e  35.5  31.2 -24.0
##  4 2006-06-23 13:37:56 0x015f   0x02f9cc5 0xfdb50c6  35.1  31.2 -24.0
##  5 2006-06-23 14:07:55 0x0150   0x02f9c98 0xfdb508e  33.6  31.2 -24.0
##  6 2006-06-23 14:37:55 0x0141   0x02f9cc9 0xfdb5065  32.1  31.2 -24.0
##  7 2006-06-23 15:07:55 0x012b   0x02f9ccb 0xfdb509b  29.9  31.2 -24.0
##  8 2006-06-23 15:37:55 0x0119   0x02f9b40 0xfdb51e7  28.1  31.2 -24.0
##  9 2006-06-23 16:07:55 0x0106   0x02f9a6a 0xfdb5360  26.2  31.2 -24.0
## 10 2006-06-23 16:37:55 0x00f7   0x02f9a2f 0xfdb53be  24.7  31.2 -24.0
## # ℹ 575 more rows

6 Plotting the data

Exercise 7.6a
Plot the decoded data (lon and lat) as lines using ggplot.

Use geom_path and not geom_line, because geom_line first orders the data (whereas geom_path does not)!


dat %>%
  ggplot(aes(x=lon, y=lat)) +
  geom_path()

7 Challenge

Challenge

Apart from using ggplot, here is a variety of methods to plot data dynamically in R, using javascript (see htmlwidgets.org). For example, using the leaflet package, you can easily draw interactive spatial maps. For example, the following short piece of code plots a dynamic map with a marker on where we currently are:

library(leaflet)
tibble(lon = 5.657619318, lat = 51.982556587) %>% 
  leaflet() %>% 
  addTiles() %>%
  addCircleMarkers()


Plot the decoded data (lon and lat) on a leaflet map. Try to map the data nicely. Also, which species do you think we have been working with all the time today? Take a close look at the leaflet map and you will figure it out (tip: which river - the big one, not the small one - is it that we’re looking at?).

dat %>% 
  leaflet() %>% 
  addTiles() %>%
  addPolylines(lng = ~lon,
               lat = ~lat) %>%
  addCircleMarkers(lng = ~lon,
                   lat = ~lat,
                   radius = 2,
                   color = "#000",
                   opacity = 0.75)

On the leaflet map we see the Olifants River: the collared animal was an elephant!

If you more challenge: try to replicate or improve upon the hex2integer function; see Appendix B.

8 Submit your last plot

Submit your script file as well as a plot: either your last created plot, or a plot that best captures your solution to the challenge. Submit the files on Brightspace via Assessment > Assignments > Skills day 7.

Note that your submission will not be graded or evaluated. It is used only to facilitate interaction and to get insight into progress.

9 Recap

Today, we’ve practices some handy tools related to working with strings (package stringr) and date-times (package lubridate), wrote our own custom-made functions, and used them in the iteration tools that the purrr package provides. Using these tools, we converted raw data (strings in hexadecimal representation that are meaningless unless you know how to decipher them) into information (the location in space and time of data from a certain species). Tomorrow, we finally are going to do some modelling and see a next cool feature of the tidyverse ecosystem: nested tibbles and list-columns!

An R script of today’s exercises can be downloaded here

10 Further reading