The timestamp column contains numeric timestamps in UNIX format ( dttm, POSIXct ). Since you can calculate with POSIXct objects, you can add the value in the timestamp column to the reference timestamp 1/1/1970 0:0:00 in the proper format.
Today’s goal: to do some wrangling with text, to create datetime objects, to write your own function, and to iterate this function over a column of data.
R4DS: chapters 14 Strings, 16 Dates and times, 19 Functions, 21 Iteration
stringr, lubridate, purrr. Stringr and purrr are loaded when you load tidyverse, but lubridate should be loaded separately (after loading tidyverse). Although this is a tidyverse package, it is not loaded in the core set of packages.
parse_date_time, str_sub, str_c, function(), map, map_dbl, leaflet
In this tutorial, we are going to practice some more skills related to working with text strings in R, datetime object, writing our own functions, and iteration. We are going to work with a dataset of a GPS-tracked animal, but only at the end of the tutorial we will figure out from which species this animal is.
First, start a new script, load the tidyverse
and
lubridate
libraries (install first if needed).
dat
. library(tidyverse)
library(lubridate)
dat <- read_csv("data/raw/somespecies/somespecies.csv")
The loaded data contains data of a single individual, with 585 records and just 2 columns:
The first rows of the data are shown here:
## # A tibble: 585 × 2
## timestamp payload
## <dbl> <chr>
## 1 1151064476 017202f979afdb4cfb
## 2 1151066276 016e02f9b55fdb508b
## 3 1151068076 016302f9bd1fdb512e
## 4 1151069876 015f02f9cc5fdb50c6
## 5 1151071675 015002f9c98fdb508e
## 6 1151073475 014102f9cc9fdb5065
## 7 1151075275 012b02f9ccbfdb509b
## 8 1151077075 011902f9b40fdb51e7
## 9 1151078875 010602f9a6afdb5360
## 10 1151080675 00f702f9a2ffdb53be
## # ℹ 575 more rows
The timestamp contains the number of seconds that have elapsed since the start of 1970 (in GMT), and the column payload contains all the data that the sensor collected at each timepoint. This website defines payload as:
Payload of a specific packet or other protocol data unit (PDU) is the actual transmitted data sent by communicating endpoints
The payload here is stored in hexadecimal
representation. A hexidecimal representation of data is written using a
symbol 0-9 and A-F (either lower-case or upper-case): in total thus 16
characters (compared to 10 for decimal, and 2 for binary; hexidecimal is
thus base-16). For example, the first payload in the data loaded in
dat
is “017202f979afdb4cfb”.
In this practical, we will convert the raw data from the sensor to information, by converting the timestamp to a datetime object were we can easily work with, and by converting the payload to 3 values: the ambient temperature, the animals longitude and latitude. When you have converted the data into these sets of information and plot it on a map, you will be able to tell from which species this data is! Namely, the individual from which we use the data was near a river that is named after the species. Thus, your task is to decode the payload, plot, and solve the mystery!
Datetime objects are often stored in a UNIX timestamp format: a number which represents the number of seconds that have passed since midnight on the 1st January 1970, GMT time. Because of this, it is actually very straightforward to calculate with datetime objects. The package lubridate, part of the tidyverse but not automatically loaded when you load the tidyverse package, is a helpful package to work with date-time data in R. Before using its functions, you thus needs to load the library (and install it once).
Lubridate has several user-friendly date-time parsing methods to
convert date-time data into a UNIX-style format (dttm). For
example, the function parse_date_time
can be used to parse
a character string into a
parse_date_time("1970-1-1 0:0:00",
orders = "%Y-%m-%d %H:%M:%S",
tz = "GMT")
Here, the first argument specifies a date and time in a character
string, the argument orders
specifies the format in which
the character string denotes date and time elements (e.g.,
%Y
means the 4-digit year; %m
means the month
in values 1-12, etc., see the specification listed in the documentation
of the strptime
of the base
package, thus,
?base::strptime()
), the tz
argument specifies
the timezone, here GMT. Internally, in R these datetime objects are
stored in an object of class POSIXct
:
parse_date_time("1970-1-1 0:0:00",
orders = "%Y-%m-%d %H:%M:%S",
tz = "GMT") %>%
class()
## [1] "POSIXct" "POSIXt"
Since R treats time as Unix time, we can thus calculate with time:
for example, we can simply compute the time difference between two
instances in time, here t1
and t2
, using
t2 - t1
, or, we can add 10 seconds to an instance in time,
here t1
, using t1 + 10
!
timestamp
into a new column called
dttm, with values that are datetime object. The timestamp column contains numeric timestamps in UNIX format ( dttm, POSIXct ). Since you can calculate with POSIXct objects, you can add the value in the timestamp column to the reference timestamp 1/1/1970 0:0:00 in the proper format.
After conversion, your data should look like this:
## # A tibble: 585 × 3
## timestamp payload dttm
## <dbl> <chr> <dttm>
## 1 1151064476 017202f979afdb4cfb 2006-06-23 12:07:56
## 2 1151066276 016e02f9b55fdb508b 2006-06-23 12:37:56
## 3 1151068076 016302f9bd1fdb512e 2006-06-23 13:07:56
## 4 1151069876 015f02f9cc5fdb50c6 2006-06-23 13:37:56
## 5 1151071675 015002f9c98fdb508e 2006-06-23 14:07:55
## 6 1151073475 014102f9cc9fdb5065 2006-06-23 14:37:55
## 7 1151075275 012b02f9ccbfdb509b 2006-06-23 15:07:55
## 8 1151077075 011902f9b40fdb51e7 2006-06-23 15:37:55
## 9 1151078875 010602f9a6afdb5360 2006-06-23 16:07:55
## 10 1151080675 00f702f9a2ffdb53be 2006-06-23 16:37:55
## # ℹ 575 more rows
dat <- dat %>%
mutate(dttm = parse_date_time("1970-1-1 0:0:00",
orders = "%Y-%m-%d %H:%M:%S",
tz = "GMT") + timestamp)
dat
## # A tibble: 585 × 3
## timestamp payload dttm
## <dbl> <chr> <dttm>
## 1 1151064476 017202f979afdb4cfb 2006-06-23 12:07:56
## 2 1151066276 016e02f9b55fdb508b 2006-06-23 12:37:56
## 3 1151068076 016302f9bd1fdb512e 2006-06-23 13:07:56
## 4 1151069876 015f02f9cc5fdb50c6 2006-06-23 13:37:56
## 5 1151071675 015002f9c98fdb508e 2006-06-23 14:07:55
## 6 1151073475 014102f9cc9fdb5065 2006-06-23 14:37:55
## 7 1151075275 012b02f9ccbfdb509b 2006-06-23 15:07:55
## 8 1151077075 011902f9b40fdb51e7 2006-06-23 15:37:55
## 9 1151078875 010602f9a6afdb5360 2006-06-23 16:07:55
## 10 1151080675 00f702f9a2ffdb53be 2006-06-23 16:37:55
## # ℹ 575 more rows
Now that we’ve converted the first column, timestamp, we
continue with the payload. Notice that the payload in in
character format: it is thus a text string. The
stringr package, part of the core tidyverse, is the
main package that deals with string manipulations. Most functions in the
stringr package start with the prefix str_
,
e.g. str_length("hello world")
returns the value
11
as the text string is 11 characters long (including the
space!). We are thus going to use some stringr functions to
convert our payload data into meaningful information. In order to do
that, we have to know how to decode the payload, i.e., which parts of
the payload codes for what piece of information, and how to
decode/translate it.
For our payload, the following information is given:
* 10
* 1e5
A nibble is one hexadecimal digit, written using a symbol 0-9 or A-F (either lower-case or upper-case). Two nibbles make 1 byte (thus 8 bits).
The notation 1e5
is
scientific
notation, meaning 105, thus 100000.
dat
to the console. When your code
works, you can consider removing the columns timestamp and
payload, so that your dataset is smaller and you have more
overview (without loosing information!). Check the str_sub
function to retrieve a subset from a
text string, index by the starting and ending position of the
sub-string.
After conversion, your data should look like this:
## # A tibble: 585 × 4
## dttm temp_hex lon_hex lat_hex
## <dttm> <chr> <chr> <chr>
## 1 2006-06-23 12:07:56 0172 02f979a fdb4cfb
## 2 2006-06-23 12:37:56 016e 02f9b55 fdb508b
## 3 2006-06-23 13:07:56 0163 02f9bd1 fdb512e
## 4 2006-06-23 13:37:56 015f 02f9cc5 fdb50c6
## 5 2006-06-23 14:07:55 0150 02f9c98 fdb508e
## 6 2006-06-23 14:37:55 0141 02f9cc9 fdb5065
## 7 2006-06-23 15:07:55 012b 02f9ccb fdb509b
## 8 2006-06-23 15:37:55 0119 02f9b40 fdb51e7
## 9 2006-06-23 16:07:55 0106 02f9a6a fdb5360
## 10 2006-06-23 16:37:55 00f7 02f9a2f fdb53be
## # ℹ 575 more rows
dat <- dat %>%
mutate(temp_hex = str_sub(payload, start = 1, end = 4),
lon_hex = str_sub(payload, start = 5, end = 11),
lat_hex = str_sub(payload, start = 12, end = 18)) %>%
select(-c(timestamp,payload))
dat
## # A tibble: 585 × 4
## dttm temp_hex lon_hex lat_hex
## <dttm> <chr> <chr> <chr>
## 1 2006-06-23 12:07:56 0172 02f979a fdb4cfb
## 2 2006-06-23 12:37:56 016e 02f9b55 fdb508b
## 3 2006-06-23 13:07:56 0163 02f9bd1 fdb512e
## 4 2006-06-23 13:37:56 015f 02f9cc5 fdb50c6
## 5 2006-06-23 14:07:55 0150 02f9c98 fdb508e
## 6 2006-06-23 14:37:55 0141 02f9cc9 fdb5065
## 7 2006-06-23 15:07:55 012b 02f9ccb fdb509b
## 8 2006-06-23 15:37:55 0119 02f9b40 fdb51e7
## 9 2006-06-23 16:07:55 0106 02f9a6a fdb5360
## 10 2006-06-23 16:37:55 00f7 02f9a2f fdb53be
## # ℹ 575 more rows
Please pay attention to the values of the column temp_hex: although they resemble numerical values, they are in fact not: they are values in hexadecimal representation, even when the values a-f are not present. In many programming languages, data in hexadecimal representation often gets the prefix 0x, which makes it visually very clear that it is not numeric data.
Check the str_c
function on how to join multiple strings
into a single string.
After conversion, your data should look like this:
## # A tibble: 585 × 4
## dttm temp_hex lon_hex lat_hex
## <dttm> <chr> <chr> <chr>
## 1 2006-06-23 12:07:56 0x0172 0x02f979a 0xfdb4cfb
## 2 2006-06-23 12:37:56 0x016e 0x02f9b55 0xfdb508b
## 3 2006-06-23 13:07:56 0x0163 0x02f9bd1 0xfdb512e
## 4 2006-06-23 13:37:56 0x015f 0x02f9cc5 0xfdb50c6
## 5 2006-06-23 14:07:55 0x0150 0x02f9c98 0xfdb508e
## 6 2006-06-23 14:37:55 0x0141 0x02f9cc9 0xfdb5065
## 7 2006-06-23 15:07:55 0x012b 0x02f9ccb 0xfdb509b
## 8 2006-06-23 15:37:55 0x0119 0x02f9b40 0xfdb51e7
## 9 2006-06-23 16:07:55 0x0106 0x02f9a6a 0xfdb5360
## 10 2006-06-23 16:37:55 0x00f7 0x02f9a2f 0xfdb53be
## # ℹ 575 more rows
dat <- dat %>%
mutate(temp_hex = str_c("0x", temp_hex, sep = ""),
lon_hex = str_c("0x", lon_hex, sep = ""),
lat_hex = str_c("0x", lat_hex, sep = ""))
dat
## # A tibble: 585 × 4
## dttm temp_hex lon_hex lat_hex
## <dttm> <chr> <chr> <chr>
## 1 2006-06-23 12:07:56 0x0172 0x02f979a 0xfdb4cfb
## 2 2006-06-23 12:37:56 0x016e 0x02f9b55 0xfdb508b
## 3 2006-06-23 13:07:56 0x0163 0x02f9bd1 0xfdb512e
## 4 2006-06-23 13:37:56 0x015f 0x02f9cc5 0xfdb50c6
## 5 2006-06-23 14:07:55 0x0150 0x02f9c98 0xfdb508e
## 6 2006-06-23 14:37:55 0x0141 0x02f9cc9 0xfdb5065
## 7 2006-06-23 15:07:55 0x012b 0x02f9ccb 0xfdb509b
## 8 2006-06-23 15:37:55 0x0119 0x02f9b40 0xfdb51e7
## 9 2006-06-23 16:07:55 0x0106 0x02f9a6a 0xfdb5360
## 10 2006-06-23 16:37:55 0x00f7 0x02f9a2f 0xfdb53be
## # ℹ 575 more rows
The newly created column temp_hex represents the ambient temperature measured by the sensor. It holds a 16-bit unsigned integer value (16 bit is 4 nibbles), which can be easily convert to a numeric value. The integer value is not the actual temperature in degrees C, it is actually 10 times the temperature in degrees C, thus we have to first decode the hexadecimal data in temp_hex to integer, and then divide by 10.
Simple conversion from hex to integer can be done using the function
strtoi
. Note that the argument base
should be
16L
(the L
here explicitly denotes that it is
an integer) since hexadecimal is base-16. For example, this code
converts “015f” into an integer value:
strtoi("015f", base = 16L)
## [1] 351
After decoding the hexadecimal character string to integer values, you now only have to still divide by 10 in order to get the temperature in degrees C!
dat <- dat %>%
mutate(temp = strtoi(temp_hex, base = 16L) / 10)
dat
## # A tibble: 585 × 5
## dttm temp_hex lon_hex lat_hex temp
## <dttm> <chr> <chr> <chr> <dbl>
## 1 2006-06-23 12:07:56 0x0172 0x02f979a 0xfdb4cfb 37
## 2 2006-06-23 12:37:56 0x016e 0x02f9b55 0xfdb508b 36.6
## 3 2006-06-23 13:07:56 0x0163 0x02f9bd1 0xfdb512e 35.5
## 4 2006-06-23 13:37:56 0x015f 0x02f9cc5 0xfdb50c6 35.1
## 5 2006-06-23 14:07:55 0x0150 0x02f9c98 0xfdb508e 33.6
## 6 2006-06-23 14:37:55 0x0141 0x02f9cc9 0xfdb5065 32.1
## 7 2006-06-23 15:07:55 0x012b 0x02f9ccb 0xfdb509b 29.9
## 8 2006-06-23 15:37:55 0x0119 0x02f9b40 0xfdb51e7 28.1
## 9 2006-06-23 16:07:55 0x0106 0x02f9a6a 0xfdb5360 26.2
## 10 2006-06-23 16:37:55 0x00f7 0x02f9a2f 0xfdb53be 24.7
## # ℹ 575 more rows
Congratulations! you’ve just done some code code crunching by converting a value stored in hex format to a value stored in decimal format! Do you recognize the difference between data and information?
To convert the other two pieces of data ( hex_lon and hex_lat ), it will help us to write a function ourselves. The R4DS book states:
One of the best ways to improve your reach as a data scientist is to write functions. Functions allow you to automate common tasks in a more powerful and general way than copy-and-pasting.
Indeed, although we did not cover this yet, being able to write your own functions is going to make your life as a (data)scientist much more convenient. Being able to write functions and use them cleverly is going to make your code more readable and much shorter. It will prevent you from being a lazy (and bad) copy-and-paste programmer, like wikipedia writes:
Copy-and-paste programming, sometimes referred to as just pasting, is the production of highly repetitive computer programming code, as produced by copy and paste operations. It is primarily a pejorative term; those who use the term are often implying a lack of programming competence.
The chapter in R4DS on functions covers the writing and use of functions nicely, so have a look there. For now, we will start with creating a very simple function.
To define a function in R, we should use the following syntax:
functionName <- function(arguments) {
computations on the arguments
some other code
return(output)
}
Here, we specify that the function gets the name
functionName
(and thus it will be used via the function
call functionName(...)
), where arguments
specifies the inputs that the function takes. If there are more function
inputs (arguments), they are separated by a comma. You can assign them a
default value using the =
operator: e.g. if the function
arguments area x1, x2=4.5
, then you specify 2 function
inputs, called x1
and x2
, where
x1
needs to be supplied to the function when calling it, as
it does not have a default value, however, input x2
is
technically now optional, as it has a default value. When no value is
supplied to x2
in the function call, R will assume that the
value 4.5 is meant. Everything in between the curly brackets
{}
is called the function body. In this function
body, all the commands to process the inputs to the output will be
placed. A function is best ended by an explicit return
function: this specifies what output the function returns.
We’ve created already a more complex function that converts the values of hex_lon and hex_lat to signed integers. This function uses the R.utils package, so you will need to install and load the library prior to the use of the function. You can load the function into your workspace by running:
library(R.utils)
source("https://wec.wur.nl/dse/-scripts/hex2integer.r")
The function has the name hex2integer
, and can be called
with 1 input argument (x
). Give it a try: inserting
“0x02f9b40” into the function should return the value 3119936:
hex2integer("0x02f9b40")
## [1] 3119936
myFun
, that takes 1 input
argument (x
; which will be the hex code of lon or lat),
calculates the equivalent longitude or latitude as an integer, and
transforms it to an appropriate decimal-point value. Test your function
for the value “0x02f9b40”. Recall from above that the lon and lat values were multiplied by 1e5! When you test your function on the specified hex string, the output should be 31.19936!
myFun <- function(x) {
y <- hex2integer(x) / 1e5
return(y)
}
myFun("0x02f9b40")
## [1] 31.19936
While writing functions is a way to reduce duplication in your code
and reduce copying-and-pasting, iteration is another.
The package purrr, part of core tidyverse, is package
that allows you to efficiently iterate a function over a vector of data.
To apply a function over each element in a vector data (e.g. a column in
a tibble) we can use the
map
function. See the R4DS section on the
map
function, especially the difference between map()
,
map_dbl()
etc, specifying the type of output that is
returned.
lon_hex
and lat_hex
, storing
the data as a decimal value in the new columns lon
and
lat
, respectively. Since the function you created above returns a decimal value
(double; map_dbl
to iterate
your function over the vectors hex_lon
and
hex_lat
.
After conversion, your data should look like this:
## # A tibble: 585 × 7
## dttm temp_hex lon_hex lat_hex temp lon lat
## <dttm> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2006-06-23 12:07:56 0x0172 0x02f979a 0xfdb4cfb 37 31.2 -24.1
## 2 2006-06-23 12:37:56 0x016e 0x02f9b55 0xfdb508b 36.6 31.2 -24.0
## 3 2006-06-23 13:07:56 0x0163 0x02f9bd1 0xfdb512e 35.5 31.2 -24.0
## 4 2006-06-23 13:37:56 0x015f 0x02f9cc5 0xfdb50c6 35.1 31.2 -24.0
## 5 2006-06-23 14:07:55 0x0150 0x02f9c98 0xfdb508e 33.6 31.2 -24.0
## 6 2006-06-23 14:37:55 0x0141 0x02f9cc9 0xfdb5065 32.1 31.2 -24.0
## 7 2006-06-23 15:07:55 0x012b 0x02f9ccb 0xfdb509b 29.9 31.2 -24.0
## 8 2006-06-23 15:37:55 0x0119 0x02f9b40 0xfdb51e7 28.1 31.2 -24.0
## 9 2006-06-23 16:07:55 0x0106 0x02f9a6a 0xfdb5360 26.2 31.2 -24.0
## 10 2006-06-23 16:37:55 0x00f7 0x02f9a2f 0xfdb53be 24.7 31.2 -24.0
## # ℹ 575 more rows
dat <- dat %>%
mutate(lon = map_dbl(lon_hex, myFun),
lat = map_dbl(lat_hex, myFun))
dat
## # A tibble: 585 × 7
## dttm temp_hex lon_hex lat_hex temp lon lat
## <dttm> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2006-06-23 12:07:56 0x0172 0x02f979a 0xfdb4cfb 37 31.2 -24.1
## 2 2006-06-23 12:37:56 0x016e 0x02f9b55 0xfdb508b 36.6 31.2 -24.0
## 3 2006-06-23 13:07:56 0x0163 0x02f9bd1 0xfdb512e 35.5 31.2 -24.0
## 4 2006-06-23 13:37:56 0x015f 0x02f9cc5 0xfdb50c6 35.1 31.2 -24.0
## 5 2006-06-23 14:07:55 0x0150 0x02f9c98 0xfdb508e 33.6 31.2 -24.0
## 6 2006-06-23 14:37:55 0x0141 0x02f9cc9 0xfdb5065 32.1 31.2 -24.0
## 7 2006-06-23 15:07:55 0x012b 0x02f9ccb 0xfdb509b 29.9 31.2 -24.0
## 8 2006-06-23 15:37:55 0x0119 0x02f9b40 0xfdb51e7 28.1 31.2 -24.0
## 9 2006-06-23 16:07:55 0x0106 0x02f9a6a 0xfdb5360 26.2 31.2 -24.0
## 10 2006-06-23 16:37:55 0x00f7 0x02f9a2f 0xfdb53be 24.7 31.2 -24.0
## # ℹ 575 more rows
ggplot
.
Use geom_path
and not geom_line
, because
geom_line
first orders the data (whereas
geom_path
does not)!
dat %>%
ggplot(aes(x=lon, y=lat)) +
geom_path()
Apart from using ggplot, here is a variety of methods to plot data dynamically in R, using javascript (see htmlwidgets.org). For example, using the leaflet package, you can easily draw interactive spatial maps. For example, the following short piece of code plots a dynamic map with a marker on where we currently are:
library(leaflet)
tibble(lon = 5.657619318, lat = 51.982556587) %>%
leaflet() %>%
addTiles() %>%
addCircleMarkers()
Plot the decoded data (lon and lat) on a leaflet map. Try to map
the data nicely. Also, which species do you think we have been working
with all the time today? Take a close look at the leaflet map and you
will figure it out (tip: which river - the big one, not the small one -
is it that we’re looking at?).
dat %>%
leaflet() %>%
addTiles() %>%
addPolylines(lng = ~lon,
lat = ~lat) %>%
addCircleMarkers(lng = ~lon,
lat = ~lat,
radius = 2,
color = "#000",
opacity = 0.75)
On the leaflet map we see the Olifants River: the collared animal was an elephant!
If you more challenge: try to replicate or improve upon the
hex2integer
function; see Appendix B.
Submit your script file as well as a plot: either your last created plot, or a plot that best captures your solution to the challenge. Submit the files on Brightspace via Assessment > Assignments > Skills day 7.
Note that your submission will not be graded or evaluated. It is used only to facilitate interaction and to get insight into progress.
Today, we’ve practices some handy tools related to working with strings (package stringr) and date-times (package lubridate), wrote our own custom-made functions, and used them in the iteration tools that the purrr package provides. Using these tools, we converted raw data (strings in hexadecimal representation that are meaningless unless you know how to decipher them) into information (the location in space and time of data from a certain species). Tomorrow, we finally are going to do some modelling and see a next cool feature of the tidyverse ecosystem: nested tibbles and list-columns!
An R script of today’s exercises can be downloaded here