By: Henjo de Knegt, Patrick Jansen, Jasper Eikelboom, Danaë Rozendaal, Janneke Troost & Alejandro Morales
Advancements in technology and information processing are rapidly changing many fields of science. The availability of unprecedented amounts of data is unlocking potential, however, it also creates a major challenge: the ability to effectively process and analyse it. In the current data-centered digital era that is driven by technological change, the volume of data will continue to sky-rocket due to decreasing costs of data collection, storage and processing.
This course covers the main elements of using a data science approach to
solving problems in (agro-) ecology. The students will be guided through
the main concepts and skills that are required extract knowledge from
data, and will get hands-on experience with the different steps of the
data science workflow: from data management, cleaning and
pre-processing; data exploration; feature engineering; selecting and
training algorithms; optimizing hyperparameters; validating algorithms;
testing predictions; to visualization and communication of results. The
students will be trained to apply different machine learning techniques,
and critically evaluate their merits. During the course, students will
acquire and expand data science skills that will prepare them for a
quantitative MSc thesis, and that will benefit their future career in
academia or business.
Experience with programming in R is needed to follow and successfully complete this course. For example, students who followed a course in which R is heavily used, e.g. CSA-34306 Ecological Modelling and Data Analysis in R, will likely have sufficient background knowledge to participate in this course. We strongly urge students without prior experience with programming in R to learn programming in R before the start of the course, either by:
We advise students that are unsure about their level of R skills to go through the first 2 parts of the online book Hands-On Programming with R. If most elements discussed in these first 2 parts are understood, then the understanding of R programming is sufficient to participate in this course.
After successful completion of the course, students are expected to be able to:
The tutorials provided via this website aim to teach you the basic principles of reproducible data science, and provide practical training in the various parts of the data science workflow: data import, tidying, wrangling, exploration and visualization, modelling and communication. This is all done in the programming language R using the framework supplied by the set of tidyverse packages.
The general outline of the course, in terms of tutorials and groupwork, is as follows:
The book R for Data Science by Hadley Wickham and Garrett Grolemund (available in print; ISBN: 978-1491910399, or freely available online at r4ds.had.co.nz is used throughout the course, as well as a collection of supplied book chapters or journal articles that cover relevant elements covered during the course. Description of tutorials can be found on this website, but lecture slides and other course documents will be supplied via Brightspace.
In this course, we will heavily use the software packages R and RStudio; programmes that are widely used in the field of ecology. It is advised to use the latest versions of both software packages: these can be downloaded from the R and RStudio websites. In generating this website, we used R v4.4.3, and RStudio v2024.12.1-563; for a Windows system, you can download the corresponding versions of R and RStudio from here and here, respectively. On most computers, the installation of both R and RStudio using all default settings will work out just fine. Thus, if you use your own computer on which you have appropriate admin rights, you can install R first, and then RStudio, using the default settings. However, if you run into problems installing (or using) R when installed in the default way; then follow the steps as outlined here.
Throughout these pages, both inline R code
(e.g. x <- 3
), as well as chunks of R code and its
output appear. An R code chunk is formatted in boxes with light-blue
background like this:
print("Hello world!")
and its corresponding output is shown following 2 hash signs (##) in boxes with a light-green background like this:
## [1] "Hello world!"
Much of the content presented here is based on the contributions from Hadley Wickham and all other people who contributed to the tidyverse (and related) packages.
At the start of each tutorial, there are references to the book R for Data Science (R4DS), where the numbers refer to the chapter numbers in the printed version of the book, yet the URLs refer to the corresponding online versions.
We would like to thank Laurens Dijkhuis, Guillaume Blanchet and Sylvana Harmsen for helping us improve this course!