WUR | WEC | CSA | PPS | GRS

By: Henjo de Knegt, Patrick Jansen, Jasper Eikelboom, Danaë Rozendaal, Janneke Troost & Alejandro Morales

Advancements in technology and information processing are rapidly changing many fields of science. The availability of unprecedented amounts of data is unlocking potential, however, it also creates a major challenge: the ability to effectively process and analyse it. In the current data-centered digital era that is driven by technological change, the volume of data will continue to sky-rocket due to decreasing costs of data collection, storage and processing.

This course covers the main elements of using a data science approach to solving problems in (agro-) ecology. The students will be guided through the main concepts and skills that are required extract knowledge from data, and will get hands-on experience with the different steps of the data science workflow: from data management, cleaning and pre-processing; data exploration; feature engineering; selecting and training algorithms; optimizing hyperparameters; validating algorithms; testing predictions; to visualization and communication of results. The students will be trained to apply different machine learning techniques, and critically evaluate their merits. During the course, students will acquire and expand data science skills that will prepare them for a quantitative MSc thesis, and that will benefit their future career in academia or business.

Assumed knowledge

Experience with programming in R is needed to follow and successfully complete this course. For example, students who followed a course in which R is heavily used, e.g. CSA-34306 Ecological Modelling and Data Analysis in R, will likely have sufficient background knowledge to participate in this course. We strongly urge students without prior experience with programming in R to learn programming in R before the start of the course, either by:

following the online Coursera course R programming: this course can be audited for free, and following the first 2 weeks of this course will suffice;
or studying the free online book Hands-On Programming with R, where parts 1 and 2 provides sufficient prerequisite knowledge.

We advise students that are unsure about their level of R skills to go through the first 2 parts of the online book Hands-On Programming with R. If most elements discussed in these first 2 parts are understood, then the understanding of R programming is sufficient to participate in this course.

Learning goals

After successful completion of the course, students are expected to be able to:

Explain important concepts in data science needed to solve typical ecological problems;
Explain how key features of ecological data influence the selection, training, validation and evaluation of algorithms;
Identify and select machine learning algorithms appropriate to specific ecological problems;
Create a reproducible workflow (loading raw data, data processing, feature engineering, and machine learning algorithms) to efficiently analyse ecological datasets;
Critically evaluate the reliability and adequacy of trained algorithms;
Create ecological insight from data using a data science approach;
Communicate the key elements and findings of a data science project clearly and concisely.

The tutorials provided via this website aim to teach you the basic principles of reproducible data science, and provide practical training in the various parts of the data science workflow: data import, tidying, wrangling, exploration and visualization, modelling and communication. This is all done in the programming language R using the framework supplied by the set of tidyverse packages.

Course outline

The general outline of the course, in terms of tutorials and groupwork, is as follows:

Week 1

Working with tidy data

First, we will practice using essential Tidyverse tools that enable powerful data wrangling and visualization within an efficient and reproducible workflow. Our focus will be on plotting, transforming, and joining data that is already in a tidy format.
Week 2

Tidying, modelling, and more

Next, we will reshape untidy data into a tidy format, practice additional programming tools, begin training algorithms, and focus on creating reproducible reports.
Week 3

Time series analyses

After having practiced the essentials of a data science workflow (data wrangling and modelling), we will now shift our focus to a common form of ecological data: time series data with repeated measurements on individual plants or animals. We will thus engage in time series analyses to expand our skills further.
Week 4

Image analyses

This week we will shift focus to another typical form of ecological data and analyses: the use of image-based data and analyses as used for e.g. wildlife monitoring, phenological studies or agronomic surveyance. We will thus focus on the use of such data in (agro-)ecology and begin conducting image analyses.
Week 5 & 6

Data challenges

During the last part of the course, we will work in small groups on a selected data science challenge. Our main goal is to apply all the theoretical knowledge gained and skills practiced, and thus to generate value from data. Since data science is mostly a collaborative effort, we’ll be working in small teams.
17/4 2025

Final presentations

At the final day of the course, each team will present the value they generated in their data science challenge.

Course materials

The book R for Data Science by Hadley Wickham and Garrett Grolemund (available in print; ISBN: 978-1491910399, or freely available online at r4ds.had.co.nz is used throughout the course, as well as a collection of supplied book chapters or journal articles that cover relevant elements covered during the course. Description of tutorials can be found on this website, but lecture slides and other course documents will be supplied via Brightspace.

Required software

In this course, we will heavily use the software packages R and RStudio; programmes that are widely used in the field of ecology. It is advised to use the latest versions of both software packages: these can be downloaded from the R and RStudio websites. In generating this website, we used R v4.4.3, and RStudio v2024.12.1-563; for a Windows system, you can download the corresponding versions of R and RStudio from here and here, respectively. On most computers, the installation of both R and RStudio using all default settings will work out just fine. Thus, if you use your own computer on which you have appropriate admin rights, you can install R first, and then RStudio, using the default settings. However, if you run into problems installing (or using) R when installed in the default way; then follow the steps as outlined here.

Code and output formatting

Throughout these pages, both inline R code (e.g. x <- 3), as well as chunks of R code and its output appear. An R code chunk is formatted in boxes with light-blue background like this:

print("Hello world!")

and its corresponding output is shown following 2 hash signs (##) in boxes with a light-green background like this:

## [1] "Hello world!"

Resources and acknowledgements

Much of the content presented here is based on the contributions from Hadley Wickham and all other people who contributed to the tidyverse (and related) packages.

At the start of each tutorial, there are references to the book R for Data Science (R4DS), where the numbers refer to the chapter numbers in the printed version of the book, yet the URLs refer to the corresponding online versions.

We would like to thank Laurens Dijkhuis, Guillaume Blanchet and Sylvana Harmsen for helping us improve this course!

Data Science for Ecology

Assumed knowledge

Learning goals

Course outline

Working with tidy data

Tidying, modelling, and more

Time series analyses

Image analyses

Data challenges

Final presentations

Course materials

Required software

Code and output formatting

Resources and acknowledgements