By: Henjo de Knegt, Jasper Eikelboom, Patrick Jansen, Danaë Rozendaal & Janneke Troost

Advancements in technology and information processing are rapidly changing many fields of science. The availability of unprecedented amounts of data is unlocking potential, however, it also creates a major challenge: the ability to effectively process and analyse it. In the current data-centered digital era that is driven by technological change, the volume of data will continue to sky-rocket due to decreasing costs of data collection, storage and processing.

Sticker logo This course covers the main elements of using a data science approach to solving problems in (agro-) ecology. The students will be guided through the main concepts and skills that are required extract knowledge from data, and will get hands-on experience with the different steps of the data science workflow: from data management, cleaning and pre-processing; data exploration; feature engineering; selecting and training algorithms; optimizing hyperparameters; validating algorithms; testing predictions; to visualization and communication of results. The students will be trained to apply different machine learning techniques, and critically evaluate their merits. During the course, students will acquire and expand data science skills that will prepare them for a quantitative MSc thesis, and that will benefit their future career in academia or business.

Assumed knowledge

Experience with programming in R is needed to follow and successfully complete this course. For example, students who followed a course in which R is heavily used, e.g. CSA-34306 Ecological Modelling and Data Analysis in R, will likely have sufficient background knowledge to participate in this course. We strongly urge students without prior experience with programming in R to learn programming in R before the start of the course, either by:

We advise students that are unsure about their level of R skills to go through the first 2 parts of the online book Hands-On Programming with R. If most elements discussed in these first 2 parts are understood, then the understanding of R programming is sufficient to participate in this course.

Learning goals

After successful completion of the course, students are expected to be able to:

  1. Explain important concepts in data science needed to solve typical ecological problems;
  2. Explain how key features of ecological data influence the selection, training, validation and evaluation of algorithms;
  3. Identify and select machine learning algorithms appropriate to specific ecological problems;
  4. Create a reproducible workflow (loading raw data, data processing, feature engineering, and machine learning algorithms) to efficiently analyse ecological datasets;
  5. Critically evaluate the reliability and adequacy of trained algorithms;
  6. Create ecological insight from data using a data science approach;
  7. Communicate the key elements and findings of a data science project clearly and concisely.

The tutorials provided via this website aim to teach you the basic principles of reproducible data science, and provide practical training in the various parts of the data science workflow: data import, tidying, wrangling, exploration and visualization, modelling and communication. This is all done in the programming language R using the framework supplied by the set of tidyverse packages.

Format

Throughout these pages, both inline R code (e.g. x <- 3), as well as chunks of R code and its output appear. A R code chunk is formatted in boxes with light-blue background like this:

print("Hello world!")

and its corresponding output is shown following 2 hash signs (##) in boxes with a light-green background like this:

## [1] "Hello world!"

Course materials

The book R for Data Science by Hadley Wickham and Garrett Grolemund (available in print; ISBN: 978-1491910399, or freely available online at r4ds.had.co.nz is used throughout the course, as well as a collection of supplied book chapters or journal articles that cover relevant elements covered during the course. Description of tutorials can be found on this website, but lecture slides and other course documents will be supplied via Brightspace.

Resources and acknowledgements

Much of the content presented here is based on the contributions from Hadley Wickham and all other people who contributed to the tidyverse (and related) packages.

At the start of each tutorial, there are references to the book R for Data Science (R4DS), where the numbers refer to the chapter numbers in the printed version of the book, yet the URLs refer to the corresponding online versions.

We would like to thank Guillaume Blanchet‬ and Sylvana Harmsen for their constructive feedback that helped to improve this course!