By: Henjo de Knegt, Jasper Eikelboom, Patrick Jansen, Danaë Rozendaal & Janneke Troost
Advancements in technology and information processing are rapidly changing many fields of science. The availability of unprecedented amounts of data is unlocking potential, however, it also creates a major challenge: the ability to effectively process and analyse it. In the current data-centered digital era that is driven by technological change, the volume of data will continue to sky-rocket due to decreasing costs of data collection, storage and processing.
This course covers the main elements of using a data science approach to solving problems in (agro-) ecology. The students will be guided through the main concepts and skills that are required extract knowledge from data, and will get hands-on experience with the different steps of the data science workflow: from data management, cleaning and pre-processing; data exploration; feature engineering; selecting and training algorithms; optimizing hyperparameters; validating algorithms; testing predictions; to visualization and communication of results. The students will be trained to apply different machine learning techniques, and critically evaluate their merits. During the course, students will acquire and expand data science skills that will prepare them for a quantitative MSc thesis, and that will benefit their future career in academia or business.
Experience with programming in R is needed to follow and successfully complete this course. For example, students who followed a course in which R is heavily used, e.g. CSA-34306 Ecological Modelling and Data Analysis in R, will likely have sufficient background knowledge to participate in this course. We strongly urge students without prior experience with programming in R to learn programming in R before the start of the course, either by:
We advise students that are unsure about their level of R skills to go through the first 2 parts of the online book Hands-On Programming with R. If most elements discussed in these first 2 parts are understood, then the understanding of R programming is sufficient to participate in this course.
After successful completion of the course, students are expected to be able to:
The tutorials provided via this website aim to teach you the basic principles of reproducible data science, and provide practical training in the various parts of the data science workflow: data import, tidying, wrangling, exploration and visualization, modelling and communication. This is all done in the programming language R using the framework supplied by the set of tidyverse packages.
Throughout these pages, both inline R code
(e.g. x <- 3
), as well as chunks of R code and its
output appear. A R code chunk is formatted in boxes with light-blue
background like this:
print("Hello world!")
and its corresponding output is shown following 2 hash signs (##) in boxes with a light-green background like this:
## [1] "Hello world!"
The book R for Data Science by Hadley Wickham and Garrett Grolemund (available in print; ISBN: 978-1491910399, or freely available online at r4ds.had.co.nz is used throughout the course, as well as a collection of supplied book chapters or journal articles that cover relevant elements covered during the course. Description of tutorials can be found on this website, but lecture slides and other course documents will be supplied via Brightspace.
Much of the content presented here is based on the contributions from Hadley Wickham and all other people who contributed to the tidyverse (and related) packages.
At the start of each tutorial, there are references to the book R for Data Science (R4DS), where the numbers refer to the chapter numbers in the printed version of the book, yet the URLs refer to the corresponding online versions.
We would like to thank Guillaume Blanchet and Sylvana Harmsen for their constructive feedback that helped to improve this course!