Introduction to R and RStudio

Chapter 13 Project management

13.1 File management

The Guide to Reproducible Code in Ecology and Evolution published by the British Ecological Society (BES) states:

The fundamental idea behind a robust, reproducible analysis is a clean, repeatable script-based workflow (i.e. the sequence of tasks from the start to the end of a project) that links raw data through to clean data and to final analysis outputs. Most analyses will be re-run many times before they are finished (and perhaps a number of times more throughout the review process), so the smoother and more automated the workflow, the easier, faster and more robust the process of repeating it will be.

A project often consists of a multitude of files; from input data, documentation and scripts to output files, tables, figures and reports. It is thus best to think about a good file system organisation, and informative, consistent naming of materials associated with your analysis, before you start any project. The BES guide lists a few principles of a good analysis workflow:

  • Start your analysis from copies of your raw data.
  • Any cleaning, merging, transforming, etc. of data should be done in scripts, not manually
  • Split your workflow (scripts) into logical thematic units. For example, you might separate your code into scripts that (i) load, merge and clean data, (ii) analyse data, and (iii) produce outputs like figures and tables.
  • Eliminate code duplication by packaging up useful code into custom functions. Make sure to comment your functions thoroughly, explaining their expected inputs and outputs, and what they are doing and why.
  • Document your code and data as comments in your scripts or by producing separate documentation.
  • Any intermediary outputs generated by your workflow should be kept separate from raw data

It is best to keep all files associated with a particular project in a single root directory: thus one folder for one project! Rstudio’s “R projects” offer a great way to keep everything together in a self-contained and portable (i.e., so they can be moved from computer to computer) manner, allowing internal pathways to data and other scripts to remain valid even when shared or moved. There is no single best way to organise a file system. The key is to make sure that the structure of directories and location of files are consistent, informative and works for you. The BES gives a good example of a basic project directory structure:

In this project directory structure:

  • The data folder contains all input data (and metadata) used in the analysis.
  • The doc folder contains the manuscript.
  • The figs directory contains figures generated by the analysis.
  • The output folder contains any type of intermediate or output files (e.g., simulation outputs, models, processed datasets, etc.). You might separate this and also have a cleaned-data folder.
  • The scripts directory contains R scripts with function definitions.
  • The reports folder contains files (e.g. RMarkdown) that document the analysis or report on results.
  • The scripts that actually do things are stored in the root directory, but if your project has many scripts, you might want to organise them in a directory of their own.

Never ever touch (edit) raw data! Store raw data separately and permanently in a (sub-)folder, e.g. in data/raw/. Process (e.g., clean, filter, select, change) the raw data using scripts, and optionally save processed data in a separate sub-folder, e.g. in data/processed/.

For more information on file management and related matters (e.g., on tips related to informative and consistent file naming), see the BES Guide to Reproducible Code in Ecology and Evolution.

13.2 RStudio projects

It is thus good keep all files (e.g. input data, scripts, analytical results, figures) in a project together in a structured way. This is such a wise and common practice that RStudio has built-in support for this via projects (see information here).

To make a project in RStudio, go to File > New Project and choose Existing Directory. Browse to the root of the new file directory structure that you just created and click Create Project. A new R session starts, and in the Files tab in the bottom-right panel, you’ll see that your files location is now actually in the just created project folder. Then, you can create a new R script using File > New File > R Script, or by clicking the New File icon directly below the File menu option, and select R Script.

When you have created your project, you only have to double-click (or via File > Open Project) on the generated .Rprj file to open the project in RStudio and continue working. When you click on the .Rproj file in the bottom-right Files panel, a pop-up will appear with the settings specific to the project. By default, RStudio’s global settings are inherited, but you could choose to change them for the specific project (or leave them at their defaults).

13.3 Working directory

An advantage of defining and using a R Project is that RStudio will automatically set the root folder of your project directory structure as the working directory. Thus, when you load (or save) files from (to) disk, you can quickly and conveniently use paths relative to this root folder, for example:

Where Example
in the working directory “observations.csv”
in a sub-folder “data/raw/observations.csv”

Note the convention of using forward slashes, unlike the Windows-specific convention of using backward slashes. This is to make references to files platform-independent.

You can get, and set, the working directory using the functions getwd() and setwd(), respectively.

Set the working directory for a project maximally once, and set it to the root directory of your project. Regularly changing the working directory (e.g., to load files) is bad practice.

If you have to set the working directory using the setwd() function, think carefully whether you do it in a script or directly in the console. Consider that other people with whom you might be collaborating do not have the same directory tree (path) as you, and nor will you in the future when you work on a different computer.

When manually setting the working directory, therefore do so by using the Session > Set Working Directory pull-down option, by typing the appropriate setwd() command in the console, or by running setwd() from a script.