1  Editorial and technical notes

In this page we describe the philosophical approach, style, and specific editorial decisions made during the creation of this handbook.

1.1 Approach and style

The potential audience for this book is large. It will surely be used by people very new to R, and also by experienced R users looking for best practices and tips. So it must be both accessible and succinct. Therefore, our approach was to provide just enough text explanation that someone very new to R can apply the code and follow what the code is doing.

A few other points:

  • This is a code reference book accompanied by relatively brief examples - not a thorough textbook on R or data science
  • This is a R handbook for use within applied epidemiology - not a manual on the methods or science of applied epidemiology
  • This is intended to be a living document - optimal R packages for a given task change often and we welcome discussion about which to emphasize in this handbook

R packages

So many choices

One of the most challenging aspects of learning R is knowing which R package to use for a given task. It is a common occurrence to struggle through a task only later to realize - hey, there’s an R package that does all that in one command line!

In this handbook, we try to offer you at least two ways to complete each task: one tried-and-true method (probably in base R or tidyverse) and one special R package that is custom-built for that purpose. We want you to have a couple options in case you can’t download a given package or it otherwise does not work for you.

In choosing which packages to use, we prioritized R packages and approaches that have been tested and vetted by the community, minimize the number of packages used in a typical work session, that are stable (not changing very often), and that accomplish the task simply and cleanly

This handbook generally prioritizes R packages and functions from the tidyverse. Tidyverse is a collection of R packages designed for data science that share underlying grammar and data structures. All tidyverse packages can be installed or loaded via the tidyverse package. Read more at the tidyverse website.

When applicable, we also offer code options using base R - the packages and functions that come with R at installation. This is because we recognize that some of this book’s audience may not have reliable internet to download extra packages.

Linking functions to packages explicitly

It is often frustrating in R tutorials when a function is shown in code, but you don’t know which package it is from! We try to avoid this situation.

In the narrative text, package names are written in bold (e.g. dplyr) and functions are written like this: mutate(). We strive to be explicit about which package a function comes from, either by referencing the package in nearby text or by specifying the package explicitly in the code like this: dplyr::mutate(). It may look redundant, but we are doing it on purpose.

See the page on R basics to learn more about packages and functions.

Code style

In the handbook, we frequently utilize “new lines”, making our code appear “long”. We do this for a few reasons:

  • We can write explanatory comments with # that are adjacent to each little part of the code
  • Generally, longer (vertical) code is easier to read
  • It is easier to read on a narrow screen (no sideways scrolling needed)
  • From the indentations, it can be easier to know which arguments belong to which function

As a result, code that could be written like this:

linelist %>% 
  group_by(hospital) %>%  # group rows by hospital
  slice_max(date, n = 1, with_ties = F) # if there's a tie (of date), take the first row

…is written like this:

linelist %>% 
  group_by(hospital) %>% # group rows by hospital
  slice_max(
    date,                # keep row per group with maximum date value 
    n = 1,               # keep only the single highest row 
    with_ties = F)       # if there's a tie (of date), take the first row

R code is generally not affected by new lines or indentations. When writing code, if you initiate a new line after a comma it will apply automatic indentation patterns.

We also use lots of spaces (e.g. n = 1 instead of n=1) because it is easier to read. Be kind to the people reading your code!

Nomenclature

In this handbook, we generally reference “columns” and “rows” instead of “variables” and “observations”. As explained in this primer on “tidy data”, most epidemiological statistical datasets consist structurally of rows, columns, and values.

Variables contain the values that measure the same underlying attribute (like age group, outcome, or date of onset). Observations contain all values measured on the same unit (e.g. a person, site, or lab sample). So these aspects can be more difficult to tangibly define.

In “tidy” datasets, each column is a variable, each row is an observation, and each cell is a single value. However some datasets you encounter will not fit this mold - a “wide” format dataset may have a variable split across several columns (see an example in the Pivoting data page). Likewise, observations could be split across several rows.

Most of this handbook is about managing and transforming data, so referring to the concrete data structures of rows and columns is more relevant than the more abstract observations and variables. Exceptions occur primarily in pages on data analysis, where you will see more references to variables and observations.

Notes

Here are the types of notes you may encounter in the handbook:

NOTE: This is a note
TIP: This is a tip.
CAUTION: This is a cautionary note.
DANGER: This is a warning.

1.2 Editorial decisions

Below, we track significant editorial decisions around package and function choice. If you disagree or want to offer a new tool for consideration, please join/start a conversation on our Github page.

Table of package, function, and other editorial decisions

Subject Considered Outcome Brief rationale
General coding approach tidyverse, data.table, base tidyverse, with a page on data.table, and mentions of base alternatives for readers with no internet tidyverse readability, universality, most-taught
Package loading library(),install.packages(), require(), pacman pacman Shortens and simplifies code for most multi-package install/load use-cases
Import and export rio, many other packages rio Ease for many file types
Grouping for summary statistics dplyr group_by(), stats aggregate() dplyr group_by() Consistent with tidyverse emphasis
Pivoting tidyr (pivot functions), reshape2 (melt/cast), tidyr (spread/gather) tidyr (pivot functions) reshape2 is retired, tidyr uses pivot functions as of v1.0.0
Clean column names linelist, janitor janitor Consolidation of packages emphasized
Epiweeks lubridate, aweek, tsibble, zoo lubridate generally, the others for specific cases lubridate’s flexibility, consistency, package maintenance prospects
ggplot labels labs(), ggtitle()/ylab()/xlab() labs() all labels in one place, simplicity
Convert to factor factor(), forcats forcats its various functions also convert to factor in same command
Epidemic curves incidence, ggplot2, EpiCurve incidence2 as quick, ggplot2 as detailed dependability
Concatenation paste(), paste0(), str_glue(), glue() str_glue() More simple syntax than paste functions; within stringr

1.3 Major revisions

Date Major changes
10 May 2021 Release of version 1.0.0
20 Nov 2022 Release of version 1.0.1

NEWS With version 1.0.1 the following changes have been implemented:

  • Update to R version 4.2
  • Data cleaning: switched {linelist} to {matchmaker}, removed unnecessary line from case_when() example
  • Dates: switched {linelist} guess_date() to {parsedate} parse_date()
  • Pivoting: slight update to pivot_wider() id_cols=
  • Survey analysis: switched plot_age_pyramid() to age_pyramid(), slight change to alluvial plot code
  • Heat plots: added ungroup() to agg_weeks chunk
  • Interactive plots: added ungroup() to chunk that makes agg_weeks so that expand() works as intended
  • Time series: added data.frame() around objects within all trending::fit() and predict() commands
  • Combinations analysis: Switch case_when() to ifelse() and added optional across() code for preparing the data
  • Transmission chains: Update to more recent version of {epicontacts}

1.4 Session info (R, RStudio, packages)

Below is the information on the versions of R, RStudio, and R packages used during this rendering of the Handbook.

sessioninfo::session_info()
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.2 (2023-10-31 ucrt)
 os       Windows 11 x64 (build 22621)
 system   x86_64, mingw32
 ui       RTerm
 language (EN)
 collate  English_United States.utf8
 ctype    English_United States.utf8
 tz       Europe/Stockholm
 date     2024-06-19
 pandoc   3.1.11 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version date (UTC) lib source
 cli           3.6.2   2023-12-11 [2] CRAN (R 4.3.2)
 digest        0.6.35  2024-03-11 [1] CRAN (R 4.3.3)
 evaluate      0.23    2023-11-01 [2] CRAN (R 4.3.2)
 fastmap       1.1.1   2023-02-24 [2] CRAN (R 4.3.2)
 htmltools     0.5.8   2024-03-25 [1] CRAN (R 4.3.3)
 htmlwidgets   1.6.4   2023-12-06 [2] CRAN (R 4.3.2)
 jsonlite      1.8.8   2023-12-04 [2] CRAN (R 4.3.2)
 knitr         1.45    2023-10-30 [2] CRAN (R 4.3.2)
 rlang         1.1.3   2024-01-10 [2] CRAN (R 4.3.2)
 rmarkdown     2.26    2024-03-05 [1] CRAN (R 4.3.3)
 rstudioapi    0.15.0  2023-07-07 [2] CRAN (R 4.3.2)
 sessioninfo   1.2.2   2021-12-06 [2] CRAN (R 4.3.2)
 xfun          0.43    2024-03-25 [1] CRAN (R 4.3.3)

 [1] C:/Users/ngulu864/AppData/Local/R/win-library/4.3
 [2] C:/Program Files/R/R-4.3.2/library

──────────────────────────────────────────────────────────────────────────────