%>%
linelist group_by(hospital) %>% # group rows by hospital
slice_max(date, n = 1, with_ties = F) # if there's a tie (of date), take the first row
1 Editorial and technical notes
In this page we describe the philosophical approach, style, and specific editorial decisions made during the creation of this handbook.
1.1 Approach and style
The potential audience for this book is large. It will surely be used by people very new to R, and also by experienced R users looking for best practices and tips. So it must be both accessible and succinct. Therefore, our approach was to provide just enough text explanation that someone very new to R can apply the code and follow what the code is doing.
A few other points:
- This is a code reference book accompanied by relatively brief examples - not a thorough textbook on R or data science
- This is a R handbook for use within applied epidemiology - not a manual on the methods or science of applied epidemiology
- This is intended to be a living document - optimal R packages for a given task change often and we welcome discussion about which to emphasize in this handbook
R packages
So many choices
One of the most challenging aspects of learning R is knowing which R package to use for a given task. It is a common occurrence to struggle through a task only later to realize - hey, there’s an R package that does all that in one command line!
In this handbook, we try to offer you at least two ways to complete each task: one tried-and-true method (probably in base R or tidyverse) and one special R package that is custom-built for that purpose. We want you to have a couple options in case you can’t download a given package or it otherwise does not work for you.
In choosing which packages to use, we prioritized R packages and approaches that have been tested and vetted by the community, minimize the number of packages used in a typical work session, that are stable (not changing very often), and that accomplish the task simply and cleanly
This handbook generally prioritizes R packages and functions from the tidyverse. Tidyverse is a collection of R packages designed for data science that share underlying grammar and data structures. All tidyverse packages can be installed or loaded via the tidyverse package. Read more at the tidyverse website.
When applicable, we also offer code options using base R - the packages and functions that come with R at installation. This is because we recognize that some of this book’s audience may not have reliable internet to download extra packages.
Linking functions to packages explicitly
It is often frustrating in R tutorials when a function is shown in code, but you don’t know which package it is from! We try to avoid this situation.
In the narrative text, package names are written in bold (e.g. dplyr) and functions are written like this: mutate()
. We strive to be explicit about which package a function comes from, either by referencing the package in nearby text or by specifying the package explicitly in the code like this: dplyr::mutate()
. It may look redundant, but we are doing it on purpose.
See the page on R basics to learn more about packages and functions.
Code style
In the handbook, we frequently utilize “new lines”, making our code appear “long”. We do this for a few reasons:
- We can write explanatory comments with
#
that are adjacent to each little part of the code
- Generally, longer (vertical) code is easier to read
- It is easier to read on a narrow screen (no sideways scrolling needed)
- From the indentations, it can be easier to know which arguments belong to which function
As a result, code that could be written like this:
…is written like this:
%>%
linelist group_by(hospital) %>% # group rows by hospital
slice_max(
# keep row per group with maximum date value
date, n = 1, # keep only the single highest row
with_ties = F) # if there's a tie (of date), take the first row
R code is generally not affected by new lines or indentations. When writing code, if you initiate a new line after a comma it will apply automatic indentation patterns.
We also use lots of spaces (e.g. n = 1
instead of n=1
) because it is easier to read. Be kind to the people reading your code!
Nomenclature
In this handbook, we generally reference “columns” and “rows” instead of “variables” and “observations”. As explained in this primer on “tidy data”, most epidemiological statistical datasets consist structurally of rows, columns, and values.
Variables contain the values that measure the same underlying attribute (like age group, outcome, or date of onset). Observations contain all values measured on the same unit (e.g. a person, site, or lab sample). So these aspects can be more difficult to tangibly define.
In “tidy” datasets, each column is a variable, each row is an observation, and each cell is a single value. However some datasets you encounter will not fit this mold - a “wide” format dataset may have a variable split across several columns (see an example in the Pivoting data page). Likewise, observations could be split across several rows.
Most of this handbook is about managing and transforming data, so referring to the concrete data structures of rows and columns is more relevant than the more abstract observations and variables. Exceptions occur primarily in pages on data analysis, where you will see more references to variables and observations.
Notes
Here are the types of notes you may encounter in the handbook:
NOTE: This is a note
TIP: This is a tip.
CAUTION: This is a cautionary note.
DANGER: This is a warning.
1.2 Editorial decisions
Below, we track significant editorial decisions around package and function choice. If you disagree or want to offer a new tool for consideration, please join/start a conversation on our Github page.
Table of package, function, and other editorial decisions
Subject | Considered | Outcome | Brief rationale |
---|---|---|---|
General coding approach | tidyverse, data.table, base | tidyverse, with a page on data.table, and mentions of base alternatives for readers with no internet | tidyverse readability, universality, most-taught |
Package loading | library() ,install.packages() , require() , pacman |
pacman | Shortens and simplifies code for most multi-package install/load use-cases |
Import and export | rio, many other packages | rio | Ease for many file types |
Grouping for summary statistics | dplyr group_by() , stats aggregate() |
dplyr group_by() |
Consistent with tidyverse emphasis |
Pivoting | tidyr (pivot functions), reshape2 (melt/cast), tidyr (spread/gather) | tidyr (pivot functions) | reshape2 is retired, tidyr uses pivot functions as of v1.0.0 |
Clean column names | linelist, janitor | janitor | Consolidation of packages emphasized |
Epiweeks | lubridate, aweek, tsibble, zoo | lubridate generally, the others for specific cases | lubridate’s flexibility, consistency, package maintenance prospects |
ggplot labels | labs() , ggtitle() /ylab() /xlab() |
labs() |
all labels in one place, simplicity |
Convert to factor | factor() , forcats |
forcats | its various functions also convert to factor in same command |
Epidemic curves | incidence, ggplot2, EpiCurve | incidence2 as quick, ggplot2 as detailed | dependability |
Concatenation | paste() , paste0() , str_glue() , glue() |
str_glue() |
More simple syntax than paste functions; within stringr |
1.3 Major revisions
Date | Major changes |
---|---|
10 May 2021 | Release of version 1.0.0 |
20 Nov 2022 | Release of version 1.0.1 |
NEWS With version 1.0.1 the following changes have been implemented:
- Update to R version 4.2
- Data cleaning: switched {linelist} to {matchmaker}, removed unnecessary line from
case_when()
example
- Dates: switched {linelist}
guess_date()
to {parsedate}parse_date()
- Pivoting: slight update to
pivot_wider()
id_cols=
- Survey analysis: switched
plot_age_pyramid()
toage_pyramid()
, slight change to alluvial plot code
- Heat plots: added
ungroup()
toagg_weeks
chunk
- Interactive plots: added
ungroup()
to chunk that makesagg_weeks
so thatexpand()
works as intended
- Time series: added
data.frame()
around objects within alltrending::fit()
andpredict()
commands
- Combinations analysis: Switch
case_when()
toifelse()
and added optionalacross()
code for preparing the data
- Transmission chains: Update to more recent version of {epicontacts}
1.4 Session info (R, RStudio, packages)
Below is the information on the versions of R, RStudio, and R packages used during this rendering of the Handbook.
::session_info() sessioninfo
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.3.2 (2023-10-31 ucrt)
os Windows 11 x64 (build 22621)
system x86_64, mingw32
ui RTerm
language (EN)
collate English_United States.utf8
ctype English_United States.utf8
tz Europe/Stockholm
date 2024-06-19
pandoc 3.1.11 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
cli 3.6.2 2023-12-11 [2] CRAN (R 4.3.2)
digest 0.6.35 2024-03-11 [1] CRAN (R 4.3.3)
evaluate 0.23 2023-11-01 [2] CRAN (R 4.3.2)
fastmap 1.1.1 2023-02-24 [2] CRAN (R 4.3.2)
htmltools 0.5.8 2024-03-25 [1] CRAN (R 4.3.3)
htmlwidgets 1.6.4 2023-12-06 [2] CRAN (R 4.3.2)
jsonlite 1.8.8 2023-12-04 [2] CRAN (R 4.3.2)
knitr 1.45 2023-10-30 [2] CRAN (R 4.3.2)
rlang 1.1.3 2024-01-10 [2] CRAN (R 4.3.2)
rmarkdown 2.26 2024-03-05 [1] CRAN (R 4.3.3)
rstudioapi 0.15.0 2023-07-07 [2] CRAN (R 4.3.2)
sessioninfo 1.2.2 2021-12-06 [2] CRAN (R 4.3.2)
xfun 0.43 2024-03-25 [1] CRAN (R 4.3.3)
[1] C:/Users/ngulu864/AppData/Local/R/win-library/4.3
[2] C:/Program Files/R/R-4.3.2/library
──────────────────────────────────────────────────────────────────────────────