35 Diagrams and charts
This page covers code to produce:
- Flow diagrams using DiagrammeR and the DOT language
- Alluvial/Sankey diagrams
- Event timelines
35.1 Preparation
Load packages
This code chunk shows the loading of packages required for the analyses. In this handbook we emphasize p_load()
from pacman, which installs the package if necessary and loads it for use. You can also load installed packages with library()
from base R. See the page on R basics for more information on R packages.
::p_load(
pacman# for flow diagrams
DiagrammeR, # For alluvial/Sankey diagrams
networkD3, # data management and visualization tidyverse)
Import data
Most of the content in this page does not require a dataset. However, in the Sankey diagram section, we will use the case linelist from a simulated Ebola epidemic. If you want to follow along for this part, click to download the “clean” linelist (as .rds file). Import data with the import()
function from the rio package (it handles many file types like .xlsx, .csv, .rds - see the Import and export page for details).
# import the linelist
<- import("linelist_cleaned.rds") linelist
The first 50 rows of the linelist are displayed below.
35.2 Flow diagrams
One can use the R package DiagrammeR to create charts/flow charts. They can be static, or they can adjust somewhat dynamically based on changes in a dataset.
Tools
The function grViz()
is used to create a “Graphviz” diagram. This function accepts a character string input containing instructions for making the diagram. Within that string, the instructions are written in a different language, called DOT - it is quite easy to learn the basics.
Basic structure
- Open the instructions
grViz("
- Specify directionality and name of the graph, and open brackets, e.g.
digraph my_flow_chart {
- Graph statement (layout, rank direction)
- Nodes statements (create nodes)
- Edges statements (gives links between nodes)
- Close the instructions
}")
Simple examples
Below are two simple examples
A very minimal example:
# A minimal plot
::grViz("digraph {
DiagrammeR
graph[layout = dot, rankdir = LR]
a
b
c
a -> b -> c
}")
An example with perhaps a bit more applied public health context:
grViz(" # All instructions are within a large character string
digraph surveillance_diagram { # 'digraph' means 'directional graph', then the graph name
# graph statement
#################
graph [layout = dot,
rankdir = TB,
overlap = true,
fontsize = 10]
# nodes
#######
node [shape = circle, # shape = circle
fixedsize = true
width = 1.3] # width of circles
Primary # names of nodes
Secondary
Tertiary
# edges
#######
Primary -> Secondary [label = ' case transfer']
Secondary -> Tertiary [label = ' case transfer']
}
")
Syntax
Basic syntax
Node names, or edge statements, can be separated with spaces, semicolons, or newlines.
Rank direction
A plot can be re-oriented to move left-to-right by adjusting the rankdir
argument within the graph statement. The default is TB (top-to-bottom), but it can be LR (left-to-right), RL, or BT.
Node names
Node names can be single words, as in the simple example above. To use multi-word names or special characters (e.g. parentheses, dashes), put the node name within single quotes (’ ’). It may be easier to have a short node name, and assign a label, as shown below within brackets [ ]. If you want to have a newline within the node’s name, you must do it via a label - use \n
in the node label within single quotes, as shown below.
Subgroups
Within edge statements, subgroups can be created on either side of the edge with curly brackets ({ }). The edge then applies to all nodes in the bracket - it is a shorthand.
Layouts
- dot (set
rankdir
to either TB, LR, RL, BT, ) - neato
- twopi
- circo
Nodes - editable attributes
label
(text, in single quotes if multi-word)
fillcolor
(many possible colors)
fontcolor
alpha
(transparency 0-1)
shape
(ellipse, oval, diamond, egg, plaintext, point, square, triangle)
style
sides
peripheries
fixedsize
(h x w)
height
width
distortion
penwidth
(width of shape border)
x
(displacement left/right)
y
(displacement up/down)
fontname
fontsize
icon
Edges - editable attributes
arrowsize
arrowhead
(normal, box, crow, curve, diamond, dot, inv, none, tee, vee)
arrowtail
dir
(direction, )
style
(dashed, …)
color
alpha
headport
(text in front of arrowhead)
tailport
(text in behind arrowtail)
fontname
fontsize
fontcolor
penwidth
(width of arrow)
minlen
(minimum length)
Color names: hexadecimal values or ‘X11’ color names, see here for X11 details
Complex examples
The example below expands on the surveillance_diagram, adding complex node names, grouped edges, colors and styling
DiagrammeR::grViz(" # All instructions are within a large character string
digraph surveillance_diagram { # 'digraph' means 'directional graph', then the graph name
# graph statement
#################
graph [layout = dot,
rankdir = TB, # layout top-to-bottom
fontsize = 10]
# nodes (circles)
#################
node [shape = circle, # shape = circle
fixedsize = true
width = 1.3]
Primary [label = 'Primary\nFacility']
Secondary [label = 'Secondary\nFacility']
Tertiary [label = 'Tertiary\nFacility']
SC [label = 'Surveillance\nCoordination',
fontcolor = darkgreen]
# edges
#######
Primary -> Secondary [label = ' case transfer',
fontcolor = red,
color = red]
Secondary -> Tertiary [label = ' case transfer',
fontcolor = red,
color = red]
# grouped edge
{Primary Secondary Tertiary} -> SC [label = 'case reporting',
fontcolor = darkgreen,
color = darkgreen,
style = dashed]
}
")
Sub-graph clusters
To group nodes into boxed clusters, put them within the same named subgraph (subgraph name {}
). To have each subgraph identified within a bounding box, begin the name of the subgraph with “cluster”, as shown with the 4 boxes below.
DiagrammeR::grViz(" # All instructions are within a large character string
digraph surveillance_diagram { # 'digraph' means 'directional graph', then the graph name
# graph statement
#################
graph [layout = dot,
rankdir = TB,
overlap = true,
fontsize = 10]
# nodes (circles)
#################
node [shape = circle, # shape = circle
fixedsize = true
width = 1.3] # width of circles
subgraph cluster_passive {
Primary [label = 'Primary\nFacility']
Secondary [label = 'Secondary\nFacility']
Tertiary [label = 'Tertiary\nFacility']
SC [label = 'Surveillance\nCoordination',
fontcolor = darkgreen]
}
# nodes (boxes)
###############
node [shape = box, # node shape
fontname = Helvetica] # text font in node
subgraph cluster_active {
Active [label = 'Active\nSurveillance']
HCF_active [label = 'HCF\nActive Search']
}
subgraph cluster_EBD {
EBS [label = 'Event-Based\nSurveillance (EBS)']
'Social Media'
Radio
}
subgraph cluster_CBS {
CBS [label = 'Community-Based\nSurveillance (CBS)']
RECOs
}
# edges
#######
{Primary Secondary Tertiary} -> SC [label = 'case reporting']
Primary -> Secondary [label = 'case transfer',
fontcolor = red]
Secondary -> Tertiary [label = 'case transfer',
fontcolor = red]
HCF_active -> Active
{'Social Media' Radio} -> EBS
RECOs -> CBS
}
")
Node shapes
The example below, borrowed from this tutorial, shows applied node shapes and a shorthand for serial edge connections
::grViz("digraph {
DiagrammeR
graph [layout = dot, rankdir = LR]
# define the global styles of the nodes. We can override these in box if we wish
node [shape = rectangle, style = filled, fillcolor = Linen]
data1 [label = 'Dataset 1', shape = folder, fillcolor = Beige]
data2 [label = 'Dataset 2', shape = folder, fillcolor = Beige]
process [label = 'Process \n Data']
statistical [label = 'Statistical \n Analysis']
results [label= 'Results']
# edge definitions with the node IDs
{data1 data2} -> process -> statistical -> results
}")
Outputs
How to handle and save outputs
- Outputs will appear in RStudio’s Viewer pane, by default in the lower-right alongside Files, Plots, Packages, and Help.
- To export you can “Save as image” or “Copy to clipboard” from the Viewer. The graphic will adjust to the specified size.
Parameterized figures
Here is a quote from this tutorial: https://mikeyharper.uk/flowcharts-in-r-using-diagrammer/
“Parameterized figures: A great benefit of designing figures within R is that we are able to connect the figures directly with our analysis by reading R values directly into our flowcharts. For example, suppose you have created a filtering process which removes values after each stage of a process, you can have a figure show the number of values left in the dataset after each stage of your process. To do this we, you can use the @@X symbol directly within the figure, then refer to this in the footer of the plot using [X]:, where X is the a unique numeric index.”
We encourage you to review this tutorial if parameterization is something you are interested in.
35.3 Alluvial/Sankey Diagrams
Load packages
This code chunk shows the loading of packages required for the analyses. In this handbook we emphasize p_load()
from pacman, which installs the package if necessary and loads it for use. You can also load installed packages with library()
from base R. See the page on R basics for more information on R packages.
We load the networkD3 package to produce the diagram, and also tidyverse for the data preparation steps.
::p_load(
pacman
networkD3, tidyverse)
Plotting from dataset
Plotting the connections in a dataset. Below we demonstrate using this package on the case linelist
. Here is an online tutorial.
We begin by getting the case counts for each unique age category and hospital combination. We’ve removed values with missing age category for clarity. We also re-label the hospital
and age_cat
columns as source
and target
respectively. These will be the two sides of the alluvial diagram.
# counts by hospital and age category
<- linelist %>%
links drop_na(age_cat) %>%
select(hospital, age_cat) %>%
count(hospital, age_cat) %>%
rename(source = hospital,
target = age_cat)
The dataset now look like this:
Now we create a data frame of all the diagram nodes, under the column name
. This consists of all the values for hospital
and age_cat
. Note that we ensure they are all class Character before combining them. and adjust the ID columns to be numbers instead of labels:
# The unique node names
<- data.frame(
nodes name=c(as.character(links$source), as.character(links$target)) %>%
unique()
)
# print nodes
name
1 Central Hospital
2 Military Hospital
3 Missing
4 Other
5 Port Hospital
6 St. Mark's Maternity Hospital (SMMH)
7 0-4
8 5-9
9 10-14
10 15-19
11 20-29
12 30-49
13 50-69
14 70+
The we edit the links
data frame, which we created above with count()
. We add two numeric columns IDsource
and IDtarget
which will actually reflect/create the links between the nodes. These columns will hold the rownumbers (position) of the source and target nodes. 1 is subtracted so that these position numbers begin at 0 (not 1).
# match to numbers, not names
$IDsource <- match(links$source, nodes$name)-1
links$IDtarget <- match(links$target, nodes$name)-1 links
The links dataset now looks like this:
Now plot the Sankey diagram with sankeyNetwork()
. You can read more about each argument by running ?sankeyNetwork
in the console. Note that unless you set iterations = 0
the order of your nodes may not be as expected.
# plot
######
<- sankeyNetwork(
p Links = links,
Nodes = nodes,
Source = "IDsource",
Target = "IDtarget",
Value = "n",
NodeID = "name",
units = "TWh",
fontSize = 12,
nodeWidth = 30,
iterations = 0) # ensure node order is as in data
p
Here is an example where the patient Outcome is included as well. Note in the data preparation step we have to calculate the counts of cases between age and hospital, and separately between hospital and outcome - and then bind all these counts together with bind_rows()
.
# counts by hospital and age category
<- linelist %>%
age_hosp_links drop_na(age_cat) %>%
select(hospital, age_cat) %>%
count(hospital, age_cat) %>%
rename(source = age_cat, # re-name
target = hospital)
<- linelist %>%
hosp_out_links drop_na(age_cat) %>%
select(hospital, outcome) %>%
count(hospital, outcome) %>%
rename(source = hospital, # re-name
target = outcome)
# combine links
<- bind_rows(age_hosp_links, hosp_out_links)
links
# The unique node names
<- data.frame(
nodes name=c(as.character(links$source), as.character(links$target)) %>%
unique()
)
# Create id numbers
$IDsource <- match(links$source, nodes$name)-1
links$IDtarget <- match(links$target, nodes$name)-1
links
# plot
######
<- sankeyNetwork(Links = links,
p Nodes = nodes,
Source = "IDsource",
Target = "IDtarget",
Value = "n",
NodeID = "name",
units = "TWh",
fontSize = 12,
nodeWidth = 30,
iterations = 0)
p
https://www.displayr.com/sankey-diagrams-r/
35.4 Event timelines
To make a timeline showing specific events, you can use the vistime
package.
See this vignette
# load package
::p_load(vistime, # make the timeline
pacman# for interactive visualization
plotly )
Here is the events dataset we begin with:
<- vistime(data) # apply vistime
p
library(plotly)
# step 1: transform into a list
<- plotly_build(p)
pp
# step 2: Marker size
for(i in 1:length(pp$x$data)){
if(pp$x$data[[i]]$mode == "markers") pp$x$data[[i]]$marker$size <- 10
}
# step 3: text size
for(i in 1:length(pp$x$data)){
if(pp$x$data[[i]]$mode == "text") pp$x$data[[i]]$textfont$size <- 10
}
# step 4: text position
for(i in 1:length(pp$x$data)){
if(pp$x$data[[i]]$mode == "text") pp$x$data[[i]]$textposition <- "right"
}
#print
pp
35.5 DAGs
You can build a DAG manually using the DiagammeR package and DOT language as described above.
Alternatively, there are packages like ggdag and daggity
35.6 Resources
Much of the above regarding the DOT language is adapted from the tutorial at this site
Another more in-depth tutorial on DiagammeR
This page on Sankey diagrams