28 GIS basics
28.1 Overview
Spatial aspects of your data can provide a lot of insights into the situation of the outbreak, and to answer questions such as:
- Where are the current disease hotspots?
- How have the hotspots have changed over time?
- How is the access to health facilities? Are any improvements needed?
The current focus of this GIS page to address the needs of applied epidemiologists in outbreak response. We will explore basic spatial data visualization methods using tmap and ggplot2 packages. We will also walk through some of the basic spatial data management and querying methods with the sf package. Lastly, we will briefly touch upon concepts of spatial statistics such as spatial relationships, spatial autocorrelation, and spatial regression using the spdep package.
28.2 Key terms
Below we introduce some key terminology. For a thorough introduction to GIS and spatial analysis, we suggest that you review one of the longer tutorials or courses listed in the References section.
Geographic Information System (GIS) - A GIS is a framework or environment for gathering, managing, analyzing, and visualizing spatial data.
GIS software
Some popular GIS software allow point-and-click interaction for map development and spatial analysis. These tools comes with advantages such as not needing to learn code and the ease of manually selecting and placing icons and features on a map. Here are two popular ones:
ArcGIS - A commercial GIS software developed by the company ESRI, which is very popular but quite expensive
QGIS - A free open-source GIS software that can do almost anything that ArcGIS can do. You can download QGIS here
Using R as a GIS can seem more intimidating at first because instead of “point-and-click”, it has a “command-line interface” (you must code to acquire the desired outcome). However, this is a major advantage if you need to repetitively produce maps or create an analysis that is reproducible.
Spatial data
The two primary forms of spatial data used in GIS are vector and raster data:
Vector Data - The most common format of spatial data used in GIS, vector data are comprised of geometric features of vertices and paths. Vector spatial data can be further divided into three widely-used types:
Points - A point consists of a coordinate pair (x,y) representing a specific location in a coordinate system. Points are the most basic form of spatial data, and may be used to denote a case (i.e. patient home) or a location (i.e. hospital) on a map.
Lines - A line is composed of two connected points. Lines have a length, and may be used to denote things like roads or rivers.
Polygons - A polygon is composed of at least three line segments connected by points. Polygon features have a length (i.e. the perimeter of the area) as well as an area measurement. Polygons may be used to note an area (i.e. a village) or a structure (i.e. the actual area of a hospital).
Raster Data - An alternative format for spatial data, raster data is a matrix of cells (e.g. pixels) with each cell containing information such as height, temperature, slope, forest cover, etc. These are often aerial photographs, satellite imagery, etc. Rasters can also be used as “base maps” below vector data.
Visualizing spatial data
To visually represent spatial data on a map, GIS software requires you to provide sufficient information about where different features should be, in relation to one another. If you are using vector data, which will be true for most use cases, this information will typically be stored in a shapefile:
Shapefiles - A shapefile is a common data format for storing “vector” spatial data consisting or lines, points, or polygons. A single shapefile is actually a collection of at least three files - .shp, .shx, and .dbf. All of these sub-component files must be present in a given directory (folder) for the shapefile to be readable. These associated files can be compressed into a ZIP folder to be sent via email or download from a website.
The shapefile will contain information about the features themselves, as well as where to locate them on the Earth’s surface. This is important because while the Earth is a globe, maps are typically two-dimensional; choices about how to “flatten” spatial data can have a big impact on the look and interpretation of the resulting map.
Coordinate Reference Systems (CRS) - A CRS is a coordinate-based system used to locate geographical features on the Earth’s surface. It has a few key components:
Coordinate System - There are many many different coordinate systems, so make sure you know which system your coordinates are from. Degrees of latitude/longitude are common, but you could also see UTM coordinates.
Units - Know what the units are for your coordinate system (e.g. decimal degrees, meters)
Datum - A particular modeled version of the Earth. These have been revised over the years, so ensure that your map layers are using the same datum.
Projection - A reference to the mathematical equation that was used to project the truly round earth onto a flat surface (map).
Remember that you can summarise spatial data without using the mapping tools shown below. Sometimes a simple table by geography (e.g. district, country, etc.) is all that is needed!
28.3 Getting started with GIS
There are a couple of key items you will need to have and to think about to make a map. These include:
A dataset – this can be in a spatial data format (such as shapefiles, as noted above) or it may not be in a spatial format (for instance just as a csv).
If your dataset is not in a spatial format you will also need a reference dataset. Reference data consists of the spatial representation of the data and the related attributes, which would include material containing the location and address information of specific features.
If you are working with pre-defined geographic boundaries (for example, administrative regions), reference shapefiles are often freely available to download from a government agency or data sharing organization. When in doubt, a good place to start is to Google “[regions] shapefile”
If you have address information, but no latitude and longitude, you may need to use a geocoding engine to get the spatial reference data for your records.
An idea about how you want to present the information in your datasets to your target audience. There are many different types of maps, and it is important to think about which type of map best fits your needs.
Types of maps for visualizing your data
Choropleth map - a type of thematic map where colors, shading, or patterns are used to represent geographic regions in relation to their value of an attribute. For instance a larger value could be indicated by a darker colour than a smaller value. This type of map is particularly useful when visualizing a variable and how it changes across defined regions or geopolitical areas.
Case density heatmap - a type of thematic map where colours are used to represent intensity of a value, however, it does not use defined regions or geopolitical boundaries to group data. This type of map is typically used for showing ‘hot spots’ or areas with a high density or concentration of points.
Dot density map - a thematic map type that uses dots to represent attribute values in your data. This type of map is best used to visualize the scatter of your data and visually scan for clusters.
Proportional symbols map (graduated symbols map) - a thematic map similar to a choropleth map, but instead of using colour to indicate the value of an attribute it uses a symbol (usually a circle) in relation to the value. For instance a larger value could be indicated by a larger symbol than a smaller value. This type of map is best used when you want to visualize the size or quantity of your data across geographic regions.
You can also combine several different types of visualizations to show complex geographic patterns. For example, the cases (dots) in the map below are colored according to their closest health facility (see legend). The large red circles show health facility catchment areas of a certain radius, and the bright red case-dots those that were outside any catchment range:
Note: The primary focus of this GIS page is based on the context of field outbreak response. Therefore the contents of the page will cover the basic spatial data manipulations, visualizations, and analyses.
28.4 Preparation
Load packages
This code chunk shows the loading of packages required for the analyses. In this handbook we emphasize p_load()
from pacman, which installs the package if necessary and loads it for use. You can also load installed packages with library()
from base R. See the page on R basics for more information on R packages.
::p_load(
pacman# to import data
rio, # to locate files
here, # to clean, handle, and plot the data (includes ggplot2 package)
tidyverse, # to manage spatial data using a Simple Feature format
sf, # to produce simple maps, works for both interactive and static maps
tmap, # to clean column names
janitor, # to add OSM basemap in ggplot map
OpenStreetMap, # spatial statistics
spdep )
You can see an overview of all the R packages that deal with spatial data at the CRAN “Spatial Task View”.
Sample case data
For demonstration purposes, we will work with a random sample of 1000 cases from the simulated Ebola epidemic linelist
dataframe (computationally, working with fewer cases is easier to display in this handbook). If you want to follow along, click to download the “clean” linelist (as .rds file).
Since we are taking a random sample of the cases, your results may look slightly different from what is demonstrated here when you run the codes on your own.
Import data with the import()
function from the rio package (it handles many file types like .xlsx, .csv, .rds - see the Import and export page for details).
# import clean case linelist
<- import("linelist_cleaned.rds") linelist
Next we select a random sample of 1000 rows using sample()
from base R.
# generate 1000 random row numbers, from the number of rows in linelist
<- sample(nrow(linelist), 1000)
sample_rows
# subset linelist to keep only the sample rows, and all columns
<- linelist[sample_rows,] linelist
Now we want to convert this linelist
which is class dataframe, to an object of class “sf” (spatial features). Given that the linelist has two columns “lon” and “lat” representing the longitude and latitude of each case’s residence, this will be easy.
We use the package sf (spatial features) and its function st_as_sf()
to create the new object we call linelist_sf
. This new object looks essentially the same as the linelist, but the columns lon
and lat
have been designated as coordinate columns, and a coordinate reference system (CRS) has been assigned for when the points are displayed. 4326 identifies our coordinates as based on the World Geodetic System 1984 (WGS84) - which is standard for GPS coordinates.
# Create sf object
<- linelist %>%
linelist_sf ::st_as_sf(coords = c("lon", "lat"), crs = 4326) sf
This is how the original linelist
dataframe looks like. In this demonstration, we will only use the column date_onset
and geometry
(which was constructed from the longitude and latitude fields above and is the last column in the data frame).
::datatable(head(linelist_sf, 10), rownames = FALSE, options = list(pageLength = 5, scrollX=T), class = 'white-space: nowrap' ) DT
Admin boundary shapefiles
Sierra Leone: Admin boundary shapefiles
In advance, we have downloaded all administrative boundaries for Sierra Leone from the Humanitarian Data Exchange (HDX) website here. Alternatively, you can download these and all other example data for this handbook via our R package, as explained in the Download handbook and data page.
Now we are going to do the following to save the Admin Level 3 shapefile in R:
- Import the shapefile
- Clean the column names
- Filter rows to keep only areas of interest
To import a shapefile we use the read_sf()
function from sf. It is provided the filepath via here()
. - in our case the file is within our R project in the “data”, “gis”, and “shp” subfolders, with filename “sle_adm3.shp” (see pages on Import and export and R projects for more information). You will need to provide your own file path.
Next we use clean_names()
from the janitor package to standardize the column names of the shapefile. We also use filter()
to keep only the rows with admin2name of “Western Area Urban” or “Western Area Rural”.
# ADM3 level clean
<- sle_adm3_raw %>%
sle_adm3 ::clean_names() %>% # standardize column names
janitorfilter(admin2name %in% c("Western Area Urban", "Western Area Rural")) # filter to keep certain areas
Below you can see the how the shapefile looks after import and cleaning. Scroll to the right to see how there are columns with admin level 0 (country), admin level 1, admin level 2, and finally admin level 3. Each level has a character name and a unique identifier “pcode”. The pcode expands with each increasing admin level e.g. SL (Sierra Leone) -> SL04 (Western) -> SL0410 (Western Area Rural) -> SL040101 (Koya Rural).
Population data
Sierra Leone: Population by ADM3
These data can again be downloaded from HDX (link here) or via our epirhandbook R package as explained in this page. We use import()
to load the .csv file. We also pass the imported file to clean_names()
to standardize the column name syntax.
# Population by ADM3
<- import(here("data", "gis", "population", "sle_admpop_adm3_2020.csv")) %>%
sle_adm3_pop ::clean_names() janitor
Here is what the population file looks like. Scroll to the right to see how each jurisdiction has columns with male
population, female
populaton, total
population, and the population break-down in columns by age group.
Health Facilities
Sierra Leone: Health facility data from OpenStreetMap
Again we have downloaded the locations of health facilities from HDX here or via instructions in the Download handbook and data page.
We import the facility points shapefile with read_sf()
, again clean the column names, and then filter to keep only the points tagged as either “hospital”, “clinic”, or “doctors”.
# OSM health facility shapefile
<- sf::read_sf(here("data", "gis", "shp", "sle_hf.shp")) %>%
sle_hf ::clean_names() %>%
janitorfilter(amenity %in% c("hospital", "clinic", "doctors"))
Here is the resulting dataframe - scroll right to see the facility name and geometry
coordinates.
28.5 Plotting coordinates
The easiest way to plot X-Y coordinates (longitude/latitude, points), in this case of cases, is to draw them as points directly from the linelist_sf
object which we created in the preparation section.
The package tmap offers simple mapping capabilities for both static (“plot” mode) and interactive (“view” mode) with just a few lines of code. The tmap syntax is similar to that of ggplot2, such that commands are added to each other with +
. Read more detail in this vignette.
- Set the tmap mode. In this case we will use “plot” mode, which produces static outputs.
tmap_mode("plot") # choose either "view" or "plot"
Below, the points are plotted alone.tm_shape()
is provided with the linelist_sf
objects. We then add points via tm_dots()
, specifying the size and color. Because linelist_sf
is an sf object, we have already designated the two columns that contain the lat/long coordinates and the coordinate reference system (CRS):
# Just the cases (points)
tm_shape(linelist_sf) + tm_dots(size=0.08, col='blue')
Alone, the points do not tell us much. So we should also map the administrative boundaries:
Again we use tm_shape()
(see documentation) but instead of providing the case points shapefile, we provide the administrative boundary shapefile (polygons).
With the bbox =
argument (bbox stands for “bounding box”) we can specify the coordinate boundaries. First we show the map display without bbox
, and then with it.
# Just the administrative boundaries (polygons)
tm_shape(sle_adm3) + # admin boundaries shapefile
tm_polygons(col = "#F7F7F7")+ # show polygons in light grey
tm_borders(col = "#000000", # show borders with color and line weight
lwd = 2) +
tm_text("admin3name") # column text to display for each polygon
# Same as above, but with zoom from bounding box
tm_shape(sle_adm3,
bbox = c(-13.3, 8.43, # corner
-13.2, 8.5)) + # corner
tm_polygons(col = "#F7F7F7") +
tm_borders(col = "#000000", lwd = 2) +
tm_text("admin3name")
And now both points and polygons together:
# All together
tm_shape(sle_adm3, bbox = c(-13.3, 8.43, -13.2, 8.5)) + #
tm_polygons(col = "#F7F7F7") +
tm_borders(col = "#000000", lwd = 2) +
tm_text("admin3name")+
tm_shape(linelist_sf) +
tm_dots(size=0.08, col='blue', alpha = 0.5) +
tm_layout(title = "Distribution of Ebola cases") # give title to map
To read a good comparison of mapping options in R, see this blog post.
28.6 Spatial joins
You may be familiar with joining data from one dataset to another one. Several methods are discussed in the Joining data page of this handbook. A spatial join serves a similar purpose but leverages spatial relationships. Instead of relying on common values in columns to correctly match observations, you can utilize their spatial relationships, such as one feature being within another, or the nearest neighbor to another, or within a buffer of a certain radius from another, etc.
The sf package offers various methods for spatial joins. See more documentation about the st_join method and spatial join types in this reference.
Points in polygon
Spatial assign administrative units to cases
Here is an interesting conundrum: the case linelist does not contain any information about the administrative units of the cases. Although it is ideal to collect such information during the initial data collection phase, we can also assign administrative units to individual cases based on their spatial relationships (i.e. point intersects with a polygon).
Below, we will spatially intersect our case locations (points) with the ADM3 boundaries (polygons):
- Begin with the linelist (points)
- Spatial join to the boundaries, setting the type of join at “st_intersects”
- Use
select()
to keep only certain of the new administrative boundary columns
<- linelist_sf %>%
linelist_adm
# join the administrative boundary file to the linelist, based on spatial intersection
::st_join(sle_adm3, join = st_intersects) sf
All the columns from sle_adms
have been added to the linelist! Each case now has columns detailing the administrative levels that it falls within. In this example, we only want to keep two of the new columns (admin level 3), so we select()
the old column names and just the two additional of interest:
<- linelist_sf %>%
linelist_adm
# join the administrative boundary file to the linelist, based on spatial intersection
::st_join(sle_adm3, join = st_intersects) %>%
sf
# Keep the old column names and two new admin ones of interest
select(names(linelist_sf), admin3name, admin3pcod)
Below, just for display purposes you can see the first ten cases and that their admin level 3 (ADM3) jurisdictions that have been attached, based on where the point spatially intersected with the polygon shapes.
# Now you will see the ADM3 names attached to each case
%>% select(case_id, admin3name, admin3pcod) linelist_adm
Simple feature collection with 1000 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: -13.27122 ymin: 8.448447 xmax: -13.20684 ymax: 8.490397
Geodetic CRS: WGS 84
First 10 features:
case_id admin3name admin3pcod geometry
376 cbce06 West II SL040207 POINT (-13.23143 8.466892)
4115 ba5769 Mountain Rural SL040102 POINT (-13.21752 8.451579)
5685 58982e West II SL040207 POINT (-13.25187 8.4793)
5426 b4bc41 East II SL040204 POINT (-13.22543 8.486554)
4951 95c347 Mountain Rural SL040102 POINT (-13.24198 8.454378)
4252 4669b3 West III SL040208 POINT (-13.26526 8.471718)
3728 135bec West III SL040208 POINT (-13.26403 8.475445)
1852 fd9c82 West III SL040208 POINT (-13.26097 8.457779)
4322 7525ac East II SL040204 POINT (-13.22541 8.485722)
5227 5c033b East II SL040204 POINT (-13.21212 8.482889)
Now we can describe our cases by administrative unit - something we were not able to do before the spatial join!
# Make new dataframe containing counts of cases by administrative unit
<- linelist_adm %>% # begin with linelist with new admin cols
case_adm3 as_tibble() %>% # convert to tibble for better display
group_by(admin3pcod, admin3name) %>% # group by admin unit, both by name and pcode
summarise(cases = n()) %>% # summarize and count rows
arrange(desc(cases)) # arrange in descending order
case_adm3
# A tibble: 10 × 3
# Groups: admin3pcod [10]
admin3pcod admin3name cases
<chr> <chr> <int>
1 SL040102 Mountain Rural 252
2 SL040208 West III 247
3 SL040207 West II 182
4 SL040204 East II 120
5 SL040201 Central I 66
6 SL040206 West I 44
7 SL040203 East I 43
8 SL040205 East III 23
9 SL040202 Central II 22
10 <NA> <NA> 1
We can also create a bar plot of case counts by administrative unit.
In this example, we begin the ggplot()
with the linelist_adm
, so that we can apply factor functions like fct_infreq()
which orders the bars by frequency (see page on Factors for tips).
ggplot(
data = linelist_adm, # begin with linelist containing admin unit info
mapping = aes(
x = fct_rev(fct_infreq(admin3name))))+ # x-axis is admin units, ordered by frequency (reversed)
geom_bar()+ # create bars, height is number of rows
coord_flip()+ # flip X and Y axes for easier reading of adm units
theme_classic()+ # simplify background
labs( # titles and labels
x = "Admin level 3",
y = "Number of cases",
title = "Number of cases, by adminstative unit",
caption = "As determined by a spatial join, from 1000 randomly sampled cases from linelist"
)
Nearest neighbor
Finding the nearest health facility / catchment area
It might be useful to know where the health facilities are located in relation to the disease hot spots.
We can use the st_nearest_feature join method from the st_join()
function (sf package) to visualize the closest health facility to individual cases.
- We begin with the shapefile linelist
linelist_sf
- We spatially join with
sle_hf
, which is the locations of health facilities and clinics (points)
# Closest health facility to each case
<- linelist_sf %>% # begin with linelist shapefile
linelist_sf_hf st_join(sle_hf, join = st_nearest_feature) %>% # data from nearest clinic joined to case data
select(case_id, osm_id, name, amenity) %>% # keep columns of interest, including id, name, type, and geometry of healthcare facility
rename("nearest_clinic" = "name") # re-name for clarity
We can see below (first 50 rows) that the each case now has data on the nearest clinic/hospital
We can see that “Den Clinic” is the closest health facility for about ~30% of the cases.
# Count cases by health facility
<- linelist_sf_hf %>% # begin with linelist including nearest clinic data
hf_catchment as.data.frame() %>% # convert from shapefile to dataframe
count(nearest_clinic, # count rows by "name" (of clinic)
name = "case_n") %>% # assign new counts column as "case_n"
arrange(desc(case_n)) # arrange in descending order
# print to console hf_catchment
nearest_clinic case_n
1 Den Clinic 386
2 Shriners Hospitals for Children 336
3 GINER HALL COMMUNITY HOSPITAL 151
4 panasonic 51
5 Princess Christian Maternity Hospital 32
6 ARAB EGYPT CLINIC 20
7 MABELL HEALTH CENTER 14
8 <NA> 10
To visualize the results, we can use tmap - this time interactive mode for easier viewing
tmap_mode("view") # set tmap mode to interactive
# plot the cases and clinic points
tm_shape(linelist_sf_hf) + # plot cases
tm_dots(size=0.08, # cases colored by nearest clinic
col='nearest_clinic') +
tm_shape(sle_hf) + # plot clinic facilities in large black dots
tm_dots(size=0.3, col='black', alpha = 0.4) +
tm_text("name") + # overlay with name of facility
tm_view(set.view = c(-13.2284, 8.4699, 13), # adjust zoom (center coords, zoom)
set.zoom.limits = c(13,14))+
tm_layout(title = "Cases, colored by nearest clinic")
Buffers
We can also explore how many cases are located within 2.5km (~30 mins) walking distance from the closest health facility.
Note: For more accurate distance calculations, it is better to re-project your sf object to the respective local map projection system such as UTM (Earth projected onto a planar surface). In this example, for simplicity we will stick to the World Geodetic System (WGS84) Geograhpic coordinate system (Earth represented in a spherical / round surface, therefore the units are in decimal degrees). We will use a general conversion of: 1 decimal degree = ~111km.
See more information about map projections and coordinate systems at this esri article. This blog talks about different types of map projection and how one can choose a suitable projection depending on the area of interest and the context of your map / analysis.
First, create a circular buffer with a radius of ~2.5km around each health facility. This is done with the function st_buffer()
from tmap. Because the unit of the map is in lat/long decimal degrees, that is how “0.02” is interpreted. If your map coordinate system is in meters, the number must be provided in meters.
<- sle_hf %>%
sle_hf_2k st_buffer(dist=0.02) # decimal degrees translating to approximately 2.5km
Below we plot the buffer zones themselves, with the :
tmap_mode("plot")
# Create circular buffers
tm_shape(sle_hf_2k) +
tm_borders(col = "black", lwd = 2)+
tm_shape(sle_hf) + # plot clinic facilities in large red dots
tm_dots(size=0.3, col='black')
**Second, we intersect these buffers with the cases (points) using st_join()
and the join type of st_intersects*. That is, the data from the buffers are joined to the points that they intersect with.
# Intersect the cases with the buffers
<- linelist_sf_hf %>%
linelist_sf_hf_2k st_join(sle_hf_2k, join = st_intersects, left = TRUE) %>%
filter(osm_id.x==osm_id.y | is.na(osm_id.y)) %>%
select(case_id, osm_id.x, nearest_clinic, amenity.x, osm_id.y)
Now we can count the results: nrow(linelist_sf_hf_2k[is.na(linelist_sf_hf_2k$osm_id.y),])
out of 1000 cases did not intersect with any buffer (that value is missing), and so live more than 30 mins walk from the nearest health facility.
# Cases which did not get intersected with any of the health facility buffers
%>%
linelist_sf_hf_2k filter(is.na(osm_id.y)) %>%
nrow()
[1] 1000
We can visualize the results such that cases that did not intersect with any buffer appear in red.
tmap_mode("view")
# First display the cases in points
tm_shape(linelist_sf_hf) +
tm_dots(size=0.08, col='nearest_clinic') +
# plot clinic facilities in large black dots
tm_shape(sle_hf) +
tm_dots(size=0.3, col='black')+
# Then overlay the health facility buffers in polylines
tm_shape(sle_hf_2k) +
tm_borders(col = "black", lwd = 2) +
# Highlight cases that are not part of any health facility buffers
# in red dots
tm_shape(linelist_sf_hf_2k %>% filter(is.na(osm_id.y))) +
tm_dots(size=0.1, col='red') +
tm_view(set.view = c(-13.2284,8.4699, 13), set.zoom.limits = c(13,14))+
# add title
tm_layout(title = "Cases by clinic catchment area")