28 GIS basics

28.1 Overview

Spatial aspects of your data can provide a lot of insights into the situation of the outbreak, and to answer questions such as:

  • Where are the current disease hotspots?
  • How have the hotspots have changed over time?
  • How is the access to health facilities? Are any improvements needed?

The current focus of this GIS page to address the needs of applied epidemiologists in outbreak response. We will explore basic spatial data visualization methods using tmap and ggplot2 packages. We will also walk through some of the basic spatial data management and querying methods with the sf package. Lastly, we will briefly touch upon concepts of spatial statistics such as spatial relationships, spatial autocorrelation, and spatial regression using the spdep package.

28.2 Key terms

Below we introduce some key terminology. For a thorough introduction to GIS and spatial analysis, we suggest that you review one of the longer tutorials or courses listed in the References section.

Geographic Information System (GIS) - A GIS is a framework or environment for gathering, managing, analyzing, and visualizing spatial data.

GIS software

Some popular GIS software allow point-and-click interaction for map development and spatial analysis. These tools comes with advantages such as not needing to learn code and the ease of manually selecting and placing icons and features on a map. Here are two popular ones:

ArcGIS - A commercial GIS software developed by the company ESRI, which is very popular but quite expensive

QGIS - A free open-source GIS software that can do almost anything that ArcGIS can do. You can download QGIS here

Using R as a GIS can seem more intimidating at first because instead of “point-and-click”, it has a “command-line interface” (you must code to acquire the desired outcome). However, this is a major advantage if you need to repetitively produce maps or create an analysis that is reproducible.

Spatial data

The two primary forms of spatial data used in GIS are vector and raster data:

Vector Data - The most common format of spatial data used in GIS, vector data are comprised of geometric features of vertices and paths. Vector spatial data can be further divided into three widely-used types:

  • Points - A point consists of a coordinate pair (x,y) representing a specific location in a coordinate system. Points are the most basic form of spatial data, and may be used to denote a case (i.e. patient home) or a location (i.e. hospital) on a map.

  • Lines - A line is composed of two connected points. Lines have a length, and may be used to denote things like roads or rivers.

  • Polygons - A polygon is composed of at least three line segments connected by points. Polygon features have a length (i.e. the perimeter of the area) as well as an area measurement. Polygons may be used to note an area (i.e. a village) or a structure (i.e. the actual area of a hospital).

Raster Data - An alternative format for spatial data, raster data is a matrix of cells (e.g. pixels) with each cell containing information such as height, temperature, slope, forest cover, etc. These are often aerial photographs, satellite imagery, etc. Rasters can also be used as “base maps” below vector data.

Visualizing spatial data

To visually represent spatial data on a map, GIS software requires you to provide sufficient information about where different features should be, in relation to one another. If you are using vector data, which will be true for most use cases, this information will typically be stored in a shapefile:

Shapefiles - A shapefile is a common data format for storing “vector” spatial data consisting or lines, points, or polygons. A single shapefile is actually a collection of at least three files - .shp, .shx, and .dbf. All of these sub-component files must be present in a given directory (folder) for the shapefile to be readable. These associated files can be compressed into a ZIP folder to be sent via email or download from a website.

The shapefile will contain information about the features themselves, as well as where to locate them on the Earth’s surface. This is important because while the Earth is a globe, maps are typically two-dimensional; choices about how to “flatten” spatial data can have a big impact on the look and interpretation of the resulting map.

Coordinate Reference Systems (CRS) - A CRS is a coordinate-based system used to locate geographical features on the Earth’s surface. It has a few key components:

  • Coordinate System - There are many many different coordinate systems, so make sure you know which system your coordinates are from. Degrees of latitude/longitude are common, but you could also see UTM coordinates.

  • Units - Know what the units are for your coordinate system (e.g. decimal degrees, meters)

  • Datum - A particular modeled version of the Earth. These have been revised over the years, so ensure that your map layers are using the same datum.

  • Projection - A reference to the mathematical equation that was used to project the truly round earth onto a flat surface (map).

Remember that you can summarise spatial data without using the mapping tools shown below. Sometimes a simple table by geography (e.g. district, country, etc.) is all that is needed!

28.3 Getting started with GIS

There are a couple of key items you will need to have and to think about to make a map. These include:

  • A dataset – this can be in a spatial data format (such as shapefiles, as noted above) or it may not be in a spatial format (for instance just as a csv).

  • If your dataset is not in a spatial format you will also need a reference dataset. Reference data consists of the spatial representation of the data and the related attributes, which would include material containing the location and address information of specific features.

    • If you are working with pre-defined geographic boundaries (for example, administrative regions), reference shapefiles are often freely available to download from a government agency or data sharing organization. When in doubt, a good place to start is to Google “[regions] shapefile”

    • If you have address information, but no latitude and longitude, you may need to use a geocoding engine to get the spatial reference data for your records.

  • An idea about how you want to present the information in your datasets to your target audience. There are many different types of maps, and it is important to think about which type of map best fits your needs.

Types of maps for visualizing your data

Choropleth map - a type of thematic map where colors, shading, or patterns are used to represent geographic regions in relation to their value of an attribute. For instance a larger value could be indicated by a darker colour than a smaller value. This type of map is particularly useful when visualizing a variable and how it changes across defined regions or geopolitical areas.

Case density heatmap - a type of thematic map where colours are used to represent intensity of a value, however, it does not use defined regions or geopolitical boundaries to group data. This type of map is typically used for showing ‘hot spots’ or areas with a high density or concentration of points.

Dot density map - a thematic map type that uses dots to represent attribute values in your data. This type of map is best used to visualize the scatter of your data and visually scan for clusters.

Proportional symbols map (graduated symbols map) - a thematic map similar to a choropleth map, but instead of using colour to indicate the value of an attribute it uses a symbol (usually a circle) in relation to the value. For instance a larger value could be indicated by a larger symbol than a smaller value. This type of map is best used when you want to visualize the size or quantity of your data across geographic regions.

You can also combine several different types of visualizations to show complex geographic patterns. For example, the cases (dots) in the map below are colored according to their closest health facility (see legend). The large red circles show health facility catchment areas of a certain radius, and the bright red case-dots those that were outside any catchment range:

Note: The primary focus of this GIS page is based on the context of field outbreak response. Therefore the contents of the page will cover the basic spatial data manipulations, visualizations, and analyses.

28.4 Preparation

Load packages

This code chunk shows the loading of packages required for the analyses. In this handbook we emphasize p_load() from pacman, which installs the package if necessary and loads it for use. You can also load installed packages with library() from base R. See the page on R basics for more information on R packages.

pacman::p_load(
  rio,           # to import data
  here,          # to locate files
  tidyverse,     # to clean, handle, and plot the data (includes ggplot2 package)
  sf,            # to manage spatial data using a Simple Feature format
  tmap,          # to produce simple maps, works for both interactive and static maps
  janitor,       # to clean column names
  OpenStreetMap, # to add OSM basemap in ggplot map
  spdep          # spatial statistics
  ) 

You can see an overview of all the R packages that deal with spatial data at the CRAN “Spatial Task View”.

Sample case data

For demonstration purposes, we will work with a random sample of 1000 cases from the simulated Ebola epidemic linelist dataframe (computationally, working with fewer cases is easier to display in this handbook). If you want to follow along, click to download the “clean” linelist (as .rds file).

Since we are taking a random sample of the cases, your results may look slightly different from what is demonstrated here when you run the codes on your own.

Import data with the import() function from the rio package (it handles many file types like .xlsx, .csv, .rds - see the Import and export page for details).

# import clean case linelist
linelist <- import("linelist_cleaned.rds")  

Next we select a random sample of 1000 rows using sample() from base R.

# generate 1000 random row numbers, from the number of rows in linelist
sample_rows <- sample(nrow(linelist), 1000)

# subset linelist to keep only the sample rows, and all columns
linelist <- linelist[sample_rows,]

Now we want to convert this linelist which is class dataframe, to an object of class “sf” (spatial features). Given that the linelist has two columns “lon” and “lat” representing the longitude and latitude of each case’s residence, this will be easy.

We use the package sf (spatial features) and its function st_as_sf() to create the new object we call linelist_sf. This new object looks essentially the same as the linelist, but the columns lon and lat have been designated as coordinate columns, and a coordinate reference system (CRS) has been assigned for when the points are displayed. 4326 identifies our coordinates as based on the World Geodetic System 1984 (WGS84) - which is standard for GPS coordinates.

# Create sf object
linelist_sf <- linelist %>%
     sf::st_as_sf(coords = c("lon", "lat"), crs = 4326)

This is how the original linelist dataframe looks like. In this demonstration, we will only use the column date_onset and geometry (which was constructed from the longitude and latitude fields above and is the last column in the data frame).

DT::datatable(head(linelist_sf, 10), rownames = FALSE, options = list(pageLength = 5, scrollX=T), class = 'white-space: nowrap' )

Admin boundary shapefiles

Sierra Leone: Admin boundary shapefiles

In advance, we have downloaded all administrative boundaries for Sierra Leone from the Humanitarian Data Exchange (HDX) website here. Alternatively, you can download these and all other example data for this handbook via our R package, as explained in the Download handbook and data page.

Now we are going to do the following to save the Admin Level 3 shapefile in R:

  1. Import the shapefile
  2. Clean the column names
  3. Filter rows to keep only areas of interest

To import a shapefile we use the read_sf() function from sf. It is provided the filepath via here(). - in our case the file is within our R project in the “data”, “gis”, and “shp” subfolders, with filename “sle_adm3.shp” (see pages on Import and export and R projects for more information). You will need to provide your own file path.

Next we use clean_names() from the janitor package to standardize the column names of the shapefile. We also use filter() to keep only the rows with admin2name of “Western Area Urban” or “Western Area Rural”.

# ADM3 level clean
sle_adm3 <- sle_adm3_raw %>%
  clean_names() %>% # standardize column names
  filter(admin2name %in% c("Western Area Urban", "Western Area Rural")) # filter to keep certain areas

Below you can see the how the shapefile looks after import and cleaning. Scroll to the right to see how there are columns with admin level 0 (country), admin level 1, admin level 2, and finally admin level 3. Each level has a character name and a unique identifier “pcode”. The pcode expands with each increasing admin level e.g. SL (Sierra Leone) -> SL04 (Western) -> SL0410 (Western Area Rural) -> SL040101 (Koya Rural).

Population data

Sierra Leone: Population by ADM3

These data can again be downloaded from HDX (link here) or via our epirhandbook R package as explained in this page. We use import() to load the .csv file. We also pass the imported file to clean_names() to standardize the column name syntax.

# Population by ADM3
sle_adm3_pop <- import(here("data", "gis", "population", "sle_admpop_adm3_2020.csv")) %>%
  clean_names()

Here is what the population file looks like. Scroll to the right to see how each jurisdiction has columns with male population, female populaton, total population, and the population break-down in columns by age group.

Health Facilities

Sierra Leone: Health facility data from OpenStreetMap

Again we have downloaded the locations of health facilities from HDX here or via instructions in the Download handbook and data page.

We import the facility points shapefile with read_sf(), again clean the column names, and then filter to keep only the points tagged as either “hospital”, “clinic”, or “doctors”.

# OSM health facility shapefile
sle_hf <- sf::read_sf(here("data", "gis", "shp", "sle_hf.shp")) %>% 
  clean_names() %>%
  filter(amenity %in% c("hospital", "clinic", "doctors"))

Here is the resulting dataframe - scroll right to see the facility name and geometry coordinates.

28.5 Plotting coordinates

The easiest way to plot X-Y coordinates (longitude/latitude, points), in this case of cases, is to draw them as points directly from the linelist_sf object which we created in the preparation section.

The package tmap offers simple mapping capabilities for both static (“plot” mode) and interactive (“view” mode) with just a few lines of code. The tmap syntax is similar to that of ggplot2, such that commands are added to each other with +. Read more detail in this vignette.

  1. Set the tmap mode. In this case we will use “plot” mode, which produces static outputs.
tmap_mode("plot") # choose either "view" or "plot"

Below, the points are plotted alone.tm_shape() is provided with the linelist_sf objects. We then add points via tm_dots(), specifying the size and color. Because linelist_sf is an sf object, we have already designated the two columns that contain the lat/long coordinates and the coordinate reference system (CRS):

# Just the cases (points)
tm_shape(linelist_sf) + tm_dots(size=0.08, col='blue')

Alone, the points do not tell us much. So we should also map the administrative boundaries:

Again we use tm_shape() (see documentation) but instead of providing the case points shapefile, we provide the administrative boundary shapefile (polygons).

With the bbox = argument (bbox stands for “bounding box”) we can specify the coordinate boundaries. First we show the map display without bbox, and then with it.

# Just the administrative boundaries (polygons)
tm_shape(sle_adm3) +               # admin boundaries shapefile
  tm_polygons(col = "#F7F7F7")+    # show polygons in light grey
  tm_borders(col = "#000000",      # show borders with color and line weight
             lwd = 2) +
  tm_text("admin3name")            # column text to display for each polygon


# Same as above, but with zoom from bounding box
tm_shape(sle_adm3,
         bbox = c(-13.3, 8.43,    # corner
                  -13.2, 8.5)) +  # corner
  tm_polygons(col = "#F7F7F7") +
  tm_borders(col = "#000000", lwd = 2) +
  tm_text("admin3name")

And now both points and polygons together:

# All together
tm_shape(sle_adm3, bbox = c(-13.3, 8.43, -13.2, 8.5)) +     #
  tm_polygons(col = "#F7F7F7") +
  tm_borders(col = "#000000", lwd = 2) +
  tm_text("admin3name")+
tm_shape(linelist_sf) +
  tm_dots(size=0.08, col='blue', alpha = 0.5) +
  tm_layout(title = "Distribution of Ebola cases")   # give title to map

To read a good comparison of mapping options in R, see this blog post.

28.6 Spatial joins

You may be familiar with joining data from one dataset to another one. Several methods are discussed in the Joining data page of this handbook. A spatial join serves a similar purpose but leverages spatial relationships. Instead of relying on common values in columns to correctly match observations, you can utilize their spatial relationships, such as one feature being within another, or the nearest neighbor to another, or within a buffer of a certain radius from another, etc.

The sf package offers various methods for spatial joins. See more documentation about the st_join method and spatial join types in this reference.

Points in polygon

Spatial assign administrative units to cases

Here is an interesting conundrum: the case linelist does not contain any information about the administrative units of the cases. Although it is ideal to collect such information during the initial data collection phase, we can also assign administrative units to individual cases based on their spatial relationships (i.e. point intersects with a polygon).

Below, we will spatially intersect our case locations (points) with the ADM3 boundaries (polygons):

  1. Begin with the linelist (points)
  2. Spatial join to the boundaries, setting the type of join at “st_intersects”
  3. Use select() to keep only certain of the new administrative boundary columns
linelist_adm <- linelist_sf %>%
  
  # join the administrative boundary file to the linelist, based on spatial intersection
  sf::st_join(sle_adm3, join = st_intersects)

All the columns from sle_adms have been added to the linelist! Each case now has columns detailing the administrative levels that it falls within. In this example, we only want to keep two of the new columns (admin level 3), so we select() the old column names and just the two additional of interest:

linelist_adm <- linelist_sf %>%
  
  # join the administrative boundary file to the linelist, based on spatial intersection
  sf::st_join(sle_adm3, join = st_intersects) %>% 
  
  # Keep the old column names and two new admin ones of interest
  select(names(linelist_sf), admin3name, admin3pcod)

Below, just for display purposes you can see the first ten cases and that their admin level 3 (ADM3) jurisdictions that have been attached, based on where the point spatially intersected with the polygon shapes.

# Now you will see the ADM3 names attached to each case
linelist_adm %>% select(case_id, admin3name, admin3pcod)
## Simple feature collection with 1000 features and 3 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -13.27276 ymin: 8.446206 xmax: -13.20673 ymax: 8.490944
## Geodetic CRS:  WGS 84
## First 10 features:
##      case_id     admin3name admin3pcod                   geometry
## 2201  ac8a1d        East II   SL040204 POINT (-13.21038 8.478539)
## 1528  c9e824 Mountain Rural   SL040102 POINT (-13.22131 8.480025)
## 1226  979ba4 Mountain Rural   SL040102 POINT (-13.21291 8.469926)
## 143   f70e4c       West III   SL040208 POINT (-13.25384 8.458798)
## 3028  929bf3       East III   SL040205 POINT (-13.20907 8.464941)
## 904   e972f8        West II   SL040207 POINT (-13.23298 8.462652)
## 2406  a8be18        East II   SL040204 POINT (-13.21953 8.483104)
## 3800  26d2a7         East I   SL040203 POINT (-13.21511 8.487789)
## 472   075855     Central II   SL040202  POINT (-13.23716 8.48732)
## 375   90dd40        West II   SL040207 POINT (-13.23506 8.462155)

Now we can describe our cases by administrative unit - something we were not able to do before the spatial join!

# Make new dataframe containing counts of cases by administrative unit
case_adm3 <- linelist_adm %>%          # begin with linelist with new admin cols
  as_tibble() %>%                      # convert to tibble for better display
  group_by(admin3pcod, admin3name) %>% # group by admin unit, both by name and pcode 
  summarise(cases = n()) %>%           # summarize and count rows
  arrange(desc(cases))                     # arrange in descending order

case_adm3
## # A tibble: 10 x 3
## # Groups:   admin3pcod [10]
##    admin3pcod admin3name     cases
##    <chr>      <chr>          <int>
##  1 SL040102   Mountain Rural   287
##  2 SL040208   West III         234
##  3 SL040207   West II          183
##  4 SL040204   East II          111
##  5 SL040203   East I            53
##  6 SL040201   Central I         47
##  7 SL040206   West I            39
##  8 SL040202   Central II        24
##  9 SL040205   East III          19
## 10 <NA>       <NA>               3

We can also create a bar plot of case counts by administrative unit.

In this example, we begin the ggplot() with the linelist_adm, so that we can apply factor functions like fct_infreq() which orders the bars by frequency (see page on Factors for tips).

ggplot(
    data = linelist_adm,                       # begin with linelist containing admin unit info
    mapping = aes(
      x = fct_rev(fct_infreq(admin3name))))+ # x-axis is admin units, ordered by frequency (reversed)
  geom_bar()+                                # create bars, height is number of rows
  coord_flip()+                              # flip X and Y axes for easier reading of adm units
  theme_classic()+                           # simplify background
  labs(                                      # titles and labels
    x = "Admin level 3",
    y = "Number of cases",
    title = "Number of cases, by adminstative unit",
    caption = "As determined by a spatial join, from 1000 randomly sampled cases from linelist"
  )

Nearest neighbor

Finding the nearest health facility / catchment area

It might be useful to know where the health facilities are located in relation to the disease hot spots.

We can use the st_nearest_feature join method from the st_join() function (sf package) to visualize the closest health facility to individual cases.

  1. We begin with the shapefile linelist linelist_sf
  2. We spatially join with sle_hf, which is the locations of health facilities and clinics (points)
# Closest health facility to each case
linelist_sf_hf <- linelist_sf %>%                  # begin with linelist shapefile  
  st_join(sle_hf, join = st_nearest_feature) %>%   # data from nearest clinic joined to case data 
  select(case_id, osm_id, name, amenity) %>%       # keep columns of interest, including id, name, type, and geometry of healthcare facility
  rename("nearest_clinic" = "name")                # re-name for clarity

We can see below (first 50 rows) that the each case now has data on the nearest clinic/hospital

We can see that “Den Clinic” is the closest health facility for about ~30% of the cases.

# Count cases by health facility
hf_catchment <- linelist_sf_hf %>%   # begin with linelist including nearest clinic data
  as.data.frame() %>%                # convert from shapefile to dataframe
  count(nearest_clinic,              # count rows by "name" (of clinic)
        name = "case_n") %>%         # assign new counts column as "case_n"
  arrange(desc(case_n))              # arrange in descending order

hf_catchment                         # print to console
##                          nearest_clinic case_n
## 1       Shriners Hospitals for Children    358
## 2                            Den Clinic    355
## 3         GINER HALL COMMUNITY HOSPITAL    151
## 4                             panasonic     43
## 5 Princess Christian Maternity Hospital     38
## 6                     ARAB EGYPT CLINIC     24
## 7                  MABELL HEALTH CENTER     17
## 8                                  <NA>     14

To visualize the results, we can use tmap - this time interactive mode for easier viewing

tmap_mode("view")   # set tmap mode to interactive  

# plot the cases and clinic points 
tm_shape(linelist_sf_hf) +            # plot cases
  tm_dots(size=0.08,                  # cases colored by nearest clinic
          col='nearest_clinic') +    
tm_shape(sle_hf) +                    # plot clinic facilities in large black dots
  tm_dots(size=0.3, col='black', alpha = 0.4) +      
  tm_text("name") +                   # overlay with name of facility
tm_view(set.view = c(-13.2284, 8.4699, 13), # adjust zoom (center coords, zoom)
        set.zoom.limits = c(13,14))+
tm_layout(title = "Cases, colored by nearest clinic")

Buffers

We can also explore how many cases are located within 2.5km (~30 mins) walking distance from the closest health facility.

Note: For more accurate distance calculations, it is better to re-project your sf object to the respective local map projection system such as UTM (Earth projected onto a planar surface). In this example, for simplicity we will stick to the World Geodetic System (WGS84) Geograhpic coordinate system (Earth represented in a spherical / round surface, therefore the units are in decimal degrees). We will use a general conversion of: 1 decimal degree = ~111km.

See more information about map projections and coordinate systems at this esri article. This blog talks about different types of map projection and how one can choose a suitable projection depending on the area of interest and the context of your map / analysis.

First, create a circular buffer with a radius of ~2.5km around each health facility. This is done with the function st_buffer() from tmap. Because the unit of the map is in lat/long decimal degrees, that is how “0.02” is interpreted. If your map coordinate system is in meters, the number must be provided in meters.

sle_hf_2k <- sle_hf %>%
  st_buffer(dist=0.02)       # decimal degrees translating to approximately 2.5km 

Below we plot the buffer zones themselves, with the :

tmap_mode("plot")
# Create circular buffers
tm_shape(sle_hf_2k) +
  tm_borders(col = "black", lwd = 2)+
tm_shape(sle_hf) +                    # plot clinic facilities in large red dots
  tm_dots(size=0.3, col='black')      

**Second, we intersect these buffers with the cases (points) using st_join() and the join type of st_intersects*. That is, the data from the buffers are joined to the points that they intersect with.

# Intersect the cases with the buffers
linelist_sf_hf_2k <- linelist_sf_hf %>%
  st_join(sle_hf_2k, join = st_intersects, left = TRUE) %>%
  filter(osm_id.x==osm_id.y | is.na(osm_id.y)) %>%
  select(case_id, osm_id.x, nearest_clinic, amenity.x, osm_id.y)

Now we can count the results: nrow(linelist_sf_hf_2k[is.na(linelist_sf_hf_2k$osm_id.y),]) out of 1000 cases did not intersect with any buffer (that value is missing), and so live more than 30 mins walk from the nearest health facility.

# Cases which did not get intersected with any of the health facility buffers
linelist_sf_hf_2k %>% 
  filter(is.na(osm_id.y)) %>%
  nrow()
## [1] 1000

We can visualize the results such that cases that did not intersect with any buffer appear in red.

tmap_mode("view")

# First display the cases in points
tm_shape(linelist_sf_hf) +
  tm_dots(size=0.08, col='nearest_clinic') +

# plot clinic facilities in large black dots
tm_shape(sle_hf) +                    
  tm_dots(size=0.3, col='black')+   

# Then overlay the health facility buffers in polylines
tm_shape(sle_hf_2k) +
  tm_borders(col = "black", lwd = 2) +

# Highlight cases that are not part of any health facility buffers
# in red dots  
tm_shape(linelist_sf_hf_2k %>%  filter(is.na(osm_id.y))) +
  tm_dots(size=0.1, col='red') +
tm_view(set.view = c(-13.2284,8.4699, 13), set.zoom.limits = c(13,14))+

# add title  
tm_layout(title = "Cases by clinic catchment area")