14  Joining data

Above: an animated example of a left join (image source)

This page describes ways to “join”, “match”, “link” “bind”, and otherwise combine data frames.

It is uncommon that your epidemiological analysis or workflow does not involve multiple sources of data, and the linkage of multiple datasets. Perhaps you need to connect laboratory data to patient clinical outcomes, or Google mobility data to infectious disease trends, or even a dataset at one stage of analysis to a transformed version of itself.

In this page we demonstrate code to:

14.1 Preparation

Load packages

This code chunk shows the loading of packages required for the analyses. In this handbook we emphasize p_load() from pacman, which installs the package if necessary and loads it for use. You can also load installed packages with library() from base R. See the page on R basics for more information on R packages.

pacman::p_load(
  rio,            # import and export
  here,           # locate files 
  tidyverse,      # data management and visualisation
  RecordLinkage,  # probabilistic matches
  fastLink        # probabilistic matches
)

Import data

To begin, we import the cleaned linelist of cases from a simulated Ebola epidemic. If you want to follow along, click to download the “clean” linelist (as .rds file). Import data with the import() function from the rio package (it handles many file types like .xlsx, .csv, .rds - see the Import and export page for details).

# import case linelist 
linelist <- import("linelist_cleaned.rds")

The first 50 rows of the linelist are displayed below.

Example datasets

In the joining section below, we will use the following datasets:

  1. A “miniature” version of the case linelist, containing only the columns case_id, date_onset, and hospital, and only the first 10 rows
  2. A separate data frame named hosp_info, which contains more details about each hospital

In the section on probabilistic matching, we will use two different small datasets. The code to create those datasets is given in that section.

“Miniature” case linelist

Below is the the miniature case linelist, which contains only 10 rows and only columns case_id, date_onset, and hospital.

linelist_mini <- linelist %>%                 # start with original linelist
  select(case_id, date_onset, hospital) %>%   # select columns
  head(10)                                    # only take the first 10 rows

Hospital information data frame

Below is the code to create a separate data frame with additional information about seven hospitals (the catchment population, and the level of care available). Note that the name “Military Hospital” belongs to two different hospitals - one a primary level serving 10000 residents and the other a secondary level serving 50280 residents.

# Make the hospital information data frame
hosp_info = data.frame(
  hosp_name     = c("central hospital", "military", "military", "port", "St. Mark's", "ignace", "sisters"),
  catchment_pop = c(1950280, 40500, 10000, 50280, 12000, 5000, 4200),
  level         = c("Tertiary", "Secondary", "Primary", "Secondary", "Secondary", "Primary", "Primary")
)

Here is this data frame:

Pre-cleaning

Traditional joins (non-probabilistic) are case-sensitive and require exact character matches between values in the two data frames. To demonstrate some of the cleaning steps you might need to do before initiating a join, we will clean and align the linelist_mini and hosp_info datasets now.

Identify differences

We need the values of the hosp_name column in the hosp_info data frame to match the values of the hospital column in the linelist_mini data frame.

Here are the values in the linelist_mini data frame, printed with the base R function unique():

unique(linelist_mini$hospital)
[1] "Other"                               
[2] "Missing"                             
[3] "St. Mark's Maternity Hospital (SMMH)"
[4] "Port Hospital"                       
[5] "Military Hospital"                   

and here are the values in the hosp_info data frame:

unique(hosp_info$hosp_name)
[1] "central hospital" "military"         "port"             "St. Mark's"      
[5] "ignace"           "sisters"         

You can see that while some of the hospitals exist in both data frames, there are many differences in spelling.

Align values

We begin by cleaning the values in the hosp_info data frame. As explained in the Cleaning data and core functions page, we can re-code values with logical criteria using dplyr’s case_when() function. For the four hospitals that exist in both data frames we change the values to align with the values in linelist_mini. The other hospitals we leave the values as they are (TRUE ~ hosp_name).

CAUTION: Typically when cleaning one should create a new column (e.g. hosp_name_clean), but for ease of demonstration we show modification of the old column

hosp_info <- hosp_info %>% 
  mutate(
    hosp_name = case_when(
      # criteria                         # new value
      hosp_name == "military"          ~ "Military Hospital",
      hosp_name == "port"              ~ "Port Hospital",
      hosp_name == "St. Mark's"        ~ "St. Mark's Maternity Hospital (SMMH)",
      hosp_name == "central hospital"  ~ "Central Hospital",
      TRUE                             ~ hosp_name
      )
    )

The hospital names that appear in both data frames are aligned. There are two hospitals in hosp_info that are not present in linelist_mini - we will deal with these later, in the join.

unique(hosp_info$hosp_name)
[1] "Central Hospital"                    
[2] "Military Hospital"                   
[3] "Port Hospital"                       
[4] "St. Mark's Maternity Hospital (SMMH)"
[5] "ignace"                              
[6] "sisters"                             

Prior to a join, it is often easiest to convert a column to all lowercase or all uppercase. If you need to convert all values in a column to UPPER or lower case, use mutate() and wrap the column with one of these functions from stringr, as shown in the page on Characters and strings.

str_to_upper()
str_to_upper()
str_to_title()

14.2 dplyr joins

The dplyr package offers several different join functions. dplyr is included in the tidyverse package. These join functions are described below, with simple use cases.

Many thanks to https://github.com/gadenbuie for the informative gifs!

General syntax

The join commands can be run as standalone commands to join two data frames into a new object, or they can be used within a pipe chain (%>%) to merge one data frame into another as it is being cleaned or otherwise modified.

In the example below, the function left_join() is used as a standalone command to create the a new joined_data data frame. The inputs are data frames 1 and 2 (df1 and df2). The first data frame listed is the baseline data frame, and the second one listed is joined to it.

The third argument by = is where you specify the columns in each data frame that will be used to aligns the rows in the two data frames. If the names of these columns are different, provide them within a c() vector as shown below, where the rows are matched on the basis of common values between the column ID in df1 and the column identifier in df2.

# Join based on common values between column "ID" (first data frame) and column "identifier" (second data frame)
joined_data <- left_join(df1, df2, by = c("ID" = "identifier"))

If the by columns in both data frames have the exact same name, you can just provide this one name, within quotes.

# Joint based on common values in column "ID" in both data frames
joined_data <- left_join(df1, df2, by = "ID")

If you are joining the data frames based on common values across multiple fields, list these fields within the c() vector. This example joins rows if the values in three columns in each dataset align exactly.

# join based on same first name, last name, and age
joined_data <- left_join(df1, df2, by = c("name" = "firstname", "surname" = "lastname", "Age" = "age"))

The join commands can also be run within a pipe chain. This will modify the data frame being piped.

In the example below, df1 is is passed through the pipes, df2 is joined to it, and df is thus modified and re-defined.

df1 <- df1 %>%
  filter(date_onset < as.Date("2020-03-05")) %>% # miscellaneous cleaning 
  left_join(df2, by = c("ID" = "identifier"))    # join df2 to df1

CAUTION: Joins are case-specific! Therefore it is useful to convert all values to lowercase or uppercase prior to joining. See the page on characters/strings.

Left and right joins

A left or right join is commonly used to add information to a data frame - new information is added only to rows that already existed in the baseline data frame. These are common joins in epidemiological work as they are used to add information from one dataset into another.

In using these joins, the written order of the data frames in the command is important*.

  • In a left join, the first data frame written is the baseline
  • In a right join, the second data frame written is the baseline

All rows of the baseline data frame are kept. Information in the other (secondary) data frame is joined to the baseline data frame only if there is a match via the identifier column(s). In addition:

  • Rows in the secondary data frame that do not match are dropped.
  • If there are many baseline rows that match to one row in the secondary data frame (many-to-one), the secondary information is added to each matching baseline row.
  • If a baseline row matches to multiple rows in the secondary data frame (one-to-many), all combinations are given, meaning new rows may be added to your returned data frame!

Animated examples of left and right joins (image source)

Example

Below is the output of a left_join() of hosp_info (secondary data frame, view here) into linelist_mini (baseline data frame, view here). The original linelist_mini has nrow(linelist_mini) rows. The modified linelist_mini is displayed. Note the following:

  • Two new columns, catchment_pop and level have been added on the left side of linelist_mini
  • All original rows of the baseline data frame linelist_mini are kept
  • Any original rows of linelist_mini for “Military Hospital” are duplicated because it matched to two rows in the secondary data frame, so both combinations are returned
  • The join identifier column of the secondary dataset (hosp_name) has disappeared because it is redundant with the identifier column in the primary dataset (hospital)
  • When a baseline row did not match to any secondary row (e.g. when hospital is “Other” or “Missing”), NA (blank) fills in the columns from the secondary data frame
  • Rows in the secondary data frame with no match to the baseline data frame (“sisters” and “ignace” hospitals) were dropped
linelist_mini %>% 
  left_join(hosp_info, by = c("hospital" = "hosp_name"))
Warning in left_join(., hosp_info, by = c(hospital = "hosp_name")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 5 of `x` matches multiple rows in `y`.
ℹ Row 4 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.

“Should I use a right join, or a left join?”

To answer the above question, ask yourself “which data frame should retain all of its rows?” - use this one as the baseline. A left join keep all the rows in the first data frame written in the command, whereas a right join keeps all the rows in the second data frame.

The two commands below achieve the same output - 10 rows of hosp_info joined into a linelist_mini baseline, but they use different joins. The result is that the column order will differ based on whether hosp_info arrives from the right (in the left join) or arrives from the left (in the right join). The order of the rows may also shift accordingly. But both of these consequences can be subsequently addressed, using select() to re-order columns or arrange() to sort rows.

# The two commands below achieve the same data, but with differently ordered rows and columns
left_join(linelist_mini, hosp_info, by = c("hospital" = "hosp_name"))
right_join(hosp_info, linelist_mini, by = c("hosp_name" = "hospital"))

Here is the result of hosp_info into linelist_mini via a left join (new columns incoming from the right)

Warning in left_join(linelist_mini, hosp_info, by = c(hospital = "hosp_name")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 5 of `x` matches multiple rows in `y`.
ℹ Row 4 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.

Here is the result of hosp_info into linelist_mini via a right join (new columns incoming from the left)

Warning in right_join(hosp_info, linelist_mini, by = c(hosp_name = "hospital")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 4 of `x` matches multiple rows in `y`.
ℹ Row 5 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.

Also consider whether your use-case is within a pipe chain (%>%). If the dataset in the pipes is the baseline, you will likely use a left join to add data to it.

Full join

A full join is the most inclusive of the joins - it returns all rows from both data frames.

If there are any rows present in one and not the other (where no match was found), the data frame will include them and become longer. NA missing values are used to fill-in any gaps created. As you join, watch the number of columns and rows carefully to troubleshoot case-sensitivity and exact character matches.

The “baseline” data frame is the one written first in the command. Adjustment of this will not impact which records are returned by the join, but it can impact the resulting column order, row order, and which identifier columns are retained.

Animated example of a full join (image source)

Example

Below is the output of a full_join() of hosp_info (originally nrow(hosp_info), view here) into linelist_mini (originally nrow(linelist_mini), view here). Note the following:

  • All baseline rows are kept (linelist_mini)
  • Rows in the secondary that do not match to the baseline are kept (“ignace” and “sisters”), with values in the corresponding baseline columns case_id and onset filled in with missing values
  • Likewise, rows in the baseline data frame that do not match to the secondary (“Other” and “Missing”) are kept, with secondary columns catchment_pop and level filled-in with missing values
  • In the case of one-to-many or many-to-one matches (e.g. rows for “Military Hospital”), all possible combinations are returned (lengthening the final data frame)
  • Only the identifier column from the baseline is kept (hospital)
linelist_mini %>% 
  full_join(hosp_info, by = c("hospital" = "hosp_name"))
Warning in full_join(., hosp_info, by = c(hospital = "hosp_name")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 5 of `x` matches multiple rows in `y`.
ℹ Row 4 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.

Inner join

An inner join is the most restrictive of the joins - it returns only rows with matches across both data frames.
This means that the number of rows in the baseline data frame may actually reduce. Adjustment of which data frame is the “baseline” (written first in the function) will not impact which rows are returned, but it will impact the column order, row order, and which identifier columns are retained.

Animated example of an inner join (image source)

Example

Below is the output of an inner_join() of linelist_mini (baseline) with hosp_info (secondary). Note the following:

  • Baseline rows with no match to the secondary data are removed (rows where hospital is “Missing” or “Other”)
  • Likewise, rows from the secondary data frame that had no match in the baseline are removed (rows where hosp_name is “sisters” or “ignace”)
  • Only the identifier column from the baseline is kept (hospital)
linelist_mini %>% 
  inner_join(hosp_info, by = c("hospital" = "hosp_name"))
Warning in inner_join(., hosp_info, by = c(hospital = "hosp_name")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 5 of `x` matches multiple rows in `y`.
ℹ Row 4 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.

Semi join

A semi join is a “filtering join” which uses another dataset not to add rows or columns, but to perform filtering.

A semi-join keeps all observations in the baseline data frame that have a match in the secondary data frame (but does not add new columns nor duplicate any rows for multiple matches). Read more about these “filtering” joins here.

Animated example of a semi join (image source)

As an example, the below code returns rows from the hosp_info data frame that have matches in linelist_mini based on hospital name.

hosp_info %>% 
  semi_join(linelist_mini, by = c("hosp_name" = "hospital"))
                             hosp_name catchment_pop     level
1                    Military Hospital         40500 Secondary
2                    Military Hospital         10000   Primary
3                        Port Hospital         50280 Secondary
4 St. Mark's Maternity Hospital (SMMH)         12000 Secondary

Anti join

The anti join is another “filtering join” that returns rows in the baseline data frame that do not have a match in the secondary data frame.

Read more about filtering joins here.

Common scenarios for an anti-join include identifying records not present in another data frame, troubleshooting spelling in a join (reviewing records that should have matched), and examining records that were excluded after another join.

As with right_join() and left_join(), the baseline data frame (listed first) is important. The returned rows are from the baseline data frame only. Notice in the gif below that row in the secondary data frame (purple row 4) is not returned even though it does not match with the baseline.

Animated example of an anti join (image source)

Simple anti_join() example

For a simple example, let’s find the hosp_info hospitals that do not have any cases present in linelist_mini. We list hosp_info first, as the baseline data frame. The hospitals which are not present in linelist_mini are returned.

hosp_info %>% 
  anti_join(linelist_mini, by = c("hosp_name" = "hospital"))

Complex anti_join() example

For another example, let us say we ran an inner_join() between linelist_mini and hosp_info. This returns only a subset of the original linelist_mini records, as some are not present in hosp_info.

linelist_mini %>% 
  inner_join(hosp_info, by = c("hospital" = "hosp_name"))
Warning in inner_join(., hosp_info, by = c(hospital = "hosp_name")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 5 of `x` matches multiple rows in `y`.
ℹ Row 4 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.

To review the linelist_mini records that were excluded during the inner join, we can run an anti-join with the same settings (linelist_mini as the baseline).

linelist_mini %>% 
  anti_join(hosp_info, by = c("hospital" = "hosp_name"))

To see the hosp_info records that were excluded in the inner join, we could also run an anti-join with hosp_info as the baseline data frame.

14.3 Probabalistic matching

If you do not have a unique identifier common across datasets to join on, consider using a probabilistic matching algorithm. This would find matches between records based on similarity (e.g. Jaro–Winkler string distance, or numeric distance). Below is a simple example using the package fastLink .

Load packages

pacman::p_load(
  tidyverse,      # data manipulation and visualization
  fastLink        # record matching
  )

Here are two small example datasets that we will use to demonstrate the probabilistic matching (cases and test_results):

Here is the code used to make the datasets:

# make datasets

cases <- tribble(
  ~gender, ~first,      ~middle,     ~last,        ~yr,   ~mon, ~day, ~district,
  "M",     "Amir",      NA,          "Khan",       1989,  11,   22,   "River",
  "M",     "Anthony",   "B.",        "Smith",      1970, 09, 19,      "River", 
  "F",     "Marialisa", "Contreras", "Rodrigues",  1972, 04, 15,      "River",
  "F",     "Elizabeth", "Casteel",   "Chase",      1954, 03, 03,      "City",
  "M",     "Jose",      "Sanchez",   "Lopez",      1996, 01, 06,      "City",
  "F",     "Cassidy",   "Jones",      "Davis",     1980, 07, 20,      "City",
  "M",     "Michael",   "Murphy",     "O'Calaghan",1969, 04, 12,      "Rural", 
  "M",     "Oliver",    "Laurent",    "De Bordow" , 1971, 02, 04,     "River",
  "F",      "Blessing",  NA,          "Adebayo",   1955,  02, 14,     "Rural"
)

results <- tribble(
  ~gender,  ~first,     ~middle,     ~last,          ~yr, ~mon, ~day, ~district, ~result,
  "M",      "Amir",     NA,          "Khan",         1989, 11,   22,  "River", "positive",
  "M",      "Tony",   "B",         "Smith",          1970, 09,   19,  "River", "positive",
  "F",      "Maria",    "Contreras", "Rodriguez",    1972, 04,   15,  "Cty",   "negative",
  "F",      "Betty",    "Castel",   "Chase",        1954,  03,   30,  "City",  "positive",
  "F",      "Andrea",   NA,          "Kumaraswamy",  2001, 01,   05,  "Rural", "positive",      
  "F",      "Caroline", NA,          "Wang",         1988, 12,   11,  "Rural", "negative",
  "F",      "Trang",    NA,          "Nguyen",       1981, 06,   10,  "Rural", "positive",
  "M",      "Olivier" , "Laurent",   "De Bordeaux",  NA,   NA,   NA,  "River", "positive",
  "M",      "Mike",     "Murphy",    "O'Callaghan",  1969, 04,   12,  "Rural", "negative",
  "F",      "Cassidy",  "Jones",     "Davis",        1980, 07,   02,  "City",  "positive",
  "M",      "Mohammad", NA,          "Ali",          1942, 01,   17,  "City",  "negative",
  NA,       "Jose",     "Sanchez",   "Lopez",        1995, 01,   06,  "City",  "negative",
  "M",      "Abubakar", NA,          "Abullahi",     1960, 01,   01,  "River", "positive",
  "F",      "Maria",    "Salinas",   "Contreras",    1955, 03,   03,  "River", "positive"
  )

The cases dataset has 9 records of patients who are awaiting test results.

The test_results dataset has 14 records and contains the column result, which we want to add to the records in cases based on probabilistic matching of records.

Probabilistic matching

The fastLink() function from the fastLink package can be used to apply a matching algorithm. Here is the basic information. You can read more detail by entering ?fastLink in your console.

  • Define the two data frames for comparison to arguments dfA = and dfB =
  • In varnames = give all column names to be used for matching. They must all exist in both dfA and dfB.
  • In stringdist.match = give columns from those in varnames to be evaluated on string “distance”.
  • In numeric.match = give columns from those in varnames to be evaluated on numeric distance.
  • Missing values are ignored
  • By default, each row in either data frame is matched to at most one row in the other data frame. If you want to see all the evaluated matches, set dedupe.matches = FALSE. The deduplication is done using Winkler’s linear assignment solution.

Tip: split one date column into three separate numeric columns using day(), month(), and year() from lubridate package

The default threshold for matches is 0.94 (threshold.match =) but you can adjust it higher or lower. If you define the threshold, consider that higher thresholds could yield more false-negatives (rows that do not match which actually should match) and likewise a lower threshold could yield more false-positive matches.

Below, the data are matched on string distance across the name and district columns, and on numeric distance for year, month, and day of birth. A match threshold of 95% probability is set.

fl_output <- fastLink::fastLink(
  dfA = cases,
  dfB = results,
  varnames = c("gender", "first", "middle", "last", "yr", "mon", "day", "district"),
  stringdist.match = c("first", "middle", "last", "district"),
  numeric.match = c("yr", "mon", "day"),
  threshold.match = 0.95)

==================== 
fastLink(): Fast Probabilistic Record Linkage
==================== 

If you set return.all to FALSE, you will not be able to calculate a confusion table as a summary statistic.
Calculating matches for each variable.
Getting counts for parameter estimation.
    Parallelizing calculation using OpenMP. 1 threads out of 8 are used.
Running the EM algorithm.
Getting the indices of estimated matches.
    Parallelizing calculation using OpenMP. 1 threads out of 8 are used.
Deduping the estimated matches.
Getting the match patterns for each estimated match.

Review matches

We defined the object returned from fastLink() as fl_output. It is of class list, and it actually contains several data frames within it, detailing the results of the matching. One of these data frames is matches, which contains the most likely matches across cases and results. You can access this “matches” data frame with fl_output$matches. Below, it is saved as my_matches for ease of accessing later.

When my_matches is printed, you see two column vectors: the pairs of row numbers/indices (also called “rownames”) in cases (“inds.a”) and in results (“inds.b”) representing the best matches. If a row number from a datafrane is missing, then no match was found in the other data frame at the specified match threshold.

# print matches
my_matches <- fl_output$matches
my_matches
  inds.a inds.b
1      1      1
2      2      2
3      3      3
4      4      4
5      8      8
6      7      9
7      6     10
8      5     12

Things to note:

  • Matches occurred despite slight differences in name spelling and dates of birth:
    • “Tony B. Smith” matched to “Anthony B Smith”
    • “Maria Rodriguez” matched to “Marialisa Rodrigues”
    • “Betty Chase” matched to “Elizabeth Chase”
    • “Olivier Laurent De Bordeaux” matched to “Oliver Laurent De Bordow” (missing date of birth ignored)
  • One row from cases (for “Blessing Adebayo”, row 9) had no good match in results, so it is not present in my_matches.

Join based on the probabilistic matches

To use these matches to join results to cases, one strategy is:

  1. Use left_join() to join my_matches to cases (matching rownames in cases to “inds.a” in my_matches)
  2. Then use another left_join() to join results to cases (matching the newly-acquired “inds.b” in cases to rownames in results)

Before the joins, we should clean the three data frames:

  • Both dfA and dfB should have their row numbers (“rowname”) converted to a proper column.
  • Both the columns in my_matches are converted to class character, so they can be joined to the character rownames
# Clean data prior to joining
#############################

# convert cases rownames to a column 
cases_clean <- cases %>% rownames_to_column()

# convert test_results rownames to a column
results_clean <- results %>% rownames_to_column()  

# convert all columns in matches dataset to character, so they can be joined to the rownames
matches_clean <- my_matches %>%
  mutate(across(everything(), as.character))



# Join matches to dfA, then add dfB
###################################
# column "inds.b" is added to dfA
complete <- left_join(cases_clean, matches_clean, by = c("rowname" = "inds.a"))

# column(s) from dfB are added 
complete <- left_join(complete, results_clean, by = c("inds.b" = "rowname"))

As performed using the code above, the resulting data frame complete will contain all columns from both cases and results. Many will be appended with suffixes “.x” and “.y”, because the column names would otherwise be duplicated.

Alternatively, to achieve only the “original” 9 records in cases with the new column(s) from results, use select() on results before the joins, so that it contains only rownames and the columns that you want to add to cases (e.g. the column result).

cases_clean <- cases %>% rownames_to_column()

results_clean <- results %>%
  rownames_to_column() %>% 
  select(rowname, result)    # select only certain columns 

matches_clean <- my_matches %>%
  mutate(across(everything(), as.character))

# joins
complete <- left_join(cases_clean, matches_clean, by = c("rowname" = "inds.a"))
complete <- left_join(complete, results_clean, by = c("inds.b" = "rowname"))

If you want to subset either dataset to only the rows that matched, you can use the codes below:

cases_matched <- cases[my_matches$inds.a,]  # Rows in cases that matched to a row in results
results_matched <- results[my_matches$inds.b,]  # Rows in results that matched to a row in cases

Or, to see only the rows that did not match:

cases_not_matched <- cases[!rownames(cases) %in% my_matches$inds.a,]  # Rows in cases that did NOT match to a row in results
results_not_matched <- results[!rownames(results) %in% my_matches$inds.b,]  # Rows in results that did NOT match to a row in cases

Probabilistic deduplication

Probabilistic matching can be used to deduplicate a dataset as well. See the page on deduplication for other methods of deduplication.

Here we began with the cases dataset, but are now calling it cases_dup, as it has 2 additional rows that could be duplicates of previous rows: See “Tony” with “Anthony”, and “Marialisa Rodrigues” with “Maria Rodriguez”.

Run fastLink() like before, but compare the cases_dup data frame to itself. When the two data frames provided are identical, the function assumes you want to de-duplicate. Note we do not specify stringdist.match = or numeric.match = as we did previously.

## Run fastLink on the same dataset
dedupe_output <- fastLink(
  dfA = cases_dup,
  dfB = cases_dup,
  varnames = c("gender", "first", "middle", "last", "yr", "mon", "day", "district")
)

==================== 
fastLink(): Fast Probabilistic Record Linkage
==================== 

If you set return.all to FALSE, you will not be able to calculate a confusion table as a summary statistic.
dfA and dfB are identical, assuming deduplication of a single data set.
Setting return.all to FALSE.

Calculating matches for each variable.
Getting counts for parameter estimation.
    Parallelizing calculation using OpenMP. 1 threads out of 8 are used.
Running the EM algorithm.
Getting the indices of estimated matches.
    Parallelizing calculation using OpenMP. 1 threads out of 8 are used.
Calculating the posterior for each pair of matched observations.
Getting the match patterns for each estimated match.

Now, you can review the potential duplicates with getMatches(). Provide the data frame as both dfA = and dfB =, and provide the output of the fastLink() function as fl.out =. fl.out must be of class fastLink.dedupe, or in other words, the result of fastLink().

## Run getMatches()
cases_dedupe <- getMatches(
  dfA = cases_dup,
  dfB = cases_dup,
  fl.out = dedupe_output)

See the right-most column, which indicates the duplicate IDs - the final two rows are identified as being likely duplicates of rows 2 and 3.

To return the row numbers of rows which are likely duplicates, you can count the number of rows per unique value in the dedupe.ids column, and then filter to keep only those with more than one row. In this case this leaves rows 2 and 3.

cases_dedupe %>% 
  count(dedupe.ids) %>% 
  filter(n > 1)
  dedupe.ids n
1          2 2
2          3 2

To inspect the whole rows of the likely duplicates, put the row number in this command:

# displays row 2 and all likely duplicates of it
cases_dedupe[cases_dedupe$dedupe.ids == 2,]   
   gender   first middle  last   yr mon day district dedupe.ids
2       M Anthony     B. Smith 1970   9  19    River          2
10      M    Tony     B. Smith 1970   9  19    River          2

14.4 Binding and aligning

Another method of combining two data frames is “binding” them together. You can also think of this as “appending” or “adding” rows or columns.

This section will also discuss how to “align” the order of rows of one data frame to the order in another data frame. This topic is discussed below in the section on Binding columns.

Bind rows

To bind rows of one data frame to the bottom of another data frame, use bind_rows() from dplyr. It is very inclusive, so any column present in either data frame will be included in the output. A few notes:

  • Unlike the base R version row.bind(), dplyr’s bind_rows() does not require that the order of columns be the same in both data frames. As long as the column names are spelled identically, it will align them correctly.
  • You can optionally specify the argument .id =. Provide a character column name. This will produce a new column that serves to identify which data frame each row originally came from.
  • You can use bind_rows() on a list of similarly-structured data frames to combine them into one data frame. See an example in the Iteration, loops, and lists page involving the import of multiple linelists with purrr.

One common example of row binding is to bind a “total” row onto a descriptive table made with dplyr’s summarise() function. Below we create a table of case counts and median CT values by hospital with a total row.

The function summarise() is used on data grouped by hospital to return a summary data frame by hospital. But the function summarise() does not automatically produce a “totals” row, so we create it by summarising the data again, but with the data not grouped by hospital. This produces a second data frame of just one row. We can then bind these data frames together to achieve the final table.

See other worked examples like this in the Descriptive tables and Tables for presentation pages.

# Create core table
###################
hosp_summary <- linelist %>% 
  group_by(hospital) %>%                        # Group data by hospital
  summarise(                                    # Create new summary columns of indicators of interest
    cases = n(),                                  # Number of rows per hospital-outcome group     
    ct_value_med = median(ct_blood, na.rm=T))     # median CT value per group

Here is the hosp_summary data frame:

Create a data frame with the “total” statistics (not grouped by hospital). This will return just one row.

# create totals
###############
totals <- linelist %>% 
  summarise(
    cases = n(),                               # Number of rows for whole dataset     
    ct_value_med = median(ct_blood, na.rm=T))  # Median CT for whole dataset

And below is that totals data frame. Note how there are only two columns. These columns are also in hosp_summary, but there is one column in hosp_summary that is not in totals (hospital).

Now we can bind the rows together with bind_rows().

# Bind data frames together
combined <- bind_rows(hosp_summary, totals)

Now we can view the result. See how in the final row, an empty NA value fills in for the column hospital that was not in hosp_summary. As explained in the Tables for presentation page, you could “fill-in” this cell with “Total” using replace_na().

Bind columns

There is a similar dplyr function bind_cols() which you can use to combine two data frames sideways. Note that rows are matched to each other by position (not like a join above) - for example the 12th row in each data frame will be aligned.

For an example, we bind several summary tables together. In order to do this, we also demonstrate how to re-arrange the order of rows in one data frame to match the order in another data frame, with match().

Here we define case_info as a summary data frame of linelist cases, by hospital, with the number of cases and the number of deaths.

# Case information
case_info <- linelist %>% 
  group_by(hospital) %>% 
  summarise(
    cases = n(),
    deaths = sum(outcome == "Death", na.rm=T)
  )

And let’s say that here is a different data frame contact_fu containing information on the percent of exposed contacts investigated and “followed-up”, again by hospital.

contact_fu <- data.frame(
  hospital = c("St. Mark's Maternity Hospital (SMMH)", "Military Hospital", "Missing", "Central Hospital", "Port Hospital", "Other"),
  investigated = c("80%", "82%", NA, "78%", "64%", "55%"),
  per_fu = c("60%", "25%", NA, "20%", "75%", "80%")
)

Note that the hospitals are the same, but are in different orders in each data frame. The easiest solution would be to use a left_join() on the hospital column, but you could also use bind_cols() with one extra step.

Use match() to align ordering

Because the row orders are different, a simple bind_cols() command would result in a mis-match of data. To fix this we can use match() from base R to align the rows of a data frame in the same order as in another. We assume for this approach that there are no duplicate values in either data frame.

When we use match(), the syntax is match(TARGET ORDER VECTOR, DATA FRAME COLUMN TO CHANGE), where the first argument is the desired order (either a stand-alone vector, or in this case a column in a data frame), and the second argument is the data frame column in the data frame that will be re-ordered. The output of match() is a vector of numbers representing the correct position ordering. You can read more with ?match.

match(case_info$hospital, contact_fu$hospital)
[1] 4 2 3 6 5 1

You can use this numeric vector to re-order the data frame - place it within subset brackets [ ] before the comma. Read more about base R bracket subset syntax in the R basics page. The command below creates a new data frame, defined as the old one in which the rows are ordered in the numeric vector above.

contact_fu_aligned <- contact_fu[match(case_info$hospital, contact_fu$hospital),]

Now we can bind the data frame columns together, with the correct row order. Note that some columns are duplicated and will require cleaning with rename(). Read more aboout bind_rows() here.

bind_cols(case_info, contact_fu)
New names:
• `hospital` -> `hospital...1`
• `hospital` -> `hospital...4`
# A tibble: 6 × 6
  hospital...1                     cases deaths hospital...4 investigated per_fu
  <chr>                            <int>  <int> <chr>        <chr>        <chr> 
1 Central Hospital                   454    193 St. Mark's … 80%          60%   
2 Military Hospital                  896    399 Military Ho… 82%          25%   
3 Missing                           1469    611 Missing      <NA>         <NA>  
4 Other                              885    395 Central Hos… 78%          20%   
5 Port Hospital                     1762    785 Port Hospit… 64%          75%   
6 St. Mark's Maternity Hospital (…   422    199 Other        55%          80%   

A base R alternative to bind_cols is cbind(), which performs the same operation.

14.5 Resources

The tidyverse page on joins

The R for Data Science page on relational data

Th tidyverse page on dplyr on binding

A vignette on fastLink at the package’s Github page

Publication describing methodology of fastLink

Publication describing RecordLinkage package