Above: an animated example of a left join (image source)
This page describes ways to “join”, “match”, “link” “bind”, and otherwise combine data frames.
It is uncommon that your epidemiological analysis or workflow does not involve multiple sources of data, and the linkage of multiple datasets. Perhaps you need to connect laboratory data to patient clinical outcomes, or Google mobility data to infectious disease trends, or even a dataset at one stage of analysis to a transformed version of itself.
In this page we demonstrate code to:
- Conduct joins of two data frames such that rows are matched based on common values in identifier columns
- Join two data frames based on probabilistic (likely) matches between values
- Expand a data frame by directly binding or (“appending”) rows or columns from another data frame
This code chunk shows the loading of packages required for the analyses. In this handbook we emphasize
p_load() from pacman, which installs the package if necessary and loads it for use. You can also load installed packages with
library() from base R. See the page on R basics for more information on R packages.
pacman::p_load( rio, # import and export here, # locate files tidyverse, # data management and visualisation RecordLinkage, # probabilistic matches fastLink # probabilistic matches )
To begin, we import the cleaned linelist of cases from a simulated Ebola epidemic. If you want to follow along, click to download the “clean” linelist (as .rds file). Import data with the
import() function from the rio package (it handles many file types like .xlsx, .csv, .rds - see the Import and export page for details).
# import case linelist linelist <- import("linelist_cleaned.rds")
The first 50 rows of the linelist are displayed below.
In the joining section below, we will use the following datasets:
- A “miniature” version of the case
linelist, containing only the columns
hospital, and only the first 10 rows
- A separate data frame named
hosp_info, which contains more details about each hospital
In the section on probabilistic matching, we will use two different small datasets. The code to create those datasets is given in that section.
Below is the the miniature case linelist, which contains only 10 rows and only columns
Below is the code to create a separate data frame with additional information about seven hospitals (the catchment population, and the level of care available). Note that the name “Military Hospital” belongs to two different hospitals - one a primary level serving 10000 residents and the other a secondary level serving 50280 residents.
# Make the hospital information data frame hosp_info = data.frame( hosp_name = c("central hospital", "military", "military", "port", "St. Mark's", "ignace", "sisters"), catchment_pop = c(1950280, 40500, 10000, 50280, 12000, 5000, 4200), level = c("Tertiary", "Secondary", "Primary", "Secondary", "Secondary", "Primary", "Primary") )
Here is this data frame:
Traditional joins (non-probabilistic) are case-sensitive and require exact character matches between values in the two data frames. To demonstrate some of the cleaning steps you might need to do before initiating a join, we will clean and align the
hosp_info datasets now.
We need the values of the
hosp_name column in the
hosp_info data frame to match the values of the
hospital column in the
linelist_mini data frame.
Here are the values in the
linelist_mini data frame, printed with the base R function
##  "Other" "Missing" "St. Mark's Maternity Hospital (SMMH)" "Port Hospital" ##  "Military Hospital"
and here are the values in the
hosp_info data frame:
##  "central hospital" "military" "port" "St. Mark's" "ignace" "sisters"
You can see that while some of the hospitals exist in both data frames, there are many differences in spelling.
We begin by cleaning the values in the
hosp_info data frame. As explained in the Cleaning data and core functions page, we can re-code values with logical criteria using dplyr’s
case_when() function. For the four hospitals that exist in both data frames we change the values to align with the values in
linelist_mini. The other hospitals we leave the values as they are (
TRUE ~ hosp_name).
CAUTION: Typically when cleaning one should create a new column (e.g.
hosp_name_clean), but for ease of demonstration we show modification of the old column
hosp_info <- hosp_info %>% mutate( hosp_name = case_when( # criteria # new value hosp_name == "military" ~ "Military Hospital", hosp_name == "port" ~ "Port Hospital", hosp_name == "St. Mark's" ~ "St. Mark's Maternity Hospital (SMMH)", hosp_name == "central hospital" ~ "Central Hospital", TRUE ~ hosp_name ) )
The hospital names that appear in both data frames are aligned. There are two hospitals in
hosp_info that are not present in
linelist_mini - we will deal with these later, in the join.
##  "Central Hospital" "Military Hospital" "Port Hospital" "St. Mark's Maternity Hospital (SMMH)" ##  "ignace" "sisters"
Prior to a join, it is often easiest to convert a column to all lowercase or all uppercase. If you need to convert all values in a column to UPPER or lower case, use
mutate() and wrap the column with one of these functions from stringr, as shown in the page on Characters and strings.
The dplyr package offers several different join functions. dplyr is included in the tidyverse package. These join functions are described below, with simple use cases.
Many thanks to https://github.com/gadenbuie for the informative gifs!
The join commands can be run as standalone commands to join two data frames into a new object, or they can be used within a pipe chain (
%>%) to merge one data frame into another as it is being cleaned or otherwise modified.
In the example below, the function
left_join() is used as a standalone command to create the a new
joined_data data frame. The inputs are data frames 1 and 2 (
df2). The first data frame listed is the baseline data frame, and the second one listed is joined to it.
The third argument
by = is where you specify the columns in each data frame that will be used to aligns the rows in the two data frames. If the names of these columns are different, provide them within a
c() vector as shown below, where the rows are matched on the basis of common values between the column
df1 and the column
# Join based on common values between column "ID" (first data frame) and column "identifier" (second data frame) joined_data <- left_join(df1, df2, by = c("ID" = "identifier"))
by columns in both data frames have the exact same name, you can just provide this one name, within quotes.
# Joint based on common values in column "ID" in both data frames joined_data <- left_join(df1, df2, by = "ID")
If you are joining the data frames based on common values across multiple fields, list these fields within the
c() vector. This example joins rows if the values in three columns in each dataset align exactly.
# join based on same first name, last name, and age joined_data <- left_join(df1, df2, by = c("name" = "firstname", "surname" = "lastname", "Age" = "age"))
The join commands can also be run within a pipe chain. This will modify the data frame being piped.
In the example below,
df1 is is passed through the pipes,
df2 is joined to it, and
df is thus modified and re-defined.
CAUTION: Joins are case-specific! Therefore it is useful to convert all values to lowercase or uppercase prior to joining. See the page on characters/strings.
A left or right join is commonly used to add information to a data frame - new information is added only to rows that already existed in the baseline data frame. These are common joins in epidemiological work as they are used to add information from one dataset into another.
In using these joins, the written order of the data frames in the command is important*.
- In a left join, the first data frame written is the baseline
- In a right join, the second data frame written is the baseline
All rows of the baseline data frame are kept. Information in the other (secondary) data frame is joined to the baseline data frame only if there is a match via the identifier column(s). In addition:
- Rows in the secondary data frame that do not match are dropped.
- If there are many baseline rows that match to one row in the secondary data frame (many-to-one), the secondary information is added to each matching baseline row.
- If a baseline row matches to multiple rows in the secondary data frame (one-to-many), all combinations are given, meaning new rows may be added to your returned data frame!
Animated examples of left and right joins (image source)
Below is the output of a
hosp_info (secondary data frame, view here) into
linelist_mini (baseline data frame, view here). The original
nrow(linelist_mini) rows. The modified
linelist_mini is displayed. Note the following:
- Two new columns,
levelhave been added on the left side of
- All original rows of the baseline data frame
- Any original rows of
linelist_minifor “Military Hospital” are duplicated because it matched to two rows in the secondary data frame, so both combinations are returned
- The join identifier column of the secondary dataset (
hosp_name) has disappeared because it is redundant with the identifier column in the primary dataset (
- When a baseline row did not match to any secondary row (e.g. when
hospitalis “Other” or “Missing”),
NA(blank) fills in the columns from the secondary data frame
- Rows in the secondary data frame with no match to the baseline data frame (“sisters” and “ignace” hospitals) were dropped
linelist_mini %>% left_join(hosp_info, by = c("hospital" = "hosp_name"))
To answer the above question, ask yourself “which data frame should retain all of its rows?” - use this one as the baseline. A left join keep all the rows in the first data frame written in the command, whereas a right join keeps all the rows in the second data frame.
The two commands below achieve the same output - 10 rows of
hosp_info joined into a
linelist_mini baseline, but they use different joins. The result is that the column order will differ based on whether
hosp_info arrives from the right (in the left join) or arrives from the left (in the right join). The order of the rows may also shift accordingly. But both of these consequences can be subsequently addressed, using
select() to re-order columns or
arrange() to sort rows.
Here is the result of
linelist_mini via a left join (new columns incoming from the right)
Here is the result of
linelist_mini via a right join (new columns incoming from the left)
Also consider whether your use-case is within a pipe chain (
%>%). If the dataset in the pipes is the baseline, you will likely use a left join to add data to it.
A full join is the most inclusive of the joins - it returns all rows from both data frames.
If there are any rows present in one and not the other (where no match was found), the data frame will include them and become longer.
NA missing values are used to fill-in any gaps created. As you join, watch the number of columns and rows carefully to troubleshoot case-sensitivity and exact character matches.
The “baseline” data frame is the one written first in the command. Adjustment of this will not impact which records are returned by the join, but it can impact the resulting column order, row order, and which identifier columns are retained.
Animated example of a full join (image source)
- All baseline rows are kept (
- Rows in the secondary that do not match to the baseline are kept (“ignace” and “sisters”), with values in the corresponding baseline columns
onsetfilled in with missing values
- Likewise, rows in the baseline data frame that do not match to the secondary (“Other” and “Missing”) are kept, with secondary columns
levelfilled-in with missing values
- In the case of one-to-many or many-to-one matches (e.g. rows for “Military Hospital”), all possible combinations are returned (lengthening the final data frame)
- Only the identifier column from the baseline is kept (
linelist_mini %>% full_join(hosp_info, by = c("hospital" = "hosp_name"))
An inner join is the most restrictive of the joins - it returns only rows with matches across both data frames.
This means that the number of rows in the baseline data frame may actually reduce. Adjustment of which data frame is the “baseline” (written first in the function) will not impact which rows are returned, but it will impact the column order, row order, and which identifier columns are retained.
Animated example of an inner join (image source)
Below is the output of an
linelist_mini (baseline) with
hosp_info (secondary). Note the following:
- Baseline rows with no match to the secondary data are removed (rows where
hospitalis “Missing” or “Other”)
- Likewise, rows from the secondary data frame that had no match in the baseline are removed (rows where
hosp_nameis “sisters” or “ignace”)
- Only the identifier column from the baseline is kept (
linelist_mini %>% inner_join(hosp_info, by = c("hospital" = "hosp_name"))
A semi join is a “filtering join” which uses another dataset not to add rows or columns, but to perform filtering.
A semi-join keeps all observations in the baseline data frame that have a match in the secondary data frame (but does not add new columns nor duplicate any rows for multiple matches). Read more about these “filtering” joins here.
Animated example of a semi join (image source)
As an example, the below code returns rows from the
hosp_info data frame that have matches in
linelist_mini based on hospital name.
hosp_info %>% semi_join(linelist_mini, by = c("hosp_name" = "hospital"))
## hosp_name catchment_pop level ## 1 Military Hospital 40500 Secondary ## 2 Military Hospital 10000 Primary ## 3 Port Hospital 50280 Secondary ## 4 St. Mark's Maternity Hospital (SMMH) 12000 Secondary
The anti join is another “filtering join” that returns rows in the baseline data frame that do not have a match in the secondary data frame.
Read more about filtering joins here.
Common scenarios for an anti-join include identifying records not present in another data frame, troubleshooting spelling in a join (reviewing records that should have matched), and examining records that were excluded after another join.
left_join(), the baseline data frame (listed first) is important. The returned rows are from the baseline data frame only. Notice in the gif below that row in the secondary data frame (purple row 4) is not returned even though it does not match with the baseline.
Animated example of an anti join (image source)
For a simple example, let’s find the
hosp_info hospitals that do not have any cases present in
linelist_mini. We list
hosp_info first, as the baseline data frame. The hospitals which are not present in
linelist_mini are returned.
hosp_info %>% anti_join(linelist_mini, by = c("hosp_name" = "hospital"))
For another example, let us say we ran an
hosp_info. This returns only a subset of the original
linelist_mini records, as some are not present in
linelist_mini %>% inner_join(hosp_info, by = c("hospital" = "hosp_name"))
To review the
linelist_mini records that were excluded during the inner join, we can run an anti-join with the same settings (
linelist_mini as the baseline).
linelist_mini %>% anti_join(hosp_info, by = c("hospital" = "hosp_name"))
To see the
hosp_info records that were excluded in the inner join, we could also run an anti-join with
hosp_info as the baseline data frame.
If you do not have a unique identifier common across datasets to join on, consider using a probabilistic matching algorithm. This would find matches between records based on similarity (e.g. Jaro–Winkler string distance, or numeric distance). Below is a simple example using the package fastLink .
pacman::p_load( tidyverse, # data manipulation and visualization fastLink # record matching )
Here are two small example datasets that we will use to demonstrate the probabilistic matching (
Here is the code used to make the datasets:
# make datasets cases <- tribble( ~gender, ~first, ~middle, ~last, ~yr, ~mon, ~day, ~district, "M", "Amir", NA, "Khan", 1989, 11, 22, "River", "M", "Anthony", "B.", "Smith", 1970, 09, 19, "River", "F", "Marialisa", "Contreras", "Rodrigues", 1972, 04, 15, "River", "F", "Elizabeth", "Casteel", "Chase", 1954, 03, 03, "City", "M", "Jose", "Sanchez", "Lopez", 1996, 01, 06, "City", "F", "Cassidy", "Jones", "Davis", 1980, 07, 20, "City", "M", "Michael", "Murphy", "O'Calaghan",1969, 04, 12, "Rural", "M", "Oliver", "Laurent", "De Bordow" , 1971, 02, 04, "River", "F", "Blessing", NA, "Adebayo", 1955, 02, 14, "Rural" ) results <- tribble( ~gender, ~first, ~middle, ~last, ~yr, ~mon, ~day, ~district, ~result, "M", "Amir", NA, "Khan", 1989, 11, 22, "River", "positive", "M", "Tony", "B", "Smith", 1970, 09, 19, "River", "positive", "F", "Maria", "Contreras", "Rodriguez", 1972, 04, 15, "Cty", "negative", "F", "Betty", "Castel", "Chase", 1954, 03, 30, "City", "positive", "F", "Andrea", NA, "Kumaraswamy", 2001, 01, 05, "Rural", "positive", "F", "Caroline", NA, "Wang", 1988, 12, 11, "Rural", "negative", "F", "Trang", NA, "Nguyen", 1981, 06, 10, "Rural", "positive", "M", "Olivier" , "Laurent", "De Bordeaux", NA, NA, NA, "River", "positive", "M", "Mike", "Murphy", "O'Callaghan", 1969, 04, 12, "Rural", "negative", "F", "Cassidy", "Jones", "Davis", 1980, 07, 02, "City", "positive", "M", "Mohammad", NA, "Ali", 1942, 01, 17, "City", "negative", NA, "Jose", "Sanchez", "Lopez", 1995, 01, 06, "City", "negative", "M", "Abubakar", NA, "Abullahi", 1960, 01, 01, "River", "positive", "F", "Maria", "Salinas", "Contreras", 1955, 03, 03, "River", "positive" )
cases dataset has 9 records of patients who are awaiting test results.
test_results dataset has 14 records and contains the column
result, which we want to add to the records in
cases based on probabilistic matching of records.
fastLink() function from the fastLink package can be used to apply a matching algorithm. Here is the basic information. You can read more detail by entering
?fastLink in your console.
- Define the two data frames for comparison to arguments
varnames =give all column names to be used for matching. They must all exist in both
stringdist.match =give columns from those in
varnamesto be evaluated on string “distance”.
numeric.match =give columns from those in
varnamesto be evaluated on numeric distance.
- Missing values are ignored
- By default, each row in either data frame is matched to at most one row in the other data frame. If you want to see all the evaluated matches, set
dedupe.matches = FALSE. The deduplication is done using Winkler’s linear assignment solution.
Tip: split one date column into three separate numeric columns using
year() from lubridate package
The default threshold for matches is 0.94 (
threshold.match =) but you can adjust it higher or lower. If you define the threshold, consider that higher thresholds could yield more false-negatives (rows that do not match which actually should match) and likewise a lower threshold could yield more false-positive matches.
Below, the data are matched on string distance across the name and district columns, and on numeric distance for year, month, and day of birth. A match threshold of 95% probability is set.
## ## ==================== ## fastLink(): Fast Probabilistic Record Linkage ## ==================== ## ## If you set return.all to FALSE, you will not be able to calculate a confusion table as a summary statistic. ## Calculating matches for each variable. ## Getting counts for parameter estimation. ## Parallelizing calculation using OpenMP. 1 threads out of 12 are used. ## Running the EM algorithm. ## Getting the indices of estimated matches. ## Parallelizing calculation using OpenMP. 1 threads out of 12 are used. ## Deduping the estimated matches. ## Getting the match patterns for each estimated match.
We defined the object returned from
fl_output. It is of class
list, and it actually contains several data frames within it, detailing the results of the matching. One of these data frames is
matches, which contains the most likely matches across
results. You can access this “matches” data frame with
fl_output$matches. Below, it is saved as
my_matches for ease of accessing later.
my_matches is printed, you see two column vectors: the pairs of row numbers/indices (also called “rownames”) in
cases (“inds.a”) and in
results (“inds.b”) representing the best matches. If a row number from a datafrane is missing, then no match was found in the other data frame at the specified match threshold.
# print matches my_matches <- fl_output$matches my_matches
## inds.a inds.b ## 1 1 1 ## 2 2 2 ## 3 3 3 ## 4 4 4 ## 5 8 8 ## 6 7 9 ## 7 6 10 ## 8 5 12
Things to note:
- Matches occurred despite slight differences in name spelling and dates of birth:
- “Tony B. Smith” matched to “Anthony B Smith”
- “Maria Rodriguez” matched to “Marialisa Rodrigues”
- “Betty Chase” matched to “Elizabeth Chase”
- “Olivier Laurent De Bordeaux” matched to “Oliver Laurent De Bordow” (missing date of birth ignored)
- “Tony B. Smith” matched to “Anthony B Smith”
- One row from
cases(for “Blessing Adebayo”, row 9) had no good match in
results, so it is not present in
Join based on the probabilistic matches
To use these matches to join
cases, one strategy is:
cases(matching rownames in
casesto “inds.a” in
- Then use another
cases(matching the newly-acquired “inds.b” in
casesto rownames in
Before the joins, we should clean the three data frames:
dfBshould have their row numbers (“rowname”) converted to a proper column.
- Both the columns in
my_matchesare converted to class character, so they can be joined to the character rownames
# Clean data prior to joining ############################# # convert cases rownames to a column cases_clean <- cases %>% rownames_to_column() # convert test_results rownames to a column results_clean <- results %>% rownames_to_column() # convert all columns in matches dataset to character, so they can be joined to the rownames matches_clean <- my_matches %>% mutate(across(everything(), as.character)) # Join matches to dfA, then add dfB ################################### # column "inds.b" is added to dfA complete <- left_join(cases_clean, matches_clean, by = c("rowname" = "inds.a")) # column(s) from dfB are added complete <- left_join(complete, results_clean, by = c("inds.b" = "rowname"))
As performed using the code above, the resulting data frame
complete will contain all columns from both
results. Many will be appended with suffixes “.x” and “.y”, because the column names would otherwise be duplicated.
Alternatively, to achieve only the “original” 9 records in
cases with the new column(s) from
results before the joins, so that it contains only rownames and the columns that you want to add to
cases (e.g. the column
cases_clean <- cases %>% rownames_to_column() results_clean <- results %>% rownames_to_column() %>% select(rowname, result) # select only certain columns matches_clean <- my_matches %>% mutate(across(everything(), as.character)) # joins complete <- left_join(cases_clean, matches_clean, by = c("rowname" = "inds.a")) complete <- left_join(complete, results_clean, by = c("inds.b" = "rowname"))
If you want to subset either dataset to only the rows that matched, you can use the codes below:
cases_matched <- cases[my_matches$inds.a,] # Rows in cases that matched to a row in results results_matched <- results[my_matches$inds.b,] # Rows in results that matched to a row in cases
Or, to see only the rows that did not match:
Probabilistic matching can be used to deduplicate a dataset as well. See the page on deduplication for other methods of deduplication.
Here we began with the
cases dataset, but are now calling it
cases_dup, as it has 2 additional rows that could be duplicates of previous rows:
See “Tony” with “Anthony”, and “Marialisa Rodrigues” with “Maria Rodriguez”.
fastLink() like before, but compare the
cases_dup data frame to itself. When the two data frames provided are identical, the function assumes you want to de-duplicate. Note we do not specify
stringdist.match = or
numeric.match = as we did previously.
## Run fastLink on the same dataset dedupe_output <- fastLink( dfA = cases_dup, dfB = cases_dup, varnames = c("gender", "first", "middle", "last", "yr", "mon", "day", "district") )
## ## ==================== ## fastLink(): Fast Probabilistic Record Linkage ## ==================== ## ## If you set return.all to FALSE, you will not be able to calculate a confusion table as a summary statistic. ## dfA and dfB are identical, assuming deduplication of a single data set. ## Setting return.all to FALSE. ## ## Calculating matches for each variable. ## Getting counts for parameter estimation. ## Parallelizing calculation using OpenMP. 1 threads out of 12 are used. ## Running the EM algorithm. ## Getting the indices of estimated matches. ## Parallelizing calculation using OpenMP. 1 threads out of 12 are used. ## Calculating the posterior for each pair of matched observations. ## Getting the match patterns for each estimated match.
Now, you can review the potential duplicates with
getMatches(). Provide the data frame as both
dfA = and
dfB =, and provide the output of the
fastLink() function as
fl.out must be of class
fastLink.dedupe, or in other words, the result of
## Run getMatches() cases_dedupe <- getMatches( dfA = cases_dup, dfB = cases_dup, fl.out = dedupe_output)
See the right-most column, which indicates the duplicate IDs - the final two rows are identified as being likely duplicates of rows 2 and 3.
To return the row numbers of rows which are likely duplicates, you can count the number of rows per unique value in the
dedupe.ids column, and then filter to keep only those with more than one row. In this case this leaves rows 2 and 3.
cases_dedupe %>% count(dedupe.ids) %>% filter(n > 1)
## dedupe.ids n ## 1 2 2 ## 2 3 2
To inspect the whole rows of the likely duplicates, put the row number in this command:
# displays row 2 and all likely duplicates of it cases_dedupe[cases_dedupe$dedupe.ids == 2,]
## gender first middle last yr mon day district dedupe.ids ## 2 M Anthony B. Smith 1970 9 19 River 2 ## 10 M Tony B. Smith 1970 9 19 River 2
Another method of combining two data frames is “binding” them together. You can also think of this as “appending” or “adding” rows or columns.
This section will also discuss how to “align” the order of rows of one data frame to the order in another data frame. This topic is discussed below in the section on Binding columns.
To bind rows of one data frame to the bottom of another data frame, use
bind_rows() from dplyr. It is very inclusive, so any column present in either data frame will be included in the output. A few notes:
- Unlike the base R version
bind_rows()does not require that the order of columns be the same in both data frames. As long as the column names are spelled identically, it will align them correctly.
- You can optionally specify the argument
.id =. Provide a character column name. This will produce a new column that serves to identify which data frame each row originally came from.
- You can use
listof similarly-structured data frames to combine them into one data frame. See an example in the Iteration, loops, and lists page involving the import of multiple linelists with purrr.
One common example of row binding is to bind a “total” row onto a descriptive table made with dplyr’s
summarise() function. Below we create a table of case counts and median CT values by hospital with a total row.
summarise() is used on data grouped by hospital to return a summary data frame by hospital. But the function
summarise() does not automatically produce a “totals” row, so we create it by summarising the data again, but with the data not grouped by hospital. This produces a second data frame of just one row. We can then bind these data frames together to achieve the final table.
# Create core table ################### hosp_summary <- linelist %>% group_by(hospital) %>% # Group data by hospital summarise( # Create new summary columns of indicators of interest cases = n(), # Number of rows per hospital-outcome group ct_value_med = median(ct_blood, na.rm=T)) # median CT value per group
Here is the
hosp_summary data frame:
Create a data frame with the “total” statistics (not grouped by hospital). This will return just one row.
And below is that
totals data frame. Note how there are only two columns. These columns are also in
hosp_summary, but there is one column in
hosp_summary that is not in
Now we can bind the rows together with
# Bind data frames together combined <- bind_rows(hosp_summary, totals)
Now we can view the result. See how in the final row, an empty
NA value fills in for the column
hospital that was not in
hosp_summary. As explained in the Tables for presentation page, you could “fill-in” this cell with “Total” using
There is a similar dplyr function
bind_cols() which you can use to combine two data frames sideways. Note that rows are matched to each other by position (not like a join above) - for example the 12th row in each data frame will be aligned.
For an example, we bind several summary tables together. In order to do this, we also demonstrate how to re-arrange the order of rows in one data frame to match the order in another data frame, with
Here we define
case_info as a summary data frame of linelist cases, by hospital, with the number of cases and the number of deaths.
And let’s say that here is a different data frame
contact_fu containing information on the percent of exposed contacts investigated and “followed-up”, again by hospital.
contact_fu <- data.frame( hospital = c("St. Mark's Maternity Hospital (SMMH)", "Military Hospital", "Missing", "Central Hospital", "Port Hospital", "Other"), investigated = c("80%", "82%", NA, "78%", "64%", "55%"), per_fu = c("60%", "25%", NA, "20%", "75%", "80%") )
Note that the hospitals are the same, but are in different orders in each data frame. The easiest solution would be to use a
left_join() on the
hospital column, but you could also use
bind_cols() with one extra step.
Because the row orders are different, a simple
bind_cols() command would result in a mis-match of data. To fix this we can use
match() from base R to align the rows of a data frame in the same order as in another. We assume for this approach that there are no duplicate values in either data frame.
When we use
match(), the syntax is
match(TARGET ORDER VECTOR, DATA FRAME COLUMN TO CHANGE), where the first argument is the desired order (either a stand-alone vector, or in this case a column in a data frame), and the second argument is the data frame column in the data frame that will be re-ordered. The output of
match() is a vector of numbers representing the correct position ordering. You can read more with
##  4 2 3 6 5 1
You can use this numeric vector to re-order the data frame - place it within subset brackets
[ ] before the comma. Read more about base R bracket subset syntax in the R basics page. The command below creates a new data frame, defined as the old one in which the rows are ordered in the numeric vector above.
contact_fu_aligned <- contact_fu[match(case_info$hospital, contact_fu$hospital),]
## New names: ## * hospital -> hospital...1 ## * hospital -> hospital...4
## # A tibble: 6 x 6 ## hospital...1 cases deaths hospital...4 investigated per_fu ## <chr> <int> <int> <chr> <chr> <chr> ## 1 Central Hospital 454 193 St. Mark's Maternity Hospital (SMMH) 80% 60% ## 2 Military Hospital 896 399 Military Hospital 82% 25% ## 3 Missing 1469 611 Missing <NA> <NA> ## 4 Other 885 395 Central Hospital 78% 20% ## 5 Port Hospital 1762 785 Port Hospital 64% 75% ## 6 St. Mark's Maternity Hospital (SMMH) 422 199 Other 55% 80%
A base R alternative to
cbind(), which performs the same operation.