10  Characters and strings

This page demonstrates use of the stringr package to evaluate and handle character values (“strings”).

  1. Combine, order, split, arrange - str_c(), str_glue(), str_order(), str_split()
  2. Clean and standardise
    • Adjust length - str_pad(), str_trunc(), str_wrap()
    • Change case - str_to_upper(), str_to_title(), str_to_lower(), str_to_sentence()
  3. Evaluate and extract by position - str_length(), str_sub(), word()
  4. Patterns
    • Detect and locate - str_detect(), str_subset(), str_match(), str_extract()
    • Modify and replace - str_sub(), str_replace_all()
  5. Regular expressions (“regex”)

For ease of display most examples are shown acting on a short defined character vector, however they can easily be adapted to a column within a data frame.

This stringr vignette provided much of the inspiration for this page.

10.1 Preparation

Load packages

Install or load the stringr and other tidyverse packages.

# install/load packages
pacman::p_load(
  stringr,    # many functions for handling strings
  tidyverse,  # for optional data manipulation
  tools)      # alternative for converting to title case

Import data

In this page we will occassionally reference the cleaned linelist of cases from a simulated Ebola epidemic. If you want to follow along, click to download the “clean” linelist (as .rds file). Import data with the import() function from the rio package (it handles many file types like .xlsx, .csv, .rds - see the Import and export(importing.qmd) page for details).

# import case linelist 
linelist <- import("linelist_cleaned.rds")

The first 50 rows of the linelist are displayed below.

10.2 Unite, split, and arrange

This section covers:

  • Using str_c(), str_glue(), and unite() to combine strings
  • Using str_order() to arrange strings
  • Using str_split() and separate() to split strings

Combine strings

To combine or concatenate multiple strings into one string, we suggest using str_c from stringr. If you have distinct character values to combine, simply provide them as unique arguments, separated by commas.

str_c("String1", "String2", "String3")
[1] "String1String2String3"

The argument sep = inserts a character value between each of the arguments you provided (e.g. inserting a comma, space, or newline "\n")

str_c("String1", "String2", "String3", sep = ", ")
[1] "String1, String2, String3"

The argument collapse = is relevant if you are inputting multiple vectors as arguments to str_c(). It is used to separate the elements of what would be an output vector, such that the output vector only has one long character element.

The example below shows the combination of two vectors into one (first names and last names). Another similar example might be jurisdictions and their case counts. In this example:

  • The sep = value appears between each first and last name
  • The collapse = value appears between each person
first_names <- c("abdul", "fahruk", "janice") 
last_names  <- c("hussein", "akinleye", "okeke")

# sep displays between the respective input strings, while collapse displays between the elements produced
str_c(first_names, last_names, sep = " ", collapse = ";  ")
[1] "abdul hussein;  fahruk akinleye;  janice okeke"

Note: Depending on your desired display context, when printing such a combined string with newlines, you may need to wrap the whole phrase in cat() for the newlines to print properly:

# For newlines to print correctly, the phrase may need to be wrapped in cat()
cat(str_c(first_names, last_names, sep = " ", collapse = ";\n"))
abdul hussein;
fahruk akinleye;
janice okeke

Dynamic strings

Use str_glue() to insert dynamic R code into a string. This is a very useful function for creating dynamic plot captions, as demonstrated below.

  • All content goes between double quotation marks str_glue("")
  • Any dynamic code or references to pre-defined values are placed within curly brackets {} within the double quotation marks. There can be many curly brackets in the same str_glue() command.
  • To display character quotes ’’, use single quotes within the surrounding double quotes (e.g. when providing date format - see example below)
  • Tip: You can use \n to force a new line
  • Tip: You use format() to adjust date display, and use Sys.Date() to display the current date

A simple example, of a dynamic plot caption:

str_glue("Data include {nrow(linelist)} cases and are current to {format(Sys.Date(), '%d %b %Y')}.")
Data include 5888 cases and are current to 19 Jun 2024.

An alternative format is to use placeholders within the brackets and define the code in separate arguments at the end of the str_glue() function, as below. This can improve code readability if the text is long.

str_glue("Linelist as of {current_date}.\nLast case hospitalized on {last_hospital}.\n{n_missing_onset} cases are missing date of onset and not shown",
         current_date = format(Sys.Date(), '%d %b %Y'),
         last_hospital = format(as.Date(max(linelist$date_hospitalisation, na.rm=T)), '%d %b %Y'),
         n_missing_onset = nrow(linelist %>% filter(is.na(date_onset)))
         )
Linelist as of 19 Jun 2024.
Last case hospitalized on 30 Apr 2015.
256 cases are missing date of onset and not shown

Pulling from a data frame

Sometimes, it is useful to pull data from a data frame and have it pasted together in sequence. Below is an example data frame. We will use it to to make a summary statement about the jurisdictions and the new and total case counts.

# make case data frame
case_table <- data.frame(
  zone        = c("Zone 1", "Zone 2", "Zone 3", "Zone 4", "Zone 5"),
  new_cases   = c(3, 0, 7, 0, 15),
  total_cases = c(40, 4, 25, 10, 103)
  )

Use str_glue_data(), which is specially made for taking data from data frame rows:

case_table %>% 
  str_glue_data("{zone}: {new_cases} ({total_cases} total cases)")
Zone 1: 3 (40 total cases)
Zone 2: 0 (4 total cases)
Zone 3: 7 (25 total cases)
Zone 4: 0 (10 total cases)
Zone 5: 15 (103 total cases)

Combine strings across rows

If you are trying to “roll-up” values in a data frame column, e.g. combine values from multiple rows into just one row by pasting them together with a separator, see the section of the De-duplication page on “rolling-up” values.

Data frame to one line

You can make the statement appear in one line using str_c() (specifying the data frame and column names), and providing sep = and collapse = arguments.

str_c(case_table$zone, case_table$new_cases, sep = " = ", collapse = ";  ")
[1] "Zone 1 = 3;  Zone 2 = 0;  Zone 3 = 7;  Zone 4 = 0;  Zone 5 = 15"

You could add the pre-fix text “New Cases:” to the beginning of the statement by wrapping with a separate str_c() (if “New Cases:” was within the original str_c() it would appear multiple times).

str_c("New Cases: ", str_c(case_table$zone, case_table$new_cases, sep = " = ", collapse = ";  "))
[1] "New Cases: Zone 1 = 3;  Zone 2 = 0;  Zone 3 = 7;  Zone 4 = 0;  Zone 5 = 15"

Unite columns

Within a data frame, bringing together character values from multiple columns can be achieved with unite() from tidyr. This is the opposite of separate().

Provide the name of the new united column. Then provide the names of the columns you wish to unite.

  • By default, the separator used in the united column is underscore _, but this can be changed with the sep = argument.
  • remove = removes the input columns from the data frame (TRUE by default)
  • na.rm = removes missing values while uniting (FALSE by default)

Below, we define a mini-data frame to demonstrate with:

df <- data.frame(
  case_ID = c(1:6),
  symptoms  = c("jaundice, fever, chills",     # patient 1
                "chills, aches, pains",        # patient 2 
                "fever",                       # patient 3
                "vomiting, diarrhoea",         # patient 4
                "bleeding from gums, fever",   # patient 5
                "rapid pulse, headache"),      # patient 6
  outcome = c("Recover", "Death", "Death", "Recover", "Recover", "Recover"))
df_split <- separate(df, symptoms, into = c("sym_1", "sym_2", "sym_3"), extra = "merge")
Warning: Expected 3 pieces. Missing pieces filled with `NA` in 2 rows [3, 4].

Here is the example data frame:

Below, we unite the three symptom columns:

df_split %>% 
  unite(
    col = "all_symptoms",         # name of the new united column
    c("sym_1", "sym_2", "sym_3"), # columns to unite
    sep = ", ",                   # separator to use in united column
    remove = TRUE,                # if TRUE, removes input cols from the data frame
    na.rm = TRUE                  # if TRUE, missing values are removed before uniting
  )
  case_ID                all_symptoms outcome
1       1     jaundice, fever, chills Recover
2       2        chills, aches, pains   Death
3       3                       fever   Death
4       4         vomiting, diarrhoea Recover
5       5 bleeding, from, gums, fever Recover
6       6      rapid, pulse, headache Recover

Split

To split a string based on a pattern, use str_split(). It evaluates the string(s) and returns a list of character vectors consisting of the newly-split values.

The simple example below evaluates one string and splits it into three. By default it returns an object of class list with one element (a character vector) for each string initially provided. If simplify = TRUE it returns a character matrix.

In this example, one string is provided, and the function returns a list with one element - a character vector with three values.

str_split(string = "jaundice, fever, chills",
          pattern = ",")
[[1]]
[1] "jaundice" " fever"   " chills" 

If the output is saved, you can then access the nth split value with bracket syntax. To access a specific value you can use syntax like this: the_returned_object[[1]][2], which would access the second value from the first evaluated string (“fever”). See the R basics page for more detail on accessing elements.

pt1_symptoms <- str_split("jaundice, fever, chills", ",")

pt1_symptoms[[1]][2]  # extracts 2nd value from 1st (and only) element of the list
[1] " fever"

If multiple strings are provided by str_split(), there will be more than one element in the returned list.

symptoms <- c("jaundice, fever, chills",     # patient 1
              "chills, aches, pains",        # patient 2 
              "fever",                       # patient 3
              "vomiting, diarrhoea",         # patient 4
              "bleeding from gums, fever",   # patient 5
              "rapid pulse, headache")       # patient 6

str_split(symptoms, ",")                     # split each patient's symptoms
[[1]]
[1] "jaundice" " fever"   " chills" 

[[2]]
[1] "chills" " aches" " pains"

[[3]]
[1] "fever"

[[4]]
[1] "vomiting"   " diarrhoea"

[[5]]
[1] "bleeding from gums" " fever"            

[[6]]
[1] "rapid pulse" " headache"  

To return a “character matrix” instead, which may be useful if creating data frame columns, set the argument simplify = TRUE as shown below:

str_split(symptoms, ",", simplify = TRUE)
     [,1]                 [,2]         [,3]     
[1,] "jaundice"           " fever"     " chills"
[2,] "chills"             " aches"     " pains" 
[3,] "fever"              ""           ""       
[4,] "vomiting"           " diarrhoea" ""       
[5,] "bleeding from gums" " fever"     ""       
[6,] "rapid pulse"        " headache"  ""       

You can also adjust the number of splits to create with the n = argument. For example, this restricts the number of splits to 2. Any further commas remain within the second values.

str_split(symptoms, ",", simplify = TRUE, n = 2)
     [,1]                 [,2]            
[1,] "jaundice"           " fever, chills"
[2,] "chills"             " aches, pains" 
[3,] "fever"              ""              
[4,] "vomiting"           " diarrhoea"    
[5,] "bleeding from gums" " fever"        
[6,] "rapid pulse"        " headache"     

Note - the same outputs can be achieved with str_split_fixed(), in which you do not give the simplify argument, but must instead designate the number of columns (n).

str_split_fixed(symptoms, ",", n = 2)

Split columns

If you are trying to split data frame column, it is best to use the separate() function from dplyr. It is used to split one character column into other columns.

Let’s say we have a simple data frame df (defined and united in the unite section) containing a case_ID column, one character column with many symptoms, and one outcome column. Our goal is to separate the symptoms column into many columns - each one containing one symptom.

Assuming the data are piped into separate(), first provide the column to be separated. Then provide into = as a vector c( ) containing the new columns names, as shown below.

  • sep = the separator, can be a character, or a number (interpreted as the character position to split at)
  • remove = FALSE by default, removes the input column
  • convert = FALSE by default, will cause string “NA”s to become NA
  • extra = this controls what happens if there are more values created by the separation than new columns named.
    • extra = "warn" means you will see a warning but it will drop excess values (the default)
    • extra = "drop" means the excess values will be dropped with no warning
    • extra = "merge" will only split to the number of new columns listed in into - this setting will preserve all your data

An example with extra = "merge" is below - no data is lost. Two new columns are defined but any third symptoms are left in the second new column:

# third symptoms combined into second new column
df %>% 
  separate(symptoms, into = c("sym_1", "sym_2"), sep=",", extra = "merge")
Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [3].
  case_ID              sym_1          sym_2 outcome
1       1           jaundice  fever, chills Recover
2       2             chills   aches, pains   Death
3       3              fever           <NA>   Death
4       4           vomiting      diarrhoea Recover
5       5 bleeding from gums          fever Recover
6       6        rapid pulse       headache Recover

When the default extra = "drop" is used below, a warning is given but the third symptoms are lost:

# third symptoms are lost
df %>% 
  separate(symptoms, into = c("sym_1", "sym_2"), sep=",")
Warning: Expected 2 pieces. Additional pieces discarded in 2 rows [1, 2].
Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [3].
  case_ID              sym_1      sym_2 outcome
1       1           jaundice      fever Recover
2       2             chills      aches   Death
3       3              fever       <NA>   Death
4       4           vomiting  diarrhoea Recover
5       5 bleeding from gums      fever Recover
6       6        rapid pulse   headache Recover

CAUTION: If you do not provide enough into values for the new columns, your data may be truncated.

Arrange alphabetically

Several strings can be sorted by alphabetical order. str_order() returns the order, while str_sort() returns the strings in that order.

# strings
health_zones <- c("Alba", "Takota", "Delta")

# return the alphabetical order
str_order(health_zones)
[1] 1 3 2
# return the strings in alphabetical order
str_sort(health_zones)
[1] "Alba"   "Delta"  "Takota"

To use a different alphabet, add the argument locale =. See the full list of locales by entering stringi::stri_locale_list() in the R console.

base R functions

It is common to see base R functions paste() and paste0(), which concatenate vectors after converting all parts to character. They act similarly to str_c() but the syntax is arguably more complicated - in the parentheses each part is separated by a comma. The parts are either character text (in quotes) or pre-defined code objects (no quotes). For example:

n_beds <- 10
n_masks <- 20

paste0("Regional hospital needs ", n_beds, " beds and ", n_masks, " masks.")
[1] "Regional hospital needs 10 beds and 20 masks."

sep = and collapse = arguments can be specified. paste() is simply paste0() with a default sep = " " (one space).

10.3 Clean and standardise

Change case

Often one must alter the case/capitalization of a string value, for example names of jursidictions. Use str_to_upper(), str_to_lower(), and str_to_title(), from stringr, as shown below:

str_to_upper("California")
[1] "CALIFORNIA"
str_to_lower("California")
[1] "california"

Using *base** R, the above can also be achieved with toupper(), tolower().

Title case

Transforming the string so each word is capitalized can be achieved with str_to_title():

str_to_title("go to the US state of california ")
[1] "Go To The Us State Of California "

Use toTitleCase() from the tools package to achieve more nuanced capitalization (words like “to”, “the”, and “of” are not capitalized).

tools::toTitleCase("This is the US state of california")
[1] "This is the US State of California"

You can also use str_to_sentence(), which capitalizes only the first letter of the string.

str_to_sentence("the patient must be transported")
[1] "The patient must be transported"

Pad length

Use str_pad() to add characters to a string, to a minimum length. By default spaces are added, but you can also pad with other characters using the pad = argument.

# ICD codes of differing length
ICD_codes <- c("R10.13",
               "R10.819",
               "R17")

# ICD codes padded to 7 characters on the right side
str_pad(ICD_codes, 7, "right")
[1] "R10.13 " "R10.819" "R17    "
# Pad with periods instead of spaces
str_pad(ICD_codes, 7, "right", pad = ".")
[1] "R10.13." "R10.819" "R17...."

For example, to pad numbers with leading zeros (such as for hours or minutes), you can pad the number to minimum length of 2 with pad = "0".

# Add leading zeros to two digits (e.g. for times minutes/hours)
str_pad("4", 2, pad = "0") 
[1] "04"
# example using a numeric column named "hours"
# hours <- str_pad(hours, 2, pad = "0")

Truncate

str_trunc() sets a maximum length for each string. If a string exceeds this length, it is truncated (shortened) and an ellipsis (…) is included to indicate that the string was previously longer. Note that the ellipsis is counted in the length. The ellipsis characters can be changed with the argument ellipsis =. The optional side = argument specifies which where the ellipsis will appear within the truncated string (“left”, “right”, or “center”).

original <- "Symptom onset on 4/3/2020 with vomiting"
str_trunc(original, 10, "center")
[1] "Symp...ing"

Standardize length

Use str_trunc() to set a maximum length, and then use str_pad() to expand the very short strings to that truncated length. In the example below, 6 is set as the maximum length (one value is truncated), and then one very short value is padded to achieve length of 6.

# ICD codes of differing length
ICD_codes   <- c("R10.13",
                 "R10.819",
                 "R17")

# truncate to maximum length of 6
ICD_codes_2 <- str_trunc(ICD_codes, 6)
ICD_codes_2
[1] "R10.13" "R10..." "R17"   
# expand to minimum length of 6
ICD_codes_3 <- str_pad(ICD_codes_2, 6, "right")
ICD_codes_3
[1] "R10.13" "R10..." "R17   "

Remove leading/trailing whitespace

Use str_trim() to remove spaces, newlines (\n) or tabs (\t) on sides of a string input. Add "right" "left", or "both" to the command to specify which side to trim (e.g. str_trim(x, "right").

# ID numbers with excess spaces on right
IDs <- c("provA_1852  ", # two excess spaces
         "provA_2345",   # zero excess spaces
         "provA_9460 ")  # one excess space

# IDs trimmed to remove excess spaces on right side only
str_trim(IDs)
[1] "provA_1852" "provA_2345" "provA_9460"

Remove repeated whitespace within

Use str_squish() to remove repeated spaces that appear inside a string. For example, to convert double spaces into single spaces. It also removes spaces, newlines, or tabs on the outside of the string like str_trim().

# original contains excess spaces within string
str_squish("  Pt requires   IV saline\n") 
[1] "Pt requires IV saline"

Enter ?str_trim, ?str_pad in your R console to see further details.

Wrap into paragraphs

Use str_wrap() to wrap a long unstructured text into a structured paragraph with fixed line length. Provide the ideal character length for each line, and it applies an algorithm to insert newlines (\n) within the paragraph, as seen in the example below.

pt_course <- "Symptom onset 1/4/2020 vomiting chills fever. Pt saw traditional healer in home village on 2/4/2020. On 5/4/2020 pt symptoms worsened and was admitted to Lumta clinic. Sample was taken and pt was transported to regional hospital on 6/4/2020. Pt died at regional hospital on 7/4/2020."

str_wrap(pt_course, 40)
[1] "Symptom onset 1/4/2020 vomiting chills\nfever. Pt saw traditional healer in\nhome village on 2/4/2020. On 5/4/2020\npt symptoms worsened and was admitted\nto Lumta clinic. Sample was taken and pt\nwas transported to regional hospital on\n6/4/2020. Pt died at regional hospital\non 7/4/2020."

The base function cat() can be wrapped around the above command in order to print the output, displaying the new lines added.

cat(str_wrap(pt_course, 40))
Symptom onset 1/4/2020 vomiting chills
fever. Pt saw traditional healer in
home village on 2/4/2020. On 5/4/2020
pt symptoms worsened and was admitted
to Lumta clinic. Sample was taken and pt
was transported to regional hospital on
6/4/2020. Pt died at regional hospital
on 7/4/2020.

10.4 Handle by position

Extract by character position

Use str_sub() to return only a part of a string. The function takes three main arguments:

  1. the character vector(s)
  2. start position
  3. end position

A few notes on position numbers:

  • If a position number is positive, the position is counted starting from the left end of the string.
  • If a position number is negative, it is counted starting from the right end of the string.
  • Position numbers are inclusive.
  • Positions extending beyond the string will be truncated (removed).

Below are some examples applied to the string “pneumonia”:

# start and end third from left (3rd letter from left)
str_sub("pneumonia", 3, 3)
[1] "e"
# 0 is not present
str_sub("pneumonia", 0, 0)
[1] ""
# 6th from left, to the 1st from right
str_sub("pneumonia", 6, -1)
[1] "onia"
# 5th from right, to the 2nd from right
str_sub("pneumonia", -5, -2)
[1] "moni"
# 4th from left to a position outside the string
str_sub("pneumonia", 4, 15)
[1] "umonia"

Extract by word position

To extract the nth ‘word’, use word(), also from stringr. Provide the string(s), then the first word position to extract, and the last word position to extract.

By default, the separator between ‘words’ is assumed to be a space, unless otherwise indicated with sep = (e.g. sep = "_" when words are separated by underscores.

# strings to evaluate
chief_complaints <- c("I just got out of the hospital 2 days ago, but still can barely breathe.",
                      "My stomach hurts",
                      "Severe ear pain")

# extract 1st to 3rd words of each string
word(chief_complaints, start = 1, end = 3, sep = " ")
[1] "I just got"       "My stomach hurts" "Severe ear pain" 

Replace by character position

str_sub() paired with the assignment operator (<-) can be used to modify a part of a string:

word <- "pneumonia"

# convert the third and fourth characters to X 
str_sub(word, 3, 4) <- "XX"

# print
word
[1] "pnXXmonia"

An example applied to multiple strings (e.g. a column). Note the expansion in length of “HIV”.

words <- c("pneumonia", "tubercolosis", "HIV")

# convert the third and fourth characters to X 
str_sub(words, 3, 4) <- "XX"

words
[1] "pnXXmonia"    "tuXXrcolosis" "HIXX"        

Evaluate length

str_length("abc")
[1] 3

Alternatively, use nchar() from base R

10.5 Patterns

Many stringr functions work to detect, locate, extract, match, replace, and split based on a specified pattern.

Detect a pattern

Use str_detect() as below to detect presence/absence of a pattern within a string. First provide the string or vector to search in (string =), and then the pattern to look for (pattern =). Note that by default the search is case sensitive!

str_detect(string = "primary school teacher", pattern = "teach")
[1] TRUE

The argument negate = can be included and set to TRUE if you want to know if the pattern is NOT present.

str_detect(string = "primary school teacher", pattern = "teach", negate = TRUE)
[1] FALSE

To ignore case/capitalization, wrap the pattern within regex(), and within regex() add the argument ignore_case = TRUE (or T as shorthand).

str_detect(string = "Teacher", pattern = regex("teach", ignore_case = T))
[1] TRUE

When str_detect() is applied to a character vector or a data frame column, it will return TRUE or FALSE for each of the values.

# a vector/column of occupations 
occupations <- c("field laborer",
                 "university professor",
                 "primary school teacher & tutor",
                 "tutor",
                 "nurse at regional hospital",
                 "lineworker at Amberdeen Fish Factory",
                 "physican",
                 "cardiologist",
                 "office worker",
                 "food service")

# Detect presence of pattern "teach" in each string - output is vector of TRUE/FALSE
str_detect(occupations, "teach")
 [1] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

If you need to count the TRUEs, simply sum() the output. This counts the number TRUE.

sum(str_detect(occupations, "teach"))
[1] 1

To search inclusive of multiple terms, include them separated by OR bars (|) within the pattern = argument, as shown below:

sum(str_detect(string = occupations, pattern = "teach|professor|tutor"))
[1] 3

If you need to build a long list of search terms, you can combine them using str_c() and sep = |, then define this is a character object, and then reference the vector later more succinctly. The example below includes possible occupation search terms for front-line medical providers.

# search terms
occupation_med_frontline <- str_c("medical", "medicine", "hcw", "healthcare", "home care", "home health",
                                "surgeon", "doctor", "doc", "physician", "surgery", "peds", "pediatrician",
                               "intensivist", "cardiologist", "coroner", "nurse", "nursing", "rn", "lpn",
                               "cna", "pa", "physician assistant", "mental health",
                               "emergency department technician", "resp therapist", "respiratory",
                                "phlebotomist", "pharmacy", "pharmacist", "hospital", "snf", "rehabilitation",
                               "rehab", "activity", "elderly", "subacute", "sub acute",
                                "clinic", "post acute", "therapist", "extended care",
                                "dental", "dential", "dentist", sep = "|")

occupation_med_frontline
[1] "medical|medicine|hcw|healthcare|home care|home health|surgeon|doctor|doc|physician|surgery|peds|pediatrician|intensivist|cardiologist|coroner|nurse|nursing|rn|lpn|cna|pa|physician assistant|mental health|emergency department technician|resp therapist|respiratory|phlebotomist|pharmacy|pharmacist|hospital|snf|rehabilitation|rehab|activity|elderly|subacute|sub acute|clinic|post acute|therapist|extended care|dental|dential|dentist"

This command returns the number of occupations which contain any one of the search terms for front-line medical providers (occupation_med_frontline):

sum(str_detect(string = occupations, pattern = occupation_med_frontline))
[1] 2

Base R string search functions

The base function grepl() works similarly to str_detect(), in that it searches for matches to a pattern and returns a logical vector. The basic syntax is grepl(pattern, strings_to_search, ignore.case = FALSE, ...). One advantage is that the ignore.case argument is easier to write (there is no need to involve the regex() function).

Likewise, the base functions sub() and gsub() act similarly to str_replace(). Their basic syntax is: gsub(pattern, replacement, strings_to_search, ignore.case = FALSE). sub() will replace the first instance of the pattern, whereas gsub() will replace all instances of the pattern.

Convert commas to periods

Here is an example of using gsub() to convert commas to periods in a vector of numbers. This could be useful if your data come from parts of the world other than the United States or Great Britain.

The inner gsub() which acts first on lengths is converting any periods to no space ““. The period character”.” has to be “escaped” with two slashes to actually signify a period, because “.” in regex means “any character”. Then, the result (with only commas) is passed to the outer gsub() in which commas are replaced by periods.

lengths <- c("2.454,56", "1,2", "6.096,5")

as.numeric(gsub(pattern = ",",                # find commas     
                replacement = ".",            # replace with periods
                x = gsub("\\.", "", lengths)  # vector with other periods removed (periods escaped)
                )
           )                                  # convert outcome to numeric

Replace all

Use str_replace_all() as a “find and replace” tool. First, provide the strings to be evaluated to string =, then the pattern to be replaced to pattern =, and then the replacement value to replacement =. The example below replaces all instances of “dead” with “deceased”. Note, this IS case sensitive.

outcome <- c("Karl: dead",
            "Samantha: dead",
            "Marco: not dead")

str_replace_all(string = outcome, pattern = "dead", replacement = "deceased")
[1] "Karl: deceased"      "Samantha: deceased"  "Marco: not deceased"

Notes:

  • To replace a pattern with NA, use str_replace_na().
  • The function str_replace() replaces only the first instance of the pattern within each evaluated string.

Detect within logic

Within case_when()

str_detect() is often used within case_when() (from dplyr). Let’s say occupations is a column in the linelist. The mutate() below creates a new column called is_educator by using conditional logic via case_when(). See the page on data cleaning to learn more about case_when().

df <- df %>% 
  mutate(is_educator = case_when(
    # term search within occupation, not case sensitive
    str_detect(occupations,
               regex("teach|prof|tutor|university",
                     ignore_case = TRUE))              ~ "Educator",
    # all others
    TRUE                                               ~ "Not an educator"))

As a reminder, it may be important to add exclusion criteria to the conditional logic (negate = F):

df <- df %>% 
  # value in new column is_educator is based on conditional logic
  mutate(is_educator = case_when(
    
    # occupation column must meet 2 criteria to be assigned "Educator":
    # it must have a search term AND NOT any exclusion term
    
    # Must have a search term
    str_detect(occupations,
               regex("teach|prof|tutor|university", ignore_case = T)) &              
    
    # AND must NOT have an exclusion term
    str_detect(occupations,
               regex("admin", ignore_case = T),
               negate = TRUE                        ~ "Educator"
    
    # All rows not meeting above criteria
    TRUE                                            ~ "Not an educator"))

Locate pattern position

To locate the first position of a pattern, use str_locate(). It outputs a start and end position.

str_locate("I wish", "sh")
     start end
[1,]     5   6

Like other str functions, there is an “_all” version (str_locate_all()) which will return the positions of all instances of the pattern within each string. This outputs as a list.

phrases <- c("I wish", "I hope", "he hopes", "He hopes")

str_locate(phrases, "h" )     # position of *first* instance of the pattern
     start end
[1,]     6   6
[2,]     3   3
[3,]     1   1
[4,]     4   4
str_locate_all(phrases, "h" ) # position of *every* instance of the pattern
[[1]]
     start end
[1,]     6   6

[[2]]
     start end
[1,]     3   3

[[3]]
     start end
[1,]     1   1
[2,]     4   4

[[4]]
     start end
[1,]     4   4

Extract a match

str_extract_all() returns the matching patterns themselves, which is most useful when you have offered several patterns via “OR” conditions. For example, looking in the string vector of occupations (see previous tab) for either “teach”, “prof”, or “tutor”.

str_extract_all() returns a list which contains all matches for each evaluated string. See below how occupation 3 has two pattern matches within it.

str_extract_all(occupations, "teach|prof|tutor")
[[1]]
character(0)

[[2]]
[1] "prof"

[[3]]
[1] "teach" "tutor"

[[4]]
[1] "tutor"

[[5]]
character(0)

[[6]]
character(0)

[[7]]
character(0)

[[8]]
character(0)

[[9]]
character(0)

[[10]]
character(0)

str_extract() extracts only the first match in each evaluated string, producing a character vector with one element for each evaluated string. It returns NA where there was no match. The NAs can be removed by wrapping the returned vector with na.exclude(). Note how the second of occupation 3’s matches is not shown.

str_extract(occupations, "teach|prof|tutor")
 [1] NA      "prof"  "teach" "tutor" NA      NA      NA      NA      NA     
[10] NA     

Subset and count

Aligned functions include str_subset() and str_count().

str_subset() returns the actual values which contained the pattern:

str_subset(occupations, "teach|prof|tutor")
[1] "university professor"           "primary school teacher & tutor"
[3] "tutor"                         

str_count() returns a vector of numbers: the number of times a search term appears in each evaluated value.

str_count(occupations, regex("teach|prof|tutor", ignore_case = TRUE))
 [1] 0 1 2 1 0 0 0 0 0 0

Regex groups

UNDER CONSTRUCTION

10.6 Special characters

Backslash \ as escape

The backslash \ is used to “escape” the meaning of the next character. This way, a backslash can be used to have a quote mark display within other quote marks (\") - the middle quote mark will not “break” the surrounding quote marks.

Note - thus, if you want to display a backslash, you must escape it’s meaning with another backslash. So you must write two backslashes \\ to display one.

Special characters

Special character Represents
"\\" backslash
"\n" a new line (newline)
"\"" double-quote within double quotes
'\'' single-quote within single quotes
"\| grave accent| carriage return| tab| vertical tab“` backspace

Run ?"'" in the R Console to display a complete list of these special characters (it will appear in the RStudio Help pane).

10.7 Regular expressions (regex)

10.8 Regex and special characters

Regular expressions, or “regex”, is a concise language for describing patterns in strings. If you are not familiar with it, a regular expression can look like an alien language. Here we try to de-mystify this language a little bit.

Much of this section is adapted from this tutorial and this cheatsheet. We selectively adapt here knowing that this handbook might be viewed by people without internet access to view the other tutorials.

A regular expression is often applied to extract specific patterns from “unstructured” text - for example medical notes, chief complaints, patient history, or other free text columns in a data frame

There are four basic tools one can use to create a basic regular expression:

  1. Character sets
  2. Meta characters
  3. Quantifiers
  4. Groups

Character sets

Character sets, are a way of expressing listing options for a character match, within brackets. So any a match will be triggered if any of the characters within the brackets are found in the string. For example, to look for vowels one could use this character set: “[aeiou]”. Some other common character sets are:

Character set Matches for
"[A-Z]" any single capital letter
"[a-z]" any single lowercase letter
"[0-9]" any digit
[:alnum:] any alphanumeric character
[:digit:] any numeric digit
[:alpha:] any letter (upper or lowercase)
[:upper:] any uppercase letter
[:lower:] any lowercase letter

Character sets can be combined within one bracket (no spaces!), such as "[A-Za-z]" (any upper or lowercase letter), or another example "[t-z0-5]" (lowercase t through z OR number 0 through 5).

Meta characters

Meta characters are shorthand for character sets. Some of the important ones are listed below:

Meta character Represents
"\\s" a single space
"\\w" any single alphanumeric character (A-Z, a-z, or 0-9)
"\\d" any single numeric digit (0-9)

Quantifiers

Typically you do not want to search for a match on only one character. Quantifiers allow you to designate the length of letters/numbers to allow for the match.

Quantifiers are numbers written within curly brackets { } after the character they are quantifying, for example,

  • "A{2}" will return instances of two capital A letters.
  • "A{2,4}" will return instances of between two and four capital A letters (do not put spaces!).
  • "A{2,}" will return instances of two or more capital A letters.
  • "A+" will return instances of one or more capital A letters (group extended until a different character is encountered).
  • Precede with an * asterisk to return zero or more matches (useful if you are not sure the pattern is present)

Using the + plus symbol as a quantifier, the match will occur until a different character is encountered. For example, this expression will return all words (alpha characters: "[A-Za-z]+"

# test string for quantifiers
test <- "A-AA-AAA-AAAA"

When a quantifier of {2} is used, only pairs of consecutive A’s are returned. Two pairs are identified within AAAA.

str_extract_all(test, "A{2}")
[[1]]
[1] "AA" "AA" "AA" "AA"

When a quantifier of {2,4} is used, groups of consecutive A’s that are two to four in length are returned.

str_extract_all(test, "A{2,4}")
[[1]]
[1] "AA"   "AAA"  "AAAA"

With the quantifier +, groups of one or more are returned:

str_extract_all(test, "A+")
[[1]]
[1] "A"    "AA"   "AAA"  "AAAA"

Relative position

These express requirements for what precedes or follows a pattern. For example, to extract sentences, “two numbers that are followed by a period” (""). (?<=\.)\s(?=[A-Z])

str_extract_all(test, "")
[[1]]
 [1] "A" "-" "A" "A" "-" "A" "A" "A" "-" "A" "A" "A" "A"
Position statement Matches to
"(?<=b)a" “a” that is preceded by a “b”
"(?<!b)a" “a” that is NOT preceded by a “b”
"a(?=b)" “a” that is followed by a “b”
"a(?!b)" “a” that is NOT followed by a “b”

Groups

Capturing groups in your regular expression is a way to have a more organized output upon extraction.

Regex examples

Below is a free text for the examples. We will try to extract useful information from it using a regular expression search term.

pt_note <- "Patient arrived at Broward Hospital emergency ward at 18:00 on 6/12/2005. Patient presented with radiating abdominal pain from LR quadrant. Patient skin was pale, cool, and clammy. Patient temperature was 99.8 degrees farinheit. Patient pulse rate was 100 bpm and thready. Respiratory rate was 29 per minute."

This expression matches to all words (any character until hitting non-character such as a space):

str_extract_all(pt_note, "[A-Za-z]+")
[[1]]
 [1] "Patient"     "arrived"     "at"          "Broward"     "Hospital"   
 [6] "emergency"   "ward"        "at"          "on"          "Patient"    
[11] "presented"   "with"        "radiating"   "abdominal"   "pain"       
[16] "from"        "LR"          "quadrant"    "Patient"     "skin"       
[21] "was"         "pale"        "cool"        "and"         "clammy"     
[26] "Patient"     "temperature" "was"         "degrees"     "farinheit"  
[31] "Patient"     "pulse"       "rate"        "was"         "bpm"        
[36] "and"         "thready"     "Respiratory" "rate"        "was"        
[41] "per"         "minute"     

The expression "[0-9]{1,2}" matches to consecutive numbers that are 1 or 2 digits in length. It could also be written "\\d{1,2}", or "[:digit:]{1,2}".

str_extract_all(pt_note, "[0-9]{1,2}")
[[1]]
 [1] "18" "00" "6"  "12" "20" "05" "99" "8"  "10" "0"  "29"

You can view a useful list of regex expressions and tips on page 2 of this cheatsheet

Also see this tutorial.

10.9 Resources

A reference sheet for stringr functions can be found here

A vignette on stringr can be found here