10 Characters and strings
This page demonstrates use of the stringr package to evaluate and handle character values (“strings”).
- Combine, order, split, arrange -
str_c()
,str_glue()
,str_order()
,str_split()
- Clean and standardise
- Adjust length -
str_pad()
,str_trunc()
,str_wrap()
- Change case -
str_to_upper()
,str_to_title()
,str_to_lower()
,str_to_sentence()
- Adjust length -
- Evaluate and extract by position -
str_length()
,str_sub()
,word()
- Patterns
- Detect and locate -
str_detect()
,str_subset()
,str_match()
,str_extract()
- Modify and replace -
str_sub()
,str_replace_all()
- Detect and locate -
- Regular expressions (“regex”)
For ease of display most examples are shown acting on a short defined character vector, however they can easily be adapted to a column within a data frame.
This stringr vignette provided much of the inspiration for this page.
10.1 Preparation
Load packages
Install or load the stringr and other tidyverse packages.
# install/load packages
::p_load(
pacman# many functions for handling strings
stringr, # for optional data manipulation
tidyverse, # alternative for converting to title case tools)
Import data
In this page we will occassionally reference the cleaned linelist
of cases from a simulated Ebola epidemic. If you want to follow along, click to download the “clean” linelist (as .rds file). Import data with the import()
function from the rio package (it handles many file types like .xlsx, .csv, .rds - see the Import and export(importing.qmd) page for details).
# import case linelist
<- import("linelist_cleaned.rds") linelist
The first 50 rows of the linelist are displayed below.
10.2 Unite, split, and arrange
This section covers:
- Using
str_c()
,str_glue()
, andunite()
to combine strings
- Using
str_order()
to arrange strings
- Using
str_split()
andseparate()
to split strings
Combine strings
To combine or concatenate multiple strings into one string, we suggest using str_c
from stringr. If you have distinct character values to combine, simply provide them as unique arguments, separated by commas.
str_c("String1", "String2", "String3")
[1] "String1String2String3"
The argument sep =
inserts a character value between each of the arguments you provided (e.g. inserting a comma, space, or newline "\n"
)
str_c("String1", "String2", "String3", sep = ", ")
[1] "String1, String2, String3"
The argument collapse =
is relevant if you are inputting multiple vectors as arguments to str_c()
. It is used to separate the elements of what would be an output vector, such that the output vector only has one long character element.
The example below shows the combination of two vectors into one (first names and last names). Another similar example might be jurisdictions and their case counts. In this example:
- The
sep =
value appears between each first and last name
- The
collapse =
value appears between each person
<- c("abdul", "fahruk", "janice")
first_names <- c("hussein", "akinleye", "okeke")
last_names
# sep displays between the respective input strings, while collapse displays between the elements produced
str_c(first_names, last_names, sep = " ", collapse = "; ")
[1] "abdul hussein; fahruk akinleye; janice okeke"
Note: Depending on your desired display context, when printing such a combined string with newlines, you may need to wrap the whole phrase in cat()
for the newlines to print properly:
# For newlines to print correctly, the phrase may need to be wrapped in cat()
cat(str_c(first_names, last_names, sep = " ", collapse = ";\n"))
abdul hussein;
fahruk akinleye;
janice okeke
Dynamic strings
Use str_glue()
to insert dynamic R code into a string. This is a very useful function for creating dynamic plot captions, as demonstrated below.
- All content goes between double quotation marks
str_glue("")
- Any dynamic code or references to pre-defined values are placed within curly brackets
{}
within the double quotation marks. There can be many curly brackets in the samestr_glue()
command.
- To display character quotes ’’, use single quotes within the surrounding double quotes (e.g. when providing date format - see example below)
- Tip: You can use
\n
to force a new line
- Tip: You use
format()
to adjust date display, and useSys.Date()
to display the current date
A simple example, of a dynamic plot caption:
str_glue("Data include {nrow(linelist)} cases and are current to {format(Sys.Date(), '%d %b %Y')}.")
Data include 5888 cases and are current to 19 Jun 2024.
An alternative format is to use placeholders within the brackets and define the code in separate arguments at the end of the str_glue()
function, as below. This can improve code readability if the text is long.
str_glue("Linelist as of {current_date}.\nLast case hospitalized on {last_hospital}.\n{n_missing_onset} cases are missing date of onset and not shown",
current_date = format(Sys.Date(), '%d %b %Y'),
last_hospital = format(as.Date(max(linelist$date_hospitalisation, na.rm=T)), '%d %b %Y'),
n_missing_onset = nrow(linelist %>% filter(is.na(date_onset)))
)
Linelist as of 19 Jun 2024.
Last case hospitalized on 30 Apr 2015.
256 cases are missing date of onset and not shown
Pulling from a data frame
Sometimes, it is useful to pull data from a data frame and have it pasted together in sequence. Below is an example data frame. We will use it to to make a summary statement about the jurisdictions and the new and total case counts.
# make case data frame
<- data.frame(
case_table zone = c("Zone 1", "Zone 2", "Zone 3", "Zone 4", "Zone 5"),
new_cases = c(3, 0, 7, 0, 15),
total_cases = c(40, 4, 25, 10, 103)
)
Use str_glue_data()
, which is specially made for taking data from data frame rows:
%>%
case_table str_glue_data("{zone}: {new_cases} ({total_cases} total cases)")
Zone 1: 3 (40 total cases)
Zone 2: 0 (4 total cases)
Zone 3: 7 (25 total cases)
Zone 4: 0 (10 total cases)
Zone 5: 15 (103 total cases)
Combine strings across rows
If you are trying to “roll-up” values in a data frame column, e.g. combine values from multiple rows into just one row by pasting them together with a separator, see the section of the De-duplication page on “rolling-up” values.
Data frame to one line
You can make the statement appear in one line using str_c()
(specifying the data frame and column names), and providing sep =
and collapse =
arguments.
str_c(case_table$zone, case_table$new_cases, sep = " = ", collapse = "; ")
[1] "Zone 1 = 3; Zone 2 = 0; Zone 3 = 7; Zone 4 = 0; Zone 5 = 15"
You could add the pre-fix text “New Cases:” to the beginning of the statement by wrapping with a separate str_c()
(if “New Cases:” was within the original str_c()
it would appear multiple times).
str_c("New Cases: ", str_c(case_table$zone, case_table$new_cases, sep = " = ", collapse = "; "))
[1] "New Cases: Zone 1 = 3; Zone 2 = 0; Zone 3 = 7; Zone 4 = 0; Zone 5 = 15"
Unite columns
Within a data frame, bringing together character values from multiple columns can be achieved with unite()
from tidyr. This is the opposite of separate()
.
Provide the name of the new united column. Then provide the names of the columns you wish to unite.
- By default, the separator used in the united column is underscore
_
, but this can be changed with thesep =
argument.
remove =
removes the input columns from the data frame (TRUE by default)
na.rm =
removes missing values while uniting (FALSE by default)
Below, we define a mini-data frame to demonstrate with:
<- data.frame(
df case_ID = c(1:6),
symptoms = c("jaundice, fever, chills", # patient 1
"chills, aches, pains", # patient 2
"fever", # patient 3
"vomiting, diarrhoea", # patient 4
"bleeding from gums, fever", # patient 5
"rapid pulse, headache"), # patient 6
outcome = c("Recover", "Death", "Death", "Recover", "Recover", "Recover"))
<- separate(df, symptoms, into = c("sym_1", "sym_2", "sym_3"), extra = "merge") df_split
Warning: Expected 3 pieces. Missing pieces filled with `NA` in 2 rows [3, 4].
Here is the example data frame:
Below, we unite the three symptom columns:
%>%
df_split unite(
col = "all_symptoms", # name of the new united column
c("sym_1", "sym_2", "sym_3"), # columns to unite
sep = ", ", # separator to use in united column
remove = TRUE, # if TRUE, removes input cols from the data frame
na.rm = TRUE # if TRUE, missing values are removed before uniting
)
case_ID all_symptoms outcome
1 1 jaundice, fever, chills Recover
2 2 chills, aches, pains Death
3 3 fever Death
4 4 vomiting, diarrhoea Recover
5 5 bleeding, from, gums, fever Recover
6 6 rapid, pulse, headache Recover
Split
To split a string based on a pattern, use str_split()
. It evaluates the string(s) and returns a list
of character vectors consisting of the newly-split values.
The simple example below evaluates one string and splits it into three. By default it returns an object of class list
with one element (a character vector) for each string initially provided. If simplify = TRUE
it returns a character matrix.
In this example, one string is provided, and the function returns a list with one element - a character vector with three values.
str_split(string = "jaundice, fever, chills",
pattern = ",")
[[1]]
[1] "jaundice" " fever" " chills"
If the output is saved, you can then access the nth split value with bracket syntax. To access a specific value you can use syntax like this: the_returned_object[[1]][2]
, which would access the second value from the first evaluated string (“fever”). See the R basics page for more detail on accessing elements.
<- str_split("jaundice, fever, chills", ",")
pt1_symptoms
1]][2] # extracts 2nd value from 1st (and only) element of the list pt1_symptoms[[
[1] " fever"
If multiple strings are provided by str_split()
, there will be more than one element in the returned list.
<- c("jaundice, fever, chills", # patient 1
symptoms "chills, aches, pains", # patient 2
"fever", # patient 3
"vomiting, diarrhoea", # patient 4
"bleeding from gums, fever", # patient 5
"rapid pulse, headache") # patient 6
str_split(symptoms, ",") # split each patient's symptoms
[[1]]
[1] "jaundice" " fever" " chills"
[[2]]
[1] "chills" " aches" " pains"
[[3]]
[1] "fever"
[[4]]
[1] "vomiting" " diarrhoea"
[[5]]
[1] "bleeding from gums" " fever"
[[6]]
[1] "rapid pulse" " headache"
To return a “character matrix” instead, which may be useful if creating data frame columns, set the argument simplify = TRUE
as shown below:
str_split(symptoms, ",", simplify = TRUE)
[,1] [,2] [,3]
[1,] "jaundice" " fever" " chills"
[2,] "chills" " aches" " pains"
[3,] "fever" "" ""
[4,] "vomiting" " diarrhoea" ""
[5,] "bleeding from gums" " fever" ""
[6,] "rapid pulse" " headache" ""
You can also adjust the number of splits to create with the n =
argument. For example, this restricts the number of splits to 2. Any further commas remain within the second values.
str_split(symptoms, ",", simplify = TRUE, n = 2)
[,1] [,2]
[1,] "jaundice" " fever, chills"
[2,] "chills" " aches, pains"
[3,] "fever" ""
[4,] "vomiting" " diarrhoea"
[5,] "bleeding from gums" " fever"
[6,] "rapid pulse" " headache"
Note - the same outputs can be achieved with str_split_fixed()
, in which you do not give the simplify
argument, but must instead designate the number of columns (n
).
str_split_fixed(symptoms, ",", n = 2)
Split columns
If you are trying to split data frame column, it is best to use the separate()
function from dplyr. It is used to split one character column into other columns.
Let’s say we have a simple data frame df
(defined and united in the unite section) containing a case_ID
column, one character column with many symptoms, and one outcome column. Our goal is to separate the symptoms
column into many columns - each one containing one symptom.
Assuming the data are piped into separate()
, first provide the column to be separated. Then provide into =
as a vector c( )
containing the new columns names, as shown below.
sep =
the separator, can be a character, or a number (interpreted as the character position to split at)remove =
FALSE by default, removes the input column
convert =
FALSE by default, will cause string “NA”s to becomeNA
extra =
this controls what happens if there are more values created by the separation than new columns named.extra = "warn"
means you will see a warning but it will drop excess values (the default)
extra = "drop"
means the excess values will be dropped with no warning
extra = "merge"
will only split to the number of new columns listed ininto
- this setting will preserve all your data
An example with extra = "merge"
is below - no data is lost. Two new columns are defined but any third symptoms are left in the second new column:
# third symptoms combined into second new column
%>%
df separate(symptoms, into = c("sym_1", "sym_2"), sep=",", extra = "merge")
Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [3].
case_ID sym_1 sym_2 outcome
1 1 jaundice fever, chills Recover
2 2 chills aches, pains Death
3 3 fever <NA> Death
4 4 vomiting diarrhoea Recover
5 5 bleeding from gums fever Recover
6 6 rapid pulse headache Recover
When the default extra = "drop"
is used below, a warning is given but the third symptoms are lost:
# third symptoms are lost
%>%
df separate(symptoms, into = c("sym_1", "sym_2"), sep=",")
Warning: Expected 2 pieces. Additional pieces discarded in 2 rows [1, 2].
Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [3].
case_ID sym_1 sym_2 outcome
1 1 jaundice fever Recover
2 2 chills aches Death
3 3 fever <NA> Death
4 4 vomiting diarrhoea Recover
5 5 bleeding from gums fever Recover
6 6 rapid pulse headache Recover
CAUTION: If you do not provide enough into
values for the new columns, your data may be truncated.
Arrange alphabetically
Several strings can be sorted by alphabetical order. str_order()
returns the order, while str_sort()
returns the strings in that order.
# strings
<- c("Alba", "Takota", "Delta")
health_zones
# return the alphabetical order
str_order(health_zones)
[1] 1 3 2
# return the strings in alphabetical order
str_sort(health_zones)
[1] "Alba" "Delta" "Takota"
To use a different alphabet, add the argument locale =
. See the full list of locales by entering stringi::stri_locale_list()
in the R console.
base R functions
It is common to see base R functions paste()
and paste0()
, which concatenate vectors after converting all parts to character. They act similarly to str_c()
but the syntax is arguably more complicated - in the parentheses each part is separated by a comma. The parts are either character text (in quotes) or pre-defined code objects (no quotes). For example:
<- 10
n_beds <- 20
n_masks
paste0("Regional hospital needs ", n_beds, " beds and ", n_masks, " masks.")
[1] "Regional hospital needs 10 beds and 20 masks."
sep =
and collapse =
arguments can be specified. paste()
is simply paste0()
with a default sep = " "
(one space).
10.3 Clean and standardise
Change case
Often one must alter the case/capitalization of a string value, for example names of jursidictions. Use str_to_upper()
, str_to_lower()
, and str_to_title()
, from stringr, as shown below:
str_to_upper("California")
[1] "CALIFORNIA"
str_to_lower("California")
[1] "california"
Using *base** R, the above can also be achieved with toupper()
, tolower()
.
Title case
Transforming the string so each word is capitalized can be achieved with str_to_title()
:
str_to_title("go to the US state of california ")
[1] "Go To The Us State Of California "
Use toTitleCase()
from the tools package to achieve more nuanced capitalization (words like “to”, “the”, and “of” are not capitalized).
::toTitleCase("This is the US state of california") tools
[1] "This is the US State of California"
You can also use str_to_sentence()
, which capitalizes only the first letter of the string.
str_to_sentence("the patient must be transported")
[1] "The patient must be transported"
Pad length
Use str_pad()
to add characters to a string, to a minimum length. By default spaces are added, but you can also pad with other characters using the pad =
argument.
# ICD codes of differing length
<- c("R10.13",
ICD_codes "R10.819",
"R17")
# ICD codes padded to 7 characters on the right side
str_pad(ICD_codes, 7, "right")
[1] "R10.13 " "R10.819" "R17 "
# Pad with periods instead of spaces
str_pad(ICD_codes, 7, "right", pad = ".")
[1] "R10.13." "R10.819" "R17...."
For example, to pad numbers with leading zeros (such as for hours or minutes), you can pad the number to minimum length of 2 with pad = "0"
.
# Add leading zeros to two digits (e.g. for times minutes/hours)
str_pad("4", 2, pad = "0")
[1] "04"
# example using a numeric column named "hours"
# hours <- str_pad(hours, 2, pad = "0")
Truncate
str_trunc()
sets a maximum length for each string. If a string exceeds this length, it is truncated (shortened) and an ellipsis (…) is included to indicate that the string was previously longer. Note that the ellipsis is counted in the length. The ellipsis characters can be changed with the argument ellipsis =
. The optional side =
argument specifies which where the ellipsis will appear within the truncated string (“left”, “right”, or “center”).
<- "Symptom onset on 4/3/2020 with vomiting"
original str_trunc(original, 10, "center")
[1] "Symp...ing"
Standardize length
Use str_trunc()
to set a maximum length, and then use str_pad()
to expand the very short strings to that truncated length. In the example below, 6 is set as the maximum length (one value is truncated), and then one very short value is padded to achieve length of 6.
# ICD codes of differing length
<- c("R10.13",
ICD_codes "R10.819",
"R17")
# truncate to maximum length of 6
<- str_trunc(ICD_codes, 6)
ICD_codes_2 ICD_codes_2
[1] "R10.13" "R10..." "R17"
# expand to minimum length of 6
<- str_pad(ICD_codes_2, 6, "right")
ICD_codes_3 ICD_codes_3
[1] "R10.13" "R10..." "R17 "
Remove leading/trailing whitespace
Use str_trim()
to remove spaces, newlines (\n
) or tabs (\t
) on sides of a string input. Add "right"
"left"
, or "both"
to the command to specify which side to trim (e.g. str_trim(x, "right")
.
# ID numbers with excess spaces on right
<- c("provA_1852 ", # two excess spaces
IDs "provA_2345", # zero excess spaces
"provA_9460 ") # one excess space
# IDs trimmed to remove excess spaces on right side only
str_trim(IDs)
[1] "provA_1852" "provA_2345" "provA_9460"
Remove repeated whitespace within
Use str_squish()
to remove repeated spaces that appear inside a string. For example, to convert double spaces into single spaces. It also removes spaces, newlines, or tabs on the outside of the string like str_trim()
.
# original contains excess spaces within string
str_squish(" Pt requires IV saline\n")
[1] "Pt requires IV saline"
Enter ?str_trim
, ?str_pad
in your R console to see further details.
Wrap into paragraphs
Use str_wrap()
to wrap a long unstructured text into a structured paragraph with fixed line length. Provide the ideal character length for each line, and it applies an algorithm to insert newlines (\n
) within the paragraph, as seen in the example below.
<- "Symptom onset 1/4/2020 vomiting chills fever. Pt saw traditional healer in home village on 2/4/2020. On 5/4/2020 pt symptoms worsened and was admitted to Lumta clinic. Sample was taken and pt was transported to regional hospital on 6/4/2020. Pt died at regional hospital on 7/4/2020."
pt_course
str_wrap(pt_course, 40)
[1] "Symptom onset 1/4/2020 vomiting chills\nfever. Pt saw traditional healer in\nhome village on 2/4/2020. On 5/4/2020\npt symptoms worsened and was admitted\nto Lumta clinic. Sample was taken and pt\nwas transported to regional hospital on\n6/4/2020. Pt died at regional hospital\non 7/4/2020."
The base function cat()
can be wrapped around the above command in order to print the output, displaying the new lines added.
cat(str_wrap(pt_course, 40))
Symptom onset 1/4/2020 vomiting chills
fever. Pt saw traditional healer in
home village on 2/4/2020. On 5/4/2020
pt symptoms worsened and was admitted
to Lumta clinic. Sample was taken and pt
was transported to regional hospital on
6/4/2020. Pt died at regional hospital
on 7/4/2020.
10.4 Handle by position
Extract by character position
Use str_sub()
to return only a part of a string. The function takes three main arguments:
- the character vector(s)
- start position
- end position
A few notes on position numbers:
- If a position number is positive, the position is counted starting from the left end of the string.
- If a position number is negative, it is counted starting from the right end of the string.
- Position numbers are inclusive.
- Positions extending beyond the string will be truncated (removed).
Below are some examples applied to the string “pneumonia”:
# start and end third from left (3rd letter from left)
str_sub("pneumonia", 3, 3)
[1] "e"
# 0 is not present
str_sub("pneumonia", 0, 0)
[1] ""
# 6th from left, to the 1st from right
str_sub("pneumonia", 6, -1)
[1] "onia"
# 5th from right, to the 2nd from right
str_sub("pneumonia", -5, -2)
[1] "moni"
# 4th from left to a position outside the string
str_sub("pneumonia", 4, 15)
[1] "umonia"
Extract by word position
To extract the nth ‘word’, use word()
, also from stringr. Provide the string(s), then the first word position to extract, and the last word position to extract.
By default, the separator between ‘words’ is assumed to be a space, unless otherwise indicated with sep =
(e.g. sep = "_"
when words are separated by underscores.
# strings to evaluate
<- c("I just got out of the hospital 2 days ago, but still can barely breathe.",
chief_complaints "My stomach hurts",
"Severe ear pain")
# extract 1st to 3rd words of each string
word(chief_complaints, start = 1, end = 3, sep = " ")
[1] "I just got" "My stomach hurts" "Severe ear pain"
Replace by character position
str_sub()
paired with the assignment operator (<-
) can be used to modify a part of a string:
<- "pneumonia"
word
# convert the third and fourth characters to X
str_sub(word, 3, 4) <- "XX"
# print
word
[1] "pnXXmonia"
An example applied to multiple strings (e.g. a column). Note the expansion in length of “HIV”.
<- c("pneumonia", "tubercolosis", "HIV")
words
# convert the third and fourth characters to X
str_sub(words, 3, 4) <- "XX"
words
[1] "pnXXmonia" "tuXXrcolosis" "HIXX"
Evaluate length
str_length("abc")
[1] 3
Alternatively, use nchar()
from base R
10.5 Patterns
Many stringr functions work to detect, locate, extract, match, replace, and split based on a specified pattern.
Detect a pattern
Use str_detect()
as below to detect presence/absence of a pattern within a string. First provide the string or vector to search in (string =
), and then the pattern to look for (pattern =
). Note that by default the search is case sensitive!
str_detect(string = "primary school teacher", pattern = "teach")
[1] TRUE
The argument negate =
can be included and set to TRUE
if you want to know if the pattern is NOT present.
str_detect(string = "primary school teacher", pattern = "teach", negate = TRUE)
[1] FALSE
To ignore case/capitalization, wrap the pattern within regex()
, and within regex()
add the argument ignore_case = TRUE
(or T
as shorthand).
str_detect(string = "Teacher", pattern = regex("teach", ignore_case = T))
[1] TRUE
When str_detect()
is applied to a character vector or a data frame column, it will return TRUE or FALSE for each of the values.
# a vector/column of occupations
<- c("field laborer",
occupations "university professor",
"primary school teacher & tutor",
"tutor",
"nurse at regional hospital",
"lineworker at Amberdeen Fish Factory",
"physican",
"cardiologist",
"office worker",
"food service")
# Detect presence of pattern "teach" in each string - output is vector of TRUE/FALSE
str_detect(occupations, "teach")
[1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
If you need to count the TRUE
s, simply sum()
the output. This counts the number TRUE
.
sum(str_detect(occupations, "teach"))
[1] 1
To search inclusive of multiple terms, include them separated by OR bars (|
) within the pattern =
argument, as shown below:
sum(str_detect(string = occupations, pattern = "teach|professor|tutor"))
[1] 3
If you need to build a long list of search terms, you can combine them using str_c()
and sep = |
, then define this is a character object, and then reference the vector later more succinctly. The example below includes possible occupation search terms for front-line medical providers.
# search terms
<- str_c("medical", "medicine", "hcw", "healthcare", "home care", "home health",
occupation_med_frontline "surgeon", "doctor", "doc", "physician", "surgery", "peds", "pediatrician",
"intensivist", "cardiologist", "coroner", "nurse", "nursing", "rn", "lpn",
"cna", "pa", "physician assistant", "mental health",
"emergency department technician", "resp therapist", "respiratory",
"phlebotomist", "pharmacy", "pharmacist", "hospital", "snf", "rehabilitation",
"rehab", "activity", "elderly", "subacute", "sub acute",
"clinic", "post acute", "therapist", "extended care",
"dental", "dential", "dentist", sep = "|")
occupation_med_frontline
[1] "medical|medicine|hcw|healthcare|home care|home health|surgeon|doctor|doc|physician|surgery|peds|pediatrician|intensivist|cardiologist|coroner|nurse|nursing|rn|lpn|cna|pa|physician assistant|mental health|emergency department technician|resp therapist|respiratory|phlebotomist|pharmacy|pharmacist|hospital|snf|rehabilitation|rehab|activity|elderly|subacute|sub acute|clinic|post acute|therapist|extended care|dental|dential|dentist"
This command returns the number of occupations which contain any one of the search terms for front-line medical providers (occupation_med_frontline
):
sum(str_detect(string = occupations, pattern = occupation_med_frontline))
[1] 2
Base R string search functions
The base function grepl()
works similarly to str_detect()
, in that it searches for matches to a pattern and returns a logical vector. The basic syntax is grepl(pattern, strings_to_search, ignore.case = FALSE, ...)
. One advantage is that the ignore.case
argument is easier to write (there is no need to involve the regex()
function).
Likewise, the base functions sub()
and gsub()
act similarly to str_replace()
. Their basic syntax is: gsub(pattern, replacement, strings_to_search, ignore.case = FALSE)
. sub()
will replace the first instance of the pattern, whereas gsub()
will replace all instances of the pattern.
Convert commas to periods
Here is an example of using gsub()
to convert commas to periods in a vector of numbers. This could be useful if your data come from parts of the world other than the United States or Great Britain.
The inner gsub()
which acts first on lengths
is converting any periods to no space ““. The period character”.” has to be “escaped” with two slashes to actually signify a period, because “.” in regex means “any character”. Then, the result (with only commas) is passed to the outer gsub()
in which commas are replaced by periods.
<- c("2.454,56", "1,2", "6.096,5")
lengths
as.numeric(gsub(pattern = ",", # find commas
replacement = ".", # replace with periods
x = gsub("\\.", "", lengths) # vector with other periods removed (periods escaped)
)# convert outcome to numeric )
Replace all
Use str_replace_all()
as a “find and replace” tool. First, provide the strings to be evaluated to string =
, then the pattern to be replaced to pattern =
, and then the replacement value to replacement =
. The example below replaces all instances of “dead” with “deceased”. Note, this IS case sensitive.
<- c("Karl: dead",
outcome "Samantha: dead",
"Marco: not dead")
str_replace_all(string = outcome, pattern = "dead", replacement = "deceased")
[1] "Karl: deceased" "Samantha: deceased" "Marco: not deceased"
Notes:
- To replace a pattern with
NA
, usestr_replace_na()
.
- The function
str_replace()
replaces only the first instance of the pattern within each evaluated string.
Detect within logic
Within case_when()
str_detect()
is often used within case_when()
(from dplyr). Let’s say occupations
is a column in the linelist. The mutate()
below creates a new column called is_educator
by using conditional logic via case_when()
. See the page on data cleaning to learn more about case_when()
.
<- df %>%
df mutate(is_educator = case_when(
# term search within occupation, not case sensitive
str_detect(occupations,
regex("teach|prof|tutor|university",
ignore_case = TRUE)) ~ "Educator",
# all others
TRUE ~ "Not an educator"))
As a reminder, it may be important to add exclusion criteria to the conditional logic (negate = F
):
<- df %>%
df # value in new column is_educator is based on conditional logic
mutate(is_educator = case_when(
# occupation column must meet 2 criteria to be assigned "Educator":
# it must have a search term AND NOT any exclusion term
# Must have a search term
str_detect(occupations,
regex("teach|prof|tutor|university", ignore_case = T)) &
# AND must NOT have an exclusion term
str_detect(occupations,
regex("admin", ignore_case = T),
negate = TRUE ~ "Educator"
# All rows not meeting above criteria
TRUE ~ "Not an educator"))
Locate pattern position
To locate the first position of a pattern, use str_locate()
. It outputs a start and end position.
str_locate("I wish", "sh")
start end
[1,] 5 6
Like other str
functions, there is an “_all” version (str_locate_all()
) which will return the positions of all instances of the pattern within each string. This outputs as a list
.
<- c("I wish", "I hope", "he hopes", "He hopes")
phrases
str_locate(phrases, "h" ) # position of *first* instance of the pattern
start end
[1,] 6 6
[2,] 3 3
[3,] 1 1
[4,] 4 4
str_locate_all(phrases, "h" ) # position of *every* instance of the pattern
[[1]]
start end
[1,] 6 6
[[2]]
start end
[1,] 3 3
[[3]]
start end
[1,] 1 1
[2,] 4 4
[[4]]
start end
[1,] 4 4
Extract a match
str_extract_all()
returns the matching patterns themselves, which is most useful when you have offered several patterns via “OR” conditions. For example, looking in the string vector of occupations (see previous tab) for either “teach”, “prof”, or “tutor”.
str_extract_all()
returns a list
which contains all matches for each evaluated string. See below how occupation 3 has two pattern matches within it.
str_extract_all(occupations, "teach|prof|tutor")
[[1]]
character(0)
[[2]]
[1] "prof"
[[3]]
[1] "teach" "tutor"
[[4]]
[1] "tutor"
[[5]]
character(0)
[[6]]
character(0)
[[7]]
character(0)
[[8]]
character(0)
[[9]]
character(0)
[[10]]
character(0)
str_extract()
extracts only the first match in each evaluated string, producing a character vector with one element for each evaluated string. It returns NA
where there was no match. The NA
s can be removed by wrapping the returned vector with na.exclude()
. Note how the second of occupation 3’s matches is not shown.
str_extract(occupations, "teach|prof|tutor")
[1] NA "prof" "teach" "tutor" NA NA NA NA NA
[10] NA
Subset and count
Aligned functions include str_subset()
and str_count()
.
str_subset()
returns the actual values which contained the pattern:
str_subset(occupations, "teach|prof|tutor")
[1] "university professor" "primary school teacher & tutor"
[3] "tutor"
str_count()
returns a vector of numbers: the number of times a search term appears in each evaluated value.
str_count(occupations, regex("teach|prof|tutor", ignore_case = TRUE))
[1] 0 1 2 1 0 0 0 0 0 0
Regex groups
UNDER CONSTRUCTION
10.6 Special characters
Backslash \
as escape
The backslash \
is used to “escape” the meaning of the next character. This way, a backslash can be used to have a quote mark display within other quote marks (\"
) - the middle quote mark will not “break” the surrounding quote marks.
Note - thus, if you want to display a backslash, you must escape it’s meaning with another backslash. So you must write two backslashes \\
to display one.
Special characters
Special character | Represents |
---|---|
"\\" |
backslash |
"\n" |
a new line (newline) |
"\"" |
double-quote within double quotes |
'\'' |
single-quote within single quotes |
"\ “| grave accent ”| carriage return “| tab ”| vertical tab “` |
backspace |
Run ?"'"
in the R Console to display a complete list of these special characters (it will appear in the RStudio Help pane).
10.7 Regular expressions (regex)
10.8 Regex and special characters
Regular expressions, or “regex”, is a concise language for describing patterns in strings. If you are not familiar with it, a regular expression can look like an alien language. Here we try to de-mystify this language a little bit.
Much of this section is adapted from this tutorial and this cheatsheet. We selectively adapt here knowing that this handbook might be viewed by people without internet access to view the other tutorials.
A regular expression is often applied to extract specific patterns from “unstructured” text - for example medical notes, chief complaints, patient history, or other free text columns in a data frame
There are four basic tools one can use to create a basic regular expression:
- Character sets
- Meta characters
- Quantifiers
- Groups
Character sets
Character sets, are a way of expressing listing options for a character match, within brackets. So any a match will be triggered if any of the characters within the brackets are found in the string. For example, to look for vowels one could use this character set: “[aeiou]”. Some other common character sets are:
Character set | Matches for |
---|---|
"[A-Z]" |
any single capital letter |
"[a-z]" |
any single lowercase letter |
"[0-9]" |
any digit |
[:alnum:] |
any alphanumeric character |
[:digit:] |
any numeric digit |
[:alpha:] |
any letter (upper or lowercase) |
[:upper:] |
any uppercase letter |
[:lower:] |
any lowercase letter |
Character sets can be combined within one bracket (no spaces!), such as "[A-Za-z]"
(any upper or lowercase letter), or another example "[t-z0-5]"
(lowercase t through z OR number 0 through 5).
Meta characters
Meta characters are shorthand for character sets. Some of the important ones are listed below:
Meta character | Represents |
---|---|
"\\s" |
a single space |
"\\w" |
any single alphanumeric character (A-Z, a-z, or 0-9) |
"\\d" |
any single numeric digit (0-9) |
Quantifiers
Typically you do not want to search for a match on only one character. Quantifiers allow you to designate the length of letters/numbers to allow for the match.
Quantifiers are numbers written within curly brackets { }
after the character they are quantifying, for example,
"A{2}"
will return instances of two capital A letters.
"A{2,4}"
will return instances of between two and four capital A letters (do not put spaces!).
"A{2,}"
will return instances of two or more capital A letters.
"A+"
will return instances of one or more capital A letters (group extended until a different character is encountered).
- Precede with an
*
asterisk to return zero or more matches (useful if you are not sure the pattern is present)
Using the +
plus symbol as a quantifier, the match will occur until a different character is encountered. For example, this expression will return all words (alpha characters: "[A-Za-z]+"
# test string for quantifiers
<- "A-AA-AAA-AAAA" test
When a quantifier of {2} is used, only pairs of consecutive A’s are returned. Two pairs are identified within AAAA
.
str_extract_all(test, "A{2}")
[[1]]
[1] "AA" "AA" "AA" "AA"
When a quantifier of {2,4} is used, groups of consecutive A’s that are two to four in length are returned.
str_extract_all(test, "A{2,4}")
[[1]]
[1] "AA" "AAA" "AAAA"
With the quantifier +
, groups of one or more are returned:
str_extract_all(test, "A+")
[[1]]
[1] "A" "AA" "AAA" "AAAA"
Relative position
These express requirements for what precedes or follows a pattern. For example, to extract sentences, “two numbers that are followed by a period” (""
). (?<=\.)\s(?=[A-Z])
str_extract_all(test, "")
[[1]]
[1] "A" "-" "A" "A" "-" "A" "A" "A" "-" "A" "A" "A" "A"
Position statement | Matches to |
---|---|
"(?<=b)a" |
“a” that is preceded by a “b” |
"(?<!b)a" |
“a” that is NOT preceded by a “b” |
"a(?=b)" |
“a” that is followed by a “b” |
"a(?!b)" |
“a” that is NOT followed by a “b” |
Groups
Capturing groups in your regular expression is a way to have a more organized output upon extraction.
Regex examples
Below is a free text for the examples. We will try to extract useful information from it using a regular expression search term.
<- "Patient arrived at Broward Hospital emergency ward at 18:00 on 6/12/2005. Patient presented with radiating abdominal pain from LR quadrant. Patient skin was pale, cool, and clammy. Patient temperature was 99.8 degrees farinheit. Patient pulse rate was 100 bpm and thready. Respiratory rate was 29 per minute." pt_note
This expression matches to all words (any character until hitting non-character such as a space):
str_extract_all(pt_note, "[A-Za-z]+")
[[1]]
[1] "Patient" "arrived" "at" "Broward" "Hospital"
[6] "emergency" "ward" "at" "on" "Patient"
[11] "presented" "with" "radiating" "abdominal" "pain"
[16] "from" "LR" "quadrant" "Patient" "skin"
[21] "was" "pale" "cool" "and" "clammy"
[26] "Patient" "temperature" "was" "degrees" "farinheit"
[31] "Patient" "pulse" "rate" "was" "bpm"
[36] "and" "thready" "Respiratory" "rate" "was"
[41] "per" "minute"
The expression "[0-9]{1,2}"
matches to consecutive numbers that are 1 or 2 digits in length. It could also be written "\\d{1,2}"
, or "[:digit:]{1,2}"
.
str_extract_all(pt_note, "[0-9]{1,2}")
[[1]]
[1] "18" "00" "6" "12" "20" "05" "99" "8" "10" "0" "29"
You can view a useful list of regex expressions and tips on page 2 of this cheatsheet
Also see this tutorial.
10.9 Resources
A reference sheet for stringr functions can be found here
A vignette on stringr can be found here