This page will show you two ways to standardize an outcome, such as hospitalizations or mortality, by characteristics such as age and sex.
- Using dsr package
- Using PHEindicatormethods package
We begin by extensively demonstrating the processes of data preparation/cleaning/joining, as this is common when combining population data from multiple countries, standard population data, deaths, etc.
There are two main ways to standardize: direct and indirect standardization. Let’s say we would like to the standardize mortality rate by age and sex for country A and country B, and compare the standardized rates between these countries.
- For direct standardization, you will have to know the number of the at-risk population and the number of deaths for each stratum of age and sex, for country A and country B. One stratum in our example could be females between ages 15-44.
- For indirect standardization, you only need to know the total number of deaths and the age- and sex structure of each country. This option is therefore feasible if age- and sex-specific mortality rates or population numbers are not available. Indirect standardization is furthermore preferable in case of small numbers per stratum, as estimates in direct standardization would be influenced by substantial sampling variation.
To show how standardization is done, we will use fictitious population counts and death counts from country A and country B, by age (in 5 year categories) and sex (female, male). To make the datasets ready for use, we will perform the following preparation steps:
- Load packages
- Load datasets
- Join the population and death data from the two countries
- Pivot longer so there is one row per age-sex stratum
- Clean the reference population (world standard population) and join it to the country data
In your scenario, your data may come in a different format. Perhaps your data are by province, city, or other catchment area. You may have one row for each death and information on age and sex for each (or a significant proportion) of these deaths. In this case, see the pages on Grouping data, Pivoting data, and Descriptive tables to create a dataset with event and population counts per age-sex stratum.
We also need a reference population, the standard population. For the purposes of this exercise we will use the
world_standard_population_by_sex. The World standard population is based on the populations of 46 countries and was developed in 1960. There are many “standard” populations - as one example, the website of NHS Scotland is quite informative on the European Standard Population, World Standard Population and Scotland Standard Population.
This code chunk shows the loading of packages required for the analyses. In this handbook we emphasize
p_load() from pacman, which installs the package if necessary and loads it for use. You can also load installed packages with
library() from base R. See the page on R basics for more information on R packages.
pacman::p_load( rio, # import/export data here, # locate files tidyverse, # data management and visualization stringr, # cleaning characters and strings frailtypack, # needed for dsr, for frailty models dsr, # standardise rates PHEindicatormethods) # alternative for rate standardisation
CAUTION: If you have a newer version of R, the dsr package cannot be directly downloaded from CRAN. However, it is still available from the CRAN archive. You can install and use this one.
For non-Mac users:
packageurl <- "https://cran.r-project.org/src/contrib/Archive/dsr/dsr_0.2.2.tar.gz" install.packages(packageurl, repos=NULL, type="source")
# Other solution that may work require(devtools) devtools::install_version("dsr", version="0.2.2", repos="http:/cran.us.r.project.org")
For Mac users:
See the Download handbook and data page for instructions on how to download all the example data in the handbook. You can import the Standardisation page data directly into R from our Github repository by running the following
# import demographics for country A directly from Github A_demo <- import("https://github.com/appliedepi/epirhandbook_eng/raw/master/data/standardization/country_demographics.csv") # import deaths for country A directly from Github A_deaths <- import("https://github.com/appliedepi/epirhandbook_eng/raw/master/data/standardization/deaths_countryA.csv") # import demographics for country B directly from Github B_demo <- import("https://github.com/appliedepi/epirhandbook_eng/raw/master/data/standardization/country_demographics_2.csv") # import deaths for country B directly from Github B_deaths <- import("https://github.com/appliedepi/epirhandbook_eng/raw/master/data/standardization/deaths_countryB.csv") # import demographics for country B directly from Github standard_pop_data <- import("https://github.com/appliedepi/epirhandbook_eng/raw/master/data/standardization/world_standard_population_by_sex.csv")
First we load the demographic data (counts of males and females by 5-year age category) for the two countries that we will be comparing, “Country A” and “Country B”.
# Country A A_demo <- import("country_demographics.csv")
# Country B B_demo <- import("country_demographics_2.csv")
Conveniently, we also have the counts of deaths during the time period of interest, by age and sex. Each country’s counts are in a separate file, shown below.
Deaths in Country A
Deaths in Country B
We need to join and transform these data in the following ways:
- Combine country populations into one dataset and pivot “long” so that each age-sex stratum is one row
- Combine country death counts into one dataset and pivot “long” so each age-sex stratum is one row
- Join the deaths to the populations
First, we combine the country populations datasets, pivot longer, and do minor cleaning. See the page on Pivoting data for more detail.
pop_countries <- A_demo %>% # begin with country A dataset bind_rows(B_demo) %>% # bind rows, because cols are identically named pivot_longer( # pivot longer cols = c(m, f), # columns to combine into one names_to = "Sex", # name for new column containing the category ("m" or "f") values_to = "Population") %>% # name for new column containing the numeric values pivoted mutate(Sex = recode(Sex, # re-code values for clarity "m" = "Male", "f" = "Female"))
The combined population data now look like this (click through to see countries A and B):
And now we perform similar operations on the two deaths datasets.
deaths_countries <- A_deaths %>% # begin with country A deaths dataset bind_rows(B_deaths) %>% # bind rows with B dataset, because cols are identically named pivot_longer( # pivot longer cols = c(Male, Female), # column to transform into one names_to = "Sex", # name for new column containing the category ("m" or "f") values_to = "Deaths") %>% # name for new column containing the numeric values pivoted rename(age_cat5 = AgeCat) # rename for clarity
The deaths data now look like this, and contain data from both countries:
We now join the deaths and population data based on common columns
Sex. This adds the column
country_data <- pop_countries %>% left_join(deaths_countries, by = c("Country", "age_cat5", "Sex"))
We can now classify
Country as factors and set the level order using
fct_relevel() function from the forcats package, as described in the page on Factors. Note, classifying the factor levels doesn’t visibly change the data, but the
arrange() command does sort it by Country, age category, and sex.
country_data <- country_data %>% mutate( Country = fct_relevel(Country, "A", "B"), Sex = fct_relevel(Sex, "Male", "Female"), age_cat5 = fct_relevel( age_cat5, "0-4", "5-9", "10-14", "15-19", "20-24", "25-29", "30-34", "35-39", "40-44", "45-49", "50-54", "55-59", "60-64", "65-69", "70-74", "75-79", "80-84", "85")) %>% arrange(Country, age_cat5, Sex)
CAUTION: If you have few deaths per stratum, consider using 10-, or 15-year categories, instead of 5-year categories for age.
Lastly, for the direct standardisation, we import the reference population (world “standard population” by sex)
# Reference population standard_pop_data <- import("world_standard_population_by_sex.csv")
The age category values in the
standard_pop_data data frames will need to be aligned.
Currently, the values of the column
age_cat5 from the
standard_pop_data data frame contain the word “years” and “plus”, while those of the
country_data data frame do not. We will have to make the age category values match. We use
str_replace_all() from the stringr package, as described in the page on Characters and strings, to replace these patterns with no space
Furthermore, the package dsr expects that in the standard population, the column containing counts will be called
"pop". So we rename that column accordingly.
# Remove specific string from column values standard_pop_clean <- standard_pop_data %>% mutate( age_cat5 = str_replace_all(age_cat5, "years", ""), # remove "year" age_cat5 = str_replace_all(age_cat5, "plus", ""), # remove "plus" age_cat5 = str_replace_all(age_cat5, " ", "")) %>% # remove " " space rename(pop = WorldStandardPopulation) # change col name to "pop", as this is expected by dsr package
CAUTION: If you try to use
str_replace_all() to remove a plus symbol, it won’t work because it is a special symbol. “Escape” the specialnes by putting two back slashes in front, as in
str_replace_call(column, "\\+", "").
Finally, the package PHEindicatormethods, detailed below, expects the standard populations joined to the country event and population counts. So, we will create a dataset
all_data for that purpose.
all_data <- left_join(country_data, standard_pop_clean, by=c("age_cat5", "Sex"))
This complete dataset looks like this:
Below we demonstrate calculating and comparing directly standardized rates using the dsr package. The dsr package allows you to calculate and compare directly standardized rates (no indirectly standardized rates!).
In the data Preparation section, we made separate datasets for country counts and standard population:
country_dataobject, which is a population table with the number of population and number of deaths per stratum per country
standard_pop_cleanobject, containing the number of population per stratum for our reference population, the World Standard Population
We will use these separate datasets for the dsr approach.
Below, we calculate rates per country directly standardized for age and sex. We use the
Of note -
dsr() expects one data frame for the country populations and event counts (deaths), and a separate data frame with the reference population. It also expects that in this reference population dataset the unit-time column name is “pop” (we assured this in the data Preparation section).
There are many arguments, as annotated in the code below. Notably,
event = is set to the column
Deaths, and the
fu = (“follow-up”) is set to the
Population column. We set the subgroups of comparison as the column
Country and we standardize based on
Sex. These last two columns are not assigned a particular named argument. See
?dsr for details.
# Calculate rates per country directly standardized for age and sex mortality_rate <- dsr::dsr( data = country_data, # specify object containing number of deaths per stratum event = Deaths, # column containing number of deaths per stratum fu = Population, # column containing number of population per stratum subgroup = Country, # units we would like to compare age_cat5, # other columns - rates will be standardized by these Sex, refdata = standard_pop_clean, # reference population data frame, with column called pop method = "gamma", # method to calculate 95% CI sig = 0.95, # significance level mp = 100000, # we want rates per 100.000 population decimals = 2) # number of decimals) # Print output as nice-looking HTML table knitr::kable(mortality_rate) # show mortality rate before and after direct standardization
|Subgroup||Numerator||Denominator||Crude Rate (per 100000)||95% LCL (Crude)||95% UCL (Crude)||Std Rate (per 100000)||95% LCL (Std)||95% UCL (Std)|
Above, we see that while country A had a lower crude mortality rate than country B, it has a higher standardized rate after direct age and sex standardization.
# Calculate RR mortality_rr <- dsr::dsrr( data = country_data, # specify object containing number of deaths per stratum event = Deaths, # column containing number of deaths per stratum fu = Population, # column containing number of population per stratum subgroup = Country, # units we would like to compare age_cat5, Sex, # characteristics to which we would like to standardize refdata = standard_pop_clean, # reference population, with numbers in column called pop refgroup = "B", # reference for comparison estimate = "ratio", # type of estimate sig = 0.95, # significance level mp = 100000, # we want rates per 100.000 population decimals = 2) # number of decimals # Print table knitr::kable(mortality_rr)
|Comparator||Reference||Std Rate (per 100000)||Rate Ratio (RR)||95% LCL (RR)||95% UCL (RR)|
The standardized mortality rate is 1.22 times higher in country A compared to country B (95% CI 1.17-1.27).
# Calculate RD mortality_rd <- dsr::dsrr( data = country_data, # specify object containing number of deaths per stratum event = Deaths, # column containing number of deaths per stratum fu = Population, # column containing number of population per stratum subgroup = Country, # units we would like to compare age_cat5, # characteristics to which we would like to standardize Sex, refdata = standard_pop_clean, # reference population, with numbers in column called pop refgroup = "B", # reference for comparison estimate = "difference", # type of estimate sig = 0.95, # significance level mp = 100000, # we want rates per 100.000 population decimals = 2) # number of decimals # Print table knitr::kable(mortality_rd)
|Comparator||Reference||Std Rate (per 100000)||Rate Difference (RD)||95% LCL (RD)||95% UCL (RD)|
Country A has 4.24 additional deaths per 100.000 population (95% CI 3.24-5.24) compared to country A.
Another way of calculating standardized rates is with the PHEindicatormethods package. This package allows you to calculate directly as well as indirectly standardized rates. We will show both.
This section will use the
all_data data frame created at the end of the Preparation section. This data frame includes the country populations, death events, and the world standard reference population. You can view it here.
Below, we first group the data by Country and then pass it to the function
phe_dsr() to get directly standardized rates per country.
Of note - the reference (standard) population can be provided as a column within the country-specific data frame or as a separate vector. If provided within the country-specific data frame, you have to set
stdpoptype = "field". If provided as a vector, set
stdpoptype = "vector". In the latter case, you have to make sure the ordering of rows by strata is similar in both the country-specific data frame and the reference population, as records will be matched by position. In our example below, we provided the reference population as a column within the country-specific data frame.
See the help with
?phr_dsr or the links in the References section for more information.
# Calculate rates per country directly standardized for age and sex mortality_ds_rate_phe <- all_data %>% group_by(Country) %>% PHEindicatormethods::phe_dsr( x = Deaths, # column with observed number of events n = Population, # column with non-standard pops for each stratum stdpop = pop, # standard populations for each stratum stdpoptype = "field") # either "vector" for a standalone vector or "field" meaning std populations are in the data # Print table knitr::kable(mortality_ds_rate_phe)
|A||11344||86790567||23.56686||23.08107||24.05944||95%||dsr per 100000||Dobson|
|B||9955||52898281||19.32549||18.45516||20.20882||95%||dsr per 100000||Dobson|
For indirect standardization, you need a reference population with the number of deaths and number of population per stratum. In this example, we will be calculating rates for country A using country B as the reference population, as the
standard_pop_clean reference population does not include number of deaths per stratum.
Below, we first create the reference population from country B. Then, we pass mortality and population data for country A, combine it with the reference population, and pass it to the function
phe_isr(), to get indirectly standardized rates. Of course, you can do it also vice versa.
Of note - in our example below, the reference population is provided as a separate data frame. In this case, we make sure that
x_ref = and
n_ref = vectors are all ordered by the same standardization category (stratum) values as that in our country-specific data frame, as records will be matched by position.
See the help with
?phr_isr or the links in the References section for more information.
# Create reference population refpopCountryB <- country_data %>% filter(Country == "B") # Calculate rates for country A indirectly standardized by age and sex mortality_is_rate_phe_A <- country_data %>% filter(Country == "A") %>% PHEindicatormethods::phe_isr( x = Deaths, # column with observed number of events n = Population, # column with non-standard pops for each stratum x_ref = refpopCountryB$Deaths, # reference number of deaths for each stratum n_ref = refpopCountryB$Population) # reference population for each stratum # Print table knitr::kable(mortality_is_rate_phe_A)
|11344||15847.42||18.81914||13.47123||13.22446||13.72145||95%||isr per 100000||Byars|