# A Case Study of Cluster-level Models

#### 2019/10/22

In this vignette, we will discuss the use of cluster-level models using the SUMMER package. We will use data from Kenya 2014 DHS survey as a case study. Due to the size limit of vignettes on CRAN, here we illustrate data preparation and how to use fitINLA2 function to perform space-time smoothing of a generic binary indicator using neonatal mortality rate (NMR) as an example).

For a more complete analysis of U5MR using cluster-level model, we refer readers to a more detailed vignette at http://faculty.washington.edu/jonno/space-station.html (Lecture 6, Case Study - Kenya 2014 DHS, in the Analysis file), where we use data from Kenya 2014 DHS survey, to illustrate spatial-temporal smoothing of U5MR at yearly scales for the 47 counties in Kenya.

## Prepare data

First, we load the package and the necessary data. INLA is not in a standard repository, so we check if it is available and install it if it is not installed. For this vignette, we used INLA version 19.09.03.

library(SUMMER)
library(ggplot2)
library(gridExtra)

The DHS data can be obtained from the DHS program website at https://dhsprogram.com/data/dataset/Kenya_Standard-DHS_2014. For the analysis of U5MR, we will use the Births Recode in .dta format. Notice that registration with the DHS program is required in order to access this dataset. The map files for this DHS can be freely downloaded from http://spatialdata.dhsprogram.com/boundaries/.

With both the DHS birth record data and the corresponding shapefiles saved in the local directory. We can load them into with packages readstata13 and rgdal. We also automatically generates the spatial adjacency matrix Amat using the function getAmat().

library(readstata13)
filename <- "../Demo/KEBR71DT/KEBR71FL.DTA"
births <- read.dta13(filename, generate.factors = TRUE)

library(rgdal)
mapfilename0 <- "../Demo/shps/sdr_subnational_boundaries.shp"
geo0 <- readOGR(mapfilename0, verbose = FALSE)
mapfilename <- "../Demo/shps/sdr_subnational_boundaries2.shp"
geo <- readOGR(mapfilename, verbose = FALSE)
Amat <- getAmat(geo, geo$REGNAME) The Amat matrix encodes the spatial adjacency matrix of the 47 counties, with column and row names matching the regions used in the map. This adjacency matrix will be used for the spatial smoothing model. It can also be created by hand if necessary. ### Prepare person-month data We first demonstrate the method that smooths the direct estimates of subnational-level U5MR. For this analysis, we consider the $$8$$ Admin-1 region groups. In order to calculate the direct estimates of U5MR, we need the full birth history data in the format so that every row corresponds to a birth and columns that contain: • Indicators corresponding to survey design, e.g., strata (v023), cluster (v001), and household (v002) • Survey weight (v025) • Date of interview in century month codes (CMC) format, i.e., the number of the month since the beginning of 1990 (v008) • Date of child’s birth in CMC format (b3) • Indicator for death of child (b5) • Age of death of child in months (b7) Since county labels are usually not in the DHS dataset, we now load the GPS location of each clusters and map them to the corresponding counties. loc <- readOGR("../Demo/KEGE71FL/KEGE71FL.shp", verbose = FALSE) loc.dat <- data.frame(cluster = loc$DHSCLUST, long = loc$LONGNUM, lat = loc$LATNUM)
gps <- mapPoints(loc.dat, geo = geo, long = "long", lat = "lat", names = c("REGNAME"))
colnames(gps)[4] <- "region"
gps0 <- mapPoints(loc.dat, geo = geo0, long = "long", lat = "lat", names = c("REGNAME"))
colnames(gps0)[4] <- "province"
gps <- merge(gps, gps0[, c("cluster", "province")])
sum(is.na(gps$region)) ## [1] 9 Notice that there are $$9$$ clusters that fall on the county boundaries and we need to manually assign them to a county based on best guess. In this example as an illustration, we remove these clusters without GPS coordinates. unknown_cluster <- gps$cluster[which(is.na(gps$region))] gps <- gps[gps$cluster %in% unknown_cluster == FALSE, ]
births <- births[births$v001 %in% unknown_cluster == FALSE, ] births <- merge(births, gps[, c("cluster", "region", "province")], by.x = "v001", by.y = "cluster", all.x = TRUE) births$v024 <- births$region The birth history data from DHS is already in this form and the getBirths function default to use the current recode manual column names (as indicated above). The name of these fields can be defined explicitly in the function arguments too. We reorganize the data into the ‘person-month’ format with getBirths function and reorder the columns for better readability. dat <- getBirths(data = births, strata = c("v023"), year.cut = seq(1985, 2020, by = 1)) dat <- dat[, c("v001", "v002", "v024", "time", "age", "v005", "strata", "died")] colnames(dat) <- c("clustid", "id", "region", "time", "age", "weights", "strata", "died") years <- levels(dat$time)
head(dat)
##   clustid id  region time  age weights strata died
## 1       1  6 nairobi 2009    0 5476381      1    0
## 2       1  6 nairobi 2009 1-11 5476381      1    0
## 3       1  6 nairobi 2009 1-11 5476381      1    0
## 4       1  6 nairobi 2009 1-11 5476381      1    0
## 5       1  6 nairobi 2009 1-11 5476381      1    0
## 6       1  6 nairobi 2009 1-11 5476381      1    0

Notice that we also need to specify the time intervals of interest. In this example, we with to calculate and predict U5MR in 5-year intervals from 1985-1990 to 2015-2019. For U5MR, we will use the discrete survival model to calculate direct estimates for each region and time. This step involves breaking down the age of each death into discrete intervals. The default option assumes a discrete survival model with six discrete hazards (probabilities of dying in a particular interval, given survival to the start of the interval) for each of the age bands: $$[0,1), [1,12), [12,24), [24,36), [36,48)$$, and $$[48,60]$$.

## Model-based Estimates

Assume there are $$N$$ regions and $$T$$ time periods. In the $$i$$-th region, there are $$n_i$$ clusters. We consider the following model. In stratum $$k$$, cluster $$c$$, and time period $$t$$, let $$p_{ktc}$$ be the prevalence of interest. We can aggregate over all clusters within each region to obtain the region-time specific prevalence to be $p_{it} = \sum_{k}\sum_{c} p_{ktc} q_{ik} \mathbf{1}_{i[c] = i},$ where $$q_{ik}$$ is the proportion of clusters that are in stratum $$k$$ within region $$i$$. Let $$Y_{ktc}^{(g)}$$ and $$n_{ktc}^{(g)}$$ to denote the number of deaths and the total number of child-months in stratum $$k$$, cluster $$c$$, and time period $$t$$. We model the cluster-level stratum-specific prevalence $$p_{ktc}$$ with a hierarchical space-time smoothing model described below.

### Beta-binomial model

We assume the following marginal model, $Y_{ktc} | p_{ktc} \sim \mbox{BetaBinomial}(n_{ktc}, p_{ktc}, d).$ We model the mean probability $$p_{ktc}$$ with a logit link and a linear model that contains strata and age group fixed effects, and space, time, and space-time random effects $\mbox{logit} p_{ktc} = \log(\mbox{BIAS}_{tc}) + \mu_g + \beta_k + \alpha_t + \gamma_t + \phi_{i[c]} + \delta_{i[c],t} ,$ where the bias term $$\mbox{BIAS}_{tc}$$ denotes the ratio of the reported prevalence to the ‘true’ prevalence. The log transformation on the logit transformed hazards approximately leads to a multiplicative bias correction on the U5MR. In countries with high prevalence of HIV, we may adjust for the proportion of missing women due to HIV prevalence. And the random effects are defined similarly as in the previous smoothing model.

We now transform the full birth history data to person-month format again. In order to fit the binomial model, we need to calculate the number of person-months and number of deaths for each age group, stratum (urban or rural), cluster, and time period. Notice that we do not need to impute $$0$$ observations for future time periods. We also rename the columns to prepare for the input in the smoothing model. In order to correctly adjust for bias due to HIV, we keep the information of province in the column ‘province’ as well.

dat <- getBirths(data = births, strata = c("v023"), year.cut = seq(1985, 2020,
by = 1))
dat <- dat[, c("v001", "time", "age", "died", "v025", "v024")]
colnames(dat) <- c("cluster", "years", "age", "Y", "strata", "province")
outcome <- getCounts(data = dat, variables = "Y", by = c("age", "strata", "cluster",
"years"), ignore = list(years = c(2015:2019)))
head(outcome)
##     age strata cluster years Y total
## 1     0  urban       1  1985 0     0
## 2  1-11  urban       1  1985 0     0
## 3 12-23  urban       1  1985 0     0
## 4 24-35  urban       1  1985 0     0
## 5 36-47  urban       1  1985 0     0
## 6 48-59  urban       1  1985 0     0

We then merge the county information to this dataset. In order to fit the model, the data file should contain the following columns: cluster ID (‘cluster’), observation period (‘years’), observation location (‘region’), strata level (‘strata’), age group corresponding to the hazards (‘age’), total number of person-months (‘total’), and the number of deaths (‘Y’).

outcome <- merge(outcome, gps[, c("cluster", "region", "province")], by = "cluster",
all.x = TRUE)

## Mapping prevalence using cluster-level model

Before discussing the hazard modeling of U5MR, we first demonstrate how to use fitINLA2 function to map a generic prevalence of interest. To illustrate this, we take only the deaths within the first months to calculate the neonatal mortality rates. For binomial prevalence mapping for an generic outcome, we need to specify the age.group and age.n to be NULL. These two arguments will be specified in the case of mapping U5MR, where we may allow hazards for different age groups to have distinct model components.

outcome.1month <- subset(outcome, age == 0)
fit2 <- fitINLA2(data = outcome.1month, family = "betabinomial", Amat = Amat,
geo = geo, year_label = 1985:2019, type.st = 1, verbose = FALSE, age.groups = NULL)
out2 <- getSmoothed(inla_mod = fit2, year_range = c(1985, 2019), year_label = years,
Amat = Amat, nsim = 1000, weight.strata = KenData$UrbanProp) The posterior medians of neonatal mortality rates in each counties can be plotted using the plot method, mapPlot function and hatchPlot function, as shown in other vignettes. For example, plot(out2$overall, year_label = years, year_med = 1985:2019, is.subnational=TRUE, plot.CI = TRUE, per1000 = TRUE) +
ggtitle("Subnational Binomial Model: Neonatal Mortality Rates") +
facet_wrap(~region, ncol = 7) + theme(legend.position = "none") +
scale_color_manual(values = rep("gray20", 47))

## Space-time smoothing of U5MR using the cluster-level models

Due to the limit of file size on CRAN, this part of analysis have been moved to the website http://faculty.washington.edu/jonno/space-station.html (Lecture 6, Case Study - Kenya 2014 DHS, in the Analysis file, at the bottom of the page). we will also use data from Kenya 2014 DHS survey, to illustrate spatial-temporal smoothing of U5MR at yearly scales for the 47 counties in Kenya.