# 1. Overview

This vignette provides a comprehensive guide to modelling population dynamics using the R package ADMUR, and accompanies the publication ‘Directly modelling population dynamics in the South American Arid Diagonal using 14C dates’, Philosophical Transactions B, 2020, A. Timpson et al. https://doi.org/10.1098/rstb.2019.0723

Throughout this vignette, R code blocks often use objects created earlier in the vignette in previous code blocks. However, the manual for each function provides examples with self sufficient R code blocks.

The motivation for creating the ADMUR package is to provide a robust framework to infer population dynamics from radiocarbon datasets, given the uncontroversial assumption that (to a first order of approximation) the archaeological record contains more dateable anthropogenic material from prehistoric periods when population levels were greater. Unfortunately, the spatiotemporal sparsity of radiocarbon data conspires with the wiggly nature of the calibration curve to encourage the overinterpretation of such datasets, often leading to colourful but statistically unjustified interpretations of population dynamics. No statistical method can (or ever will) be able to perfectly reconstruct the true population dynamics from such a dataset. ADMUR is no exception to this, but provides tools to infer a plausible yet conservative reconstruction of population dynamics.

## Installation

The ADMUR package can be installed directly from the CRAN in the usual way:

install.packages('ADMUR')

Alternatively it can be installed from GitHub, after installing and loading the ‘devtools’ package on the CRAN:

install.packages('devtools')
library(devtools)
install_github('UCL/ADMUR')

library(ADMUR)

## 14C datasets

A summary of the available help files and data sets included in the package can be browsed, which include a terrestrial anthropogenic 14C dataset from the South American Arid Diagonal:

help(ADMUR)
help(SAAD)

Datasets must be structured as a data frame that include columns ‘age’ and ‘sd’, which represent the uncalibrated 14C age and its error, respectively.

SAAD[1:5,1:8]
##   UniqID          site       lat      long  age sd      LabNo Material_D
## 1    237 Villa del Mar -17.62295 -71.34017 6360 60 Beta-71133       bone
## 2    240          Yara -17.51998 -71.36716 5020 60 Beta-80970       bone
## 3    505    El Ahogado -17.96667 -70.88333 3515 40    Pa-1769   charcoal
## 4    506    El Ahogado -17.96667 -70.88333 3535 60    Pa-1768   charcoal
## 5    507    El Ahogado -17.96667 -70.88333 3660 40    Pa-1789   charcoal

## Citations

Citations are available as follows:

citation('ADMUR')
##
## To cite the ADMUR package in publications please use both the
## following:
##
##   A. Timpson, R. Barberena, M. G. Thomas, C. Méndez, K. Manning.
##   Directly modelling population dynamics in the South American Arid
##   Diagonal using 14C dates. 2020. Philosophical Transactions of the
##   Royal Society B. 10.1098/rstb.2019.0723
##
##   Timpson. University College London. Research Department of Genetics,
##   Environment and Evolution (GEE), Darwin Building, Gower Street,
##   London, WC1E 6BT. 2020 https://github.com/UCL/ADMUR
##
## To see these entries in BibTeX format, use 'print(<citation>,
## bibtex=TRUE)', 'toBibtex(.)', or set
## 'options(citation.bibtex.max=999)'.

# 2. Date calibration and SPDs

The algorithm used by ADMUR to calculate model likelihoods of a 14C dataset uses several functions to first calibrate 14C dates. These functions are also intrinsically useful for 14C date calibration or for generating a Summed Probability Distribution (SPD).

## Calibrated 14C date probability distributions

Generating a single calibrated date distribution or SPD requires either a two-step process to give the user full control of the date range and temporal resolution, or a simpler one step process using a wrapper function that automatically estimates a sensible date range and resolution from the dataset, performs the two step process internally, and plots the SPD.

### With the wrapper

1. Use the function summedCalibratorWrapper()
data <- data.frame( age=c(6562,7144), sd=c(44,51) )
x <- summedCalibratorWrapper(data)

Notice the function assumes the data provided were all 14C dates. However, if you have other kinds of date such as thermoluminescence you can specify this. Non-14C types are assumed to be in calendar time, BP. You can also specify a particular calibration curve:

data <- data.frame( age=c(6562,7144), sd=c(44,51), datingType=c('14C','TL') )
x <- summedCalibratorWrapper(data=data, calcurve=shcal20)

### Without the wrapper

Generating the SPD without the wrapper gives you more control, and requires a two-step process:

1. Convert a calibration curve to a CalArray using the function makeCalArray()
2. Calibrate the 14C dates through the CalArray using the function summedCalibrator().

This is useful for improving computational times if generating many SPDs, for example in a simulation framework, since the CalArray needs generating only once.

data <- data.frame( age = c(9144), sd=c(151) )
CalArray <- makeCalArray( calcurve=intcal20, calrange=c(8000,13000) )
cal <- summedCalibrator(data, CalArray)
plotPD(cal)

The CalArray is essentially a two-dimensional probability array of the calibration curve, and can be viewed using the plotCalArray() function. Calibration curves vary in their temporal resolution, and the preferred resolution can be specified using the parameter inc which interpolates the calibration curve. It would become prohibitively time and memory costly if analysing the entire 50,000 year range of the calibration curve at a 1 year resolution (requiring a 50,000 by 50,000 array) and in practice the default 5 year resolution provides equivalent results to 1 year resolution for study periods wider than c.1000 years.

x <- makeCalArray( calcurve=shcal20, calrange=c(5500,6000), inc=1 )
plotCalArray(x)

## Comparison with other calibration software

It is worth noting that the algorithm used by this package to calibrate 14C dates gives practically equivalent results to those from OxCal generated using oxcAAR and Bchron

However, there are two fringe circumstances where these software programs differ substantially: at the border of the calibration curve; and if a date has a large error.

### Edge effects

Consider the real 14C date [MAMS-13035] https://doi.org/10.1016/j.aeae.2015.11.003 age: 50524 +/- 833 BP calibrated through intcal13, which only extends to 46401BP. Bchron throws an error, whilst OxCal applies a one-to-one mapping between Conventional Radiocarbon (CRA) time and calendar time for any date (mean) beyond the range of the calibration curve. The latter is in theory a reasonable way to mitigate the problem, however OxCal applies this in a binary manner that can create peculiarities. Instead ADMUR gradually fades the calibration curve to a one-to-one mapping between the end of the curve and 60,000 BP.

### Large errors

A 14C date is typically reported as a mean date with an error, which is often interpreted as representing a symmetric Gaussian distribution before calibration. However, a Gaussian has a non-zero probability at all possible years (between -$$\infty$$ and +$$\infty$$), and therefore cannot fairly represent the date uncertainty which must be skewed towards the past. Specifically, if we consider the date in CRA time, it must have a zero probability of occurring in the future. Alternatively, if we consider the date as a 14C/12C ratio, it cannot be smaller than 1 (the present). Therefore ADMUR assumes a 14C date error is lognormally distributed with a mean equal to the CRA date, and a variance equal to the CRA error squared. This naturally skews the distribution away from the present. In practice, this difference is undetectably trivial for typical radiocarbon errors since the lognormal distribution approximates a normal distribution away from zero. However, theoretically the differences can be large if considering dates with large errors that are close to the present.

## Phased data: adjusting for ascertainment bias

A naive approach to generating an SPD as a proxy for population dynamics would be to sum all dates in the dataset, but a more sensible approach is to sum the SPDs of each phase. The need to bin dates into phases is an important step in modelling population dynamics to adjust for the data ascertainment bias of some archaeological finds having more dates by virtue of a larger research interest or budget.

Therefore phaseCalibrator() generates an SPD for each phase in a dataset, and includes a binning algorithm which provides a useful solution to handling large datasets that have not been phased. For example, consider the following 8 dates from 2 sites:

data <- subset( SAAD, site %in% c('Carrizal','Pacopampa') )
data[,2:7]
##           site       lat      long  age  sd      LabNo
## 1192  Carrizal -6.063056 -79.49806 3640 100 Beta-31075
## 1193  Carrizal -6.063056 -79.49806 4390 110 Beta-18920
## 1194  Carrizal -6.063056 -79.49806 4450 100 Beta-31073
## 1195  Carrizal -6.063056 -79.49806 4620 100 Beta-31074
## 1196  Carrizal -6.063056 -79.49806 4690 120 Beta-27417
## 1205 Pacopampa -6.200000 -79.01000 2385 155     SI-794
## 1206 Pacopampa -6.200000 -79.01000 2765 135     SI-792
## 1207 Pacopampa -6.200000 -79.01000 2855  95     SI-793

The data have not already been phased (do not include a column ‘phase’) therefore the default binning algorithm calibrates these dates into four phases. this is achieved by binning dates that have a mean 14C date within 200 14C years of any other date in that respective bin. Therefore Pacopampa.1 comprises samples 1207 and 1206, Pacopampa.2 comprises sample 1205, Carrizal.1 comprises samples 1196 and 1195 and 1194 and 1193, and Carrizal.2 comprises sample 1192:

CalArray <- makeCalArray( calcurve=shcal20, calrange=c(2000,6000) )
x <- phaseCalibrator(data=data, CalArray=CalArray)
plotPD(x)

Finally, the distributions in each phase can be summed and normalised to unity. It is straight forward to achieve this directly from the dataframe created above:

SPD <- as.data.frame( rowSums(x) )

# normalise
SPD <- SPD/( sum(SPD) * CalArray$inc ) Alternatively, the wrapper function summedPhaseCalibrator() will perform this entire workflow internally: SPD <- summedPhaseCalibrator( data=data, calcurve=shcal20, calrange=c(2000,6000) ) plotPD(SPD) # 3. Continuous Piecewise Linear (CPL) Modelling A CPL model lends itself well to the objectives of identifying specific demographic events. Its parameters are the (x,y) coordinates of the hinge points, which are the relative population size (y) and timing (x) of these events. Crucially, this package calculates model likelihoods (the probability of the data given some proposed parameter combination). This likelihood is used in a search algorithm to find the maximum likelihood parameters; to compare models with different numbers parameters to find the best fit without overfitting; in Monte-Carlo Markov Chain (MCMC) analysis to estimate credible intervals of those parameters; and in a goodness-of-fit test to check that the data is a typical realisation of the maximum likelihood model and its parameters. ## Calculating likelihoods Theoretically a calibrated date should be a continuous Probability Density Function (PDF), however in practice a date is represented as a discrete vector of probabilities corresponding to each calendar year, and therefore is a Probability Mass Function (PMF). This discretisation provides the advantage that numerical methods can be used to easily calculate relative likelihoods, provided the model is also discretised to the same time points. A toy() model is provided to demonstrate how this achieved. First, we simulate a plausible 14C dataset and calibrate it. The function simulateCalendarDates() automatically covers a slightly wider date range to ensure simulated 14C dates are well represented around the edges: set.seed(12345) N <- 350 # randomly sample calendar dates from the toy model cal <- simulateCalendarDates(toy, N) # Convert to 14C dates. age <- uncalibrateCalendarDates(cal, shcal20) data <- data.frame(age = age, sd = 50, phase = 1:N, datingType = '14C') # Calibrate each phase, taking care to restrict to the modelled date range with 'remove.external' CalArray <- makeCalArray(shcal20, calrange = range(toy$year))
PD <- phaseCalibrator(data, CalArray, remove.external = TRUE)

The argument ‘remove.external = TRUE’ ensures any calibrated phases with less than 50% of their probability mass within the modelled date range are excluded, reducing the effective sample size from 350 to 303. This is a crucial step to avoid mischievous edge effects of dates outside the date range. Similarly, notice we constrained the CalArray to the modelled date range. These are important to ensure that we only model the population across a range that is well represented by data. To extend the model beyond the range of available data would be to assume the absence of evidence means evidence of absence. No doubt there may be occasions when this is reasonable (for example if modelling the first colonisation of an island that has been well excavated, and the period before arrival is evidenced by the absence of datable material), but more often the range of representative data is due to research interest, and therefore the logic of only including dates with at least 50% of their probability within the date range is that their true dates are more likely to be internal (within the date range) than external.

print( ncol(PD) )
## [1] 303

Finally we calculate the overall relative log likelihood of the model using function loglik()

loglik(PD=PD, model=toy)
## [1] -2266.526

For comparison, we can calculate the overall relative likelihood of a uniform model given exactly the same data. Intuitively this should have a lower relative likelihood, since our dataset was randomly generated from the non-uniform toy population history:

uniform.model <- convertPars(pars=NULL, years=5500:7500, type='uniform')
loglik(PD=PD, model=uniform.model)
## [1] -2311.632

And indeed the toy model is thirty nine million trillion times more likely than the uniform model:

exp( loglik(PD=PD, model=toy) - loglik(PD=PD, model=uniform.model) )
## [1] 3.882277e+19

Crucially, loglik() calculates the relative likelihoods for each effective sample separately (each phase containing a few dates). The overall model likelihood is the overall product of these individual likelihoods. This means that even in the case where there is no ascertainment bias, each date should still be assigned to its own phase, to ensure phaseCalibrator() calibrates each date separately. In contrast, attempting to calculate a likelihood for a single SPD constructed from the entire dataset would be incorrect, as this would be treating the entire dataset as a single ‘average’ sample.

## The anatomy of a CPL model

Having established how to calculate the relative likelihood of a proposed model given a dataset, we can use any out-of-the-box search algorithm to find the maximum likelihood model. This first requires us to describe the PD of any population model in terms of a small number of parameters, rather than a vector of probabilities for each year.

We achieve this using the Continuous Piecewise Linear (CPL) model, which is defined by the (x,y) coordinates of its hinge points.

When performing a search for the best 3-CPL model coordinates (given a dataset), only five of these eight values are free parameters. The x-coordinates of the start and end (5500 BP and 7500 BP) are fixed by the choice of date range. Additionally, one of the y-coordinates must be constrained by the other parameters, since the total probability (area) must equal 1. As a result, an n-CPL model will have 2n-1 free parameters.

## Parameter space: The Area Breaking Process

We use the function convertPars() to map our search parameters to their corresponding PD coordinates. This allows us to propose independent parameter values from a uniform distribution between 0 and 1, and convert them into coordinates that describe a corresponding CPL model PD. This parameter-to-coordinate mapping is achieved using a modified stick breaking Dirichlet process.

The Dirichlet Process (not to be confused with the Dirichlet distribution) is an algorithm that can break a stick (the x-axis date range) into a desired number of pieces, ensuring all lengths are sampled evenly. The length (proportion) of remaining stick to break is chosen by sampling from the Beta distribution, such that we use the Beta CDF (with $$\alpha$$ = 1 and $$\beta$$ = the number of pieces still to be broken) to convert an x-parameter into its equivalent x-coordinate value. We extend this algorithm for use with the CPL model by also converting y-parameters to y-coordinates as follows:

1. Fix the y-value of the first hinge (H1, x = 5500 BP) to any constant (y = 3 is arbitrarily chosen since the mapping function below gives 3 for an average y-parameter of 0.5).
2. Use the mapping function $$f(y) = (1/(1-y))^2)-1$$ to convert all remaining y-parameters (between 0 and 1) to y-values (between 0 and +$$\infty$$).
3. Calculate the total area, given the y-values and previously calculated x-coordinates.
4. Divide y-values by the total area, to give the y-coordinates of the final PDF.

The parameters must be provided as a single vector with an odd length, each between 0 and 1 (y,x,y,x,…y). For example, a randomly generated 6-CPL model will have 11 parameters and 7 hinges:

set.seed(12345)
CPLparsToHinges(pars=runif(11), years=5500:7500)
##       year          pdf
## 1 5500.000 3.261064e-06
## 2 5950.539 4.771822e-07
## 3 6580.113 1.299434e-06
## 4 6929.120 3.426049e-06
## 5 7307.354 1.357384e-05
## 6 7395.293 1.031902e-02
## 7 7500.000 7.915814e-08

Note: The Area Breaking Algorithm is a heuristic that ensures all parameter space is explored and therefore the maximum likelihood parameters are always found. However, unlike the one-dimensional stick-breaking process, its mapping of random parameters to PD coordinates is not perfectly even, and we welcome ideas for a more elegant algorithm.

## Credible interval parameter search using MCMC

The ADMUR function mcmc() uses the Metropolis-Hastings algorithm to search joint parameter values of an n-CPL model, given a the calibrated probability distributions of phases in a 14C dataset (PDarray). In principle the starting parameters do not matter if burn is of an appropriate length, but in practice it is more efficient to start in a sensible place such as the maximum likelihood parameters:

chain <- mcmc(PDarray=PD, startPars=best$par, type='CPL', N=100000, burn=2000, thin=5, jumps =0.025) The acceptance ratio (AR) and raw chain (before burn-in and thinning) can be sanity checked. Ideally we want the AR somewhere in the range 0.3 to 0.5 (this can be tuned with the ‘jumps’ argument), and the raw chain to resemble ‘hairy caterpillars’: print(chain$acceptance.ratio)
par(mfrow=c(3,2), mar=c(4,3,3,1))
col <- 'steelblue'
for(n in 1:5){
plot(chain$all.pars[,n], type='l', ylim=c(0,1), col=col, xlab='', ylab='', main=paste('par',n)) } These parameters can then be converted to the hinge coordinates using the convertPars() function, and their marginal distributions plotted. Note, the MLE parameters (red lines) may not exactly match the peaks of these distributions because they are only marginals. Note also the dates of hinge 1 and 2 are fixed at 5500 and 7500: hinges <- convertPars(pars=chain$res, years=5500:7500, type='CPL')
par(mfrow=c(3,2), mar=c(4,3,3,1))
c1 <- 'steelblue'
c2 <- 'firebrick'
lwd <- 3
pdf.brk <- seq(0,0.0015, length.out=40)
yr.brk <- seq(5500,7500,length.out=40)
names <- c('Date of H2','Date of H3','PD of H1','PD of H2','PD of H3','PD of H4')
hist(hinges$yr2,border=c1,breaks=yr.brk, main=names[1], xlab='');abline(v=CPL$year[2],col=c2,lwd=lwd)
hist(hinges$yr3, border=c1,breaks=yr.brk, main=names[2], xlab='');abline(v=CPL$year[3],col=c2,lwd=lwd)
hist(hinges$pdf1, border=c1,breaks=pdf.brk, main=names[3], xlab='');abline(v=CPL$pdf[1],col=c2,lwd=lwd)
hist(hinges$pdf2, border=c1,breaks=pdf.brk, main=names[4], xlab='');abline(v=CPL$pdf[2],col=c2,lwd=lwd)
hist(hinges$pdf3, border=c1,breaks=pdf.brk, main=names[5], xlab='');abline(v=CPL$pdf[3],col=c2,lwd=lwd)
hist(hinges$pdf4, border=c1,breaks=pdf.brk, main=names[6], xlab='');abline(v=CPL$pdf[4],col=c2,lwd=lwd)

Some two-dimensional combinations of joint parameters may be preferred, but still these are 2D marginal representations of 5D parameters, again with MLE in red:

require(scales)
par( mfrow=c(1,2) , mar=c(4,4,1.5,2), cex=0.7 )
plot(hinges$yr2, hinges$pdf2, pch=16, col=alpha(1,0.02), ylim=c(0,0.0005))
points(CPL$year[2], CPL$pdf[2], col='red', pch=16, cex=1.2)
plot(hinges$yr3, hinges$pdf3, pch=16, col=alpha(1,0.02), ylim=c(0,0.0015))
points(CPL$year[3], CPL$pdf[3], col='red', pch=16, cex=1.2) 

Alternatively, the joint distributions can be visualised by plotting the CPL model for each iteration of the chain, with the MLE in red:

plot(NULL, xlim=c(7500,5500),ylim=c(0,0.0011), xlab='calBP', ylab='PD', cex=0.7)
for(n in 1:nrow(hinges)){
x <- c(hinges$yr1[n], hinges$yr2[n], hinges$yr3[n], hinges$yr4[n])
y <- c(hinges$pdf1[n], hinges$pdf2[n], hinges$pdf3[n], hinges$pdf4[n])
lines( x, y, col=alpha(1,0.005) )
}
lines(x=CPL$year, y=CPL$pdf, lwd=2, col=c2)