Validating cluster models

Niek Den Teuling

2021-04-14

In this vignette we demonstrate different ways in which longitudinal cluster models can be internally validated.

library(latrend)

Demonstration

We explore the latrendData dataset. This is a synthetic dataset for which the reference group of each trajectory is available, indicated by the Class column. However, in this vignette we will assume that the true group specification and number of groups are unknown. Instead, we have a candidate model which we wish to validate internally.

data(latrendData)
head(latrendData)
#>   Id      Time           Y   Class
#> 1  1 0.0000000 -1.08049205 Class 1
#> 2  1 0.2222222 -0.68024151 Class 1
#> 3  1 0.4444444 -0.65148373 Class 1
#> 4  1 0.6666667 -0.39115398 Class 1
#> 5  1 0.8888889 -0.19407876 Class 1
#> 6  1 1.1111111 -0.02991783 Class 1

Specify the package options to adopt the respective column names of the loaded dataset, for convenience.

options(latrend.id = "Id", latrend.time = "Time")

The candidate model

We consider a KML model with 3 clusters to be our candidate model that we will validate. We defined the method with a reduced number of repeated random starts (as indicated by the nbRedrawing argument) in order to reduce the computation time needed in the repeated evaluations below. This is only done for demonstration purposes.

kml <- lcMethodKML("Y", nClusters = 3, nbRedrawing = 5)
kml
#> lcMethodKML as "longitudinal k-means (KML)"
#>  nbRedrawing:    5
#>  maxIt:          200
#>  imputationMethod:"copyMean"
#>  distanceName:   "euclidean"
#>  power:          2
#>  distance:       function() {}
#>  centerMethod:   meanNA
#>  startingCond:   "nearlyAll"
#>  nbCriterion:    1000
#>  scale:          TRUE
#>  response:       "Y"
#>  time:           getOption("latrend.time")
#>  id:             getOption("latrend.id")
#>  nClusters:      3

Evaluate stability of the estimation through repeated estimation

The purpose of model validation is essentially to identify that the model is robust and generalizes well to unseen data. A necessary condition for these aspects is that the model is reproducible on the original training data. This evaluation helps to ensure that the model estimation procedure is robust, i.e., does not yield spurious model solutions.

We can fit a method repeatedly using the latrendRep data.

repModels <- latrendRep(kml, data = latrendData, .rep=10)
print(repModels, excludeShared = FALSE)
#> List of 10 lcModels with
#>    .name .method        data       seed nbRedrawing maxIt imputationMethod
#> 1      1     kml latrendData 1140350788           5   200         copyMean
#> 2      2     kml latrendData  312928385           5   200         copyMean
#> 3      3     kml latrendData  866248189           5   200         copyMean
#> 4      4     kml latrendData 1909893419           5   200         copyMean
#> 5      5     kml latrendData  554504146           5   200         copyMean
#> 6      6     kml latrendData  884616499           5   200         copyMean
#> 7      7     kml latrendData  803234389           5   200         copyMean
#> 8      8     kml latrendData 1158971242           5   200         copyMean
#> 9      9     kml latrendData  934673902           5   200         copyMean
#> 10    10     kml latrendData 1632225031           5   200         copyMean
#>    distanceName power       distance                             centerMethod
#> 1     euclidean     2 function () {} function (x) {    mean(x, na.rm = TRUE)}
#> 2     euclidean     2 function () {} function (x) {    mean(x, na.rm = TRUE)}
#> 3     euclidean     2 function () {} function (x) {    mean(x, na.rm = TRUE)}
#> 4     euclidean     2 function () {} function (x) {    mean(x, na.rm = TRUE)}
#> 5     euclidean     2 function () {} function (x) {    mean(x, na.rm = TRUE)}
#> 6     euclidean     2 function () {} function (x) {    mean(x, na.rm = TRUE)}
#> 7     euclidean     2 function () {} function (x) {    mean(x, na.rm = TRUE)}
#> 8     euclidean     2 function () {} function (x) {    mean(x, na.rm = TRUE)}
#> 9     euclidean     2 function () {} function (x) {    mean(x, na.rm = TRUE)}
#> 10    euclidean     2 function () {} function (x) {    mean(x, na.rm = TRUE)}
#>    startingCond nbCriterion scale response time id nClusters
#> 1     nearlyAll        1000  TRUE        Y Time Id         3
#> 2     nearlyAll        1000  TRUE        Y Time Id         3
#> 3     nearlyAll        1000  TRUE        Y Time Id         3
#> 4     nearlyAll        1000  TRUE        Y Time Id         3
#> 5     nearlyAll        1000  TRUE        Y Time Id         3
#> 6     nearlyAll        1000  TRUE        Y Time Id         3
#> 7     nearlyAll        1000  TRUE        Y Time Id         3
#> 8     nearlyAll        1000  TRUE        Y Time Id         3
#> 9     nearlyAll        1000  TRUE        Y Time Id         3
#> 10    nearlyAll        1000  TRUE        Y Time Id         3

A convenient way to assess the stability across repeated runs is to compare the models on one or more internal model metrics. Similar solutions should yield similar metric scores.

repSelfMetrics <- metric(repModels, name = c("BIC", "WMAE", "APPA"))
head(repSelfMetrics)
#>        BIC       WMAE      APPA
#> 1 655.5420 0.06418772 0.9865047
#> 2 655.5420 0.06418772 0.9865047
#> 3 655.5420 0.06418772 0.9865047
#> 4 655.0419 0.06416027 0.9865672
#> 5 655.5420 0.06418772 0.9865047
#> 6 655.0419 0.06416027 0.9865672
summary(repSelfMetrics[, "WMAE"])
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#> 0.06416 0.06416 0.06416 0.06417 0.06419 0.06419

As can be seen from the numbers, the models are (practically) identical in terms of model fit, measurement error, and cluster separation.

Comparison between fits

Alternatively, we can select the model with the best fit, and compare it against the other fitted models.

bestRepModel <- min(repModels, "BIC")
externalMetric(repModels, bestRepModel, name = "adjustedRand")
#>  [1] 1 1 1 1 1 1 1 1 1 1

As indicated by the adjusted Rand index, the methods are highly similar (a score of 1 indicates a perfect agreement). Note however that there are some discrepancies among the repeated runs on one or two trajectories.

Similarly, we can compute the pairwise adjusted Rand indices, resulting in a similarity matrix

simMat <- externalMetric(repModels, name = "adjustedRand")
round(simMat, 2)
#>    1 2 3 4 5 6 7 8 9
#> 2  1                
#> 3  1 1              
#> 4  1 1 1            
#> 5  1 1 1 1          
#> 6  1 1 1 1 1        
#> 7  1 1 1 1 1 1      
#> 8  1 1 1 1 1 1 1    
#> 9  1 1 1 1 1 1 1 1  
#> 10 1 1 1 1 1 1 1 1 1
summary(simMat)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>       1       1       1       1       1       1

Evaluating replicability and stability through bootstrapping

bootModels <- latrendBoot(kml, data = latrendData, samples = 10)
bootModels
#> List of 10 lcModels with
#>    .name .method                                       data
#> 1      1     kml   bootSample(latrendData, "Id", 64231719L)
#> 2      2     kml  bootSample(latrendData, "Id", 893996438L)
#> 3      3     kml bootSample(latrendData, "Id", 1434113967L)
#> 4      4     kml  bootSample(latrendData, "Id", 958577579L)
#> 5      5     kml bootSample(latrendData, "Id", 2079738042L)
#> 6      6     kml bootSample(latrendData, "Id", 2012583691L)
#> 7      7     kml  bootSample(latrendData, "Id", 520205446L)
#> 8      8     kml  bootSample(latrendData, "Id", 648143680L)
#> 9      9     kml bootSample(latrendData, "Id", 2127214623L)
#> 10    10     kml  bootSample(latrendData, "Id", 882537923L)
bootMetrics <- metric(bootModels, name = c("BIC", "WMAE", "APPA"))
bootMetrics
#>         BIC       WMAE      APPA
#> 1  386.5510 0.06249879 0.9918266
#> 2  430.5063 0.06337630 0.9904221
#> 3  389.1990 0.06258588 0.9847272
#> 4  544.3614 0.06614531 0.9891894
#> 5  480.6107 0.06621823 0.9952108
#> 6  461.5246 0.06367826 0.9888881
#> 7  410.5406 0.06151990 0.9874232
#> 8  372.2148 0.06216186 0.9888215
#> 9  466.8833 0.06500594 0.9884566
#> 10 410.7789 0.06232585 0.9897330
colMeans(bootMetrics)
#>          BIC         WMAE         APPA 
#> 435.31706882   0.06355163   0.98946986
apply(bootMetrics, 2, sd)
#>          BIC         WMAE         APPA 
#> 53.073813235  0.001686640  0.002749425

Ten-fold cross validation

Lastly, we can fit models using \(k\)-fold cross-validation to validate the models on previously unseen data from the test folds.

trainModels <- latrendCV(kml, data = latrendData, folds = 10, seed = 1)
trainModels
#> List of 10 lcModels with
#>    .name .method                                           data
#> 1      1     kml  trainFold(latrendData, fold = 1, "Id", 10, 1)
#> 2      2     kml  trainFold(latrendData, fold = 2, "Id", 10, 1)
#> 3      3     kml  trainFold(latrendData, fold = 3, "Id", 10, 1)
#> 4      4     kml  trainFold(latrendData, fold = 4, "Id", 10, 1)
#> 5      5     kml  trainFold(latrendData, fold = 5, "Id", 10, 1)
#> 6      6     kml  trainFold(latrendData, fold = 6, "Id", 10, 1)
#> 7      7     kml  trainFold(latrendData, fold = 7, "Id", 10, 1)
#> 8      8     kml  trainFold(latrendData, fold = 8, "Id", 10, 1)
#> 9      9     kml  trainFold(latrendData, fold = 9, "Id", 10, 1)
#> 10    10     kml trainFold(latrendData, fold = 10, "Id", 10, 1)

Manual cross validation

Alternatively, we can generate the training data folds ourselves, and fit models using latrendBatch.

dataFolds <- createTrainDataFolds(latrendData, folds = 10)
foldModels <- latrendBatch(kml, data = dataFolds)
foldModels
#> List of 10 lcModels with
#>    .name .method            data
#> 1      1     kml  dataFolds[[1]]
#> 2      2     kml  dataFolds[[2]]
#> 3      3     kml  dataFolds[[3]]
#> 4      4     kml  dataFolds[[4]]
#> 5      5     kml  dataFolds[[5]]
#> 6      6     kml  dataFolds[[6]]
#> 7      7     kml  dataFolds[[7]]
#> 8      8     kml  dataFolds[[8]]
#> 9      9     kml  dataFolds[[9]]
#> 10    10     kml dataFolds[[10]]

The list of test data folds is obtained using

testDataFolds <- createTestDataFolds(latrendData, dataFolds)