PRECAST: simulation

Wei Liu

2024-01-24

This vignette introduces the PRECAST workflow for the analysis of integrating multiple spatial transcriptomics dataset. The workflow consists of three steps

We demonstrate the use of PRECAST to three simulated Visium data that are here, which can be downloaded to the current working path by the following command:

githubURL <- "https://github.com/feiyoung/PRECAST/blob/main/vignettes_data/data_simu.rda?raw=true"
download.file(githubURL,"data_simu.rda",mode='wb')

Then load to R

load("data_simu.rda")

The package can be loaded with the command:

library(PRECAST)
library(Seurat)

Load the simulated data

First, we view the the three simulated spatial transcriptomics data with Visium platform.

data_simu ## a list including three Seurat object with default assay: RNA

Check the content in data_simu.

head(data_simu[[1]])
row.names(data_simu[[1]])[1:10]

Create a PRECASTObject object

We show how to create a PRECASTObject object step by step. First, we create a Seurat list object using the count matrix and meta data of each data batch. Although data_simu is a prepared Seurat list object, we re-create a same objcet seuList to show the details.

## Get the gene-by-spot read count matrices
countList <- lapply(data_simu, function(x){
  assay <- DefaultAssay(x)
  GetAssayData(x, assay = assay, slot='counts')
  
} )

## Check the spatial coordinates: Yes, they are named as "row" and "col"!
head(data_simu[[1]]@meta.data)

## Get the meta data of each spot for each data batch
metadataList <- lapply(data_simu, function(x) x@meta.data)


## ensure the row.names of metadata in metaList are the same as that of colnames count matrix in countList
M <- length(countList)
for(r in 1:M){
  row.names(metadataList[[r]]) <- colnames(countList[[r]])
}


## Create the Seurat list  object

seuList <- list()
for(r in 1:M){
  seuList[[r]] <- CreateSeuratObject(counts = countList[[r]], meta.data=metadataList[[r]], project = "PRECASTsimu")
}

Prepare the PRECASTObject with preprocessing step.

Next, we use CreatePRECASTObject() to create a PRECASTObject based on the Seurat list object seuList. This function will do three things:

If the argument customGenelist is not NULL, then this function only does (3) based on customGenelist gene list.

In this simulated dataset, we don’t require to select genes, thus, we set customGenelist=row.names(seuList[[1]]), representing the user-defined gene list. User can retain the raw seurat list object by setting rawData.preserve = TRUE.

Fit PRECAST using simulated data

Add the model setting

Add adjacency matrix list and parameter setting of PRECAST. More model setting parameters can be found in model_set().

Fit PRECAST

For function PRECAST, users can specify the number of clusters \(K\) or set K to be an integer vector by using modified BIC(MBIC) to determine \(K\). For convenience, we give a single K here.

Other options Run for multiple K. Here, we set K=6:9.

  • Note: For parallel compuation based on Rcpp on Linux, users require to use the following system command to set the C_stack unlimited in case of R Error: C stack usage is too close to the limit.

ulimit -s unlimited

Besides, user can also use different initialization method by setting int.model, for example, set int.model=NULL; see the functions AddParSetting() and model_set() for more details.

Select a best model and re-organize the results by useing SelectModel(). Even though K is not a vector, it is also necessary to run SelectModel() to re-organize the results in PRECASTObj. The selected best K is 7 by using command str(PRECASTObj@resList).

Use ARI to check the performance of clustering:

We provide two methods to correct the batch effects in gene expression level. Method (1) is using only PRECAST results to obtain the batch corrected gene expressions if the species of data is unknown or the number of overlapped housekeeping genes between the variable genes in PRECASTObj@seulist and the genes in database is less than five. Method (2) is using bouth housekeeping gene and PRECAST results to obtain the batch corrected gene expressions.

Integrate the two samples by the function IntegrateSpaData. Because this is a simulated data, we use Method (1) by setting species='unknown'.

Visualization

First, user can choose a beautiful color schema using chooseColors().

cols_cluster <- chooseColors(palettes_name = 'Nature 10', n_colors = 7, plot_colors = TRUE)

Show the spatial scatter plot for clusters

p12 <- SpaPlot(seuInt, batch=NULL, cols=cols_cluster, point_size=2, combine=TRUE)
p12
# users can plot each sample by setting combine=FALSE

Users can re-plot the above figures for specific need by returning a ggplot list object. For example, we only plot the spatial heatmap of first two data batches.

pList <- SpaPlot(seuInt, batch=NULL, cols=cols_cluster, point_size=2, combine=FALSE, title_name=NULL)
drawFigs(pList[1:2], layout.dim = c(1,2), common.legend = TRUE, legend.position = 'right', align='hv')

Show the spatial UMAP/tNSE RGB plot

seuInt <- AddUMAP(seuInt) 
SpaPlot(seuInt, batch=NULL,item='RGB_UMAP',point_size=1, combine=TRUE, text_size=15)

## Plot tSNE RGB plot
#seuInt <- AddTSNE(seuInt) 
#SpaPlot(seuInt, batch=NULL,item='RGB_TSNE',point_size=2, combine=T, text_size=15)

Show the tSNE plot based on the extracted features from PRECAST to check the performance of integration.

seuInt <- AddTSNE(seuInt, n_comp = 2) 

p1 <- dimPlot(seuInt, item='cluster', font_family='serif', cols=cols_cluster) # Times New Roman
p2 <- dimPlot(seuInt, item='batch', point_size = 1,  font_family='serif')
drawFigs(list(p1, p2), common.legend=FALSE, align='hv') 
# It is noted that only sample batch 1 has cluster 4, and only sample batch 2 has cluster 7. 

Show the UMAP plot based on the extracted features from PRECAST.

dimPlot(seuInt, reduction = 'UMAP3', item='cluster', cols=cols_cluster, font_family='serif')

Users can also use the visualization functions in Seurat package:

library(Seurat)
p1 <- DimPlot(seuInt[,1: 4226], reduction = 'position', cols=cols_cluster, pt.size =1) # plot the first data batch: first 4226 spots.
p2 <- DimPlot(seuInt, reduction = 'tSNE',cols=cols_cluster, pt.size=1)
drawFigs(list(p1, p2), layout.dim = c(1,2), common.legend = TRUE)

Combined differential expression analysis

dat_deg <- FindAllMarkers(seuInt)
library(dplyr)
n <- 2
dat_deg %>%
  group_by(cluster) %>%
  top_n(n = n, wt = avg_log2FC) -> top10

head(top10)

Session Info

sessionInfo()
#> R version 4.1.2 (2021-11-01)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 22621)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=C                              
#> [2] LC_CTYPE=Chinese (Simplified)_China.936   
#> [3] LC_MONETARY=Chinese (Simplified)_China.936
#> [4] LC_NUMERIC=C                              
#> [5] LC_TIME=Chinese (Simplified)_China.936    
#> 
#> attached base packages:
#> [1] parallel  stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#> [1] PRECAST_1.6.4  gtools_3.9.2.2
#> 
#> loaded via a namespace (and not attached):
#>   [1] backports_1.4.1             GiRaF_1.0.1                
#>   [3] plyr_1.8.7                  igraph_1.3.5               
#>   [5] lazyeval_0.2.2              sp_1.5-0                   
#>   [7] splines_4.1.2               BiocParallel_1.28.3        
#>   [9] listenv_0.8.0               scattermore_0.8            
#>  [11] GenomeInfoDb_1.30.1         ggplot2_3.4.1              
#>  [13] scater_1.25.1               digest_0.6.29              
#>  [15] htmltools_0.5.2             viridis_0.6.2              
#>  [17] fansi_1.0.4                 magrittr_2.0.3             
#>  [19] ScaledMatrix_1.2.0          tensor_1.5                 
#>  [21] cluster_2.1.2               ROCR_1.0-11                
#>  [23] globals_0.15.0              matrixStats_0.62.0         
#>  [25] spatstat.sparse_2.1-1       colorspace_2.1-0           
#>  [27] ggrepel_0.9.1               xfun_0.29                  
#>  [29] dplyr_1.0.9                 RCurl_1.98-1.6             
#>  [31] jsonlite_1.8.0              progressr_0.10.1           
#>  [33] spatstat.data_3.0-0         survival_3.2-13            
#>  [35] zoo_1.8-10                  glue_1.6.2                 
#>  [37] polyclip_1.10-0             gtable_0.3.3               
#>  [39] zlibbioc_1.40.0             XVector_0.34.0             
#>  [41] leiden_0.4.2                DelayedArray_0.20.0        
#>  [43] car_3.0-12                  BiocSingular_1.10.0        
#>  [45] future.apply_1.9.0          SingleCellExperiment_1.16.0
#>  [47] BiocGenerics_0.40.0         abind_1.4-5                
#>  [49] scales_1.2.1                DBI_1.1.2                  
#>  [51] rstatix_0.7.0               spatstat.random_2.2-0      
#>  [53] ggthemes_4.2.4              miniUI_0.1.1.1             
#>  [55] Rcpp_1.0.10                 viridisLite_0.4.1          
#>  [57] xtable_1.8-4                reticulate_1.25            
#>  [59] spatstat.core_2.4-4         rsvd_1.0.5                 
#>  [61] mclust_5.4.10               stats4_4.1.2               
#>  [63] htmlwidgets_1.5.4           httr_1.4.3                 
#>  [65] RColorBrewer_1.1-3          ellipsis_0.3.2             
#>  [67] Seurat_4.1.1                ica_1.0-2                  
#>  [69] scuttle_1.4.0               pkgconfig_2.0.3            
#>  [71] sass_0.4.1                  uwot_0.1.11                
#>  [73] deldir_1.0-6                utf8_1.2.3                 
#>  [75] tidyselect_1.1.2            rlang_1.1.0                
#>  [77] reshape2_1.4.4              later_1.3.0                
#>  [79] munsell_0.5.0               tools_4.1.2                
#>  [81] cli_3.2.0                   generics_0.1.2             
#>  [83] broom_0.7.12                ggridges_0.5.3             
#>  [85] evaluate_0.15               stringr_1.4.0              
#>  [87] fastmap_1.1.0               yaml_2.3.6                 
#>  [89] goftest_1.2-3               knitr_1.37                 
#>  [91] fitdistrplus_1.1-8          purrr_0.3.4                
#>  [93] RANN_2.6.1                  sparseMatrixStats_1.6.0    
#>  [95] pbapply_1.5-0               future_1.26.1              
#>  [97] nlme_3.1-155                mime_0.12                  
#>  [99] formatR_1.11                compiler_4.1.2             
#> [101] rstudioapi_0.13             beeswarm_0.4.0             
#> [103] plotly_4.10.0               png_0.1-7                  
#> [105] ggsignif_0.6.3              spatstat.utils_3.0-1       
#> [107] tibble_3.2.1                bslib_0.3.1                
#> [109] stringi_1.7.6               rgeos_0.5-9                
#> [111] lattice_0.20-45             Matrix_1.4-0               
#> [113] vctrs_0.6.1                 CompQuadForm_1.4.3         
#> [115] pillar_1.9.0                lifecycle_1.0.3            
#> [117] spatstat.geom_2.4-0         lmtest_0.9-40              
#> [119] jquerylib_0.1.4             BiocNeighbors_1.12.0       
#> [121] RcppAnnoy_0.0.19            bitops_1.0-7               
#> [123] data.table_1.14.2           cowplot_1.1.1              
#> [125] irlba_2.3.5                 GenomicRanges_1.46.1       
#> [127] httpuv_1.6.5                patchwork_1.1.1            
#> [129] R6_2.5.1                    promises_1.2.0.1           
#> [131] KernSmooth_2.23-20          gridExtra_2.3              
#> [133] vipor_0.4.5                 IRanges_2.28.0             
#> [135] parallelly_1.32.0           codetools_0.2-18           
#> [137] MASS_7.3-55                 assertthat_0.2.1           
#> [139] DR.SC_3.4                   SummarizedExperiment_1.24.0
#> [141] SeuratObject_4.1.0          sctransform_0.3.3          
#> [143] GenomeInfoDbData_1.2.7      S4Vectors_0.32.3           
#> [145] mgcv_1.8-39                 beachmat_2.10.0            
#> [147] grid_4.1.2                  rpart_4.1.16               
#> [149] tidyr_1.2.0                 DelayedMatrixStats_1.16.0  
#> [151] rmarkdown_2.11              carData_3.0-5              
#> [153] MatrixGenerics_1.6.0        Rtsne_0.16                 
#> [155] ggpubr_0.4.0                Biobase_2.54.0             
#> [157] shiny_1.7.1                 ggbeeswarm_0.6.0