This tutorial describes the preprocessing of single-cell CRISPR screen (CROP-seq) dataset in Lalli et al., and how to run GSFA on the processed data.
Due to the size of data, we recommend running the code in this tutorial in an R script instead of in R markdown.
library(data.table)
library(tidyverse)
library(Matrix)
library(Seurat)
library(GSFA)
library(ggplot2)
Original CROP-seq study:
High-throughput single-cell functional elucidation of neurodevelopmental
disease-associated genes reveals convergent mechanisms altering neuronal
differentiation. Genome Res. (2020).
Data source:
GEO accession: GSE142078,
GSE142078_RAW.tar
file.
The study targeted 14 neurodevelopmental genes, including 13 autism risk genes, with CRISPR knock-down in Lund human mesencephalic (LUHMES) neural progenitor cells. After CRISPR targeting, the cells were differentiated into postmitotic neurons and sequenced. The overall goal is to understand the effects of these target genes on the transcriptome and on the process of neuronal differentiation.
The original data came in three batches, each in standard
cellranger
single-cell RNA-seq output format. Below is the
code to merge all batches of cells together into one dataset.
Meanwhile, each cell is also assigned its CRISPR perturbation target
based on the gRNA readout. Although each gene was targeted by 3 gRNAs,
and 5 non-targeting gRNAs were designed as negative control, only
gene-level perturbations are assigned to cells, resulting in 15 (14
genes + negative control) perturbation groups in total.
(Each cell contains a unique gRNA, so the assignment is
straight-forward.)
## Change the following directory to where the downloaded data is:
data_dir <- "/project2/xinhe/yifan/Factor_analysis/LUHMES/GSE142078_raw/"
filename_tb <- data.frame(experiment = c("Run1", "Run2", "Run3"),
prefix = c("GSM4219575_Run1", "GSM4219576_Run2", "GSM4219577_Run3"),
stringsAsFactors = F)
seurat_lst <- list()
guide_lst <- list()
for (i in 1:3){
experiment <- filename_tb$experiment[i]
prefix <- filename_tb$prefix[i]
cat(paste0("Loading data of ", experiment, " ..."))
cat("\n\n")
feature.names <- data.frame(fread(paste0(data_dir, prefix, "_genes.tsv.gz"),
header = FALSE), stringsAsFactors = FALSE)
barcode.names <- data.frame(fread(paste0(data_dir, prefix, "_barcodes.tsv.gz"),
header = FALSE), stringsAsFactors = FALSE)
barcode.names$V2 <- sapply(strsplit(barcode.names$V1, split = "-"),
function(x){x[1]})
# Load the gene count matrix (gene x cell) and annotate the dimension names:
dm <- readMM(file = paste0(data_dir, prefix, "_matrix.mtx.gz"))
rownames(dm) <- feature.names$V1
colnames(dm) <- barcode.names$V2
# Load the meta data of cells:
metadata <- data.frame(fread(paste0(data_dir, prefix, "_Cell_Guide_Lookup.csv.gz"),
header = T, sep = ','), check.names = F)
metadata$target <- sapply(strsplit(metadata$sgRNA, split = "_"),
function(x){x[1]})
metadata <- metadata %>% filter(CellBarcode %in% barcode.names$V2)
targets <- unique(metadata$target)
targets <- targets[order(targets)]
# Make a cell by perturbation matrix:
guide_mat <- data.frame(matrix(nrow = nrow(metadata),
ncol = length(targets)))
rownames(guide_mat) <- metadata$CellBarcode
colnames(guide_mat) <- targets
for (i in targets){
guide_mat[[i]] <- (metadata$target == i) * 1
}
guide_lst[[experiment]] <- guide_mat
# Only keep cells with gRNA info:
dm.cells_w_gRNA <- dm[, metadata$CellBarcode]
real_gene_bool <- startsWith(rownames(dm.cells_w_gRNA), "ENSG")
dm.cells_w_gRNA <- dm.cells_w_gRNA[real_gene_bool, ]
cat("Dimensions of the final gene expression matrix: ")
cat(dim(dm.cells_w_gRNA))
cat("\n\n")
dm.seurat <- CreateSeuratObject(dm.cells_w_gRNA, project = paste0("LUHMES_", experiment))
dm.seurat <- AddMetaData(dm.seurat, metadata = guide_mat)
seurat_lst[[experiment]] <- dm.seurat
}
Loading data of Run1 …
Dimensions of the final gene expression matrix: 33694 4336
Loading data of Run2 …
Dimensions of the final gene expression matrix: 33694 2436
Loading data of Run3 …
Dimensions of the final gene expression matrix: 33694 2071
combined_obj <- merge(seurat_lst[[1]], c(seurat_lst[[2]], seurat_lst[[3]]),
add.cell.ids = c("Run1", "Run2", "Run3"),
project = "LUHMES")
Dimensions of the merged gene expression matrix:
paste0("Genes: ", dim(combined_obj)[1])
paste0("Cells: ", dim(combined_obj)[2])
[1] "Genes: 33694"
[1] "Cells: 8843"
Next, Seurat is used to filter cells with a library size > 20000 or more than 10% of total read counts from mitochondria genes.
The violin plots show the distributions of unique UMI count, library size, and mitochondria gene percentage in cells from each batch.
MT_genes <- feature.names %>% filter(startsWith(V2, "MT-")) %>% pull(V1)
combined_obj[['percent_mt']] <- PercentageFeatureSet(combined_obj,
features = MT_genes)
combined_obj <- subset(combined_obj,
subset = percent_mt < 10 & nCount_RNA < 20000)
VlnPlot(combined_obj,
features = c('nFeature_RNA', 'nCount_RNA', 'percent_mt'),
pt.size = 0)
Dimensions of the merged gene expression matrix after QC:
paste0("Genes: ", dim(combined_obj)[1])
paste0("Cells: ", dim(combined_obj)[2])
[1] "Genes: 33694"
[1] "Cells: 8708"
To accommodate the application of GSFA, we adopt the transformation proposed in Townes et al., and convert the scRNA-seq count data into deviance residuals, a continuous quantity analogous to z-scores that approximately follows a normal distribution.
The deviance residual transformation overcomes some problems of the commonly used log transformation of read counts, and has been shown to improve downstream analyses, such as feature selection and clustering.
In the following, we used the
deviance_residual_transform
function in GSFA
to perform the tranformation.
Due to the size of the data, this process can take up to 30
minutes.
dev_res <- deviance_residual_transform(t(as.matrix(combined_obj@assays$RNA@counts)))
Genes with constant expression across cells are not informative and will have a deviance of 0, while genes that vary across cells in expression will have a larger deviance.
Therefore, one can pick the genes with high deviance as an alternative to selecting highly variable genes, with the advantage that the selection is not sensitive to normalization.
Here we select the top 6000 genes ranked by deviance statistics, and downsize the gene expression matrix.
top_gene_index <- select_top_devres_genes(dev_res, num_top_genes = 6000)
dev_res_filtered <- dev_res[, top_gene_index]
We further regressed out the differences in experimental batch, unique UMI count, library size, and percentage of mitochondrial gene expression.
covariate_df <- data.frame(batch = factor(combined_obj$orig.ident),
lib_size = combined_obj$nCount_RNA,
umi_count = combined_obj$nFeature_RNA,
percent_mt = combined_obj$percent_mt)
dev_res_corrected <- covariate_removal(dev_res_filtered, covariate_df)
scaled.gene_exp <- scale(dev_res_corrected)
The corrected and scaled gene expression matrix (8708 cells by 6000
genes) will be used as input for GSFA input argument Y
. We
annonate this matrix with cell and gene names so we can keep track.
sample_names <- colnames(combined_obj@assays$RNA@counts)
gene_names <- rownames(combined_obj@assays$RNA@counts)
rownames(scaled.gene_exp) <- sample_names
colnames(scaled.gene_exp) <- gene_names[top_gene_index]
In addition, we have a cell by perturbation matrix containing
gene-level perturbation conditions across cells for GSFA input argument
G
:
G_mat <- combined_obj@meta.data[, 4:18]
G_mat <- as.matrix(G_mat)
Here is the number of cells under each gene-level perturbation:
num_cells <- colSums(G_mat)
num_cells_df <- data.frame(locus = names(num_cells),
count = num_cells)
ggplot(data = num_cells_df, aes(x=locus, y=count)) +
geom_bar(stat = "identity", width = 0.6) +
labs(x = "Perturbation",
y = "Number of cells") +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size =11),
axis.text = element_text(size = 12),
axis.title = element_text(size = 14),
panel.grid.minor = element_blank())
Now that we have the inputs we need (scaled.gene_exp
,
G_mat
), we can perform GSFA (guided sparse factor analysis)
on the data.
In fit_gsfa_multivar()
, we specify 20 factors
initialized from truncated SVD and run Gibbs sampling for 3000
iterations, with the posterior mean estimates computed over the last
1000 iterations of samples.
This process is both time and memory intensive, and should be run in an R script separately, preferably on a high performance computing cluster that can guarantee 30GB memory and 3.5 (typically ~3) hours of runtime without interruption.
set.seed(14314)
fit <- fit_gsfa_multivar(Y = scaled.gene_exp, G = G_mat,
K = 20,
prior_type = "mixture_normal",
init.method = "svd",
niter = 3000, used_niter = 1000,
verbose = T, return_samples = T)
More options for running GSFA are available in the
R/run_gsfa_LUHMES.R
script. For example, one can run Gibbs
sampling in segments and restart from a previous run if limited
computational resources are available.
sessionInfo()
R version 4.2.0 (2022-04-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS/LAPACK: /software/openblas-0.3.13-el7-x86_64/lib/libopenblas_haswellp-r0.3.13.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=C
[4] LC_COLLATE=C LC_MONETARY=C LC_MESSAGES=C
[7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] GSFA_0.2.8 SeuratObject_4.1.3 Seurat_4.3.0 Matrix_1.5-3
[5] forcats_1.0.0 stringr_1.5.0 dplyr_1.1.0 purrr_1.0.1
[9] readr_2.1.4 tidyr_1.3.0 tibble_3.1.8 ggplot2_3.4.1
[13] tidyverse_1.3.2 data.table_1.14.6
loaded via a namespace (and not attached):
[1] readxl_1.4.2 backports_1.4.1 plyr_1.8.7
[4] igraph_1.4.0 lazyeval_0.2.2 sp_1.6-0
[7] splines_4.2.0 listenv_0.8.0 scattermore_0.8
[10] digest_0.6.31 htmltools_0.5.4 fansi_1.0.4
[13] magrittr_2.0.3 tensor_1.5 googlesheets4_1.0.1
[16] cluster_2.1.3 ROCR_1.0-11 tzdb_0.3.0
[19] globals_0.15.0 modelr_0.1.10 matrixStats_0.63.0
[22] R.utils_2.12.2 timechange_0.2.0 spatstat.sparse_3.0-0
[25] colorspace_2.1-0 rvest_1.0.3 ggrepel_0.9.3
[28] haven_2.5.1 xfun_0.37 crayon_1.5.2
[31] jsonlite_1.8.4 progressr_0.10.0 spatstat.data_3.0-0
[34] survival_3.3-1 zoo_1.8-10 glue_1.6.2
[37] polyclip_1.10-0 gtable_0.3.1 gargle_1.3.0
[40] leiden_0.4.2 future.apply_1.9.0 abind_1.4-5
[43] scales_1.2.1 DBI_1.1.3 spatstat.random_3.1-3
[46] miniUI_0.1.1.1 Rcpp_1.0.10 viridisLite_0.4.1
[49] xtable_1.8-4 reticulate_1.24 htmlwidgets_1.6.1
[52] httr_1.4.4 RColorBrewer_1.1-3 ellipsis_0.3.2
[55] ica_1.0-2 farver_2.1.1 R.methodsS3_1.8.2
[58] pkgconfig_2.0.3 sass_0.4.5 uwot_0.1.14
[61] dbplyr_2.3.0 deldir_1.0-6 utf8_1.2.3
[64] labeling_0.4.2 tidyselect_1.2.0 rlang_1.0.6
[67] reshape2_1.4.4 later_1.3.0 munsell_0.5.0
[70] cellranger_1.1.0 tools_4.2.0 cachem_1.0.6
[73] cli_3.6.0 generics_0.1.3 broom_1.0.3
[76] ggridges_0.5.3 evaluate_0.20 fastmap_1.1.0
[79] yaml_2.3.7 goftest_1.2-3 knitr_1.42
[82] fs_1.6.1 fitdistrplus_1.1-8 pander_0.6.5
[85] RANN_2.6.1 pbapply_1.5-0 future_1.25.0
[88] nlme_3.1-157 mime_0.12 ggrastr_1.0.1
[91] R.oo_1.25.0 xml2_1.3.3 compiler_4.2.0
[94] rstudioapi_0.14 beeswarm_0.4.0 plotly_4.10.0
[97] png_0.1-8 spatstat.utils_3.0-1 reprex_2.0.2
[100] bslib_0.4.2 stringi_1.7.12 highr_0.10
[103] lattice_0.20-45 vctrs_0.5.2 pillar_1.8.1
[106] lifecycle_1.0.3 spatstat.geom_3.0-6 lmtest_0.9-40
[109] jquerylib_0.1.4 RcppAnnoy_0.0.19 cowplot_1.1.1
[112] irlba_2.3.5 httpuv_1.6.5 patchwork_1.1.1
[115] R6_2.5.1 promises_1.2.0.1 KernSmooth_2.23-20
[118] gridExtra_2.3 vipor_0.4.5 parallelly_1.34.0
[121] codetools_0.2-18 MASS_7.3-56 assertthat_0.2.1
[124] withr_2.5.0 sctransform_0.3.5 parallel_4.2.0
[127] hms_1.1.2 grid_4.2.0 rmarkdown_2.20
[130] googledrive_2.0.0 Rtsne_0.16 spatstat.explore_3.0-6
[133] shiny_1.7.1 lubridate_1.9.2 ggbeeswarm_0.7.1