Applying GSFA in a simulated dataset

Environment set-up

Load the GSFA package:

library(GSFA)

set.seed(46568)

Simulate a data set

We generate a normal gene expression matrix $Y$ with 400 samples and 600 genes and a binary perturbation matrix $G$ with 2 types of perturbations according to:

\[G_{im} \overset{i.i.d.}{\sim} \text{Bern}(0.2), \phi_{ik} \overset{i.i.d.}{\sim} N(0,0.5) \Rightarrow Z = G \beta + \Phi,\] \[F_{jk} \overset{i.i.d.}{\sim} \text{Bern}(0.1), U_{jk} \overset{i.i.d.}{\sim} N(0, 0.5) \Rightarrow W_{jk}=F_{jk}\cdot U_{jk},\] \[E_{ij} \overset{i.i.d.}{\sim} N(0,1) \Rightarrow Y = ZW^T+E.\] Gene expression $Y$ was generated from 5 factors, with each factor has ~0.1 of all genes with non-zero loading in it.

The true association effects between factors and perturbations, $\beta$, are set to: \[\begin{pmatrix} 1 & 0 & 0 & 0 & 0 \\ 0 & 0.8 & 0 & 0 & 0 \end{pmatrix}\]

That is, the 1st factor is associated with perturbation 1, and the 2nd factor associated with perturbation 2.

N: Number of samples, P: Number of genes, K: Number of factors, M: Number of perturbations in the simulated data.

beta_true <- rbind(c(1, 0, 0, 0, 0), 
                   c(0, 0.8, 0, 0, 0))
sim_data <- normal_data_sim(N = 400, P = 600, K = 5, M = 2,
                            beta_true = beta_true,
                            pi_true = rep(0.1, 5),
                            psi_true = 0.5, G_prob = 0.2)

Fit GSFA

Now we perform GSFA on the given normal expression data and binary perturbation matrix using fit_gsfa_multivar().

Below, 5 factors are specified in the model, Gibbs sampling is initialized with truncated SVD for 1000 iterations, with the posterior means computed using the last 500 iterations.

prior_w_s, prior_w_r are prior parameters of the gene loading on the factors, prior_beta_s, and prior_beta_r are prior parameters of the effects of perturbations on the factors. For details about the GSFA model and prior specification, please see the GSFA paper and Supplementary Notes (Section 1).

fit0 <- fit_gsfa_multivar(Y = sim_data$Y, G = sim_data$G,
                          K = 5, init.method = "svd",
                          prior_w_s = 10, prior_w_r = 0.2,
                          prior_beta_s = 5, prior_beta_r = 0.2,
                          niter = 1000, used_niter = 500,
                          verbose = T, return_samples = T)

Initializing Z and W with SVD.
Iteration [50] finished.
Iteration [100] finished.
Iteration [150] finished.
Iteration [200] finished.
Iteration [250] finished.
Iteration [300] finished.
Iteration [350] finished.
Iteration [400] finished.
Iteration [450] finished.
Iteration [500] finished.
Iteration [550] finished.
Iteration [600] finished.
Iteration [650] finished.
Iteration [700] finished.
Iteration [750] finished.
Iteration [800] finished.
Iteration [850] finished.
Iteration [900] finished.
Iteration [950] finished.
Iteration [1000] finished.
Computing total effects for each perturbation-gene pair.
Computing posterior means of parameters.

If the option return_samples = T, one can inspect the traces of samples throughout the iterations stored in fit0$*_samples slots.

If the sampling chain does not seem to have converged, we can continue the Gibbs sampling from the previous run by passing the previous fit object fit0 to fit_gsfa_multivar(). Below, the sampling is resumed for another 1000 iterations, with the posterior means recomputed using the last 500 iterations.

(Note that we no longer need to specify k, init.method and prior values this time.)

fit <- fit_gsfa_multivar(Y = sim_data$Y, G = sim_data$G,
                         fit0 = fit0,
                         niter = 1000, used_niter = 500,
                         verbose = F, return_samples = T)

Computing total effects for each perturbation-gene pair.
Computing posterior means of parameters.

Interpretation

Perturbation effects on factors

Note that factors are interchangeable, so their orders won’t necessarily match the original.

The estimated associations between factors and perturbations are:

signif(fit$posterior_means$beta_pm[-nrow(fit$posterior_means$beta_pm), ],
       digits = 3)

  Factor_1 Factor_2 Factor_3 Factor_4 Factor_5
1 -0.00112 -0.00274  -1.0200 -0.00343 -0.01090
2 -0.00768  0.77100  -0.0148  0.00630 -0.00176

Factor 3 is associated with perturbation 1 with an absolute effect size of ~1.
Factor 2 is associated with perturbation 2 with an absolute effect size of ~0.8.

The PIPs (posterior inclusion probability, a measurement of certainty) of associations between factors and perturbations are:

signif(fit$posterior_means$Gamma_pm[-nrow(fit$posterior_means$Gamma_pm), ],
       digits = 3)

  Factor_1 Factor_2 Factor_3 Factor_4 Factor_5
1     0.04    0.042    1.000    0.032    0.080
2     0.07    1.000    0.096    0.060    0.048

Associations with high certainty are:
Factor 3 ~ Perturbation 1 and Factor 2 ~ Perturbation 2.

Visualization of perturbation effects on factors:

dotplot_beta_PIP(fit, target_names = c("Perturbation 1", "Perturbation 2"))

Factor interpretation

Genes with non-zero loading on factors can be obtained by thresholding the gene PIP.

For example, non-zero genes in factor 2 are:

est_genes_factor2 <- which(fit$posterior_means$F_pm[, 2] > 0.95)

Compare with genes truly in factor 2:

true_genes_factor2 <- which(sim_data$F[, 2] > 0)
num_olap <- length(intersect(est_genes_factor2, true_genes_factor2))
# Sensitivity:
sens <- num_olap / length(true_genes_factor2)
print(paste0("Sensitivity: ", signif(sens, digits = 3)))
# Specificity:
fpr <- (length(est_genes_factor2) - num_olap) / (sum(sim_data$F[, 2] == 0))
print(paste0("Specificity: ", signif(1 - fpr, digits = 3)))

[1] "Sensitivity: 0.8"
[1] "Specificity: 1"

Perturbation effects on genes

Differentially expressed genes (DEGs) can be detected by thresholding LFSR.

DEGs detected under Perturbation 1 and the sensitivity and specificity of discovery:

genes_detected1 <- which(fit$lfsr[, 1] < 0.05)
print(paste0(length(genes_detected1), " genes passed LFSR < 0.05."))

true_genes_factor1 <- which(sim_data$F[, 1] > 0)
num_olap1 <- length(intersect(genes_detected1, true_genes_factor1))
# Sensitivity:
sens1 <- num_olap1 / length(true_genes_factor1)
print(paste0("Sensitivity: ", signif(sens1, digits = 3)))
# Specificity:
fpr1 <- (length(genes_detected1) - num_olap1) / (sum(sim_data$F[, 1] == 0))
print(paste0("Specificity: ", signif(1 - fpr1, digits = 3)))

[1] "46 genes passed LFSR < 0.05."
[1] "Sensitivity: 0.852"
[1] "Specificity: 1"

DEGs detected under Perturbation 2 and the sensitivity and specificity of discovery:

genes_detected2 <- which(fit$lfsr[, 2] < 0.05)
print(paste0(length(genes_detected2), " genes passed LFSR < 0.05."))

true_genes_factor2 <- which(sim_data$F[, 2] > 0)
num_olap2 <- length(intersect(genes_detected2, true_genes_factor2))
# Sensitivity:
sens2 <- num_olap2 / length(true_genes_factor2)
print(paste0("Sensitivity: ", signif(sens2, digits = 3)))
# Specificity:
fpr2 <- (length(genes_detected2) - num_olap2) / (sum(sim_data$F[, 2] == 0))
print(paste0("Specificity: ", signif(1 - fpr2, digits = 3)))

[1] "44 genes passed LFSR < 0.05."
[1] "Sensitivity: 0.8"
[1] "Specificity: 1"

Visualization of the total effects each perturbation has on selected genes:

dotplot_total_effect(fit,
                     gene_indices = c(1, 12, 14, 88, 89, 91, 123),
                     target_names = c("Perturbation 1", "Perturbation 2"))

Session information

sessionInfo()

R version 4.2.1 (2022-06-23)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.3

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggplot2_3.4.3 dplyr_1.1.3   GSFA_0.2.8   

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.11       plyr_1.8.8        bslib_0.5.1       compiler_4.2.1   
 [5] pillar_1.9.0      jquerylib_0.1.4   tools_4.2.1       digest_0.6.33    
 [9] gtable_0.3.4      jsonlite_1.8.7    evaluate_0.21     memoise_2.0.1    
[13] lifecycle_1.0.3   tibble_3.2.1      pkgconfig_2.0.3   rlang_1.1.1      
[17] cli_3.6.1         rstudioapi_0.15.0 yaml_2.3.7        pkgdown_2.0.7    
[21] xfun_0.40         fastmap_1.1.1     withr_2.5.1       stringr_1.5.0    
[25] knitr_1.44        desc_1.4.2        generics_0.1.3    fs_1.6.3         
[29] vctrs_0.6.3       sass_0.4.7        systemfonts_1.0.4 grid_4.2.1       
[33] rprojroot_2.0.3   tidyselect_1.2.0  glue_1.6.2        R6_2.5.1         
[37] textshaping_0.3.6 fansi_1.0.4       rmarkdown_2.25    farver_2.1.1     
[41] reshape2_1.4.4    purrr_1.0.2       magrittr_2.0.3    scales_1.2.1     
[45] htmltools_0.5.6   colorspace_2.1-0  labeling_0.4.3    ragg_1.2.5       
[49] utf8_1.2.3        stringi_1.7.12    munsell_0.5.0     cachem_1.0.8

Yifan Zhou

2023-10-03