Simulate a Continuous Gene Expression Matrix and an Accompanying Perturbation Matrix

Generate a binary perturbation matrix and a continuous gene expression matrix in a bottom-up fashion according to a hierarchical factor model with normal noise terms.

Usage

normal_data_sim(
  N = 400,
  P = 600,
  beta_true,
  K = ncol(beta_true),
  M = nrow(beta_true),
  pi_true = rep(0.1, K),
  sigma_w2_true = rep(0.5, K),
  psi_true = 1,
  G_prob = 0.2,
  offset = FALSE
)

Arguments

N: Number of samples to simulate
P: Number of genes to simulate
beta_true: A \(M\) by \(K\) numeric matrix that stores the true effect sizes of perturbation-factor associations; when offset=TRUE, \(M+1\) rows should be provided instead.
K: Number of factors to simulate
M: Number of perturbations to simulate
pi_true: The true density (proportion of nonzero gene loading) of each factor
G_prob: The Bernoulli probability based on which the binary perturbation matrix G will be generated; determines the frequency of each perturbation in the sample population
offset: Default is FALSE. If TRUE, beta_true should have \(M+1\) rows, with the last row storing the intercept values \(\beta_0\)

Value

A list object with the following elements:

Y: a sample by gene matrix with continuous gene expression values;
G: a binary sample by perturbation matrix;
Z: a sample by factor matrix;
F: a binary gene by factor matrix that indicates whether a gene has non-zero loading in the factor;
U: a gene by factor matrix with normal effect sizes, and F*U (element-wise multiplication) gives the loading matrix W.

Examples

set.seed(12345)
beta_true <- rbind(c(1, 0, 0, 0, 0), c(0, 0.8, 0, 0, 0))
sim_data <- normal_data_sim(N = 4000, P = 6000, beta_true = beta_true)