Causal inference for TWAS

ctwas(
  pgenfs,
  exprfs,
  Y,
  ld_regions = c("EUR", "ASN", "AFR"),
  ld_regions_version = c("b37", "b38"),
  ld_regions_custom = NULL,
  thin = 1,
  prob_single = 0.8,
  max_snp_region = Inf,
  rerun_gene_PIP = 0.8,
  niter1 = 3,
  niter2 = 30,
  L = 5,
  group_prior = NULL,
  group_prior_var = NULL,
  estimate_group_prior = T,
  estimate_group_prior_var = T,
  use_null_weight = T,
  coverage = 0.95,
  standardize = T,
  ncore = 1,
  outputdir = getwd(),
  outname = NULL,
  logfile = NULL
)

Arguments

pgenfs

A character vector of .pgen or .bed files. One file for one chromosome, in the order of 1 to 22. Therefore, the length of this vector needs to be 22. If .pgen files are given, then .pvar and .psam are assumed to present in the same directory. If .bed files are given, then .bim and .fam files are assumed to present in the same directory.

exprfs

A character vector of .`expr` or `.expr.gz` files. One file for one chromosome, in the order of 1 to 22. Therefore, the length of this vector needs to be 22. `.expr.gz` file is gzip compressed `.expr` files. `.expr` is a matrix of imputed expression values, row is for each sample, column is for each gene. Its sample order is same as in files provided by `.pgenfs`. We also assume corresponding `.exprvar` files are present in the same directory. `.exprvar` files are just tab delimited text files, with columns:

chrom: chromosome number, numeric
p0: gene boundary position, the smaller value
p1: gene boundary position, the larger value
id: gene id

Its rows should be in the same order as the columns for corresponding `.expr` files.

Y

a vector of length n, phenotype, the same order as provided by `.pgenfs` (defined in .psam or .fam files).

ld_regions

A string representing the population to use for defining LD regions. These LD regions were previously defined by ldetect. The user can also provide custom LD regions matching genotype data, see ld_regions_custom.

ld_regions_version

A string representing the genome reference build ("b37", "b38") to use for defining LD regions. See ld_regions.

ld_regions_custom

A bed format file defining LD regions. The default is NULL; when specified, ld_regions and ld_regions_version will be ignored.

thin

The proportion of SNPs to be used for the parameter estimation and initial fine mapping steps. Smaller thin parameters reduce runtime at the expense of accuracy. The fine mapping step is rerun using full SNPs for regions with strong gene signals; see rerun_gene_PIP.

prob_single

Blocks with probability greater than prob_single of having 1 or fewer effects will be used for parameter estimation

max_snp_region

Inf or integer. Maximum number of SNPs in a region. Default is Inf, no limit. This can be useful if there are many SNPs in a region and you don't have enough memory to run the program. This applies to the last rerun step (using full SNPs and rerun susie for regions with strong gene signals) only.

rerun_gene_PIP

if thin <1, will rerun blocks with the max gene PIP > rerun_gene_PIP using full SNPs. if rerun_gene_PIP is 0, then all blocks will rerun with full SNPs

niter1

the number of iterations of the E-M algorithm to perform during the initial parameter estimation step

niter2

the number of iterations of the E-M algorithm to perform during the complete parameter estimation step

L

the number of effects for susie during the fine mapping steps

group_prior

a vector of two prior inclusion probabilities for SNPs and genes. This is ignored if estimate_group_prior = T

group_prior_var

a vector of two prior variances for SNPs and gene effects. This is ignored if estimate_group_prior_var = T

estimate_group_prior

TRUE/FALSE. If TRUE, the prior inclusion probabilities for SNPs and genes are estimated using the data. If FALSE, group_prior must be specified

estimate_group_prior_var

TRUE/FALSE. If TRUE, the prior variances for SNPs and genes are estimated using the data. If FALSE, group_prior_var must be specified

use_null_weight

TRUE/FALSE. If TRUE, allow for a probability of no effect in susie

coverage

A number between 0 and 1 specifying the “coverage” of the estimated confidence sets

standardize

TRUE/FALSE. If TRUE, all variables are standardized to unit variance

ncore

The number of cores used to parallelize susie over regions

outputdir

a string, the directory to store output

outname

a string, the output name

logfile

the log file, if NULL will print log info on screen