Causal inference for TWAS using summary statistics

ctwas_rss(
  z_gene,
  z_snp,
  ld_exprvarfs,
  ld_exprfs = NULL,
  ld_pgenfs = NULL,
  ld_R_dir = NULL,
  ld_regions = c("EUR", "ASN", "AFR"),
  ld_regions_version = c("b37", "b38"),
  ld_regions_custom = NULL,
  thin = 1,
  prob_single = 0.8,
  rerun_gene_PIP = 0.8,
  niter1 = 3,
  niter2 = 30,
  L = 5,
  group_prior = NULL,
  group_prior_var = NULL,
  estimate_group_prior = T,
  estimate_group_prior_var = T,
  use_null_weight = T,
  coverage = 0.95,
  max_snp_region = Inf,
  ncore = 1,
  ncore.rerun = 1,
  outputdir = getwd(),
  outname = NULL,
  logfile = NULL,
  merge = TRUE
)

Arguments

z_gene

A data frame with two columns: "id", "z". giving the z scores for genes.

z_snp

A data frame with four columns: "id", "A1", "A2", "z". giving the z scores for snps. "A1" is effect allele. "A2" is the other allele.

ld_exprvarfs

A character vector of `.exprvar` files. One file for one chromosome, in the order of 1 to 22. Therefore, the length of this vector needs to be 22. `.exprvar` files are tab delimited text files, with columns:

chrom: chromosome number, numeric
p0: gene boundary position, the smaller value
p1: gene boundary position, the larger value
id: gene id

Its rows should be in the same order as the columns for corresponding `.expr` files.

ld_pgenfs

A character vector of .pgen or .bed files. One file for one chromosome, in the order of 1 to 22. Therefore, the length of this vector needs to be 22. If .pgen files are given, then .pvar and .psam are assumed to present in the same directory. If .bed files are given, then .bim and .fam files are assumed to present in the same directory.

ld_regions

A string representing the population to use for defining LD regions. These LD regions were previously defined by ldetect. The user can also provide custom LD regions matching genotype data, see ld_regions_custom.

ld_regions_version

A string representing the genome reference build ("b37", "b38") to use for defining LD regions. See ld_regions.

ld_regions_custom

A bed format file defining LD regions. The default is NULL; when specified, ld_regions and ld_regions_version will be ignored.

thin

The proportion of SNPs to be used for the parameter estimation and initial fine mapping steps. Smaller thin parameters reduce runtime at the expense of accuracy. The fine mapping step is rerun using full SNPs for regions with strong gene signals; see rerun_gene_PIP.

prob_single

Blocks with probability greater than prob_single of having 1 or fewer effects will be used for parameter estimation

rerun_gene_PIP

if thin <1, will rerun blocks with the max gene PIP > rerun_gene_PIP using full SNPs. if rerun_gene_PIP is 0, then all blocks will rerun with full SNPs

niter1

the number of iterations of the E-M algorithm to perform during the initial parameter estimation step

niter2

the number of iterations of the E-M algorithm to perform during the complete parameter estimation step

L

the number of effects for susie during the fine mapping steps

group_prior

a vector of two prior inclusion probabilities for SNPs and genes. This is ignored if estimate_group_prior = T

group_prior_var

a vector of two prior variances for SNPs and gene effects. This is ignored if estimate_group_prior_var = T

estimate_group_prior

TRUE/FALSE. If TRUE, the prior inclusion probabilities for SNPs and genes are estimated using the data. If FALSE, group_prior must be specified

estimate_group_prior_var

TRUE/FALSE. If TRUE, the prior variances for SNPs and genes are estimated using the data. If FALSE, group_prior_var must be specified

use_null_weight

TRUE/FALSE. If TRUE, allow for a probability of no effect in susie

coverage

A number between 0 and 1 specifying the “coverage” of the estimated confidence sets

max_snp_region

Inf or integer. Maximum number of SNPs in a region. Default is Inf, no limit. This can be useful if there are many SNPs in a region and you don't have enough memory to run the program. This applies to the last rerun step (using full SNPs and rerun susie for regions with strong gene signals) only.

ncore

The number of cores used to parallelize susie over regions

ncore.rerun

integer, number of cores to rerun regions with strong signals using full SNPs.

outputdir

a string, the directory to store output

outname

a string, the output name

logfile

the log file, if NULL will print log info on screen

merge

TRUE/FALSE. If TRUE, merge regions when a gene spans a region boundary (i.e. belongs to multiple regions.)

LD_R_dir

a string, pointing to a directory containing all LD matrix files and variant information. Expects .RDS files which contain LD correlation matrices for a region/block. For each RDS file, a file with same base name but ended with .Rvar needs to be present in the same folder. the .Rvar file has 5 required columns: "chrom", "id", "pos", "alt", "ref". If using PredictDB format weights and scale_by_ld_variance=T, a 6th column is also required: "variance", which is the variance of the each SNP. The order of rows needs to match the order of rows in .RDS file.