Preprocess PredictDB/FUSION weights and harmonize with LD reference

preprocess_weights(
  weight_file,
  region_info,
  gwas_snp_ids,
  snp_map,
  LD_map = NULL,
  type,
  context,
  weight_name = paste0(context, "_", type),
  weight_format = c("PredictDB", "FUSION"),
  drop_strand_ambig = TRUE,
  filter_protein_coding_genes = TRUE,
  scale_predictdb_weights = TRUE,
  load_predictdb_LD = TRUE,
  include_weight_LD = TRUE,
  fusion_method = c("lasso", "enet", "top1", "blup", "bslmm", "best.cv"),
  fusion_genome_version = NA,
  top_n_snps = NULL,
  LD_format = c("rds", "rdata", "csv", "txt", "custom"),
  LD_loader_fun = NULL,
  snpinfo_loader_fun = NULL,
  varID_converter_fun = NULL,
  ncore = 1,
  logfile = NULL,
  verbose = FALSE
)

Arguments

weight_file

filename of the '.db' file for PredictDB weights; or the directory containing '.wgt.RDat' files for FUSION weights.

region_info

a data frame of region definitions.

gwas_snp_ids

a vector of SNP IDs in GWAS summary statistics (z_snp$id).

snp_map

a list of SNP-to-region map for the reference.

LD_map

a data frame with filenames of LD matrices and SNP information for the regions. Required when load_predictdb_LD = FALSE.

type

a string, specifying QTL type of the weight file, e.g. expression, splicing, protein.

context

a string, specifying context (tissue/cell type) of the weight file, e.g. Liver, Lung, Brain.

weight_name

a string, specifying name of the weight file. By default, it is weight_name = paste0(context, "_", type)

weight_format

a string, specifying format of the weight file, e.g. PredictDB, FUSION.

drop_strand_ambig

If TRUE remove strand ambiguous variants (A/T, G/C).

filter_protein_coding_genes

If TRUE, keep protein coding genes only. This option is only for PredictDB weights.

scale_predictdb_weights

If TRUE, scale PredictDB weights by the variance. This is because PredictDB weights assume that variant genotypes are not standardized, but our implementation assumes standardized variant genotypes. This option is only for PredictDB weights.

load_predictdb_LD

If TRUE, load pre-computed LD among weight SNPs. This option is only for PredictDB weights.

include_weight_LD

If TRUE, include LD of variants in weights (R_wgt) in the weights object. R_wgt is used for computing gene Z-scores. If FALSE, will skip computing R_wgt. This could save running time when using precomputed gene Z-scores.

fusion_method

a string, specifying the method to choose in FUSION models. "best.cv" option will use the best model (smallest p-value) under cross-validation.

fusion_genome_version

a string, specifying the genome version of FUSION models

top_n_snps

a number, specifying the top n SNPs included in weight models. By default, use all SNPs in weights.

LD_format

file format for LD matrix. If "custom", use a user defined LD_loader_fun() function to load LD matrix.

LD_loader_fun

a user defined function to load LD matrix when LD_format = "custom".

snpinfo_loader_fun

a user defined function to load SNP information file, if SNP information files are not in standard cTWAS reference format.

varID_converter_fun

a user defined function to convert weight variant IDs to the reference variant format.

ncore

The number of cores used to parallelize computation.

logfile

The log filename. If NULL, will print log info on screen.

verbose

If TRUE, print detail messages.

Value

a list of processed weights