A few issues can potentially lead to problematic results by cTWAS. In this tutorial, we will show how to address these issues by performing post-processing of cTWAS results.

Load the packages.

Let’s first load the cTWAS input data needed to run this tutorial: GWAS sample size (gwas_n), preprocessed GWAS z-scores (z_snp), prediction models of molecular traits (weights), and the reference data: including information about regions (region_info), reference variant list (snp_map).

gwas_n <- 343621

z_snp <- readRDS(system.file("extdata/sample_data", "LDL_example.preprocessed.z_snp.RDS", package = "ctwas"))

weights <- readRDS(system.file("extdata/sample_data", "LDL_example.preprocessed.weights.RDS", package = "ctwas"))

region_info <- readRDS(system.file("extdata/sample_data", "LDL_example.region_info.RDS", package = "ctwas"))

snp_map <- readRDS(system.file("extdata/sample_data", "LDL_example.snp_map.RDS", package = "ctwas"))

When running with LD, we load the cTWAS result from running the ctwas_sumstats() function. We also load reference LD (LD_map) and the directory where we save the correlation matrices.

ctwas_res <- readRDS(system.file("extdata/sample_data", "LDL_example.ctwas_sumstats_res.RDS", package = "ctwas"))
z_gene <- ctwas_res$z_gene
param <- ctwas_res$param
finemap_res <- ctwas_res$finemap_res
boundary_genes <- ctwas_res$boundary_genes
region_data <- ctwas_res$region_data
screen_res <- ctwas_res$screen_res

LD_map <- readRDS(system.file("extdata/sample_data", "LDL_example.LD_map.RDS", package = "ctwas"))

cor_dir <- system.file("extdata/sample_data", "cor_matrix", package = "ctwas")

When running without LD, we load the cTWAS result from running the ctwas_sumstats_noLD() function:

ctwas_res <- readRDS(system.file("extdata/sample_data", "LDL_example.ctwas_sumstats_noLD_res.RDS", package = "ctwas"))
z_gene <- ctwas_res$z_gene
param <- ctwas_res$param
finemap_res <- ctwas_res$finemap_res
boundary_genes <- ctwas_res$boundary_genes
region_data <- ctwas_res$region_data
screen_res <- ctwas_res$screen_res

Cross-region LD

If the variants in a prediction model of a gene (or molecular trait) span two (or more) regions, it would be unclear to cTWAS what region the gene should be assigned to. If this happens, cTWAS will attempt to assign the gene to one of the two regions that contain most of the weights. Nevertheless, there will be a risk of cross-region LD, where the genetic component of this gene may correlate with variants or genes of both regions. This violates the assumption of cTWAS and can lead to false positive findings.

We address this “cross-region” problem by performing “region merging” as a post-processing step. If any gene has variants in the weights that span two or more regions (“cross-boundary”), those regions will be merged, and cTWAS will rerun the fine-mapping step on the merged regions.

We first select the genes to merge. This could be all genes whose weights span two or more regions. In practice, the genes that are unlikely to be causal for the phenotype generally wouldn’t cause problems. So we can limit to the cross-boundary genes that are plausible risk genes. The results of cTWAS contain the list of genes with cross-boundary weights, boundary_genes. We can then select among these genes, the ones above a PIP cutoff and in credible sets.

high_PIP_finemap_gene_res <- subset(finemap_res, group != "SNP" & susie_pip > 0.5 & !is.na(cs))
high_PIP_genes <- unique(high_PIP_finemap_gene_res$id)

selected_boundary_genes <- boundary_genes[boundary_genes$id %in% high_PIP_genes, , drop=FALSE]

Merge regions

Next, we use the function merge_region_data() to perform region merging. It first identifies overlapping regions from the selected genes, then creates the data about the merged region(s) needed for fine-mapping. The data includes: merged_region_data, merged_region_info, and the variant and LD information of the merged region(s), merged_snp_map and merged_LD_map. The function expands the merged regions with all SNPs if the region_data contains thinned SNPs. It also creates merged_region_id_map, a data frame containing region IDs for merged regions and the original region IDs, so that one could keep track of the regions that were merged.

merge_region_res <- merge_region_data(selected_boundary_genes,
                                      region_data,
                                      region_info = region_info,
                                      LD_map = LD_map,
                                      snp_map = snp_map,
                                      weights = weights,
                                      z_snp = z_snp,
                                      z_gene = z_gene,
                                      estimate_L = TRUE,
                                      maxSNP = 20000)
merged_region_data <- merge_region_res$merged_region_data
merged_region_info <- merge_region_res$merged_region_info
merged_LD_map <- merge_region_res$merged_LD_map
merged_snp_map <- merge_region_res$merged_snp_map
merged_region_id_map <- merge_region_res$merged_region_id_map
merged_region_L <- merge_region_res$merged_region_L

When estimate_L = TRUE, we estimate the number of credible sets (merged_region_L) for the merged regions. This information will be used later in the fine-mapping step. We could set maxSNP to limit the number of SNPs in the region, if merged regions contain too many SNPs,

Next, we run fine-mapping again for these merged regions, similar to the section “Fine-mapping screened regions” in the tutorial “Running cTWAS analysis”.

res <- finemap_regions(merged_region_data,
                       LD_map = merged_LD_map,
                       weights = weights,
                       group_prior = group_prior,
                       group_prior_var = group_prior_var,
                       L = merged_region_L,
                       save_cor = TRUE,
                       cor_dir = "./cor_matrix")
merged_region_finemap_res <- res$finemap_res
merged_region_susie_alpha_res <- res$susie_alpha_res

The output of this function has the same format as those from regular cTWAS analysis. It returns fine-mapping results for merged regions.

Merge regions without LD

When running cTWAS without LD, we can do region merging using the function merge_region_data_noLD().

merge_region_res <- merge_region_data_noLD(selected_boundary_genes,
                                           region_data,
                                           region_info = region_info,
                                           snp_map = snp_map,
                                           z_snp = z_snp,
                                           z_gene = z_gene,
                                           maxSNP = 20000)
merged_region_data <- merge_region_res$merged_region_data
merged_region_info <- merge_region_res$merged_region_info
merged_snp_map <- merge_region_res$merged_snp_map
merged_region_id_map <- merge_region_res$merged_region_id_map

We can then run fine-mapping (without LD) on the merged region data:

res <- finemap_regions_noLD(merged_region_data,                                                     
                            group_prior = group_prior,
                            group_prior_var = group_prior_var)
merged_region_finemap_res <- res$finemap_res
merged_region_susie_alpha_res <- res$susie_alpha_res

Dealing with LD mismatch

LD mismatch between GWAS data (in-sample LD) and the reference LD could lead to false positives in fine-mapping. Diagnostic tools including SuSiE-RSS, and DENTIST, have been developed to check possible LD mismatch. Because it is very time consuming to run the LD mismatch diagnosis for all the regions across the genome, we will perform LD mismatch diagnosis and adjustment only for selected regions with high PIP signals in the post-processing.

Here, we perform the LD mismatch diagnosis using SuSiE-RSS for selected regions. It is an optional step that users could run after finishing the main cTWAS analysis. Users could choose regions of interest.

For example, we could use the function compute_region_nonSNP_PIPs() to compute total non-SNP PIPs for the regions. The non-SNP PIP of a region is the sum of PIPs of all molecular traits in that region. We also limit to credible sets by setting filter_cs = TRUE. We can then select regions with total non-SNP PIPs > 0.8 to run LD mismatch diagnosis:

nonSNP_PIPs <- compute_region_nonSNP_PIPs(finemap_res, filter_cs = TRUE)
selected_region_ids <- names(nonSNP_PIPs)[nonSNP_PIPs > 0.8]

We use the function diagnose_LD_mismatch_susie() to perform LD mismatch diagnosis for these regions. This function uses SuSiE-RSS to detect problematic SNPs among all variants in a region. Basically, it infers the expected statistic of a variant, based on the statistics of nearby variants and their LD from the reference, and compares this with the observed statistic. Inconsistency between the two would suggest potential LD mismatch.

res <- diagnose_LD_mismatch_susie(region_ids = selected_region_ids, 
                                  z_snp = z_snp, 
                                  LD_map = LD_map, 
                                  gwas_n = gwas_n,
                                  p_diff_thresh = 5e-8,
                                  ncore = 4)
problematic_snps <- res$problematic_snps
flipped_snps <- res$flipped_snps
condz_stats <- res$condz_stats

This returns a list of problematic SNPs (diagnostic test p-value < 5e-8, by default), flipped SNPs, and the test statistics of kriging_rss function from SuSiE-RSS.

Our basic strategy of dealing with LD mismatch is: we use the list of problematic SNPs to identify the molecular traits whose results may be affected by LD-mismatch. We would then run SuSiE fine-mapping with L = 1 in the regions containing these molecular traits, assuming a single causal signal in such a region. The fine-mapping results in this setting would be independent of LD.

We choose the genes with some plausibility of being risk genes (gene PIP > 0.5 or abs(Z-score) > 3, by default) and problematic SNPs in their weights. We would then select regions containing these problematic genes.

problematic_genes <- get_problematic_genes(problematic_snps, 
                                           weights, 
                                           finemap_res,
                                           pip_thresh = 0.5,
                                           z_thresh = 3)

problematic_region_ids <- unique(finemap_res[finemap_res$id %in% problematic_genes, "region_id"])

finemap_res[finemap_res$id %in% problematic_genes,]

We then rerun the fine-mapping without LD information for the problematic regions, using the finemap_regions_noLD() function:

if (length(problematic_region_ids) > 0) {
  rerun_region_data <- screen_res$screened_region_data[problematic_region_ids]
  finemap_noLD_res <- finemap_regions_noLD(rerun_region_data, 
                                           group_prior = group_prior,
                                           group_prior_var = group_prior_var)
}

We can check the fine-mapping results (without LD) for the problematic genes:

finemap_noLD_res[finemap_noLD_res$id %in% problematic_genes,]