In this tutorial, we will show how to summarize and visualize cTWAS results.

Load the packages.

library(ctwas)
library(EnsDb.Hsapiens.v86)

We need the Ensembl database EnsDb.Hsapiens.v86 (for hg38) in this tutorial, please choose the version of Ensembl database for your specific data (e.g. EnsDb.Hsapiens.v75 for hg19).

Let’s first load the cTWAS input data needed to run this tutorial: GWAS sample size (gwas_n), prediction models of molecular traits (weights), the reference data: including information about regions (region_info), reference variant list (snp_map) and reference LD (LD_map).

gwas_n <- 343621

weights <- readRDS(system.file("extdata/sample_data", "LDL_example.preprocessed.weights.RDS", package = "ctwas"))

region_info <- readRDS(system.file("extdata/sample_data", "LDL_example.region_info.RDS", package = "ctwas"))

snp_map <- readRDS(system.file("extdata/sample_data", "LDL_example.snp_map.RDS", package = "ctwas"))

LD_map <- readRDS(system.file("extdata/sample_data", "LDL_example.LD_map.RDS", package = "ctwas"))

We then load the cTWAS result for the sample data (chr16) using parameters estimated from the entire genome.

ctwas_res <- readRDS(system.file("extdata/sample_data", "LDL_example.ctwas_sumstats_res_param_allchrs.RDS", package = "ctwas"))
z_gene <- ctwas_res$z_gene
param <- ctwas_res$param
finemap_res <- ctwas_res$finemap_res
susie_alpha_res <- ctwas_res$susie_alpha_res
boundary_genes <- ctwas_res$boundary_genes
region_data <- ctwas_res$region_data
screen_res <- ctwas_res$screen_res

ctwas_sumstats() returns several main output:

  • z_gene: the Z-scores of molecular traits,

  • param: the estimated parameters,

  • finemap_res: fine-mapping results, a data frame of molecular traits and variants, including their cTWAS IDs (“id”), molecular ID (“molecular_ id”), the regions they belong to (“region_id”), Z-scores (“z”), PIPs (“susie_pip”), credible set indices (“cs_index”), and estimated effect size variance (“mu2”).

The IDs of molecular traits (“id”) used in cTWAS is in the format of <molecular_ id|weight_name>. “molecular_id” are the IDs originally used in the PredictDB or FUSION weights for genes or other molecular traits, such as Ensembl gene IDs. weight_name is the user defined weight name when preprocessing weights, see the tutorial “Preparing cTWAS input data”).

It also produces other output:

  • susie_alpha_res: a data frame with finemapping results of molecular traits and the single effect probabilities (alpha) in all credible sets. This will be used when computing combined gene PIPs below.

  • boundary_genes: the genes/molecular traits that cross region boundaries (we will need these in the post-processing step).

  • region_data: assembled region data, including Z scores of SNPs and molecular traits, for all the regions. Note that only data of the subset of SNPs used in screening regions are included.

  • screen_res: screening regions results, including the data of all SNPs and molecular traits of the selected regions, estimated numbers of causal signals (L) for selected regions, and a data frame with estimated L and non-SNP PIPs for all regions.

We focus on examples from running cTWAS with LD here, but the steps below also mostly work for running without LD.

Assessing parameters and computing PVE

First, we make plots using the function make_convergence_plots() to see how estimated parameters converge during the execution of the program:

make_convergence_plots(param, gwas_n)
&nbsp;

 

These plots show the estimated prior inclusion probability, prior effect size variance, enrichment and proportion of variance explained (PVE) over the iterations of parameter estimation. The enrichment is defined as the ratio of the prior inclusion probability of molecular traits over the prior inclusion probability of variants. We generally expect molecular traits to have higher prior inclusion probability than variants. Enrichment values typically range from 20 - 100 for expression traits.

Then, we use summarize_param() to obtain estimated parameters (from the last iteration) and to compute the PVE by variants and molecular traits.

ctwas_parameters <- summarize_param(param, gwas_n)
ctwas_parameters
## $group_size
##   liver|expression adipose|expression                SNP 
##               8775              10538            7556260 
## 
## $group_prior
##   liver|expression adipose|expression                SNP 
##       0.0128593248       0.0011512596       0.0001901423 
## 
## $group_prior_var
##   liver|expression adipose|expression                SNP 
##           27.68179           27.68179           10.08116 
## 
## $enrichment
##   liver|expression adipose|expression 
##          67.630014           6.054727 
## 
## $group_pve
##   liver|expression adipose|expression                SNP 
##       0.0090903336       0.0009773407       0.0421518279 
## 
## $total_pve
## [1] 0.0522195
## 
## $prop_heritability
##   liver|expression adipose|expression                SNP 
##         0.17407928         0.01871601         0.80720471

This function returns several output: group_prior are the estimated prior inclusion probabilities for molecular traits and variants; group_prior_var are estimated prior effect size variance for molecular traits and variants; enrichment are estimated enrichment of molecular traits over variants, defined as ratios of prior inclusion probabilities. Other output parameters are related to the proportion of variance explained (PVE) by molecular traits or variants: group_pve are PVE of molecular traits and variants; total_pve is the total PVE from all molecular traits and variants (sum of group_pve) - basically heritability from all included variables (effectively all common variants); prop_heritability are the proportion of heritability mediated by molecular traits and variants.

These PVE values allow us to compute how heritability of the phenotype is partitioned among variants and groups of molecular traits. From our experience, in single-tissue gene expression analysis, genes typically explain around 5-15% of the total heritability of the trait. In multi-group analysis, the heritability of genes could become lower.

Inspecting and summarizing the cTWAS results

We can add p-values (computed from Z-scores) to the fine-mapping results.

finemap_res$pval <- z2p(finemap_res$z)
head(finemap_res)
##                                    id       molecular_id       type context
## 1 ENSG00000103494.12|liver_expression ENSG00000103494.12 expression   liver
## 2 ENSG00000177508.11|liver_expression ENSG00000177508.11 expression   liver
## 3 ENSG00000087245.12|liver_expression ENSG00000087245.12 expression   liver
## 4 ENSG00000087253.12|liver_expression ENSG00000087253.12 expression   liver
## 5 ENSG00000103546.18|liver_expression ENSG00000103546.18 expression   liver
## 6 ENSG00000198848.12|liver_expression ENSG00000198848.12 expression   liver
##              group            region_id          z    susie_pip      mu2   cs
## 1 liver|expression 16_53348660_55869862 -0.4550967 1.524699e-06 1.158057 <NA>
## 2 liver|expression 16_53348660_55869862 -0.3912807 1.485472e-06 1.107746 <NA>
## 3 liver|expression 16_53348660_55869862  0.5636824 1.608305e-06 1.261103 <NA>
## 4 liver|expression 16_53348660_55869862 -1.3879522 3.495492e-06 2.759558 <NA>
## 5 liver|expression 16_53348660_55869862  0.5507643 1.597171e-06 1.247693 <NA>
## 6 liver|expression 16_53348660_55869862 -0.6042867 1.645537e-06 1.305278 <NA>
##        pval
## 1 0.6490397
## 2 0.6955898
## 3 0.5729703
## 4 0.1651516
## 5 0.5817952
## 6 0.5456531

We can view the molecular traits above a certain PIP cutoff, e.g. PIP > 0.8. We also generally recommend limiting the results to molecular traits within credible sets.

subset(finemap_res, group != "SNP" & susie_pip > 0.8 & !is.na(cs))
##                                          id       molecular_id       type
## 7899  ENSG00000087237.10|adipose_expression ENSG00000087237.10 expression
## 17549    ENSG00000261701.6|liver_expression  ENSG00000261701.6 expression
##       context              group            region_id         z susie_pip
## 7899  adipose adipose|expression 16_55869862_57630418  13.74984 0.9996853
## 17549   liver   liver|expression 16_71020125_72901251 -18.40315 1.0000000
##            mu2 cs         pval
## 7899  177.0699 L1 5.104261e-43
## 17549 430.9403 L1 1.239454e-75

Below we describe how to aggregate the information from molecular traits targeting the same genes, to compute the evidence of a gene being causal (denoted as gene PIPs below). Additionally, we will show how to add annotations (e.g. gene names and positions) to the cTWAS results that may be needed in downstream analysis.

Computing gene PIPs

When we have multiple contexts or different types of molecular traits, it is useful to evaluate the total evidence of a gene being causal, combining evidence of all the molecular traits affecting this gene across all types and contexts.

We use the function combine_gene_pips() to compute combined PIPs. It groups molecular traits targeting the same gene (we use group_by to select the column to group molecular traits). It then computes combined PIPs across contexts (by = "context", default option), types (by = "type"), or both (by = "group").

We have several ways of combining PIPs:

  • “combine_cs” (default option): it first sums PIPs of molecular traits of a genes in each credible set, and then combine PIPs using the following formula: \(1 - \prod_k (1 - \text{PIP}_k)\), where \(\text{PIP}_k\) is the summed PIP of the \(k\)-th credible set of a gene.
  • “sum”: sum over PIPs of all molecular traits for the same gene. This summation is the expected number of causal molecular traits in this gene, and could be higher than 1.

For example, when using gene expression (eQTL) data from multiple tissues, we can use the function to compute combined PIPs across contexts, using the “combine_cs” method. By default, we limit the results to molecular traits within credible sets (filter_cs = TRUE). We also report the credible set information of each molecular trait in the output (include_cs_id = TRUE), which allows us to understand how gene PIPs are obtained.

combined_pip_by_context <- combine_gene_pips(susie_alpha_res, 
                                             group_by = "molecular_id",
                                             by = "context",
                                             method = "combine_cs",
                                             filter_cs = TRUE,
                                             include_cs_id = TRUE)
subset(combined_pip_by_context, combined_pip > 0.8)
##         molecular_id          combined_cs_id combined_pip
## 1  ENSG00000261701.6 16_71020125_72901251.L1    1.0000000
## 2 ENSG00000087237.10 16_55869862_57630418.L1    0.9996853
##               liver_cs_id liver_pip           adipose_cs_id adipose_pip
## 1 16_71020125_72901251.L1         1                    <NA>          NA
## 2                    <NA>        NA 16_55869862_57630418.L1   0.9996853

In this example, we have gene IDs in the “molecular_id” column, so we could use that column to group genes. If the gene IDs are not provided, we will need to map molecular traits to their target genes in order to compute gene PIPs, as described below.

Adding gene annotations to cTWAS results

In the example above, we have Ensembl gene IDs in cTWAS results. It is often helpful to add additional gene information such as gene names, gene types (protein coding, non-coding RNAs, etc.) and gene positions.

We have a helper function get_gene_annot_from_ens_db(), which extracts gene annotations from the Ensembl database for a list of genes. It takes an Ensembl database and a list of Ensembl gene IDs as input, and returns gene_annot, a data frame with gene IDs, gene names, gene types, and gene positions for the genes.

ens_db <- EnsDb.Hsapiens.v86
finemap_gene_res <- subset(finemap_res, group != "SNP")
gene_ids <- unique(finemap_gene_res$molecular_id)
gene_annot <- get_gene_annot_from_ens_db(ens_db, gene_ids)
colnames(gene_annot)[colnames(gene_annot) == "gene_id"] <- "molecular_id"
head(gene_annot)
##         molecular_id gene_name      gene_type chrom    start      end
## 1 ENSG00000103494.12  RPGRIP1L protein_coding    16 53597683 53703938
## 2 ENSG00000177508.11      IRX3 protein_coding    16 54283304 54286763
## 3 ENSG00000087245.12      MMP2 protein_coding    16 55389700 55506691
## 4 ENSG00000087253.12    LPCAT2 protein_coding    16 55508998 55586670
## 5 ENSG00000103546.18    SLC6A2 protein_coding    16 55655604 55706192
## 6 ENSG00000198848.12      CES1 protein_coding    16 55802851 55833337

Note: we used the gene annotations from EnsDb.Hsapiens.v86 for the example data in hg38. You should use the Ensembl database for the genome build of your own data (e.g. EnsDb.Hsapiens.v75 for hg19).

Basically, gene_annot serves as a map between Ensembl gene IDs and the corresponding genes. Using this map, we could use the function anno_finemap_res() to add additional columns of gene annotations and positions of genes and variants to the fine-mapping results.

finemap_res <- anno_finemap_res(finemap_res,
                                snp_map = snp_map,
                                mapping_table = gene_annot,
                                add_gene_annot = TRUE,
                                map_by = "molecular_id",
                                drop_unmapped = TRUE,
                                add_position = TRUE,
                                use_gene_pos = "mid")
## 2024-11-20 14:15:52 INFO::Annotating fine-mapping result ...
## 2024-11-20 14:15:53 INFO::Map molecular traits to genes
## 2024-11-20 14:15:54 INFO::Add gene positions
## 2024-11-20 14:15:54 INFO::Add SNP positions

If add_gene_annot = TRUE, it joins the fine-mapping result table (finemap_res) with the gene_annot table, by the “molecular ids” column (set in map_by), and adds additional columns of gene annotations to the fine-mapping results By default, we drop the genes not found in gene_annot (drop_unmapped = TRUE). If add_position = TRUE, it adds positional information to the fine-mapping results. The positional information of variants can be found in the data structure of reference SNPs: snp_map. The users need to provide positional information of molecular traits, through mapping_table. In our example, we use gene_annot, which has gene positions. By default, we will use the midpoint positions for genes. You could also use “start” or “end” positions by setting the use_gene_pos argument.

With the added gene information, we could limit results to protein coding genes, and view the prioritized genes with additional information:

subset(finemap_res, group != "SNP" & gene_type == "protein_coding" & susie_pip > 0.8 & !is.na(cs))
##                                        id       molecular_id       type context
## 51  ENSG00000087237.10|adipose_expression ENSG00000087237.10 expression adipose
## 111    ENSG00000261701.6|liver_expression  ENSG00000261701.6 expression   liver
##                  group            region_id         z susie_pip      mu2 cs
## 51  adipose|expression 16_55869862_57630418  13.74984 0.9996853 177.0699 L1
## 111   liver|expression 16_71020125_72901251 -18.40315 1.0000000 430.9403 L1
##             pval gene_name      gene_type chrom    start      end      pos
## 51  5.104261e-43      CETP protein_coding    16 56961850 56983845 56972848
## 111 1.239454e-75       HPR protein_coding    16 72063224 72077246 72070235

Integrating multiple types of molecular traits

For more complex settings that involve different types of molecular traits, we will need to map molecular traits to their corresponding genes in order to compute gene PIPs.

To map molecular traits to their corresponding genes, we need a data frame mapping_table, with IDs of molecular traits (“molecular_id”), and the corresponding gene names (“gene_name”). It is helpful to include additional gene information, such as gene types and gene positions. We provide a mapping_table to map introns to genes for PredictDB expression and slicing data here. The mapping file is also available on UChicago RCC server: /project2/xinhe/shared_data/multigroup_ctwas/weights/mapping_files/PredictDB_mapping.RDS .

For example:

mapping_table <- readRDS("/project2/xinhe/shared_data/multigroup_ctwas/weights/mapping_files/PredictDB_mapping.RDS")

head(mapping_table[mapping_table$gene_name == "UBR1",])
##                       molecular_id gene_name      gene_type chrom    start
## 310222          ENSG00000159459.11      UBR1 protein_coding    15 42942897
## 173881 intron_15_43038232_43043215      UBR1 protein_coding    15 43038232
## 173882 intron_15_43043395_43047161      UBR1 protein_coding    15 43043395
## 173883 intron_15_43047289_43048392      UBR1 protein_coding    15 43047289
## 173884 intron_15_43048491_43054742      UBR1 protein_coding    15 43048491
## 173885 intron_15_43054899_43056344      UBR1 protein_coding    15 43054899
##             end
## 310222 43106113
## 173881 43043215
## 173882 43047161
## 173883 43048392
## 173884 43054742
## 173885 43056344

Using this mapping table, we can use the anno_finemap_res() function to add additional columns to the fine-mapping results. If a molecular trait is mapped to multiple genes, it splits the PIP of that molecular trait to its target genes. We use gene_annot below as the mapping table between expression traits and the corresponding gene names.

finemap_res <- anno_finemap_res(finemap_res,
                                snp_map = snp_map,
                                mapping_table = gene_annot,
                                add_gene_annot = TRUE,
                                map_by = "molecular_id",
                                drop_unmapped = TRUE,
                                add_position = TRUE,
                                use_gene_pos = "mid")

Similarly, we use the function anno_susie_alpha_res() to add additional columns to susie_alpha_res.

susie_alpha_res <- anno_susie_alpha_res(susie_alpha_res,
                                        mapping_table = gene_annot,
                                        map_by = "molecular_id",
                                        drop_unmapped = TRUE)
## 2024-11-20 14:15:57 INFO::Annotating susie alpha result ...
## 2024-11-20 14:15:57 INFO::Map molecular traits to genes

We could use the combine_gene_pips() function to compute gene PIPs across different types of molecular traits. Here we use group_by = "gene_name", because molecular traits are mapped to their corresponding genes by the “gene_name” column.

combined_pip_by_type <- combine_gene_pips(susie_alpha_res,
                                          group_by = "gene_name",
                                          by = "type",
                                          method = "combine_cs",
                                          filter_cs = TRUE,
                                          include_cs_id = TRUE)
subset(combined_pip_by_type, combined_pip > 0.8)
##   gene_name          combined_cs_id combined_pip        expression_cs_id
## 1       HPR 16_71020125_72901251.L1    1.0000000 16_71020125_72901251.L1
## 2      CETP 16_55869862_57630418.L1    0.9996853 16_55869862_57630418.L1
##   expression_pip
## 1      1.0000000
## 2      0.9996853

Visualizing the cTWAS results

It is often useful to visualize the results of cTWAS. This would help one understand the rationale of how cTWAS chooses a particular molecular trait and identify issues when cTWAS doesn’t behave properly.

We make locus plots to visualize the association of variants and molecular traits, cTWAS PIP results, and other information, such as QTLs and gene annotations. We illustrate this with an example below.

We use the make_locusplot() function to make locus plots for regions of interest. We need the fine-mapping result with position information, region ID, Ensembl gene annotation database (ens_db) for plotting the gene track, and optionally, preprocessed weights (weights) for plotting the QTL track.

make_locusplot(finemap_res,
               region_id = "16_71020125_72901251",
               ens_db = ens_db,
               weights = weights,
               highlight_pip = 0.8,
               filter_protein_coding_genes = TRUE,
               filter_cs = TRUE,
               color_pval_by = "cs",
               color_pip_by = "cs")
## 2024-11-20 14:15:57 INFO::Limit to protein coding genes
## 2024-11-20 14:15:57 INFO::focal id: ENSG00000261701.6|liver_expression
## 2024-11-20 14:15:57 INFO::focal molecular trait: HPR liver expression
## 2024-11-20 14:15:57 INFO::Range of locus: chr16:71020348-72900542
## 2024-11-20 14:15:57 INFO::focal molecular trait QTL positions: 72063820,72063928
## 2024-11-20 14:15:58 INFO::Limit PIPs to credible sets
&nbsp;

 

The locus plot shows several tracks in the example region. The top one shows -log10(p-value) of the association of variants (from GWAS) and molecular traits (from the package computed z-scores) with the phenotype. The next track shows the PIPs of variants and molecular traits. By default, we limit PIP results to credible sets in the PIP track (filter_cs = TRUE). The next track shows the QTLs of the focal gene. By default, it chooses the molecular trait with the highest PIP (“HPR” in this case). We could specify the focal gene by setting the focal_id to the “ID” of interest, or focal_gene to the gene name of interest. If there are multiple molecular traits or contexts of the focal gene, it will automatically use the one with the highest PIP. The bottom is the gene track. We limit results to protein coding genes by default. We can draw a red dotted line to show the PIP cutoff of 0.8 by setting highlight_pip = 0.8.

There are a few ways to color the data points in the p-value and PIP tracks, by setting color_pval_by and color_pip_by options. “cs” would color the data points by credible sets (default option); “LD” would color data points by correlations with the focal gene; “none” would use the same colors for all data points, except for the focal gene.

To color data points by correlations with the focal gene, We need correlation matrices among molecular traits and SNPs. If we have saved correlation matrices in cor_dir when running cTWAS with LD, we could simply load those precomputed correlation matrices of the region using the function load_region_cor().

region_id <- "16_71020125_72901251"
cor_dir <- system.file("extdata/sample_data", "cor_matrix", package = "ctwas")
cor_res <- load_region_cor(region_id, cor_dir = cor_dir)

If we don’t have precomputed correlation matrices, we could compute the correlation matrices of the region using the function get_region_cor(). We will need the region data with all SNPs (screened_region_data), LD_map and weights.

screened_region_data <- screen_res$screened_region_data
region_id <- "16_71020125_72901251"
cor_res <- get_region_cor(region_id,
                          sids = screened_region_data[[region_id]]$sid,
                          gids = screened_region_data[[region_id]]$gid,
                          LD_map = LD_map,
                          weights = weights)

The locus plot above shows the whole region. We could zoom in a region of interest by specifying the locus_range argument. In the example below, we zoom in the region, color the p-value track by correlations with the focal gene, and color the PIP track by credible sets. We could also highlight positions of interest by setting highlight_pos, for example, we highlight the position of HPR gene below.

make_locusplot(finemap_res,
               region_id = "16_71020125_72901251",
               ens_db = ens_db, 
               weights = weights,
               locus_range = c(71.6e6,72.4e6),
               highlight_pip = 0.8,
               highlight_pos = 72070235,
               R_snp_gene = cor_res$R_snp_gene,
               R_gene = cor_res$R_gene,
               filter_protein_coding_genes = TRUE,
               filter_cs = TRUE,
               color_pval_by = "LD",
               color_pip_by = "cs")
## 2024-11-20 14:16:04 INFO::Limit to protein coding genes
## 2024-11-20 14:16:04 INFO::focal id: ENSG00000261701.6|liver_expression
## 2024-11-20 14:16:04 INFO::focal molecular trait: HPR liver expression
## 2024-11-20 14:16:04 INFO::Range of locus: chr16:71600000-72400000
## 2024-11-20 14:16:04 INFO::focal molecular trait QTL positions: 72063820,72063928
## 2024-11-20 14:16:04 INFO::Limit PIPs to credible sets
## 2024-11-20 14:16:04 INFO::highlight positions: 72070235
&nbsp;

 

In this plot, one can see that there are several genes with good associations with the phenotype (top panel), but only HPR in liver is prioritized by cTWAS (second panel). This suggests that other associations are likely due to their correlations with the prioritized genes rather than representing independent signals.

For the “no-LD” version, we could still make the locus plot. The difference is that it could not color the data points by correlations with the focal gene.

Sample report of the cTWAS results

We present a sample cTWAS report based on real data analysis. The analyzed trait is LDL cholesterol, the prediction models are liver gene expression and splicing models trained on GTEx v8 in the PredictDB format.