vignettes/gene_mapping_tutorial.Rmd
gene_mapping_tutorial.Rmd
Despite our fine-mapping efforts, there remained considerable uncertainty of causal variants in most loci. Even if the causal variants are known, assigning target genes can be difficult due to long-range regulation of enhancers.
Our gene mapping procedure prioritizes target genes:
For every putative causal SNP, we assign a weight to each nearby gene, considering multiple ways the SNP may affect a gene. The weight of a gene can be viewed as the probability that the SNP affects that gene.
For each gene, we then aggregate the causal evidence of all SNPs likely targeting this gene, expressed as the weighted sum of the PIPs of all these SNPs. To ensure that the causal evidence of a variant is not counted multiple times when it targets multiple genes, we normalize the SNP-to-gene weights in this calculation. The resulting “gene PIP” approximates the probability of a gene being causal. Similar to variant-level fine-mapping, we also define a “credible gene set”, the set of genes that capture the causal signal at a locus with high probability.
The weights of SNP-gene pairs reflect the strength of biological evidence linking SNPs to genes. For a SNP in an exon/promoter or in a regulatory region linked to a particular gene (“enhancer loops”), we assign a weight of 1 to that gene.
When a SNP cannot be linked to any gene in these ways, its target genes are assigned using a distance weighted function so that nearby genes receive higher weights.
See the Methods section in our paper for details.
In this tutorial, we show the gene mapping procedure using the fine-mapping result from our AFib study.
Our gene mapping procedure takes the following input data:
Load R packages
Here we use fine-mapping summary statistics from our AFib study.
Keep SNPs with PIP > 1e-5, rename columns, and save as a GRanges object.
finemapstats <- readRDS(system.file("extdata", "AF.finemapping.sumstats.rds", package = "mapgen"))
finemapstats <- process_finemapping_sumstats(finemapstats,
snp = 'snp',
chr = 'chr',
pos = 'pos',
pip = 'susie_pip',
pval = 'pval',
zscore = 'zscore',
cs = 'cs',
locus = 'locus',
pip.thresh = 1e-5)
## Processing fine-mapping summary statistics ...
## Warning: replacing previous import 'utils::download.file' by
## 'restfulr::download.file' when loading 'rtracklayer'
## Select SNPs with PIP > 1e-05
# For simplicity, only keep the following columns in the steps below
cols.to.keep <- c('snp','chr','pos', 'pip', 'pval', 'zscore', 'cs', 'locus')
finemapstats <- finemapstats[, cols.to.keep]
head(finemapstats, 3)
## GRanges object with 3 ranges and 8 metadata columns:
## seqnames ranges strand | snp chr pos pip
## <Rle> <IRanges> <Rle> | <character> <integer> <integer> <numeric>
## [1] chr1 10161265 * | rs189262553 1 10161265 0.005541892
## [2] chr1 10167425 * | rs187585530 1 10167425 0.031227600
## [3] chr1 10727192 * | rs191213578 1 10727192 0.000159041
## pval zscore cs locus
## <numeric> <numeric> <numeric> <numeric>
## [1] 7.89245 -5.69091 0 7
## [2] 8.11250 -5.77273 0 7
## [3] 4.53313 -4.17653 0 7
## -------
## seqinfo: 20 sequences from an unspecified genome; no seqlengths
To link SNPs to genes, we use a hierarchical approach to assign the following genomic annotation categories:
You can download gene annotations (GTF file) from GENCODE, then make gene annotations using the GTF file.
gtf_file <- '/project2/xinhe/shared_data/gencode/gencode.v19.annotation.gtf.gz'
genomic.annots <- make_genomic_annots(gtf_file)
We included gene annotations (hg19) in the package, downloaded from GENCODE release 19.
genomic.annots <- readRDS(system.file("extdata", "genomic.annots.hg19.rds", package = "mapgen"))
gene.annots <- genomic.annots$genes
OCRs <- readRDS(system.file("extdata", "OCRs.hg19.gr.rds", package = "mapgen"))
In this example, we include exons and active promoters in this category for gene mapping. We defined active promoters as promoters overlapping with OCRs (overlap at least 100 bp).
active_promoters <- IRanges::subsetByOverlaps(genomic.annots$promoters,
OCRs,
minoverlap = 100)
genomic.annots$exons_promoters <- list(exons = genomic.annots$exons,
active_promoters = active_promoters)
If you don’t have OCR data, you can simply use exons and promoters.
genomic.annots$exons_promoters <- list(exons = genomic.annots$exons,
promoters = genomic.annots$promoters)
We need chromatin loop data, such as PC-HiC or ABC scores, with the following columns, and save as a GRanges object.
Promoter-capture HiC (PC-HiC)
Here we use promoter-capture HiC (PC-HiC) data from cardiomyocytes (CMs). You may skip this if you do not have relevant PC-HiC data.
pcHiC <- readRDS(system.file("extdata", "pcHiC.CM.gr.rds", package = "mapgen"))
pcHiC <- pcHiC[pcHiC$gene_name %in% gene.annots$gene_name, ] # restrict to protein coding genes
head(pcHiC, 3)
## GRanges object with 3 ranges and 4 metadata columns:
## seqnames ranges strand | promoter_start promoter_end
## <Rle> <IRanges> <Rle> | <numeric> <numeric>
## [1] chr1 10056320-10057012 * | 10002213 10004089
## [2] chr1 10062800-10063261 * | 10002213 10004089
## [3] chr1 10085205-10085681 * | 10002213 10004089
## gene_name score
## <character> <numeric>
## [1] NMNAT1 6.33
## [2] NMNAT1 7.37
## [3] NMNAT1 7.28
## -------
## seqinfo: 24 sequences from an unspecified genome; no seqlengths
ABC scores
The process_ABC()
function processes a data frame of ABC scores from Nasser et al. Nature 2021, and converts it to a GRanges object.
Here we use ABC scores from heart ventricle (from Nasser et al. Nature 2021). You may skip this if you do not have relevant ABC scores.
ABC <- data.table::fread(system.file("extdata", "heart_ventricle-ENCODE_ABC.tsv.gz", package = "mapgen"))
ABC <- process_ABC(ABC, full.element = TRUE)
ABC <- ABC[ABC$gene_name %in% gene.annots$gene_name, ] # restrict to protein coding genes
head(ABC, 3)
## GRanges object with 3 ranges and 4 metadata columns:
## seqnames ranges strand | promoter_start promoter_end gene_name
## <Rle> <IRanges> <Rle> | <integer> <integer> <character>
## [1] chr1 888243-888743 * | 894679 894679 NOC2L
## [2] chr1 908361-908861 * | 895966 895966 KLHL17
## [3] chr1 908361-908861 * | 901876 901876 PLEKHN1
## score
## <numeric>
## [1] 0.015224
## [2] 0.017673
## [3] 0.041100
## -------
## seqinfo: 23 sequences from an unspecified genome; no seqlengths
Here we define enhancer regions by OCRs. You can also use histone mark data to define enhancer regions.
genomic.annots$enhancer_regions <- OCRs[OCRs$peakType!="Promoter",]
Considering the fact that PC-HiC and HiC loops may miss contacts between close regions due to technical reasons, here we also consider enhancer regions within 20 kb of active promoters as “nearby interactions”.
nearby20kb <- nearby_interactions(genomic.annots$enhancer_regions,
active_promoters,
max.dist = 20000)
We include ABC, PC-HiC and nearby interactions in the “enhancer loop” category.
genomic.annots$enhancer_loops <- list(ABC = ABC, pcHiC = pcHiC, nearby20kb = nearby20kb)
summary(genomic.annots)
## Length Class Mode
## exons 195913 GRanges S4
## introns 189762 GRanges S4
## UTRs 50889 GRanges S4
## promoters 19430 GRanges S4
## genes 19430 GRanges S4
## exons_promoters 2 -none- list
## enhancer_regions 327014 GRanges S4
## enhancer_loops 3 -none- list
Parameters:
TRUE
, assign intronic SNPs to genes containing the introns.
gene.mapping.res <- compute_gene_pip(finemapstats,
genomic.annots,
intron.mode = FALSE,
d0 = 50000,
exon.weight = 1,
loop.weight = 1)
## Map SNPs to genes and assign weights ...
## Assign SNPs in exons and promoters ...
## Assign SNPs to linked genes through enhancer loops ...
## Enhancer loops include: ABC, pcHiC, nearby20kb
## Assign SNPs in enhancer regions to genes by distance weighting ...
## Assign SNPs in UTRs (excluding enhancer regions) to the UTR genes ...
## Assign the rest of SNPs (intergenic) to genes by distance weighting ...
## Compute gene PIP ...
## Done.
head(gene.mapping.res)
## snp chr pos pip pval zscore cs locus gene_name
## 1 rs6658920 1 22257739 3.991827e-03 6.40472 5.057971 0 15 HSPG2
## 2 rs6658920 1 22257739 3.991827e-03 6.40472 5.057971 0 15 USP48
## 3 rs213734 1 41575699 4.560662e-05 3.56067 -3.611940 0 25 SCMH1
## 4 rs2793374 1 205647508 6.666840e-05 4.70421 4.268657 0 106 SLC45A3
## 5 rs12047038 1 205718468 4.506310e-04 4.26074 -4.040541 0 106 NUCKS1
## 6 rs7576657 2 60779478 1.328678e-05 3.19620 -3.432432 0 173 BCL11A
## category weight frac_pip gene_pip
## 1 ABC,nearby20kb 1 0.3333333 0.494801999
## 2 ABC 1 0.3333333 0.001727402
## 3 ABC 1 1.0000000 0.075123247
## 4 ABC 1 1.0000000 0.017269657
## 5 ABC,nearby20kb 1 1.0000000 0.554064134
## 6 ABC,nearby20kb 1 1.0000000 0.000439048
Gene-level result table
gene.pip.res <- extract_gene_level_result(gene.mapping.res, gene.annots)
head(gene.pip.res)
## chr pos gene_name gene_pip
## 1 chr1 22263790 HSPG2 0.494801999
## 2 chr1 22110099 USP48 0.001727402
## 3 chr1 41707826 SCMH1 0.075123247
## 4 chr1 205649587 SLC45A3 0.017269657
## 5 chr1 205719404 NUCKS1 0.554064134
## 6 chr2 60780702 BCL11A 0.000439048
cat(sprintf("%d genes with PIP >= 0.8",
length(gene.pip.res$gene_name[gene.pip.res$gene_pip >= 0.8])))
## 46 genes with PIP >= 0.8
80% credible gene sets
## locus gene_cs gene_cs_locus_pip top_gene top_locus_gene_pip
## 1 7 CASZ1 CASZ1(0.882) CASZ1 0.882
## 2 15 HSPG2,CELA3B HSPG2(0.495),CELA3B(0.486) HSPG2 0.495
## 3 25 SLFNL1,CTPS1 SLFNL1(0.625),CTPS1(0.259) SLFNL1 0.625
## 4 31 BEND5 BEND5(0.997) BEND5 0.997
## 5 32 FAF1,CDKN2C FAF1(0.41),CDKN2C(0.405) FAF1 0.410
## 6 33 TTC39A,EPS15 TTC39A(0.447),EPS15(0.38) TTC39A 0.447
## top_gene_pip
## 1 0.882
## 2 0.495
## 3 0.625
## 4 0.997
## 5 0.426
## 6 0.448
label genes with gene PIP > 0.8.
gene_manhattan_plot(gene.pip.res, gene.pip.thresh = 0.8)
Note: in some cases where a gene spans two nearby LD blocks and/or is linked to SNPs in multiple credible sets (L > 1), the gene PIP may exceed 1, which can be interpreted as the expected number of causal variants targeting the gene (see Methods of our paper for details).