Tutorial for data preparation

In this tutorial, we prepare input data for the analysis.

Load the package.

library(mapgen)

GWAS summary statistics

We need a data frame of GWAS summary statistics with the following columns:

chr: chromosome
pos: position
beta: effect size (if you have Odds Ratio, you will need to transform it to log(Odds Ratio))
se: standard error (SE)
a0: reference allele
a1: association/effect allele
snp: SNP ID (e.g. rsID)
pval: p-value

Let’s load a small example GWAS dataset (test.sumstats.txt.gz), and process the data (see below).

We can use the vroom package to read the GWAS summary statistics.

library(vroom)
gwas.file <- system.file('extdata', 'test.sumstats.txt.gz', package='mapgen')
sumstats <- vroom(gwas.file, col_names = TRUE, show_col_types = FALSE)
head(as.data.frame(sumstats),3)

##                           var_name                              phase3_1kg_id
## 1 1_10616_CCGCCGTTGCAAAGGCGCGCCG_C rs376342519:10616:CCGCCGTTGCAAAGGCGCGCCG:C
## 2                      1_11008_C_G                                1:11008:C:G
## 3                      1_11012_C_G                                1:11012:C:G
##   chr position_b37                     a0 a1 bcac_onco2_r2
## 1   1        10616 CCGCCGTTGCAAAGGCGCGCCG  C      0.837659
## 2   1        11008                      C  G      0.580657
## 3   1        11012                      C  G      0.580657
##   bcac_onco2_eaf_controls bcac_onco2_beta bcac_onco2_se bcac_onco2_P1df_Wald
## 1                  0.9925         0.00009       0.05784             0.998700
## 2                  0.0979        -0.04365       0.02002             0.029280
## 3                  0.0979        -0.04365       0.02002             0.029279
##   bcac_onco2_P1df_LRT bcac_icogs2_r2       rsIDs
## 1            0.998700       0.262460 rs376342519
## 2            0.029368       0.335425           1
## 3            0.029367       0.335425           1

Reference data

Fine-mapping requires linkage-disequilibrium (LD) information. You could use either reference genotype panel or LD matrices for the LD blocks.

LD blocks

We need a data frame defining LD blocks, with four columns: chromosome, start, end positions, and the indices of the LD blocks.

We provided the LD blocks from LDetect for the 1000 Genomes (1KG) European population (in hg19).

LD_Blocks <- readRDS(system.file('extdata', 'LD.blocks.EUR.hg19.rds', package='mapgen'))
head(LD_Blocks, 3)

##   chr   start     end locus
## 1   1   10583 1892607     1
## 2   1 1892607 3582736     2
## 3   1 3582736 4380811     3

Reference panel

To use the reference genotype panel, we utilize the R package bigsnpr to read in PLINK files (bed/bim/fam) into R and match alleles between GWAS summary statistics and reference genotype panel.

If you have reference genotypes in PLINK format, you can use bigsnpr::snp_readBed() to read bed/bim/fam files into a bigSNP object and save as a ‘.rds’ file. This ‘.rds’ file could be used for downstream analyses when attached with .

We provided a bigSNP object of the reference genotype panel from the 1000 Genomes (1KG) European population. If you are in the He lab at UChicago, you can load the bigSNP object from RCC as below.

library(bigsnpr)
bigSNP <- snp_attach(rdsfile = '/project2/xinhe/1kg/bigsnpr/EUR_variable_1kg.rds')

LD matrices

We also have pre-computed LD matrices of European samples from UK Biobank. They can be downloaded here. If you are at UChicago, you can directly access those LD matrices from RCC at /project2/mstephens/wcrouse/UKB_LDR_0.1_b37/.

We create a data frame region_info, with filenames of UKBB reference LD matrices and SNP info adding to the LD_Blocks.

region_info <- get_UKBB_region_info(LD_Blocks,
                                    LDREF.dir = "/project2/mstephens/wcrouse/UKB_LDR_0.1_b37", 
                                    prefix = "ukb_b37_0.1")

* You can skip the LD matrices if you only need to run enrichment analysis, or if you only need to run downstream analysis (e.g. gene mapping) with precomputed fine-mapping result.

Process GWAS summary statistics

We run process_gwas_sumstats() to process the summary statistics. This checks and cleans GWAS summary statistics, add locus ID from the LD_Blocks if available, match alleles in GWAS SNPs with the reference panel from the bigSNP object if bigSNP is specified, or harmonize GWAS SNPs with the reference LD information from the precopmuted LD matrices if region_info is specified.

gwas.sumstats <- process_gwas_sumstats(sumstats, 
                                       chr='chr', 
                                       pos='position_b37', 
                                       beta='bcac_onco2_beta', 
                                       se='bcac_onco2_se',
                                       a0='a0', 
                                       a1='a1', 
                                       snp='phase3_1kg_id', 
                                       pval='bcac_onco2_P1df_Wald',
                                       LD_Blocks=LD_Blocks,
                                       bigSNP=bigSNP)

Check that the summary statistics is processed and has appropriate columns:

head(as.data.frame(gwas.sumstats),3)