vignettes/data_preparation_tutorial.Rmd
data_preparation_tutorial.Rmd
In this tutorial, we prepare input data for the analysis.
Load the package.
We need a data frame of GWAS summary statistics with the following columns:
Let’s load a small example GWAS dataset (test.sumstats.txt.gz
), and process the data (see below).
We can use the vroom
package to read the GWAS summary statistics.
library(vroom)
gwas.file <- system.file('extdata', 'test.sumstats.txt.gz', package='mapgen')
sumstats <- vroom(gwas.file, col_names = TRUE, show_col_types = FALSE)
head(as.data.frame(sumstats),3)
## var_name phase3_1kg_id
## 1 1_10616_CCGCCGTTGCAAAGGCGCGCCG_C rs376342519:10616:CCGCCGTTGCAAAGGCGCGCCG:C
## 2 1_11008_C_G 1:11008:C:G
## 3 1_11012_C_G 1:11012:C:G
## chr position_b37 a0 a1 bcac_onco2_r2
## 1 1 10616 CCGCCGTTGCAAAGGCGCGCCG C 0.837659
## 2 1 11008 C G 0.580657
## 3 1 11012 C G 0.580657
## bcac_onco2_eaf_controls bcac_onco2_beta bcac_onco2_se bcac_onco2_P1df_Wald
## 1 0.9925 0.00009 0.05784 0.998700
## 2 0.0979 -0.04365 0.02002 0.029280
## 3 0.0979 -0.04365 0.02002 0.029279
## bcac_onco2_P1df_LRT bcac_icogs2_r2 rsIDs
## 1 0.998700 0.262460 rs376342519
## 2 0.029368 0.335425 1
## 3 0.029367 0.335425 1
Fine-mapping requires linkage-disequilibrium (LD) information. You could use either reference genotype panel or LD matrices for the LD blocks.
We need a data frame defining LD blocks, with four columns: chromosome, start, end positions, and the indices of the LD blocks.
We provided the LD blocks from LDetect
for the 1000 Genomes (1KG) European population (in hg19).
LD_Blocks <- readRDS(system.file('extdata', 'LD.blocks.EUR.hg19.rds', package='mapgen'))
head(LD_Blocks, 3)
## chr start end locus
## 1 1 10583 1892607 1
## 2 1 1892607 3582736 2
## 3 1 3582736 4380811 3
To use the reference genotype panel, we utilize the R package bigsnpr
to read in PLINK files (bed/bim/fam) into R and match alleles between GWAS summary statistics and reference genotype panel.
If you have reference genotypes in PLINK format, you can use bigsnpr::snp_readBed() to read bed/bim/fam
files into a bigSNP object and save as a ‘.rds’ file. This ‘.rds’ file could be used for downstream analyses when attached with .
We provided a bigSNP
object of the reference genotype panel from the 1000 Genomes (1KG) European population. If you are in the He lab at UChicago, you can load the bigSNP
object from RCC as below.
library(bigsnpr)
bigSNP <- snp_attach(rdsfile = '/project2/xinhe/1kg/bigsnpr/EUR_variable_1kg.rds')
We also have pre-computed LD matrices of European samples from UK Biobank. They can be downloaded [here][UKBB_LD]. If you are at UChicago, you can directly access those LD matrices from RCC at /project2/mstephens/wcrouse/UKB_LDR_0.1_b37/
.
We create a data frame region_info
, with filenames of UKBB reference LD matrices and SNP info adding to the LD_Blocks.
region_info <- get_UKBB_region_info(LD_Blocks,
LDREF.dir = "/project2/mstephens/wcrouse/UKB_LDR_0.1_b37",
prefix = "ukb_b37_0.1")
* You can skip the LD matrices if you only need to run enrichment analysis, or if you only need to run downstream analysis (e.g. gene mapping) with precomputed fine-mapping result.
We run process_gwas_sumstats()
to process the summary statistics. This checks and cleans GWAS summary statistics, add locus ID from the LD_Blocks
if available, match alleles in GWAS SNPs with the reference panel from the bigSNP
object if bigSNP
is specified, or harmonize GWAS SNPs with the reference LD information from the precopmuted LD matrices if region_info
is specified.
gwas.sumstats <- process_gwas_sumstats(sumstats,
chr='chr',
pos='position_b37',
beta='bcac_onco2_beta',
se='bcac_onco2_se',
a0='a0',
a1='a1',
snp='phase3_1kg_id',
pval='bcac_onco2_P1df_Wald',
LD_Blocks=LD_Blocks,
bigSNP=bigSNP)
Check that the summary statistics is processed and has appropriate columns:
head(as.data.frame(gwas.sumstats),3)
## chr pos a0 a1 beta se snp pval zscore
## 1 1 13110 A G -0.02066 0.03231 1:13110:G:A 0.52245 -0.6394305
## 2 1 13116 G T -0.01255 0.01895 rs201725126:13116:T:G 0.50772 -0.6622691
## 3 1 13118 G A -0.01255 0.01895 rs200579949:13118:A:G 0.50772 -0.6622691
## locus ss_index bigSNP_index
## 1 1 3 4
## 2 1 4 5
## 3 1 5 6