The pipeline is implemented in SoS as displayed in the rest of this SoS notebook. The pipeline can be executed from this notebook directly on local or remote computer (with caveats see below). For more information see SoS website.
docker
images. This is me being lazy to avoid installing these softwares on a local computer (which should not take much time anyways). But some cluster system, such as UChicago RCC, does not support docker
. Also here I did not configure cluster job queues with task
option. So this notebook is intended to run on a local workstation as is, though involves minimal changes to get it work distributively on cluster, given required software are installed there.bedtools
is needed by QuASAR
thus cannot be loaded from a docker image. It is installed from standard software repo (Debian's apt-get install bedtools
).To run the pipeline:
nb=analysis/20180311_ASE_Calling_WASP_QuASAR.ipynb
sos run $nb hg19_reference
sos run $nb obtain_samples
sos run $nb default
You can use -s force
to force redo everything, or -s build
to skip existing results merely by file name.
Other than using a formal pipeline tool, differences are:
STAR
and WASP
command parameters. I used a global boolean parameter is_paired_end
to control for it.quasar
.A list of cautions learned from running the pipeline:
The STAR
aligner is fast (compared to tophat
) but requires over 30GB of memory. A desktop with 32GB of memory will not work for the STAR
steps. I actually had to use a 64GB RAM desktop. To save memory, have to use sparse mode for STAR
with caveats pointed here, which works for under 16GB memory. This is implemented by global parameter sparse_star = True
.
STAR
, when it fails (memory or not), will fail silently.
STAR
database takes long time, but it is portable as long as one changes the absolute path in genomeParameters.txt
to relative (ie, remove all the directories from file names only keep the basename). This is what I did. So the resource prepared by [star]
workflow (below) can be copied to use in other machines.[global]
cwd = path('~/Documents/m6A/Data/ASE')
parameter: ncpu = 20
# Resource files
resource_dir = f"{cwd:a}/hg19"
ref_fa = "hg19.fa"
ref_gtf = 'Homo_sapiens.GRCh38.91.gtf.gz'
wasp_dir = path(f"/opt/WASP/mapping")
samtools_snp_list = f"{resource_dir}/020717_filtered_maf5_blacklist.final.bed"
# Sample files
sample_dir = f"{cwd:a}/samples"
paired_end = True
sparse_star = False
from collections import OrderedDict
## A list of sample names (keys) and their corresponding FASTQ files (values)
# samples = OrderedDict({'ENCLB279NMT': ['ENCFF824TZM', 'ENCFF176JNE'],
# 'IGF2BP2': ['ENCFF307AUO', 'ENCFF563WIX'],
# 'FTO': ['ENCFF374PSI', 'ENCFF877URW'],
# 'HNRNPC': ['ENCFF035QUK', 'ENCFF065BIC']})
samples = OrderedDict({"IGF2BP2": ['ENCFF307AUO', 'ENCFF563WIX'],
"FTO": ['ENCFF374PSI', 'ENCFF877URW'],
"HNRNPC": ['ENCFF035QUK', 'ENCFF065BIC']})
fastq = paths([[f"{sample_dir}/{s}/{q}.fastq.gz" for q in samples[s]] for s in samples])
[default]
sos_run('align+call')
[hg19_reference_1 (download)]
# Download `hg19.2bit` and `twoBitToFa` from {ucsc_url}
ucsc_url = "http://hgdownload.cse.ucsc.edu"
output: f"{resource_dir}/hg19.2bit", f"{resource_dir}/twoBitToFa"
download: dest_dir = resource_dir, expand = True
{ucsc_url}/goldenPath/hg19/bigZips/hg19.2bit
{ucsc_url}/admin/exe/linux.x86_64/twoBitToFa
[hg19_reference_2 (decompress hg19.fa)]
# Use `twoBitToFa` to extract `hg19.fa` from `hg19.2bit`
output: f"{resource_dir}/{ref_fa}"
bash: expand = True
chmod +x {_input[1]}
{_input[1]} {_input[0]} {_output}
[hg19_reference_3 (gene annotations)]
# Download `Homo_sapiens.GRCh38.91.gtf.gz` from Ensembl
# https://useast.ensembl.org/info/data/ftp/index.html
ensembl_ftp = 'ftp://ftp.ensembl.org/pub/release-91/gtf/homo_sapiens/'
output: f"{resource_dir}/{ref_gtf}"
download: dest_dir = resource_dir, expand = True
{ensembl_ftp}/{ref_gtf}
[star: provides = file_target(f"{resource_dir}/genomeParameters.txt")]
# Quite time & resource consuming (3hrs, 32GB memory if sparse_star is False)
depends: Py_Module('docker')
output: f"{resource_dir}/genomeParameters.txt"
run: workdir = f'{cwd:a}', docker_image = 'gaow/debian-ngs', expand=True
STAR --runMode genomeGenerate \
--genomeDir {resource_dir} \
--genomeFastaFiles {resource_dir}/{ref_fa} \
--sjdbGTFtagExonParentTranscript {resource_dir}/{ref_gtf} \
--runThreadN {ncpu} {'--genomeSAsparseD 2' if sparse_star else ''}
[quasar: provides = file_target(f'{resource_dir}/convertPileupToQuasar.R')]
depends: R_library("QuASAR@piquelab/QuASAR")
output: f'{resource_dir}/convertPileupToQuasar.R'
download: dest_file = f'{resource_dir}/convertPileupToQuasar.R'
https://raw.githubusercontent.com/piquelab/QuASAR/master/scripts/convertPileupToQuasar.R
run: expand = True
perl -pi.bak -e "s/less/zcat/" {_output}
FIXME: add description -- what are these samples?
[obtain_samples]
# Download samples from ENCODE
# https://www.encodeproject.org/experiments/ENCSR384KAN/
encode_url = 'https://www.encodeproject.org/files'
input: for_each = 'fastq', concurrent = True
output: fastq, group_by = 1
download: dest_dir = f'{_fastq:d}', expand = True
{encode_url}/{_fastq:bnn}/@@download/{_fastq:b}
(workflow chart prepared by Yanyu)
%preview ../writeup/wasp.png -n
Align with STAR
, followd by samtools
to remove reads with quality less than given cutoff.
[align_1 (STAR prefiltering alignment)]
parameter: qual_cutoff = 10
depends: file_target(f"{resource_dir}/genomeParameters.txt")
input: fastq, group_by = 2 #, concurrent = True
output: [f"{sample_dir}/{x}.qual{qual_cutoff}.bam" for x in samples], group_by = 1
run: workdir = f'{cwd:a}', docker_image = 'gaow/debian-ngs', expand=True
rm -rf {_output:n}_STARtmp && \
STAR --genomeDir {resource_dir} \
--readFilesIn {_input[0]} {_input[1]} \
--readFilesCommand zcat \
--outTmpDir {_output:n}_STARtmp/ \
--runThreadN {ncpu} --outStd BAM_SortedByCoordinate \
--outSAMtype BAM SortedByCoordinate \
--sjdbGTFtagExonParentTranscript {resource_dir}/{ref_gtf} |
samtools view -bq {qual_cutoff} > {_output}
Remap reads after WASP
adjustment. WASP
uses pre-defined list of SNPs and removes bias caused by them. The list was generated by Yanyu using 1k genome SNP in VCF format with MAF filter as input to WASP/mapping/extract_vcf_snps.sh
command. Here I just take this pre-compiled list from Yanyu.
[align_2 (WASP intersecting SNP): shared = {'wasp_split': 'step_output'}]
# WASP finding unbiased reads intersecting with SNP
input: group_by = 1, concurrent = True
output: f"{_input:n}.remap.fq1.gz", f"{_input:n}.remap.fq2.gz", f"{_input:n}.to.remap.bam", f"{_input:n}.keep.bam"
bash: workdir = f'{cwd:a}', docker_image = 'gaow/wasp', volumes = [f'{resource_dir}/wasp_snp_list:{wasp_dir:d}/examples'], expand = True
python {wasp_dir}/find_intersecting_snps.py {_input} \
--snp_dir {wasp_dir:d}/examples \
--is_sorted {'--is_paired_end' if paired_end else ''}
[align_3 (STAR post alignment)]
# Align WASP remap with STAR
# Followd by samtools remove reads with quality less than {qual_cutoff}
parameter: qual_cutoff = 10
input: group_by = 4, pattern = '{name}.{qual}.remap.{ext}' #, concurrent = True
output: expand_pattern(f'{_name[0]}.remapped.qual{qual_cutoff}.bam')
run: workdir = f'{cwd:a}', docker_image = 'gaow/debian-ngs', expand=True
STAR --genomeDir {resource_dir} \
--readFilesIn {_input[0]} {_input[1]} \
--readFilesCommand zcat \
--runThreadN {ncpu} --outStd BAM_SortedByCoordinate \
--outSAMtype BAM SortedByCoordinate \
--sjdbGTFtagExonParentTranscript {resource_dir}/{ref_gtf} |
samtools view -bq {qual_cutoff} > {_output}
[align_4 (WASP remove ambiguously mapped reads)]
to_remap = paths([wasp_split[i:i+4][2] for i in range(0, len(wasp_split), 4)])
input: group_by = 1, paired_with = 'to_remap', pattern = '{name}.remapped.{ext}', concurrent = True
output: expand_pattern(f'{_name[0]}.keep_remapped.{_ext[0]}')
bash: workdir = f'{cwd:a}', docker_image = 'gaow/wasp', expand=True, stderr = f'{_output:n}.log'
python {wasp_dir}/filter_remapped_reads.py {_to_remap} {_input} {_output}
[align_5 (Merge WASP adjusted and originally kept BAM)]
kept = paths([wasp_split[i:i+4][3] for i in range(0, len(wasp_split), 4)])
input: group_by = 1, paired_with = 'kept', pattern = '{name}.keep_remapped.{ext}', concurrent = True
output: expand_pattern(f'{_name[0]}.wasp_remapped.{_ext[0]}')
run: workdir = f'{cwd:a}', docker_image = 'gaow/debian-ngs', expand=True
samtools merge - {_input} {_kept} | samtools sort -o {_output}
samtools index {_output}
[align_6 (WASP remove duplicate reads )]
input: group_by = 1, pattern = '{name}.bam', concurrent = True
output: expand_pattern(f'{_name[0]}.dedup.bam')
bash: workdir = f'{cwd:a}', docker_image = 'gaow/wasp', expand=True, stderr = f'{_output:n}.log'
python {wasp_dir}/{'rmdup_pe.py' if paired_end else 'rmdup.py'} {_input} {_output}
QuASAR
calling¶This step extracts count information from BAM file based on SNP list and prepare it into QuASAR
input format, as instructed on QuASAR
documentation. Here we need the list of SNPs used in BED format, and reference genome fasta
file. Yanyu has previously prepared this BED file. Here I just grab and use it.
Note: the awk
command from Yanyu is different from the instruction above; yet using the above command there is little variants left ...
[call_1 (samtools pileup)]
# A standard samtools mpileup call
input: group_by = 1, pattern = '{name}.bam', concurrent = True
output: expand_pattern(f'{_name[0]}.pileup.gz')
run: workdir = f'{cwd:a}', docker_image = 'gaow/debian-ngs', expand=True
samtools sort {_input} | samtools mpileup -f {resource_dir}/{ref_fa} -l {samtools_snp_list} - | gzip > {_output}
[call_2 (awk & bedtools filter pileup)]
# Remove bad pileup (indels) and intersect only with list of SNPs provided
input: group_by = 1, pattern = '{name}.pileup.gz', concurrent = True
output: expand_pattern(f'{_name[0]}.pileup.bed.gz')
run: workdir = f'{cwd:a}', docker_image = 'gaow/debian-ngs', expand = '${ }'
# less input.pileup.gz | awk -v OFS='\t' '{ if ($4>0 && $5 !~ /[^\^][<>]/ && $5 !~ /\+[0-9]+[ACGTNacgtn]+/ && $5 !~ /-[0-9]+[ACGTNacgtn]+/ && $5 !~ /[^\^]\*/) print $1,$2-1,$2,$3,$4,$5,$6}' | sortBed -i stdin | intersectBed -a stdin -b snps.af.bed -wo | cut -f 1-7,11-14 | gzip > input.pileup.bed.gz
zcat ${_input} | \
awk -v OFS='\t' '{ if ($4>0 && $5 !~ /\+[0-9]+[ACGTNacgtn]+/ && $5 !~ /-[0-9]+[ACGTNacgtn]+/ && $5 !~ /[^\^]\*/) print $1,$2-1,$2,$3,$4,$5,$6}' | \
sortBed -i stdin | intersectBed -a stdin -b ${samtools_snp_list} -wo | \
cut -f 1-7,11-14 | gzip > ${_output}
[call_3 (convert to quasar format)]
input: group_by = 1, pattern = '{name}.pileup.bed.gz', concurrent = True
output: expand_pattern(f'{_name[0]}.quasar.in.gz')
bash: workdir = f'{_input:d}', expand = True, stdout = f'{_output:n}.log'
R --slave --args {_input} < {resource_dir}/convertPileupToQuasar.R
The QuASAR
example workflow assumes multiple replicates per sample; so is Yanyu's 2017 analysis. Here we only have one replicate per sample.
[call_4 (quasar genotyping and ASE)]
depends: f'{resource_dir}/convertPileupToQuasar.R'
input: group_by = 1, concurrent = True
output: f"{_input:nn}.genotype.tsv", f"{_input:nn}.ase.rds"
R: workdir = f'{cwd:a}', expand = '${ }', stdout = f'{_input:nn}.out.log'
library('QuASAR')
ase.dat <- UnionExtractFields(${_input:r}, combine=T)
ase.dat.gt <- PrepForGenotyping(ase.dat, min.coverage=5)
sample.names <- colnames(ase.dat.gt$ref)
ase.joint <- fitAseNullMulti(as.matrix(ase.dat.gt$ref), as.matrix(ase.dat.gt$alt), log.gmat=log(ase.dat.gt$gmat))
out_gt <- data.frame(ase.dat.gt$annotations[, -5], map=ase.joint$gt)
out_ase <- aseInference(gts=ase.joint$gt, eps.vect=ase.joint$eps, priors=ase.dat.gt$gmat, ref.mat=as.matrix(ase.dat.gt$ref), alt.mat=as.matrix(ase.dat.gt$alt),
min.cov=10, sample.names=sample.names, annos=ase.dat.gt$annotations)
write.table(out_gt, file=${_output[0]:r}, row.names=FALSE, col.names=FALSE, quote=FALSE, sep="\t")
saveRDS(out_ase, ${_output[1]:r})
The output of the alignment and calling pipelines are *.genotype.tsv
for genotypes and *.ase.rds
for ASE.
Exported from analysis/20180311_ASE_Calling_WASP_QuASAR.ipynb
committed by Min Qiao on Tue May 8 16:57:41 2018 revision 26, 1eb517c