ChIP-seq coverage analysis
It has been a long time since I wrote this post, and in between a wonderful set of tools for coverage calculation and visualization has been published: deepTools does metagene analysis, and much, much more in a better than I could possibly do with my näive scripts.
Try it. It is worth it.
The problem
There a few analysis tasks for which one would think a solution, or an easy to follow example, is readily available in the web (it has been a long time since the Web became the Internet). One such example is how to do coverage analysis of ChIP-seq, or any other data for that matter. That is, how are the reads distributed around a region of interest, commonly the TSS?
HTSeq, the python library for NGS data analysis offers that, but I prefer to do my plots in R and ggplot. I also find bioconductor nice but some of the objects and the packages literature hard to understand for someone looking for imminently practical information. So I decided to ask around in my institute and create my own script. I like to do scripts that can be easily modified and also that generate plots that can be used with little or now change for presentations.
Create the bed file with TSSs
Firstly, how to create the BED file with the TSSs using a combination of bioconductor and bedtools - it is quite fast and it very-well documented. This only needs to be ran once and there is a myriad of ways to do it. Of course one could also use any bed file with features of interest.
######################
### Load libraries ###
library(GenomicRanges)
library(GenomicFeatures)
library(plyr)
library(reshape)
library(data.table)
### custom functions ##
# string concatenation without space gaps
<- function(...) paste(..., sep="")
concat
################################
## create TSS bed file (UCSC) ##
# create a database of transcripts
<- makeTranscriptDbFromUCSC(genome = "hg19", tablename = "refGene") # run only once
hg19RefGenes # summary of the object
hg19RefGenes saveDb(hg19RefGenes, file='~/Homo\_sapiens/hg19\_RefSeq\_genes.sqlite')
# by saving the database one keeps a record of which version of the genome was used for the analysis. Very important for reproducibility.
<- loadDb(file="~/Homo\_sapiens/hg19\_RefSeq\_genes.sqlite") # not necessary of useful if the the script need to be ran again later
hg19RefGenes
# create an object with a list of transcripts grouped by gene
<- transcriptsBy(hg19RefGenes, by="gene")
transcripts
# convert to data frame:
<- as.data.frame(transcripts)
gene2tx
# Note that transcription\_start is always smaller than transcription\_end, even when the transcript is on the "−" strand. Hence, we have to use either the start or the end coordinate of the transcript, depending on the strand, to get the actual transcription start sites, i.e., the 5' ends of the transcripts:
#http://www.bioconductor.org/help/course-materials/2009/EMBLJune09/Practicals/TSS/TSS\_plot.pdf
<- ifelse(gene2tx$strand == "+", gene2tx$start, gene2tx$end)
tss
<- data.frame(seqnames = gene2tx$seqnames, start = tss, end= tss, strand=gene2tx$strand, transcriptID=gene2tx$tx\_name, type=rep("tss", length(tss)))
tss.bed
## just keep proper chromosomes
<- subset(tss.bed, str\_detect(seqnames, "^[[:alnum:]]\*$"))
tss.bed with(tss.bed, as.data.frame(table(seqnames))) # list the frequency of genes per chromosome
# save bed file
write.table(tss.bed, file = "~/Homo\_sapiens/hg19RefSeqTSS.bed", row.names = FALSE,
col.names = FALSE, quote = FALSE, sep="\\t")
## use bedtools to increase the region around the tss
system('
chromInfo=~/Homo\_sapiens/UCSC/hg19/Annotation/Genes/ChromInfo.txt
bedtools slop -i ~/Homo\_sapiens/hg19RefSeqTSS.bed -g $chromInfo -b 1000 \> ~/Homo\_sapiens/hg19RefSeqTSS\_1000kb.bed
')
Counting the number of reads per nucleotide
Because I run over several bam files and I like to keep the file names tidy I’ve wrapped the bedtools function in a shell function so that the several analysis can ran in parallel to save time. Because this step takes a of time, it sends me an email at the end. This way I can do something else without being constantly looking at it. Hint: running it in a screen session also saves a lot of headaches. If you are doing for a single bam file, or are not versed in shell scripting all you need is to run this:
coverageBed -d -abam bamFile -b geneModel \> name.cov.bed
But in all likelihood you will have several bam files:
#################################################################################
# This little shell script calculates coverage per base at any given annotation #
# wrapper shell function that does the coverage calculations and returns the files with nice names
coverageAtFeatures(){
# the input is a bam file that comes from sdout (see bellow)
bamFile=$1
name=$(basename ${bamFile%%.bam}) # to keep the directory tidy I rename the results files using the original bam file
geneModel=$2 # also from sdout
outDir=results/coverage # defines the output directory
coverageBed -d -abam $bamFile -b $geneModel \> $outDir/$name.cov.bed # actually performs the coverage calculation.
# note the -d, very important!
}
export -f coverageAtFeatures # this add the function to the shell to be used
# Features:
promoters=~/Homo\_sapiens/hg19RefSeqTSS\_1000kb.bed
# bam folder:
bam\_folder=~/chip\_seq/data/reads/
# now we can run the function defined above:
# ls will list all the bam files in the folder and pass them to parallel.
#Each bam file will the first argument of coverageAtFeatures
# the second argument will be the bed file with the genomic regions of interest.
# Parallel will run the analysis for 4 bam simultaneously - this # can vary depending on your computer.
parallel -i -j 4 bash -c "coverageAtFeatures {} $promoters" -- $(ls $bam\_folder/\*bam)"
# when all is done a message will arrive in your inbox.
echo "Subject:Coverage done" | sendmail -v myname@mymail.com \> /dev/null
At the end of this we will have a file that contains something like:
head chip_experiment.cov.bed
chr1 19922471 19924471 + NM_001204088 tss 1 1
chr1 19922471 19924471 + NM_001204088 tss 2 1
chr1 19922471 19924471 + NM_001204088 tss 3 1
chr1 19922471 19924471 + NM_001204088 tss 4 1
chr1 19922471 19924471 + NM_001204088 tss 5 1
chr1 19922471 19924471 + NM_001204088 tss 6 0
chr1 19922471 19924471 + NM_001204088 tss 7 0
chr1 19922471 19924471 + NM_001204088 tss 8 0
chr1 19922471 19924471 + NM_001204088 tss 9 0
chr1 19922471 19924471 + NM_001204088 tss 10 0
Col1-6 are the original entries of the bed file containing the TSSs, followed by the position of the base in the feature (remember this is a bp-resolution coverage) and the number of reads that mapped to the base.
Summarize and plot coverage
Now we are ready calculate some statistics on own many reads map to each base surrounding the TSSs and plot these. In the next snippet I’ll calculate the sum of reads per base but calculating the average requires a minor change to the script. Since several of the operations take some time, I use the package doMC save time but this should work without it.
############################################
## function to calculate coverage profile ##
## list coverage files
="./"
path<- list.files(path=path, pattern=".cov.bed")
cov.files
## for multicore usage:
library(doMC)
registerDoMC(cores=10) # yay for our server
=1000 # this half the size of the window around the TSSs
halfWindow
<- function(file){
loadCovMat # read files
<- read.table(concat(path, file), header=FALSE, sep = "\\t", stringsAsFactors = FALSE)
cov.data
# http://stackoverflow.com/questions/1727772/quickly-reading-very-large-tables-as-dataframes-in-r
# cov.data <- fread(file) # not used because strand not recognized
names(cov.data) <- c("chr", "start", "end", "strand", "ID", "type", "position", "count")
# create a new ID to avoid repeated gene names - common issue with UCSC annotations
<- transform(cov.data, newID = paste(ID, start, sep="\_"))
cov.data.n # add base position with reference to the TSS.
# very important: not the strand operation.
<- mutate(cov.data.n, position\_aligned=position-halfWindow,
cov.algned _cor=ifelse(strand=="+", position\_aligned, -1\*position\_aligned))
position\_str\
# for the next operation R needs a lot of memory so I'll remove objs from memory
rm("cov.data", "cov.data.n", "cov.data.t")
# counts the number of reads per position - the final result!
<- ddply(cov.algned, .(position\_str\_cor), summarize, sum=sum(count), .parallel=T)
res
## get condition name
# important to have column with the name of the experiment when running multiple analysis
<- strsplit(file, "\\\\.")[[1]][1]
sample $condition <- rep(sample, nrow(res))
res
return(res)
}
## Run the function over all coverage files
# the output is a list of data frames
<- lapply(cov.files, loadCovMat)
cov.l
## see the data
lapply(cov.l, head)
## save the coverage data
lapply(cov.l, function(x){
write.table(x,
concat(unique(x$condition), ".sum.cov"),
row.names = FALSE,
quote = FALSE, sep="\\t")})
# join the data frame in he list in one big data frame for plotting
<- do.call("rbind", cov.l)
cov
## 2 different plots:
<- ggplot(data=cov, aes(x=position\_str\_cor, y=sum, color=condition)) + geom\_line()
p.cov.all ggsave(filename = "./plots/coverage\_tss\_all.pdf", p.cov.all)
<- p.cov.all + facet\_wrap(~condition)
p.cov.wrap ggsave(filename = "./plots/coverage\_tss\_wrap.pdf", p.cov.wrap)
If all went well this is the final result:
Al there is lacking now is putting all of this in a nice wrapper, which could be easily done, but I preferred to show the different stages of the process separated because it is easier to explain. I hope it helps.
Reuse
Citation
@online{domingues2013,
author = {Domingues, António},
title = {ChIP-Seq Coverage Analysis},
date = {2013-06-02},
url = {https://amjdomingues.com/posts/2013-06-02-chip-seq-coverage-analysis/},
langid = {en}
}