Merge fastq sample from different lanes and rename them
This is something I need to do often and a collegue asked me how to do it herself. So the best way to share is to post it in the blog.
Problem
Quite often samples are sequenced in multiple lanes, or for whatever reason are split in multiple file, which we receive. Before processing, I like to join the fastq files and rename the at that point - afterwards everything will have plot names which are mostly presentation/publication ready, so one last thing to change manually.
1st step: merge
PROJECT='my_project'
# create project folder
mkdir -p ${PROJECT}/data/reads/03_December_2015
cd ${PROJECT}/data/reads/03_December_2015
# copy original reads - these will always be kept!
rsync -r -t -x -v --progress -u -l -z -s /fsimb/exchange/imb-genomicscf/AG_Ketting/imb_ketting_2015_22/ ./
# merge the reads in parallel taking advantage of the cluster
mkdir -p logs
for f in $(find ./ -type d -name "*imb*");
do
echo $f
bsub -q short -n 1 -app Reserve3G -o logs/output.txt -J logs/mergeFastQ_"$f" -e logs/mergeFastQ_"$f".err.log mergeFastQ.sh $f "single"
done
This will call mergeFastQ.sh
that changes to each sample directory and merges the files therein (save it in a folder that is on your $PATH
):
#!/usr/bin/env bash
# mergeFastQC.sh
# António Domingues
# 22.01.2015
cd $1
# outname=`echo $1 | sed -r 's/_[a-Z]{6}_.*(_R[1-2]).*$/\1/'`
outname=${PWD##*/}
echo $outname
# read1=`find -name "*R1*.fastq.gz"`
# read2=`find -name "*R2*.fastq.gz"`
if [ "$2" = "paired" ]; then
echo "Paired"
cat *_R1_*.fastq.gz > "$outname"_R1.fastq.gz
cat *_R2_*.fastq.gz > "$outname"_R2.fastq.gz
elif [ "$2" = "single" ]; then
echo "Single"
cat *_R1_*.fastq.gz > "$outname".fastq.gz
fi
2nd step: rename
Once the merge is confirmed, merged files were renamed and moved to a merge folder. Original files were compressed in a folder.
mkdir -p pooled
for f in $(find ./ -type f -name "Sample*.fastq.gz");
do
n="pooled/"$(echo `basename $f` | sed -e 's/Sample_imb_ketting_2015_22_[0-9]*_//')
mv $f $n
done
# compress the original reads in tar and cleans up
tar cf Sample_imb_ketting_2015_20.tar $(find ./ -type d -name "*imb*")
rm -rf $(find ./ -type d -name "*imb*")
3rd step: processing
Once we have this we start the removal of poor quality reads, fastQC, mapping, etc. For his I take advantage of the Institutes’s pipelines, NGSpipe2go, to which I added a small RNA-seq pipeline. You can download the whole thing here to your ${PROJECT}
folder and uncompress the zip file. Then to run you will only need one command: bpipe run smallrnaseq_v0.1.txt data/reads/03_December_2015/*.fastq.gz
. All the required scripts and files should be accessible.
Reuse
Citation
@online{domingues2015,
author = {Domingues, António},
title = {Merge Fastq Sample from Different Lanes and Rename Them},
date = {2015-12-03},
url = {https://amjdomingues.com/posts/2015-12-03-renaming-and-merging-fastq/},
langid = {en}
}