Skip to content

Hi seq depth regions

AndyMenzies edited this page May 11, 2022 · 2 revisions

It is known that various regions of any genome have a tendency to attract more than the anticipated fraction of sequencing reads. For Human GRCh37/hg19 UCGC have a Mapping and Sequencing -> Hi Seq Depth table generated from short read sequences and ChIP-seq assays.

Unfortunately this doesn't translate to other species or even the later builds of human.

Here we provide some guidance how you can generate similar information to prevent analysis of computationally intensive low value regions.

Please note that this is not exhaustive and some manual extension/tuning with centromeric regions is generally required.

Generate BigWig coverage files

You need a set of whole genome sequence data files generated from normal samples.

Use the bamToBw.pl script to generate these in an efficient manner:

  Basic usage:
    bamToBw.pl -t 16 -b in.bam -o out.bw

You should ensure that your input data is representative of the data you will want to filter, similar read-length and technology/chemistry are recommended. Data must be mapped with the same algorithm.

Interrogate BigWig for regions of high-depth

Use the script detectExtremeDepth.pl to identify the regions of high depth. By default this will isolate areas which are greater than mean+12sd on the current chromosome.

An example LFS bsub for Mouse GRCm38 would look like:

cd some_scratch_area
mkdir `pwd`/raw_output
mkdir `pwd`/logs
ls -1 input/*.bw | xargs -I {} bash -c "bsub -oo `pwd`/logs/%J.%I.log -q long -J 'mouseDepth[1-21]' -n 1 -R'select[mem>=2000] span[hosts=1] rusage[mem=2000]' -M 2000 'detectExtremeDepth.pl -b {} -o `pwd`/raw_output -d 20:X -d 21:Y -r \$LSB_JOBINDEX'"

Now merge your results back into a per-sample dataset:

mkdir `pwd`/raw_merged
cd `pwd`/raw_output
ls -1 *.1.bed | cut -f 1 -d '.' | xargs -I {} bash -c 'cat {}.merged.*.bed | sort -k1,1 -k2,2n > ../raw_merged/{}.bed'

Then merge with bedtools and apply some filtering:

cd ../
bedtools multiinter -i raw_merged/*.bed > intersect.bed
perl -ane 'next if($F[3] < 3); printf qq{%s\t%d\t%d\n}, @F[0..2];' < intersect.bed | bedtools merge -i stdin -d 250 | perl -ane 'next if($F[2]-$F[1] < 500); print $_;' | bgzip -c > depth_mask.bed.gz
tabix -p bed depth_mask.bed.gz