-
Notifications
You must be signed in to change notification settings - Fork 25
McCortex Pipeline
The scripts/make-pipeline.pl
script makes it simple to quickly generate a script to run all the steps to go from sequence files to VCF files and sample assemblies. Sequence files may be in any of SAM,BAM,FASTA,FASTQ or gzipped FASTA/FASTQ formats. First the script generates a Makefile
. You then run this makefile with the make
command which generates your result VCFs.
scripts/make-pipeline.pl [-r hg19.fa] <list-of-kmers> <out-dir> <samples.txt> > my_job_script.mk
<list-of-kmers>
may be:
- a single kmer size e.g. '31'
- a comma separated list of kmer sizes e.g. '31,41'
- a first and last kmer size with a step size (
first:last:step
): e.g. '29:41:4' => 29,33,37,41
Remember kmer sizes must be odd numbers. Results from various kmer sizes are merged. More on choosing a kmer size.
Example samples.txt
file:
# Comment line start with a #
# Columns are tab or space separated
# file lists are comma separated
# <sample-name> <single-ended-reads> <paired-end-reads> <interleaved-files>
Benjamin ben.fa.gz,ben.fq . .
Jerry . jerry.1.fq.gz:jerry.2.fq.gz .
Notes:
-
Sample Name: This is used as an identifier throughout the processing. May use the characters
a-z0-9_-.
. -
File lists: each column should contain a comma-separated list of files to use or a
.
to signify an empty list. -
Paired-end files: paired read files are separated by a colon
:
e.g.A.1.fq:A.2.fq,B.1.fq:B.2.fq
- Interleaved files: these are files where paired-end reads come one after the other (e.g. a name-sorted bam).
The reference must be compressed with bgzip
to run the final VCF finishing steps with bcftools
. This is a requirement of bcftools
. To do this:
# If compressed with gzip then uncompress first:
gzip -d <ref.fa.gz>
# Then:
mccortex/libs/htslib/bgzip <ref.fa>
McCortex cannot handle paired-end reads whose ends overlap. There are several good programs that already exist to merge overlapping read pairs such as: PEAR and FLASH. Use one of these to merge your paired end reads into single reads before calling McCortex.
If McCortex knows the fragment size it can use this to remove erroneous links in the read threading step. You can pass this by editing the following line in your pipeline Makefile:
THREAD_ARGS=--min-frag-len 150 --max-frag-len 1000 ...
For a fragment size of e.g. 600, the values min=400 and max=800 should be suitable.
You can edit your pipeline Makefile to set additional run time parameters.
## Running the MakefileTo run your pipeline you need to pass two options, CTXDIR
the path to McCortex directory and MEM
the maximum amount of memory to use:
make -f my_job_script.mk CTXDIR=~/bin/mccortex MEM=120G [optional-target]
MEM
is used as a guide and usage may be slightly above the given value. To be safe, pass a value lower than your machine limit.
You can specify what you would like to run by adding a target to the end the command:
-
graphs
- build and clean graphs -
links
- build and clean links -
bubbles
- make bubble calls -
breakpoints
- make breakpoint calls -
bub-vcf
- make bubble VCF -
brk-vcf
- make breakpoint VCF -
bub-geno-vcf
- genotyped bubble VCF -
brk-geno-vcf
- genotyped breakpoint VCF -
vcfs
- make all vcfs [default] -
contigs
- assemble contigs for each sample -
contigs-pop
- assemble contigs after popping bubbles -
unitigs
- dump unitigs for each sample -
<outdir>/k<k>/contigs/<sample>.rmdup.fa.gz
to assemble contigs for a given sample and kmer-size
By default if no target is given, bubbles
is run if there is no reference and vcfs
if there is.
If at some point your job fails, the last line of the output before the error will show the job that died. This line will include a log file (e.g. <some command> >& <outdir>/k31/graphs/ben.raw.ctx.log
). The log file will include the error that caused the pipeline to stop. Once you have fixed the issues, you can resume the job with the same command. It will continue from the point it stopped - magic!
List the commands still to be run:
make -f my_job_script.mk --dry-run <options>
List all the commands in this pipeline (including those already run):
make -f my_job_script.mk --dry-run --always <options>
You can run multiple jobs in parallel using the make
command line option --jobs=<N>
. Remember that if you use this option maximum memory usage will be <N> * <MEM>
.
VCF results can be found in <outdir>/vcfs/
. There is one VCF for results found with the bubble caller and another for results from the breakpoint caller e.g. outdir/vcfs/breakpoints.k29.k31.geno.vcf.gz
, outdir/vcfs/bubbles.k29.k31.geno.vcf.gz
.
Individual sample assemblies can be found at: <outdir>/k<K>/contigs/<sample>.rmdup.fa.gz
.
To generate a report after running, run:
~/mccortex/scripts/make-report.pl proj myreport
where proj
is your run directory and myreport
is a directory (to be created) to put your report in. To run R
and latex
to compile the report, run:
cd myreport
make
Will generate a file myreport/report.pdf
. This may be useful in inspecting cleaning threshold and sample coverage.
Below is a simple example, with two samples (NA12878 and NA12880). McCortex directory is ~/bin/mccortex/
and we are using kmer sizes 27,31,51
with reference /data/references/hg19.fa
:
echo "NA12878 NA12878_unpaired.fq.gz NA12878_1.fq.gz:NA12878_2.fq.gz ." > samples.txt
echo "NA12880 NA12880_unpaired.fq.gz NA12880_1.fq.gz:NA12880_2.fq.gz ." >> samples.txt
~/bin/mccortex/scripts/make-pipeline.pl -r /data/references/hg19.fa 27,31,51 mccortex_job > job.mk
make -f job.mk CTXDIR=~/bin/mccortex MEM=300G
Genotyped VCFs are now in mccortex_job/vcfs/bubbles.k27.k31.k51.geno.vcf.gz
and mccortex_job/vcfs/breakpoints.k27.k31.k51.geno.vcf.gz
Warning: if you modify cortex graph files or link files and re-run your Makefile, it will re-generate the VCFs. This is because it detects that a dependency (the input files for variant calling) have changed. However it is ok to delete (rather than modify) input and intermediate files without triggering a re-run.
Once the link files have been cleaned to remove sequencing error, you can delete the raw link files if you wish to free up disk space. This will not interfere with the running job or any re-started jobs. These files are found in <outdir>/k<K>/links/<sampleid>.raw.ctp.gz
. Raw graph files are used to generate VCFs so you need to keep them until you have genotyped VCFs. Raw graph files are found in <outdir>/k<K>/graphs/<sampleid>.raw.ctx
.