Skip to content
Emeline Favreau edited this page May 1, 2018 · 11 revisions

McCortex Pipeline

The scripts/make-pipeline.pl script makes it simple to quickly generate a script to run all the steps to go from sequence files to VCF files and sample assemblies. Sequence files may be in any of SAM,BAM,FASTA,FASTQ or gzipped FASTA/FASTQ formats. First the script generates a Makefile. You then run this makefile with the make command which generates your result VCFs.

## Setup To generate a Makefile:
scripts/make-pipeline.pl [-r hg19.fa] <list-of-kmers> <out-dir> <samples.txt> > my_job_script.mk

Kmer size

<list-of-kmers> may be:

  • a single kmer size e.g. '31'
  • a comma separated list of kmer sizes e.g. '31,41'
  • a first and last kmer size with a step size (first:last:step): e.g. '29:41:4' => 29,33,37,41

Remember kmer sizes must be odd numbers. Results from various kmer sizes are merged. More on choosing a kmer size.

Specifying samples

Example samples.txt file:

# Comment line start with a #
# Columns are tab or space separated
# file lists are comma separated
# <sample-name> <single-ended-reads> <paired-end-reads> <interleaved-files>
Benjamin	ben.fa.gz,ben.fq	.	.
Jerry	.	jerry.1.fq.gz:jerry.2.fq.gz	.

Notes:

  • Sample Name: This is used as an identifier throughout the processing. May use the characters a-z0-9_-..
  • File lists: each column should contain a comma-separated list of files to use or a . to signify an empty list.
  • Paired-end files: paired read files are separated by a colon : e.g. A.1.fq:A.2.fq,B.1.fq:B.2.fq
  • Interleaved files: these are files where paired-end reads come one after the other (e.g. a name-sorted bam).

Reference

The reference must be compressed with bgzip to run the final VCF finishing steps with bcftools. This is a requirement of bcftools. To do this:

# If compressed with gzip then uncompress first:
gzip -d <ref.fa.gz>
# Then:
mccortex/libs/htslib/bgzip <ref.fa>

Fragment size parameters

McCortex cannot handle paired-end reads whose ends overlap. There are several good programs that already exist to merge overlapping read pairs such as: PEAR and FLASH. Use one of these to merge your paired end reads into single reads before calling McCortex.

If McCortex knows the fragment size it can use this to remove erroneous links in the read threading step. You can pass this by editing the following line in your pipeline Makefile:

THREAD_ARGS=--min-frag-len 150 --max-frag-len 1000 ...

For a fragment size of e.g. 600, the values min=400 and max=800 should be suitable.

You can edit your pipeline Makefile to set additional run time parameters.

## Running the Makefile

To run your pipeline you need to pass two options, CTXDIR the path to McCortex directory and MEM the maximum amount of memory to use:

make -f my_job_script.mk CTXDIR=~/bin/mccortex MEM=120G [optional-target]

MEM is used as a guide and usage may be slightly above the given value. To be safe, pass a value lower than your machine limit.

You can specify what you would like to run by adding a target to the end the command:

  • graphs - build and clean graphs
  • links - build and clean links
  • bubbles - make bubble calls
  • breakpoints - make breakpoint calls
  • bub-vcf - make bubble VCF
  • brk-vcf - make breakpoint VCF
  • bub-geno-vcf - genotyped bubble VCF
  • brk-geno-vcf - genotyped breakpoint VCF
  • vcfs - make all vcfs [default]
  • contigs - assemble contigs for each sample
  • contigs-pop - assemble contigs after popping bubbles
  • unitigs - dump unitigs for each sample
  • <outdir>/k<k>/contigs/<sample>.rmdup.fa.gz to assemble contigs for a given sample and kmer-size

By default if no target is given, bubbles is run if there is no reference and vcfs if there is.

If at some point your job fails, the last line of the output before the error will show the job that died. This line will include a log file (e.g. <some command> >& <outdir>/k31/graphs/ben.raw.ctx.log). The log file will include the error that caused the pipeline to stop. Once you have fixed the issues, you can resume the job with the same command. It will continue from the point it stopped - magic!

List the commands still to be run:

make -f my_job_script.mk --dry-run <options>

List all the commands in this pipeline (including those already run):

make -f my_job_script.mk --dry-run --always <options>

You can run multiple jobs in parallel using the make command line option --jobs=<N>. Remember that if you use this option maximum memory usage will be <N> * <MEM>.

## Where are my results?

VCF results can be found in <outdir>/vcfs/. There is one VCF for results found with the bubble caller and another for results from the breakpoint caller e.g. outdir/vcfs/breakpoints.k29.k31.geno.vcf.gz, outdir/vcfs/bubbles.k29.k31.geno.vcf.gz.

Individual sample assemblies can be found at: <outdir>/k<K>/contigs/<sample>.rmdup.fa.gz.

## Generate a report

To generate a report after running, run:

~/mccortex/scripts/make-report.pl proj myreport

where proj is your run directory and myreport is a directory (to be created) to put your report in. To run R and latex to compile the report, run:

cd myreport
make

Will generate a file myreport/report.pdf. This may be useful in inspecting cleaning threshold and sample coverage.

## Example

Below is a simple example, with two samples (NA12878 and NA12880). McCortex directory is ~/bin/mccortex/ and we are using kmer sizes 27,31,51 with reference /data/references/hg19.fa:

echo "NA12878 NA12878_unpaired.fq.gz NA12878_1.fq.gz:NA12878_2.fq.gz ." > samples.txt
echo "NA12880 NA12880_unpaired.fq.gz NA12880_1.fq.gz:NA12880_2.fq.gz ." >> samples.txt
~/bin/mccortex/scripts/make-pipeline.pl -r /data/references/hg19.fa 27,31,51 mccortex_job > job.mk
make -f job.mk CTXDIR=~/bin/mccortex MEM=300G

Genotyped VCFs are now in mccortex_job/vcfs/bubbles.k27.k31.k51.geno.vcf.gz and mccortex_job/vcfs/breakpoints.k27.k31.k51.geno.vcf.gz

## Final Notes

Warning: if you modify cortex graph files or link files and re-run your Makefile, it will re-generate the VCFs. This is because it detects that a dependency (the input files for variant calling) have changed. However it is ok to delete (rather than modify) input and intermediate files without triggering a re-run.

Once the link files have been cleaned to remove sequencing error, you can delete the raw link files if you wish to free up disk space. This will not interfere with the running job or any re-started jobs. These files are found in <outdir>/k<K>/links/<sampleid>.raw.ctp.gz. Raw graph files are used to generate VCFs so you need to keep them until you have genotyped VCFs. Raw graph files are found in <outdir>/k<K>/graphs/<sampleid>.raw.ctx.

Clone this wiki locally