Skip to content

fmalmeida/MpGAP

Repository files navigation

DOI Releases Documentation Dockerhub Docker build Docker Pulls Nextflow version License

MpGAP pipeline

A generic multi-platform genome assembly pipeline


See the documentation »

Report Bug · Request Feature

About

MpGAP is an easy to use nextflow docker-based pipeline that adopts well known software for genome assembly of Illumina, Pacbio and Oxford Nanopore sequencing data through illumina only, long reads only or hybrid modes. This pipeline wraps up the following software:

Source
Assemblers Canu, Flye, Raven, Shasta, wtdbg2, Haslr, Unicycler, Spades, Shovill
Polishers Nanopolish, Medaka, gcpp, Pilon
Quality check QUAST, MultiQC

Further reading

This pipeline has two complementary pipelines (also written in nextflow) for NGS preprocessing and prokaryotic genome annotation that can give the user a complete workflow for bacterial genomics analyses.

Requirements

This pipeline has only two dependencies: Docker and Nextflow.

Installation

  1. If you don't have it already install Docker in your computer.
    • After installed, you need to download the required Docker images

      docker pull fmalmeida/mpgap:v2.3
      

Each release is accompanied by a Dockerfile in the docker folder. When using releases older releases, users can create the correct image using the Dockerfile that goes alongside with the release (Remember to give the image the correct name, as it is in dockerhub and the nextflow script). The latest release will always have its docker image in dockerhub.

  1. Install Nextflow (version 20.01 or higher):

    curl -s https://get.nextflow.io | bash
    
  2. Give it a try:

    nextflow run fmalmeida/mpgap --help
    

Users can let the pipeline always updated with: nextflow pull fmalmeida/mpgap

Documentation

Explanation of hybrid strategies

Hybrid assemblies can be produced using one of two available strategies:

Strategy 1

By using Unicycler, Haslr and/or SPAdes hybrid assembly modes. For instance, it can use the Unicycler hybrid mode which will first assemble a high quality assembly graph with Illumina data and then it will use long reads to bridge the gaps. More information about Unicycler Hybrid mode can be found here.

Strategy 2

By polishing a long reads only assembly with Illumina reads. For that, users will have to set --strategy_2 to true. This will tell the pipeline to produce a long reads only assembly (with canu, flye, raven or unicycler) and polish it with Pilon (for unpaired reads) or with Unicycler-polish program (for paired end reads).

Note that, --strategy_2 parameter is an alternative workflow, when used, it will execute ONLY strategy 2 and not both strategies. When false, only strategy 1 will be executed.

Example:

    nextflow run fmalmeida/mpgap --outdir output --threads 5 --shortreads_paired "path-to/illumina_r{1,2}.fastq" \
    --shortreads_single "path-to/illumina_unpaired.fastq" --lr_type 'nanopore' --longreads "path-to/ont_reads.fastq" --strategy_2

Usage

Users are advised to read the complete documentation »

  • Complete command line explanation of parameters:
    • nextflow run fmalmeida/mpgap --help
  • See usage examples in the command line:
    • nextflow run fmalmeida/mpgap --examples

Command line usage examples

Command line executions are exemplified in the manual.

Warnings

  • Remember to always write input paths inside double quotes.
  • When using paired end reads it is required that input reads are set with the “{1,2}” pattern. For example: “SRR6307304_{1,2}.fastq”. This will properly load reads “SRR6307304_1.fastq” and “SRR6307304_2.fastq”
  • When running hybrid assemblies or mixing short read types it is advised to avoid not required REGEX and write the full file path, using only the required REGEX for paired end reads when applicable. So that the pipeline does not load any different read that also matches the REGEX and avoid confusions with the inputs.

Using the configuration file

All parameters showed above can be, and are advised to be, set through the configuration file. When a configuration file is used the pipeline is executed as nextflow run fmalmeida/mpgap -c ./configuration-file. Your configuration file is what will tell the pipeline which type of data you have, and which processes to execute. Therefore, it needs to be correctly configured.

To create a configuration file in your working directory:

  • For Hybrid assemblies:

    nextflow run fmalmeida/mpgap --get_hybrid_config
    
  • For Long reads only assemblies:

    nextflow run fmalmeida/mpgap --get_lreads_config
    
  • For illumina only assemblies:

    nextflow run fmalmeida/mpgap --get_sreads_config
    

Interactive graphical configuration and execution

Via NF tower launchpad (good for cloud env execution)

Nextflow has an awesome feature called NF tower. It allows that users quickly customise and set-up the execution and configuration of cloud enviroments to execute any nextflow pipeline from nf-core, github (this one included), bitbucket, etc. By having a compliant JSON schema for pipeline configuration it means that the configuration of parameters in NF tower will be easier because the system will render an input form.

Checkout more about this feature at: https://seqera.io/blog/orgs-and-launchpad/

Via nf-core launch (good for local execution)

Users can trigger a graphical and interactive pipeline configuration and execution by using nf-core launch utility. nf-core launch will start an interactive form in your web browser or command line so you can configure the pipeline step by step and start the execution of the pipeline in the end.

# Install nf-core
pip install nf-core

# Launch the pipeline
nf-core launch fmalmeida/mpgap

It will result in the following:

Known issues

  1. Whenever using unicycler with unpaired reads, an odd platform-specific SPAdes-related crash seems do randomly happen as it can be seen in the issue discussed at rrwick/Unicycler#188.
  • As a workaround, Ryan says to use the --no_correct parameter which solves the issue and does not have a negative impact on assembly quality.
  • Therefore, if you run into this error when using unpaired data you can activate this workaroud with --unicycler_additional_parameters "--no_correct".
  1. Whenever running the pipeline for multiple samples at once using glob patterns such as '*' and '?', users are advised to do not perform hybrid assemblies, nor combining both paired and unpaired short reads in short reads only assemblies. Because the pipeline is not yet trained to properly search for the correct pairs, and since nextflow channels are random, we cannot ensure that the combination of data used in these to assembly types will be right. The pipeline treats each input file as a unique sample, and it will execute it individually.
  • To date, the use of glob patterns only works properly with long reads only assembly, or short reads only assemblies using either paired or unpaired reads, not both at the same time. For example:
    • nextflow run [...] --longreads 'my_data/*.fastq' --lr_type 'nanopore' --outdir my_results
    • The pipeline will load and assembly each fastq in the my_data folder and assemble it, writing the results for each read in a sub-folder with the reads basename in the my_results output folder.
    • nextflow run [...] --shortreads_single 'my_data/*.fastq' --outdir my_results
    • The pipeline will load and assembly each fastq in the my_data folder and assemble it, writing the results for each read in a sub-folder with the reads basename in the my_results output folder.

However, we are currently working in a proper way to execute the hybrid and combination of short reads in assemblies for multiple samples at once so that users can properly execute it without confusion.

  1. Sometimes, shovill assembler can fail and cause the pipeline to fail due to problems in estimating the genome size. This, is actually super simple to solve! Instead of letting the shovill assembler estimate the genome size, you can pass the information to it and prevent its fail:
    • --shovill_additional_parameters '--gsize 3m'

Citation

To cite this pipeline users can use our Zenodo tag or directly via the github url. Users are encouraged to cite the programs used in this pipeline whenever they are used.