Skip to content

Releases: broadinstitute/pilon

Pilon version 1.24

28 Jan 18:36
Compare
Choose a tag to compare

Pilon version 1.24 has one algorithm change which affects local reassembly solutions, particularly when using diploid data. Previously, if there were several equivalent solutions for a local reassembly (gap fill or continuity break fix), it would pick the first available solution. In v1.24, it picks the smallest change among equivalent alternatives, preventing it from including surrounding heterozygous (or ambiguous in the haploid case) SNPs in a larger changed block. This reduces both the size and number of BreakFix solutions in the output from diploid data, since many of them contained spurious differences related to heterozygosity.

Support for the experimental --threads option has been removed. It was implemented by an ugly hack which no longer works in modern scala, and it was a resource hog in any case. I hope to revisit multithreading again in the future.

Otherwise, v1.24 is a maintenance release, updating the code to compile on scala 2.13 and updating the htsjdk library version to 2.23.0, allowing support for newer file formats such as csi indexes. Additionally, v1.24 was build using the Java 11 toolchain, so I recommend using a JRE version 11 or greater.

Pilon version 1.23

27 Nov 02:36
Compare
Choose a tag to compare

Pilon version 1.23 introduces two new experimental arguments to specify long read input BAMs:

--nanopore ont.bam identifies ont.bam as containing long reads from Oxford Nanopore sequencing
--pacbio pb.bam identifies pb.bam as containing long reads from Pacific Biosciences sequencing

In this version, the long read BAMs are only used for SNP and indel calling based on pileups. Long reads are not yet used for local reassembly or gap filling, but that will likely come in a future release. For development, I have been using minimap2 to generate the long read BAM files.

Currently, use of long reads is most effective in combination with Illumina --frags libraries, so that Pilon can use the high base quality of the Illumina libraries for unique sequence and use the long reads to reach into repeat sequence to disambiguate embedded differences. It is possible to use only long reads as input to Pilon, but consider that very experimental.

There are limitations in Pilon's use of long reads in pileups: for both --pacbio and --nanopore libraries, Pilon does not attempt to call indels in homopolymer runs of 4 or more bases (e.g., AAAA...), and for --nanopore sequence, Pilon does not use the long reads to call the middle base of a CCxGG motif, as the ONT base calls can be confused by methylation. So this is very basic long read support, but it has been effectively applied to more than a dozen bacterial genomes in conjunction with Illumina paired-end sequencing.

In addition to the long read support, v1.23 fixes a couple of bugs:

  1. Spurious long indels were occasionally called in pileups with minimal evidence
  2. A crash could occur when an indel was called at the beginning of a scaffold

Finally, this version updates the code base to use the Scala 2.12 compiler, uses a newer version of the htsjdk library, and is packaged with the sbt-assembly module instead of sbt-onejar.

Pilon version 1.22

15 Mar 20:52
Compare
Choose a tag to compare

This is a very minor release incorporating two bug fixes reported by users:

  1. Fixed bug in .bed file coordinates generated by the --tracks option (start coordinate was 1-based rather than 0-based);
  2. More flexibility in --target specifications is now allowed so that scaffold names may contain colons (apparently Quiver generates these). Thanks to bwlang for sample code!

There should be no changes to results in this version other than the above.

--bruce, 15 Mar 2017

Pilon version 1.21

09 Dec 23:44
Compare
Choose a tag to compare

Version 1.21 introduces two new --fix options for assembly improvement: snps and indels. Prior to this release, the bases fix option was used to control alignment-based (as opposed to assembly-based) fixes of both SNPs and small indels. Now, fixing of SNPs and small indels can be controlled independently. The --fix bases option is retained for back-compatibility, and is equivalent to specifying both snps and indels.

For example, to use Illumina data to try to get rid of potential suprious indels in a pacbio assembly without changing anything else, one could use --fix indels.

This version also fixes an integer overflow bug caused genome sizes > 2Gb to be printed as negative size.

--bruce, 9 Dec 2016

Pilon version 1.20

21 Sep 00:48
Compare
Choose a tag to compare

This release only fixes a bug in the experimental PacBio long read circular element closure feature introduced in 1.19.

--bruce, 20 Sep 2016

Pilon version 1.19

14 Sep 19:15
Compare
Choose a tag to compare

Pilon version 1.19 includes a new experimental feature specifically for improving PacBio bacterial assemblies generated by HGAP/Falcon by identifying circular elements (chromosome or plasmids) and trimming them for circular continuity. If this new option is not used, 1.19 is identical in functionality to 1.18.

There is a new --fix option called circles, and it is requires an aligned bam of PacBio corrected long reads. For development purposes, I have been creating these using a command like:

blasr corrected.fasta submission.contigs.fasta -nproc <N> -sam -clipping soft -minPctIdentity 97

and then sorting and indexing the output BAM. Then Pilon can be called using this as an --unpaired library along with --fix circles. It can be combined with other options (e.g., --fix bases,circles might be a common thing to try), but the circles option will have no effect if you don't feed it an --unpaired corrected long read file.

Pilon uses the long read alignment information to look for potential circular structures, then re-assembles across the ends to ensure correct continuity. At this time, it makes no attempt to join multiple input scaffolds/contigs together to close a circle, it just trims or extends the ends of an existing element. Multiple Pilon users have reported that HGAP PacBio assemblies often have extra stuff off the ends of circular elements, and this attempts to fix that particular issue.

If Pilon thinks an element may be circular and attempts to close it, it will output something like:

Attempting to close circle

fix circle: contig000002 306915 ClosedCircle 1 -6372 +0 313288 -6090 +0 306916

The first number after the contig name indicates the estimated length before reassembly. ClosedCircle means it was successful, and then it prints the trimming changes it made as in other large fixes. In this case, it removed 6372 bases starting at coordinate 1 and 6090 bases starting at coordinate 313288, and the resulting length of the element is 306916. If it can't successfully re-assemble across the ends, it will print NoSolution as it does for other failed reassemblies.

Please keep in mind this is a first attempt using limited sample data, and I'm happy to try to make improvements based on experience any of you have. Please share your experiences on the pilon-users mailing list.

Pilon version 1.18

12 Jun 05:17
Compare
Choose a tag to compare

This version adds a new --iupac command line option which enables output of IUPAC ambiguous base codes in the output FASTA file. This will be most useful for diploid assembly improvement, allowing Pilon to include heterozygous SNP codes in the improved assembly FASTA. Pilon currently only makes two-way heterozygous calls (e.g., "C or T" is encoded "Y"), not 3-way.

--bruce, 12 June 2016

Pilon version 1.17

28 Apr 05:58
Compare
Choose a tag to compare

This release implements two enhancements requested by users:

  1. A new optional argument --outdir <directory> which specifies a place for Pilon to put all its output files. If you use this option, the naming of the individual files doesn't change, just the location.
  2. In addition to --frags, --jumps, and --unpaired, Pilon now allows aligned reads to be simply specified as --bam <aligned-bam.bam>. If the --bam argument is used, Pilon will scan the BAM file (as it usually does anyway) to gather statistics about the orientation and insert size distribution of the reads in the library, and it will make its best determination as to whether they should be treated as small insert (fragment) pairs, large insert (jump) pairs, or unpaired. The heuristics are pretty simple: if the plurality of reads are unpaired, that's what it will use; otherwise, it determines whether most of the aligned read pairs are in FR or RF orientation, and uses the corresponding mean insert size to determine whether to consider them frags (< 700bp) or jumps (> 700bp). As always, separate libraries should each be in their own BAM file.

Thanks to Chris Desjardins and Ashlee Earl for suggesting the automatic BAM type determination and to Torsten Seemann for suggesting the output directory argument.

Pilon version 1.16

07 Dec 15:09
Compare
Choose a tag to compare

This release fixes a bug introduced in v1.15 which caused Pilon to crash when using input BAM files containing unpaired reads (--unpaired). It is otherwise identical to v1.15.

--bruce, 7 Dec 2015

PIlon version 1.15

22 Nov 23:47
Compare
Choose a tag to compare

This release is aimed at improving performance and convenience for those applying Pilon to large genomes. There are no changes in functionality from v1.14, though on most of my test cases, v1.15 is 10-15% faster overall because of some efficiency improvements to finding reads for the local reassembly process.

More significantly, v1.15 contains an optimization for those who use the --targets argument to specify a subset of input scaffolds to process during a run in order to reduce memory requirements for handling large genomes. It is my hope that this will make it more viable to use the --targets argument to specify a file with a list of scaffolds to process from a large genome rather than having to split all the input files.

The first thing Pilon normally does is scan the BAMs to compute stats and create an in-memory data structure to hold any "stray" pairs, that is, pairs which are not mapped as "proper" pairs in the BAM. This is necessary to get easy access to mates which may be mapped far away in the input genome (and hence far away in the BAM file), as these faraway mates are prime candidates to be included in the local reassembly process. However, mapping stray pairs can be very memory intensive for large genomes, so starting with this version, Pilon will ignore any stray pairs for which neither read maps to any of the specified --target scaffolds. This should increase speed and reduce memory for running a scaffold subset.