Document all cases where Augur reads/writes sequences #637

huddlej · 2020-12-15T23:02:00Z

In preparation for refactoring Augur's logic to read/write sequences and then add support for compressed sequences, the following section documents which subcommands read or write sequences by their input argument and how those sequences are read/written.

Places where Augur reads/writes sequences

parse
- sequences
  - BioPython SeqIO.parse, iterate through all sequences
- output-sequences
  - BioPython SeqIO.write one sequence at a time to a file handle
filter
- sequences
  - BioPython SeqIO.index, random access of specific sequences
- output
  - BioPython SeqIO.write, all sequences at once with an iterator to a filename
mask
- sequences — called “sequences” but the expectation in the code is that this input is an alignment?
  - Multiple OS-level checks for whether the input file exists and is non-zero in size
  - BioPython SeqIO.parse , iterate through all sequences
- output
  - BioPython SeqIO.write, one sequence at a time to a file handle
align
- sequences
  - read_sequences function that accepts one or more input filenames, reads each file with SeqIO.parse, and returns a list of distinct sequence records. Raise an AlignmentError exception the first time it encounters a duplicate strain name with a different sequence (implicitly de-duplicates records with matching sequences and names).
  - write_seqs function writes an iterable all at once to a filename prior to running the alignment command. This function is currently a redundant wrapper around BioPython’s SeqIO.write that catches any FileNotFoundError exceptions and re-raises them as AlignmentError exceptions.
- reference-sequence — expected to be a GenBank file with a name field instead of an id field?
  - read_reference function reads a single sequence from a GenBank or FASTA file (using filename extensions to guess format) using BioPython SeqIO.read
- existing-alignment
  - read_alignment function that redundantly wraps BioPython AlignIO.read and catches all exceptions just to re-raise them as AlignmentError exceptions.
- debug — implicit alignment outputs in FASTA format
  - shutils.copyfile to make copies of the input and/or output FASTA files
- output
  - write_seqs
tree
- alignment
  - FASTA input with mask sites
    - BioPython SeqIO.parse to loop through input alignment one record at a time
    - BioPython SeqIO.write, one sequence at a time to a file handle
  - VCF input: variable FASTA created with write_out_informative_fasta function that uses BioPython SeqIO.write to write a list of sequence records to a filename
refine
- alignment
  - Passed as a filename to TreeTime and TreeAnc classes
- vcf-reference
  - Passed as a filename to treetime.vcf_utils.read_vcf
ancestral
- alignment
  - Passed as a filename to TreeAnc class
- vcf-reference
  - Passed as a filename to treetime.vcf_utils.read_vcf
- output-sequences
  - BioPython SeqIO.write a list of all sequences at once to a filename
translate
- reference-sequence — GenBank or GFF file with annotations
  - BCBio GFF.parse for filename with .gff extension.
  - Bio SeqIO.read for all other filenames but assumes the input is in GenBank format (FASTA will not work).
reconstruct-sequences
- vcf-aa-reference
  - BioPython SeqIO.parse, looping through each record from a file handle where sequences are expected to be (but not verified to be) amino acid sequences
clades
- reference
  - Not used.
sequence-traits
- vcf-reference
  - Passed as a filename to treetime.vcf_utils.read_vcf
distance
- alignment
  - reconstruct_sequences.load_alignments function that accepts one or more input FASTA filenames and corresponding gene names, reads each FASTA file with BioPython AlignIO.read, and returns a dictionary of multiple sequence alignment objects indexed by gene name. Strangely, load_alignments is never used in the reconstruct_sequences.py module where it is defined.
titers sub
- alignment
  - Uses reconstruct_sequences.load_alignments function as in distance.py
frequencies
- alignments
  - Uses BioPython AlignIO.read to loop through each sequence and create a new MultipleSeqAlignment instance without internal nodes.
  - Iterates over one or more alignment input files by gene name (analogous to load_alignments but without loading all alignments in memory at once).
augur export v1
- reference
  - Calls BioPython SeqIO.read from get_root_sequence function to load reference sequence.

Other notes

15 of 20 commands read or write sequences!
“FASTA” is inconsistently written throughout our code and docs as “FASTA”, “fasta”, and “Fasta”
Commands most frequently identify sequence file type by extension or not at all (assuming that a given file is the correct format).

The text was updated successfully, but these errors were encountered:

huddlej closed this as completed Dec 18, 2020

huddlej mentioned this issue Dec 31, 2020

Add read/write sequence interface with support for compressed sequences #652

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document all cases where Augur reads/writes sequences #637

Document all cases where Augur reads/writes sequences #637

huddlej commented Dec 15, 2020 •

edited

Loading

Document all cases where Augur reads/writes sequences #637

Document all cases where Augur reads/writes sequences #637

Comments

huddlej commented Dec 15, 2020 • edited Loading

Places where Augur reads/writes sequences

Other notes

huddlej commented Dec 15, 2020 •

edited

Loading