Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document all cases where Augur reads/writes sequences #637

Closed
huddlej opened this issue Dec 15, 2020 · 0 comments
Closed

Document all cases where Augur reads/writes sequences #637

huddlej opened this issue Dec 15, 2020 · 0 comments

Comments

@huddlej
Copy link
Contributor

huddlej commented Dec 15, 2020

In preparation for refactoring Augur's logic to read/write sequences and then add support for compressed sequences, the following section documents which subcommands read or write sequences by their input argument and how those sequences are read/written.

Places where Augur reads/writes sequences

  • parse
    • sequences
      • BioPython SeqIO.parse, iterate through all sequences
    • output-sequences
      • BioPython SeqIO.write one sequence at a time to a file handle
  • filter
    • sequences
      • BioPython SeqIO.index, random access of specific sequences
    • output
      • BioPython SeqIO.write, all sequences at once with an iterator to a filename
  • mask
    • sequences — called “sequences” but the expectation in the code is that this input is an alignment?
      • Multiple OS-level checks for whether the input file exists and is non-zero in size
      • BioPython SeqIO.parse , iterate through all sequences
    • output
      • BioPython SeqIO.write, one sequence at a time to a file handle
  • align
    • sequences
      • read_sequences function that accepts one or more input filenames, reads each file with SeqIO.parse, and returns a list of distinct sequence records. Raise an AlignmentError exception the first time it encounters a duplicate strain name with a different sequence (implicitly de-duplicates records with matching sequences and names).
      • write_seqs function writes an iterable all at once to a filename prior to running the alignment command. This function is currently a redundant wrapper around BioPython’s SeqIO.write that catches any FileNotFoundError exceptions and re-raises them as AlignmentError exceptions.
    • reference-sequence — expected to be a GenBank file with a name field instead of an id field?
      • read_reference function reads a single sequence from a GenBank or FASTA file (using filename extensions to guess format) using BioPython SeqIO.read
    • existing-alignment
      • read_alignment function that redundantly wraps BioPython AlignIO.read and catches all exceptions just to re-raise them as AlignmentError exceptions.
    • debug — implicit alignment outputs in FASTA format
      • shutils.copyfile to make copies of the input and/or output FASTA files
    • output
      • write_seqs
  • tree
    • alignment
      • FASTA input with mask sites
        • BioPython SeqIO.parse to loop through input alignment one record at a time
        • BioPython SeqIO.write, one sequence at a time to a file handle
      • VCF input: variable FASTA created with write_out_informative_fasta function that uses BioPython SeqIO.write to write a list of sequence records to a filename
  • refine
    • alignment
      • Passed as a filename to TreeTime and TreeAnc classes
    • vcf-reference
      • Passed as a filename to treetime.vcf_utils.read_vcf
  • ancestral
    • alignment
      • Passed as a filename to TreeAnc class
    • vcf-reference
      • Passed as a filename to treetime.vcf_utils.read_vcf
    • output-sequences
      • BioPython SeqIO.write a list of all sequences at once to a filename
  • translate
    • reference-sequence — GenBank or GFF file with annotations
      • BCBio GFF.parse for filename with .gff extension.
      • Bio SeqIO.read for all other filenames but assumes the input is in GenBank format (FASTA will not work).
  • reconstruct-sequences
    • vcf-aa-reference
      • BioPython SeqIO.parse, looping through each record from a file handle where sequences are expected to be (but not verified to be) amino acid sequences
  • clades
    • reference
      • Not used.
  • sequence-traits
    • vcf-reference
      • Passed as a filename to treetime.vcf_utils.read_vcf
  • distance
    • alignment
      • reconstruct_sequences.load_alignments function that accepts one or more input FASTA filenames and corresponding gene names, reads each FASTA file with BioPython AlignIO.read, and returns a dictionary of multiple sequence alignment objects indexed by gene name. Strangely, load_alignments is never used in the reconstruct_sequences.py module where it is defined.
  • titers sub
    • alignment
      • Uses reconstruct_sequences.load_alignments function as in distance.py
  • frequencies
    • alignments
      • Uses BioPython AlignIO.read to loop through each sequence and create a new MultipleSeqAlignment instance without internal nodes.
      • Iterates over one or more alignment input files by gene name (analogous to load_alignments but without loading all alignments in memory at once).
  • augur export v1
    • reference
      • Calls BioPython SeqIO.read from get_root_sequence function to load reference sequence.

Other notes

  • 15 of 20 commands read or write sequences!
  • “FASTA” is inconsistently written throughout our code and docs as “FASTA”, “fasta”, and “Fasta”
  • Commands most frequently identify sequence file type by extension or not at all (assuming that a given file is the correct format).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant