Define tests for sequence read/write interface #645

huddlej · 2020-12-21T21:18:13Z

Use cases

open one FASTA by name or handle and iterate through each record
open one FASTA by name or handle and select specific individual records
open one or more FASTAs by name or handle and construct a list of unique records by name and sequence (deduplicating redundant records)
open one FASTA or GenBank file and select its individual sequence
open any of the above inputs from a file compressed with standard algorithms
open one GenBank or GFF file and load its annotations
open one alignment FASTA by name or handle and parse as a multiple sequence alignment (indexable by column or row)
open one or more alignments FASTAs by name or handle and indexed the resulting alignments by gene name

Only use cases 1-5 are addressed by the proposed interfaces below. Multiple sequence alignments and sequence annotations require different treatment since the former can require loading sequences into memory and the latter are a different type of data.

Existing functions

SeqIO.parse
SeqIO.read

read_sequences:

augur/augur/align.py

Lines 178 to 192 in efcc8be

    
           def read_sequences(*fnames): 
        
               """return list of sequences from all fnames""" 
        
               seqs = {} 
        
               try: 
        
                   for fname in fnames: 
        
                       for record in SeqIO.parse(fname, 'fasta'): 
        
                           if record.name in seqs and record.seq != seqs[record.name].seq: 
        
                               raise AlignmentError("Detected duplicate input strains \"%s\" but the sequences are different." % record.name) 
        
                               # if the same sequence then we can proceed (and we only take one) 
        
                           seqs[record.name] = record 
        
               except FileNotFoundError: 
        
                   raise AlignmentError("\nCannot read sequences -- make sure the file %s exists and contains sequences in fasta format" % fname) 
        
               except ValueError as error: 
        
                   raise AlignmentError("\nERROR: Problem reading in {}: {}".format(fname, str(error))) 
        
               return list(seqs.values())

read_reference:

augur/augur/align.py

Lines 222 to 231 in efcc8be

    
           def read_reference(ref_fname): 
        
               if not os.path.isfile(ref_fname): 
        
                   raise AlignmentError("ERROR: Cannot read reference sequence." 
        
                                        "\n\tmake sure the file \"%s\" exists"%ref_fname) 
        
               try: 
        
                   ref_seq = SeqIO.read(ref_fname, 'genbank' if ref_fname.split('.')[-1] in ['gb', 'genbank'] else 'fasta') 
        
               except: 
        
                   raise AlignmentError("ERROR: Cannot read reference sequence." 
        
                           "\n\tmake sure the file %s contains one sequence in genbank or fasta format"%ref_fname) 
        
               return ref_seq

read_alignment:

augur/augur/align.py

Lines 201 to 205 in efcc8be

    
           def read_alignment(fname): 
        
               try: 
        
                   return AlignIO.read(fname, 'fasta') 
        
               except Exception as error: 
        
                   raise AlignmentError("\nERROR: Problem reading in {}: {}".format(fname, str(error)))

load_alignments:

augur/augur/reconstruct_sequences.py

Lines 50 to 55 in efcc8be

    
           def load_alignments(sequence_files, gene_names): 
        
               from Bio import AlignIO 
        
               alignments = {} 
        
               for fname, gene in zip(sequence_files, gene_names): 
        
                   alignments[gene] = AlignIO.read(fname, 'fasta') 
        
               return alignments

Ideal interface

SeqIO.parse is a nice generic interface for reading since it accepts a file handle and a given file type. Ideally, augur’s interface would abstract the work to open a file handle for the given filename (with or without compression) such that the remaining work of iterating through the handle can be passed to BioPython.

Ideally, augur would also handle the logic of identifying the file type by its content instead of requiring users to specify the type (e.g., FASTA, GenBank, etc.).

Finally, at least one use case requires the ability to read through multiple files. The read interface should not be concerned with deduplication of those input sequences; that logic should be implemented elsewhere by a function that consumes the sequences iterator.

The following code mocks up potential interfaces for reading and writing sequences with and without compression. The write interfaces only consider writing FASTA output, but we may want to add a format argument that allows us to write out other formats like GenBank.

# Return an iterator that loops through sequences in a single
# given file.
sequences = read_sequences("sequences.fasta")

# Iterate through multiple input files in the order they are
# provided. Dynamically infer compression modes.
sequences = read_sequences([
    "sequences.fasta.gz",
    "additional_sequences.fasta"
])
for sequence in sequences:
    print(sequence.id)

# Define additional functions to consume the stream of
# sequences from multiple files and return distinct sequences.
# These functions are responsible for appropriate memory
# management.
distinct_sequences = deduplicate_sequences(sequences)

# Use Python's iterator interface to get the first record from
# a single-record input file. The content type of the given
# file is inferred.
reference = next(read_sequences("reference.gb"))

# Alternately, define a custom function that returns the first
# record by calling `read_sequences` as above.
reference = read_sequence("reference.gb")

# Handle writing one sequence at a time.
with open("new_sequences.fasta", "w") as out_handle:
    for sequence in sequences:
        if sequence.id.startswith("A/"):
            write_sequences(
                sequence,
                out_handle
            )

# Alternately, require a single call to write_sequences
# with a generator input to minimize memory use. This
# interface forces the calling code to abstract the logic
# to build input sequences into a generator function or
# comprehension.
new_sequences = (
    sequence
    for sequence in sequences
    if sequence.id.startswith("A/")
)
sequences_written = write_sequences(
    new_sequences,
    "new_sequences.fasta"
)

# The write interface abstracts compression following
# the pattern used by pandas [1] where the default behavior
# is to infer desired compression from the output filename,
# but the calling code can specify an explicit compression
# mode.
# [1] https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html

# Infer "gzip" compression from the filename.
sequences_written = write_sequences(
    [...],
    "new_sequences.fasta.gz"
)

# Request "gzip" compression explicitly.
sequences_written = write_sequences(
    [...],
    "new_sequences.fasta.gz",
    compression="gzip"
)

# Handle compression with a handle defined by the calling
# code.
with gzip.open("new_sequences.fasta.gz", "wb") as out_handle:
    sequences_written = write_sequences(
        [...],
        out_handle
    )

# As in the pandas interface, we can allow the calling code
# to pass arguments specific to the given compression mode.
sequences_written = write_sequences(
    [...],
    "new_sequences.fasta.gz",
    compression={
        "method": "gzip",
        "level": 7
    }
)

The write interface options are more complicated because writing often happens inside a loop and involves calling the write function multiple times for the same output handle. I prefer the more restricted interface above that expects an iterable as input. This interface forces the calling code to prepare the input as an iterator and likely refactor code originally inside a loop into its own function.

We may want to consider implementing these new functions in a new io augur module and eventually migrating other I/O functions from the generic utils module to the more clearly defined io module.

The text was updated successfully, but these errors were encountered:

Adds tests for new `read_sequences` method based on proposed API [1]. Ignores BioPython and pycov warnings during unit tests to minimize output during test-driven development. Adds code to make most tests pass. [1] #645

Adds tests and code for new `open_file`, `read_sequences`, and `write_sequences` functions loosely based on a proposed API [1]. These functions transparently handle compressed inputs and outputs using the xopen library. The `open_file` function is a context manager that lightly wraps the `xopen` function and also supports either path strings or existing IO buffers. Both the read and write functions use this context manager to open files. This manager enables the common use case of writing to the same handle many times inside a for loop, by replacing the standard `open` call with `open_file`. Doing so, we maintain a Pythonic interface that also supports compressed file formats and path-or-buffer inputs. This context manager also enables input and output of any other file type in compressed formats (e.g., metadata, sequence indices, etc.). Note that the `read_sequences` and `write_sequences` functions do not infer the format of sequence files (e.g., FASTA, GenBank, etc.). Inferring file formats requires peeking at the first record in each given input, but peeking is not supported by piped inputs that we want to support (e.g., piped gzip inputs from xopen). There are also no internal use cases for Augur to read multiple sequences of different formats, so I can't currently justify the complexity required to support type inference. Instead, I opted for the same approach used by BioPython where the calling code must know the type of input file being passed. This isn't an unreasonable expectation for Augur's internal code. I also considered inferring file type by filename extensions like xopen infers compression modes. Filename extensions are less standardized across bioinformatics than we would like for this type of inference to work robustly. Tests ignore BioPython and pycov warnings to minimize warning fatigue for issues we cannot address during test-driven development. [1] #645

huddlej self-assigned this Dec 21, 2020

huddlej mentioned this issue Dec 31, 2020

Add read/write sequence interface with support for compressed sequences #652

Merged

huddlej closed this as completed Dec 31, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define tests for sequence read/write interface #645

Define tests for sequence read/write interface #645

huddlej commented Dec 21, 2020 •

edited

Loading

Define tests for sequence read/write interface #645

Define tests for sequence read/write interface #645

Comments

huddlej commented Dec 21, 2020 • edited Loading

Use cases

Existing functions

Ideal interface

huddlej commented Dec 21, 2020 •

edited

Loading