Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define tests for sequence read/write interface #645

Closed
huddlej opened this issue Dec 21, 2020 · 0 comments
Closed

Define tests for sequence read/write interface #645

huddlej opened this issue Dec 21, 2020 · 0 comments
Assignees

Comments

@huddlej
Copy link
Contributor

huddlej commented Dec 21, 2020

Use cases

  1. open one FASTA by name or handle and iterate through each record
  2. open one FASTA by name or handle and select specific individual records
  3. open one or more FASTAs by name or handle and construct a list of unique records by name and sequence (deduplicating redundant records)
  4. open one FASTA or GenBank file and select its individual sequence
  5. open any of the above inputs from a file compressed with standard algorithms
  6. open one GenBank or GFF file and load its annotations
  7. open one alignment FASTA by name or handle and parse as a multiple sequence alignment (indexable by column or row)
  8. open one or more alignments FASTAs by name or handle and indexed the resulting alignments by gene name

Only use cases 1-5 are addressed by the proposed interfaces below. Multiple sequence alignments and sequence annotations require different treatment since the former can require loading sequences into memory and the latter are a different type of data.

Existing functions

  • SeqIO.parse
  • SeqIO.read
  • read_sequences:

    augur/augur/align.py

    Lines 178 to 192 in efcc8be

    def read_sequences(*fnames):
    """return list of sequences from all fnames"""
    seqs = {}
    try:
    for fname in fnames:
    for record in SeqIO.parse(fname, 'fasta'):
    if record.name in seqs and record.seq != seqs[record.name].seq:
    raise AlignmentError("Detected duplicate input strains \"%s\" but the sequences are different." % record.name)
    # if the same sequence then we can proceed (and we only take one)
    seqs[record.name] = record
    except FileNotFoundError:
    raise AlignmentError("\nCannot read sequences -- make sure the file %s exists and contains sequences in fasta format" % fname)
    except ValueError as error:
    raise AlignmentError("\nERROR: Problem reading in {}: {}".format(fname, str(error)))
    return list(seqs.values())
  • read_reference:

    augur/augur/align.py

    Lines 222 to 231 in efcc8be

    def read_reference(ref_fname):
    if not os.path.isfile(ref_fname):
    raise AlignmentError("ERROR: Cannot read reference sequence."
    "\n\tmake sure the file \"%s\" exists"%ref_fname)
    try:
    ref_seq = SeqIO.read(ref_fname, 'genbank' if ref_fname.split('.')[-1] in ['gb', 'genbank'] else 'fasta')
    except:
    raise AlignmentError("ERROR: Cannot read reference sequence."
    "\n\tmake sure the file %s contains one sequence in genbank or fasta format"%ref_fname)
    return ref_seq
  • read_alignment:

    augur/augur/align.py

    Lines 201 to 205 in efcc8be

    def read_alignment(fname):
    try:
    return AlignIO.read(fname, 'fasta')
    except Exception as error:
    raise AlignmentError("\nERROR: Problem reading in {}: {}".format(fname, str(error)))
  • load_alignments:
    def load_alignments(sequence_files, gene_names):
    from Bio import AlignIO
    alignments = {}
    for fname, gene in zip(sequence_files, gene_names):
    alignments[gene] = AlignIO.read(fname, 'fasta')
    return alignments

Ideal interface

SeqIO.parse is a nice generic interface for reading since it accepts a file handle and a given file type. Ideally, augur’s interface would abstract the work to open a file handle for the given filename (with or without compression) such that the remaining work of iterating through the handle can be passed to BioPython.

Ideally, augur would also handle the logic of identifying the file type by its content instead of requiring users to specify the type (e.g., FASTA, GenBank, etc.).

Finally, at least one use case requires the ability to read through multiple files. The read interface should not be concerned with deduplication of those input sequences; that logic should be implemented elsewhere by a function that consumes the sequences iterator.

The following code mocks up potential interfaces for reading and writing sequences with and without compression. The write interfaces only consider writing FASTA output, but we may want to add a format argument that allows us to write out other formats like GenBank.

# Return an iterator that loops through sequences in a single
# given file.
sequences = read_sequences("sequences.fasta")

# Iterate through multiple input files in the order they are
# provided. Dynamically infer compression modes.
sequences = read_sequences([
    "sequences.fasta.gz",
    "additional_sequences.fasta"
])
for sequence in sequences:
    print(sequence.id)

# Define additional functions to consume the stream of
# sequences from multiple files and return distinct sequences.
# These functions are responsible for appropriate memory
# management.
distinct_sequences = deduplicate_sequences(sequences)

# Use Python's iterator interface to get the first record from
# a single-record input file. The content type of the given
# file is inferred.
reference = next(read_sequences("reference.gb"))

# Alternately, define a custom function that returns the first
# record by calling `read_sequences` as above.
reference = read_sequence("reference.gb")

# Handle writing one sequence at a time.
with open("new_sequences.fasta", "w") as out_handle:
    for sequence in sequences:
        if sequence.id.startswith("A/"):
            write_sequences(
                sequence,
                out_handle
            )

# Alternately, require a single call to write_sequences
# with a generator input to minimize memory use. This
# interface forces the calling code to abstract the logic
# to build input sequences into a generator function or
# comprehension.
new_sequences = (
    sequence
    for sequence in sequences
    if sequence.id.startswith("A/")
)
sequences_written = write_sequences(
    new_sequences,
    "new_sequences.fasta"
)

# The write interface abstracts compression following
# the pattern used by pandas [1] where the default behavior
# is to infer desired compression from the output filename,
# but the calling code can specify an explicit compression
# mode.
# [1] https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html

# Infer "gzip" compression from the filename.
sequences_written = write_sequences(
    [...],
    "new_sequences.fasta.gz"
)

# Request "gzip" compression explicitly.
sequences_written = write_sequences(
    [...],
    "new_sequences.fasta.gz",
    compression="gzip"
)

# Handle compression with a handle defined by the calling
# code.
with gzip.open("new_sequences.fasta.gz", "wb") as out_handle:
    sequences_written = write_sequences(
        [...],
        out_handle
    )

# As in the pandas interface, we can allow the calling code
# to pass arguments specific to the given compression mode.
sequences_written = write_sequences(
    [...],
    "new_sequences.fasta.gz",
    compression={
        "method": "gzip",
        "level": 7
    }
)

The write interface options are more complicated because writing often happens inside a loop and involves calling the write function multiple times for the same output handle. I prefer the more restricted interface above that expects an iterable as input. This interface forces the calling code to prepare the input as an iterator and likely refactor code originally inside a loop into its own function.

We may want to consider implementing these new functions in a new io augur module and eventually migrating other I/O functions from the generic utils module to the more clearly defined io module.

@huddlej huddlej self-assigned this Dec 21, 2020
huddlej added a commit that referenced this issue Dec 30, 2020
Adds tests for new `read_sequences` method based on proposed API [1].
Ignores BioPython and pycov warnings during unit tests to minimize
output during test-driven development. Adds code to make most tests
pass.

[1] #645
@huddlej huddlej closed this as completed Dec 31, 2020
huddlej added a commit that referenced this issue Mar 10, 2021
Adds tests for new `read_sequences` method based on proposed API [1].
Ignores BioPython and pycov warnings during unit tests to minimize
output during test-driven development. Adds code to make most tests
pass.

[1] #645
huddlej added a commit that referenced this issue Mar 10, 2021
Adds tests and code for new `open_file`, `read_sequences`, and
`write_sequences` functions loosely based on a proposed API [1]. These
functions transparently handle compressed inputs and outputs using the
xopen library.

The `open_file` function is a context manager that lightly wraps the
`xopen` function and also supports either path strings or existing IO
buffers. Both the read and write functions use this context manager to
open files. This manager enables the common use case of writing to the
same handle many times inside a for loop, by replacing the standard
`open` call with `open_file`. Doing so, we maintain a Pythonic interface
that also supports compressed file formats and path-or-buffer inputs.
This context manager also enables input and output of any other file
type in compressed formats (e.g., metadata, sequence indices, etc.).

Note that the `read_sequences` and `write_sequences` functions do not
infer the format of sequence files (e.g., FASTA, GenBank, etc.).
Inferring file formats requires peeking at the first record in each
given input, but peeking is not supported by piped inputs that we want
to support (e.g., piped gzip inputs from xopen). There are also no
internal use cases for Augur to read multiple sequences of different
formats, so I can't currently justify the complexity required to support
type inference. Instead, I opted for the same approach used by BioPython
where the calling code must know the type of input file being passed.
This isn't an unreasonable expectation for Augur's internal code. I also
considered inferring file type by filename extensions like xopen infers
compression modes. Filename extensions are less standardized across
bioinformatics than we would like for this type of inference to work
robustly.

Tests ignore BioPython and pycov warnings to minimize warning fatigue
for issues we cannot address during test-driven development.

[1] #645
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant