Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Support multiple inputs during filter #697

Closed
wants to merge 8 commits into from

Commits on Mar 10, 2021

  1. Add initial I/O interface and tests

    Adds tests and code for new `open_file`, `read_sequences`, and
    `write_sequences` functions loosely based on a proposed API [1]. These
    functions transparently handle compressed inputs and outputs using the
    xopen library.
    
    The `open_file` function is a context manager that lightly wraps the
    `xopen` function and also supports either path strings or existing IO
    buffers. Both the read and write functions use this context manager to
    open files. This manager enables the common use case of writing to the
    same handle many times inside a for loop, by replacing the standard
    `open` call with `open_file`. Doing so, we maintain a Pythonic interface
    that also supports compressed file formats and path-or-buffer inputs.
    This context manager also enables input and output of any other file
    type in compressed formats (e.g., metadata, sequence indices, etc.).
    
    Note that the `read_sequences` and `write_sequences` functions do not
    infer the format of sequence files (e.g., FASTA, GenBank, etc.).
    Inferring file formats requires peeking at the first record in each
    given input, but peeking is not supported by piped inputs that we want
    to support (e.g., piped gzip inputs from xopen). There are also no
    internal use cases for Augur to read multiple sequences of different
    formats, so I can't currently justify the complexity required to support
    type inference. Instead, I opted for the same approach used by BioPython
    where the calling code must know the type of input file being passed.
    This isn't an unreasonable expectation for Augur's internal code. I also
    considered inferring file type by filename extensions like xopen infers
    compression modes. Filename extensions are less standardized across
    bioinformatics than we would like for this type of inference to work
    robustly.
    
    Tests ignore BioPython and pycov warnings to minimize warning fatigue
    for issues we cannot address during test-driven development.
    
    [1] #645
    huddlej committed Mar 10, 2021
    Configuration menu
    Copy the full SHA
    8a20b4f View commit details
    Browse the repository at this point in the history
  2. Support compressed inputs/outputs for index

    Adds support to augur index for compressed sequence inputs and index
    outputs.
    huddlej committed Mar 10, 2021
    Configuration menu
    Copy the full SHA
    0a9d742 View commit details
    Browse the repository at this point in the history
  3. Support compress inputs/outputs for parse and mask

    Adds tests for augur parse and mask and then refactors these modules to
    use the new read/write interface.
    
    For augur parse, the refactor moves from an original for loop into its
    own `parse_sequence` function, adds tests for this new function, and
    updates the body of the `run` function to use this function inside the
    for loop. This commit also replaces the Bio.SeqIO read and write
    functions with the new `read_sequences` and `write_sequences` functions.
    These functions support compressed input and output files based on the
    filename extensions.
    
    For augur mask, the refactor moves logic for masking individual
    sequences into its own function and replaces Bio.SeqIO calls with new
    `read_sequences` and `write_sequences` functions. The refactoring of the
    `mask_sequence` function allows us to easily define a generator for the
    output sequences to write and make a single call to `write_sequences`.
    huddlej committed Mar 10, 2021
    Configuration menu
    Copy the full SHA
    c77bcb7 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    071023d View commit details
    Browse the repository at this point in the history

Commits on Mar 15, 2021

  1. Configuration menu
    Copy the full SHA
    f6c61f1 View commit details
    Browse the repository at this point in the history

Commits on Mar 16, 2021

  1. Add Zika build test for compressed inputs/outputs

    Documents which steps of a standard build support compressed
    inputs/outputs by adding a copy of the Zika build test and corresponding
    expected compressed inputs/outputs.
    huddlej committed Mar 16, 2021
    Configuration menu
    Copy the full SHA
    46b8a65 View commit details
    Browse the repository at this point in the history
  2. Support compressed inputs in augur align

    Adds support for compressed inputs (reference files and alignment
    sequences) in augur align by refactoring existing code to use Augur's
    `io` module.
    
    This is a work in progress and still requires focused work to add
    support for compressed output files.
    huddlej committed Mar 16, 2021
    Configuration menu
    Copy the full SHA
    6a71928 View commit details
    Browse the repository at this point in the history

Commits on Mar 17, 2021

  1. Support multiple inputs to filter

    Work in progress prototyping how we could add support multiple metadata,
    sequence, and sequence index inputs to augur filter to simplify
    workflows that aggregate filters across multiple input datasets (e.g.,
    the ncov workflow).
    huddlej committed Mar 17, 2021
    Configuration menu
    Copy the full SHA
    f53f921 View commit details
    Browse the repository at this point in the history