Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Support multiple inputs during filter #697

Closed
wants to merge 8 commits into from

Conversation

huddlej
Copy link
Contributor

@huddlej huddlej commented Mar 17, 2021

Description of proposed changes

Work in progress that builds on the new I/O interface to support multiple metadata, sequence, and sequence index inputs for augur filter.

Related issues

Adds tests and code for new `open_file`, `read_sequences`, and
`write_sequences` functions loosely based on a proposed API [1]. These
functions transparently handle compressed inputs and outputs using the
xopen library.

The `open_file` function is a context manager that lightly wraps the
`xopen` function and also supports either path strings or existing IO
buffers. Both the read and write functions use this context manager to
open files. This manager enables the common use case of writing to the
same handle many times inside a for loop, by replacing the standard
`open` call with `open_file`. Doing so, we maintain a Pythonic interface
that also supports compressed file formats and path-or-buffer inputs.
This context manager also enables input and output of any other file
type in compressed formats (e.g., metadata, sequence indices, etc.).

Note that the `read_sequences` and `write_sequences` functions do not
infer the format of sequence files (e.g., FASTA, GenBank, etc.).
Inferring file formats requires peeking at the first record in each
given input, but peeking is not supported by piped inputs that we want
to support (e.g., piped gzip inputs from xopen). There are also no
internal use cases for Augur to read multiple sequences of different
formats, so I can't currently justify the complexity required to support
type inference. Instead, I opted for the same approach used by BioPython
where the calling code must know the type of input file being passed.
This isn't an unreasonable expectation for Augur's internal code. I also
considered inferring file type by filename extensions like xopen infers
compression modes. Filename extensions are less standardized across
bioinformatics than we would like for this type of inference to work
robustly.

Tests ignore BioPython and pycov warnings to minimize warning fatigue
for issues we cannot address during test-driven development.

[1] #645
Adds support to augur index for compressed sequence inputs and index
outputs.
Adds tests for augur parse and mask and then refactors these modules to
use the new read/write interface.

For augur parse, the refactor moves from an original for loop into its
own `parse_sequence` function, adds tests for this new function, and
updates the body of the `run` function to use this function inside the
for loop. This commit also replaces the Bio.SeqIO read and write
functions with the new `read_sequences` and `write_sequences` functions.
These functions support compressed input and output files based on the
filename extensions.

For augur mask, the refactor moves logic for masking individual
sequences into its own function and replaces Bio.SeqIO calls with new
`read_sequences` and `write_sequences` functions. The refactoring of the
`mask_sequence` function allows us to easily define a generator for the
output sequences to write and make a single call to `write_sequences`.
Documents which steps of a standard build support compressed
inputs/outputs by adding a copy of the Zika build test and corresponding
expected compressed inputs/outputs.
Adds support for compressed inputs (reference files and alignment
sequences) in augur align by refactoring existing code to use Augur's
`io` module.

This is a work in progress and still requires focused work to add
support for compressed output files.
Work in progress prototyping how we could add support multiple metadata,
sequence, and sequence index inputs to augur filter to simplify
workflows that aggregate filters across multiple input datasets (e.g.,
the ncov workflow).
@huddlej huddlej self-assigned this Mar 19, 2021
@victorlin victorlin self-assigned this May 29, 2024
@victorlin
Copy link
Member

Closing in favor of new augur merge command implemented in #1563

@victorlin victorlin closed this Sep 13, 2024
@victorlin victorlin deleted the wip/filter-multiple-inputs branch September 13, 2024 23:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

2 participants