Add read/write sequence interface with support for compressed sequences #652

huddlej · 2020-12-31T00:39:47Z

Description of proposed changes

This PR adds a new io.py module with three new functions:

open_file: a context manager that transparently supports reading/writing:
- compressed files using xopen to infer compression from the filename
- file paths or file handles
read_sequences: a function that streams sequence records from one or more input files of the same format (FASTA, GenBank, etc.). Uses open_file to support compressed inputs.
write_sequences: a function that streams sequence records to a single filename or file handle. Uses open_file to support compressed output.

This PR also refactors the following Augur subcommands to use these new functions:

parse
index
filter
mask

Python API example

# Read sequences from multiple files with a generator.
sequences = read_sequences(*args.sequences)

# Open a file to write to. If `args.output` ends with
# ".gz", for example, its contents will compressed.
observed_sequence_strains = set()
with open_file(args.output, "wt") as output_handle:
    for sequence in sequences:
        # Track all the strains we've written.
        observed_sequence_strains.add(sequence.id)

        # Write one record at a time to the handle.
        write_sequences(sequence, output_handle)

Command line interface examples

The following command reads in H3N2 HA sequences from a gzip-compressed file and writes out the parsed sequences to an LZMA-compressed file.

augur parse \
  --sequences h3n2_ha.fasta.gz \
  --output-sequences sequences.fasta.xz \
  --output-metadata metadata.tsv \
  --fields strain virus accession date region country division location passage originating_lab submitting_lab age gender

The following command reads LZMA-compressed sequences from the previous command and writes out a gzip-compressed sequence index.

augur index \
    --sequences sequences.fasta.xz \
    --output sequence_index.tsv.gz

Related issue(s)

See the ZenHub Epic for a list of all related issues.
Fixes #644
See #637 for details about how Augur reads and write sequences.
See #645 for the original proposed interface for the new read/write functions.

Testing

This PR adds functional and unit tests for all new and refactored code.

codecov · 2020-12-31T00:42:16Z

Codecov Report

Merging #652 (bbf963f) into master (8df4b4d) will increase coverage by 1.05%.
The diff coverage is 77.21%.

@@            Coverage Diff             @@
##           master     #652      +/-   ##
==========================================
+ Coverage   30.52%   31.57%   +1.05%     
==========================================
  Files          40       41       +1     
  Lines        5615     5779     +164     
  Branches     1363     1436      +73     
==========================================
+ Hits         1714     1825     +111     
- Misses       3830     3848      +18     
- Partials       71      106      +35

Impacted Files	Coverage Δ
augur/index.py	`80.70% <33.33%> (-1.45%)`	⬇️
augur/parse.py	`62.50% <47.82%> (+12.50%)`	⬆️
augur/filter.py	`48.78% <100.00%> (+2.93%)`	⬆️
augur/io.py	`100.00% <100.00%> (ø)`
augur/mask.py	`100.00% <100.00%> (ø)`
augur/utils.py	`37.76% <0.00%> (-0.61%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8df4b4d...bbf963f. Read the comment docs.

kairstenfay

By request, I took a pass at reviewing your patches here. They're perhaps not the most insightful comments but here's what I have for today :)

augur/io.py

augur/mask.py

augur/parse.py

tests/test_io.py

huddlej · 2021-01-21T18:18:10Z

Thank you for the review, @kairstenfay! Do you have any general thoughts on the read/write interfaces? The more I've thought about it, the more I don't like passing the file handle around to write_sequences. These two functions should probably be classes like a SequenceReader and SequenceWriter. This implementation would allow the latter class to track the handle as an attribute. It could also allow the sequence reader to provide fancier functions like streaming out unique sequences, etc.

kairstenfay · 2021-01-21T22:30:49Z

Thank you for the review, @kairstenfay! Do you have any general thoughts on the read/write interfaces? The more I've thought about it, the more I don't like passing the file handle around to write_sequences. These two functions should probably be classes like a SequenceReader and SequenceWriter. This implementation would allow the latter class to track the handle as an attribute. It could also allow the sequence reader to provide fancier functions like streaming out unique sequences, etc.

I don't have a lot of context or experience using augur, so I hesitate to comment on the interfaces. However, I did think it was a bit odd to pass around the file handler arguments to the write_sequences function. With my superficial understanding, creating SequenceReader & SequenceWriter classes sounds like a good approach to me.

rneher · 2021-01-30T17:53:36Z

This looks good to me John. I installed this on our cluster. Builds on Monday will use it (and I'll pass in the raw gz files at the top). A bunch of ncov scripts would need adjusting, so I am not changing everything to compressed just yet.

huddlej · 2021-02-01T22:47:33Z

Thank you for trying this out, @rneher. I'm interested to know what issues you encounter. :)

From the engineering perspective, I'd to make a quick attempt at implementing these two new functions as classes, as discussed with @kairstenfay above. Since my hope is to make this part of the minimal public-facing Python API that we've discussed, I want to make sure this interface is as good as it can be from the start.

The last little lift to finishing this work is to update all augur commands that read or write sequences to use this new interface. I'd prefer not to make this change piece-meal, so users don't get different experiences with different commands.

rneher · 2021-02-02T21:59:52Z

I merged master into it on scicore and I am running the builds locally w/o problem. But I have only tried compressed files in filter.

Adds tests and code for new `open_file`, `read_sequences`, and `write_sequences` functions loosely based on a proposed API [1]. These functions transparently handle compressed inputs and outputs using the xopen library. The `open_file` function is a context manager that lightly wraps the `xopen` function and also supports either path strings or existing IO buffers. Both the read and write functions use this context manager to open files. This manager enables the common use case of writing to the same handle many times inside a for loop, by replacing the standard `open` call with `open_file`. Doing so, we maintain a Pythonic interface that also supports compressed file formats and path-or-buffer inputs. This context manager also enables input and output of any other file type in compressed formats (e.g., metadata, sequence indices, etc.). Note that the `read_sequences` and `write_sequences` functions do not infer the format of sequence files (e.g., FASTA, GenBank, etc.). Inferring file formats requires peeking at the first record in each given input, but peeking is not supported by piped inputs that we want to support (e.g., piped gzip inputs from xopen). There are also no internal use cases for Augur to read multiple sequences of different formats, so I can't currently justify the complexity required to support type inference. Instead, I opted for the same approach used by BioPython where the calling code must know the type of input file being passed. This isn't an unreasonable expectation for Augur's internal code. I also considered inferring file type by filename extensions like xopen infers compression modes. Filename extensions are less standardized across bioinformatics than we would like for this type of inference to work robustly. Tests ignore BioPython and pycov warnings to minimize warning fatigue for issues we cannot address during test-driven development. [1] #645

Adds support to augur index for compressed sequence inputs and index outputs.

Adds tests for augur parse and mask and then refactors these modules to use the new read/write interface. For augur parse, the refactor moves from an original for loop into its own `parse_sequence` function, adds tests for this new function, and updates the body of the `run` function to use this function inside the for loop. This commit also replaces the Bio.SeqIO read and write functions with the new `read_sequences` and `write_sequences` functions. These functions support compressed input and output files based on the filename extensions. For augur mask, the refactor moves logic for masking individual sequences into its own function and replaces Bio.SeqIO calls with new `read_sequences` and `write_sequences` functions. The refactoring of the `mask_sequence` function allows us to easily define a generator for the output sequences to write and make a single call to `write_sequences`.

Documents which steps of a standard build support compressed inputs/outputs by adding a copy of the Zika build test and corresponding expected compressed inputs/outputs.

jameshadfield · 2021-03-18T03:28:26Z

Thanks @huddlej -- I spent some time looking through this and like the interface.

We have an open_file function in utils which we no longer use, should we add a deprecation warning to this?

Running nCoV with (xz) compressed metadata worked 💯. This actually surprised me as it means read_metadata handles xz files already!

P.S. for future work when we allow augur tree to take compressed input, be aware that VCF inputs, which need a fasta reference, use the TreeTime helper function read_vcf rather than our own helper functions.

huddlej · 2021-03-18T17:55:15Z

Thank you for looking through this, @jameshadfield!

We have an open_file function in utils which we no longer use, should we add a deprecation warning to this?

The old open_file function was added last March as part of a mask.py refactor and it has only ever been used in that module. We still use this older function in mask.py to open VCF files.

The old function has some functionality that the new function does not including:

a check for opening gzipped files in "text" mode. The xopen package implements the "text" mode check already, so we don't need to implement that anymore.
an explicit request for UTF-8 encoding. Based on @tsibley's excellent commit message when this UTF-8 encoding was added, we should continue to use this as an explicit default and still allow the calling code to override the default.

As far as how we handle this older function, I'd prefer to replace all internal references to that function with the new function now and then handle deprecation/migration of these types of I/O functions from utils.py to io.py in a later PR.

Running nCoV with (xz) compressed metadata worked 💯. This actually surprised me as it means read_metadata handles xz files already!

Yeah! We get this functionality "for free" by using pandas to read in the metadata. pandas has an interesting I/O library, too, where they effectively implement their own version of the xopen module (I would guess the pandas code predates the xopen module even though xopen was inspired by Heng Li's function of the same name from 2014!).

for future work when we allow augur tree to take compressed input, be aware that VCF inputs, which need a fasta reference, use the TreeTime helper function read_vcf rather than our own helper functions.

That's a good point. I'd vote to implement our own wrapper around any TreeTime functionality for this kind of I/O, so we can more easily replace/update the backend in the future.

huddlej · 2021-03-18T20:17:27Z

It turns out xopen does not support passing arguments like encoding through to the internal open function call. This may not be an issue for sequence data, since BioPython appears to do its own string decoding, but it could be an issue for other data types. I'm not sure if this is a deal-breaker or not...

Replaces calls to a similar `open_file` function from `utils.py` with the new function in `io.py`. Updates the functional tests for the mask module to confirm that compressed VCF inputs work with the old and new function alike.

huddlej requested a review from tsibley December 31, 2020 00:52

huddlej assigned jameshadfield and rneher and unassigned jameshadfield and rneher Dec 31, 2020

huddlej requested review from rneher, jameshadfield and trvrb December 31, 2020 00:53

huddlej marked this pull request as ready for review December 31, 2020 00:55

huddlej self-assigned this Jan 16, 2021

kairstenfay reviewed Jan 21, 2021

View reviewed changes

augur/io.py Outdated Show resolved Hide resolved

augur/mask.py Show resolved Hide resolved

augur/mask.py Show resolved Hide resolved

augur/parse.py Show resolved Hide resolved

tests/test_io.py Outdated Show resolved Hide resolved

huddlej force-pushed the add-sequence-interface branch from 94179a2 to 57ad120 Compare March 10, 2021 07:05

huddlej added 4 commits March 10, 2021 10:38

Support compressed inputs/outputs for index

0a9d742

Adds support to augur index for compressed sequence inputs and index outputs.

Update filter to use new IO interface

071023d

huddlej force-pushed the add-sequence-interface branch from 57ad120 to 071023d Compare March 10, 2021 18:52

huddlej added 2 commits March 15, 2021 16:54

Add docstring for mask sequence function

f6c61f1

Add Zika build test for compressed inputs/outputs

46b8a65

Documents which steps of a standard build support compressed inputs/outputs by adding a copy of the Zika build test and corresponding expected compressed inputs/outputs.

huddlej mentioned this pull request Mar 16, 2021

Support compressed alignments #696

Draft

huddlej requested a review from emmahodcroft March 16, 2021 21:50

Replace utils.open_file in mask with io.open_file

bbf963f

Replaces calls to a similar `open_file` function from `utils.py` with the new function in `io.py`. Updates the functional tests for the mask module to confirm that compressed VCF inputs work with the old and new function alike.

huddlej merged commit 684b90a into master Mar 18, 2021

huddlej deleted the add-sequence-interface branch March 18, 2021 22:07

huddlej added this to the Next release 11.X.X milestone Mar 18, 2021

huddlej mentioned this pull request Mar 18, 2021

Update augur subcommands to use new read/write methods #647

Closed

huddlej mentioned this pull request Jan 14, 2022

io: Pass encoding and newline keyword arguments to open_file #829

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add read/write sequence interface with support for compressed sequences #652

Add read/write sequence interface with support for compressed sequences #652

huddlej commented Dec 31, 2020 •

edited

Loading

codecov bot commented Dec 31, 2020 •

edited

Loading

kairstenfay left a comment

huddlej commented Jan 21, 2021

kairstenfay commented Jan 21, 2021

rneher commented Jan 30, 2021

huddlej commented Feb 1, 2021

rneher commented Feb 2, 2021

jameshadfield commented Mar 18, 2021 •

edited

Loading

huddlej commented Mar 18, 2021

huddlej commented Mar 18, 2021

Add read/write sequence interface with support for compressed sequences #652

Add read/write sequence interface with support for compressed sequences #652

Conversation

huddlej commented Dec 31, 2020 • edited Loading

Description of proposed changes

Python API example

Command line interface examples

Related issue(s)

Testing

codecov bot commented Dec 31, 2020 • edited Loading

Codecov Report

kairstenfay left a comment

Choose a reason for hiding this comment

huddlej commented Jan 21, 2021

kairstenfay commented Jan 21, 2021

rneher commented Jan 30, 2021

huddlej commented Feb 1, 2021

rneher commented Feb 2, 2021

jameshadfield commented Mar 18, 2021 • edited Loading

huddlej commented Mar 18, 2021

huddlej commented Mar 18, 2021

huddlej commented Dec 31, 2020 •

edited

Loading

codecov bot commented Dec 31, 2020 •

edited

Loading

jameshadfield commented Mar 18, 2021 •

edited

Loading