Improvements to parse.py #496

groutr · 2020-03-29T06:08:39Z

Description of proposed changes

This PR implements some enhancements for parse.py. Some of the enhancements will improve performance by using common python idioms. Overall, memory used by parse.py should also be less since we don't read the entire fasta file into memory anymore.

Testing

The test suite passes. This PR fixes several suboptimal implementations of things with equivalent replacements as verified with manual testing.

The one place where I changed behavior is with prettify. It will now convert "Et Al." to "et al."

Thank you for contributing to Nextstrain!

This is far more efficient than repeated calls to str.replace.

rneher · 2020-05-15T13:27:17Z

Thanks @groutr! Sorry for the delay. this looks overall pretty good. But your implementation has one problem. It doesn't write out the sequences to file. By the time you arrive here:

augur/augur/parse.py

Line 119 in 693372a

SeqIO.write(seqs, args.output_sequences, 'fasta')

the iterator over the sequences is at the end and this writes an empty file. You either need to write these sequences to file as you loop over the input, or load them all into memory.

Opening/Closing the output file here allows us to avoid loading all of the input file into memory.

codecov · 2020-05-15T18:38:19Z

Codecov Report

❗ No coverage uploaded for pull request base (master@b17e78b). Click here to learn what that means.
The diff coverage is n/a.

@@            Coverage Diff            @@
##             master     #496   +/-   ##
=========================================
  Coverage          ?   19.16%           
=========================================
  Files             ?       31           
  Lines             ?     5072           
  Branches          ?     1286           
=========================================
  Hits              ?      972           
  Misses            ?     4077           
  Partials          ?       23

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b17e78b...d9f88ce. Read the comment docs.

No longer need to check if a field is in tmp_meta.

rneher · 2020-05-15T19:05:26Z

thanks, this looks good. I'll test a few more use cases tomorrow.

groutr added 6 commits March 29, 2020 00:14

Use str.translate to replace multiple characters

722d52b

This is far more efficient than repeated calls to str.replace.

Fix up prettify.

b348379

Avoid loading all sequences into memory at once.

5485162

Use dictionaries for more cleaner field access.

0965def

More efficiently remove strain from tmp_meta

5c48377

Check that strain exists before trying to remove it.

693372a

groutr added 2 commits May 15, 2020 13:27

Manage file open/close outside of SeqIO.

c905e40

Opening/Closing the output file here allows us to avoid loading all of the input file into memory.

Nicer condition checks.

81c8e06

groutr added 2 commits May 15, 2020 13:41

Iterate over intersection of tmp_meta and prettify_fields

8d428dd

No longer need to check if a field is in tmp_meta.

Fix dates on tmp_meta.

d9f88ce

rneher self-requested a review May 15, 2020 19:04

rneher merged commit 0642487 into nextstrain:master May 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements to parse.py #496

Improvements to parse.py #496

groutr commented Mar 29, 2020

rneher commented May 15, 2020

codecov bot commented May 15, 2020 •

edited

Loading

rneher commented May 15, 2020

Improvements to parse.py #496

Improvements to parse.py #496

Conversation

groutr commented Mar 29, 2020

Description of proposed changes

Testing

Thank you for contributing to Nextstrain!

rneher commented May 15, 2020

codecov bot commented May 15, 2020 • edited Loading

Codecov Report

rneher commented May 15, 2020

codecov bot commented May 15, 2020 •

edited

Loading