Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to parse.py #496

Merged
merged 10 commits into from
May 23, 2020
Merged

Improvements to parse.py #496

merged 10 commits into from
May 23, 2020

Conversation

groutr
Copy link
Contributor

@groutr groutr commented Mar 29, 2020

Description of proposed changes

This PR implements some enhancements for parse.py. Some of the enhancements will improve performance by using common python idioms. Overall, memory used by parse.py should also be less since we don't read the entire fasta file into memory anymore.

Testing

The test suite passes. This PR fixes several suboptimal implementations of things with equivalent replacements as verified with manual testing.

The one place where I changed behavior is with prettify. It will now convert "Et Al." to "et al."

Thank you for contributing to Nextstrain!

@rneher
Copy link
Member

rneher commented May 15, 2020

Thanks @groutr! Sorry for the delay. this looks overall pretty good. But your implementation has one problem. It doesn't write out the sequences to file. By the time you arrive here:

SeqIO.write(seqs, args.output_sequences, 'fasta')

the iterator over the sequences is at the end and this writes an empty file. You either need to write these sequences to file as you loop over the input, or load them all into memory.

Opening/Closing the output file here allows us to avoid loading all of
the input file into memory.
@codecov
Copy link

codecov bot commented May 15, 2020

Codecov Report

❗ No coverage uploaded for pull request base (master@b17e78b). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##             master     #496   +/-   ##
=========================================
  Coverage          ?   19.16%           
=========================================
  Files             ?       31           
  Lines             ?     5072           
  Branches          ?     1286           
=========================================
  Hits              ?      972           
  Misses            ?     4077           
  Partials          ?       23           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b17e78b...d9f88ce. Read the comment docs.

@rneher rneher self-requested a review May 15, 2020 19:04
@rneher
Copy link
Member

rneher commented May 15, 2020

thanks, this looks good. I'll test a few more use cases tomorrow.

@rneher rneher merged commit 0642487 into nextstrain:master May 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants