Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[filter] Index input sequences and stream output sequences #627

Merged
merged 1 commit into from
Nov 6, 2020

Conversation

huddlej
Copy link
Contributor

@huddlej huddlej commented Nov 5, 2020

Description of proposed changes

Reduces memory needed by augur filter to a smaller constant value
instead of loading all input sequences into memory once and then again
when writing sequences out to disk. The primary improvements here are:

  1. Use a BioPython index data structure [1] that tracks where each
    sequence is on disk but does not load sequences into memory. This
    structure acts the same as the dictionary structure we originally used
    except sequences are loaded lazily when they are requested.

  2. Use an iterator to write sequences back to disk. BioPython's SeqIO
    write method accepts an iterator that allows us to stream sequences back
    to disk without first loading them into memory as a list of sequence
    objects.

[1] http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec66

Testing

This code has been tested with a recent ncov dataset to confirm the same number of sequences are output for a given input and filters. It also passes all functional and unit tests.

Reduces memory needed by augur filter to a smaller constant value
instead of loading all input sequences into memory once and then again
when writing sequences out to disk. The primary improvements here are:

1. Use a BioPython index data structure [1] that tracks where each
sequence is on disk but does not load sequences into memory. This
structure acts the same as the dictionary structure we originally used
except sequences are loaded lazily when they are requested.

2. Use an iterator to write sequences back to disk. BioPython's SeqIO
write method accepts an iterator that allows us to stream sequences back
to disk without first loading them into memory as a list of sequence
objects.

[1] http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec66
@codecov
Copy link

codecov bot commented Nov 5, 2020

Codecov Report

Merging #627 into master will increase coverage by 0.02%.
The diff coverage is 83.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #627      +/-   ##
==========================================
+ Coverage   28.53%   28.56%   +0.02%     
==========================================
  Files          39       39              
  Lines        5365     5367       +2     
  Branches     1319     1320       +1     
==========================================
+ Hits         1531     1533       +2     
  Misses       3778     3778              
  Partials       56       56              
Impacted Files Coverage Δ
augur/filter.py 45.16% <83.33%> (+0.35%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5a712db...9febe77. Read the comment docs.

@huddlej huddlej added this to the Next release 10.x.x milestone Nov 5, 2020
Copy link
Member

@rneher rneher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was initially worried that this would come at a big performance cost, but this doesn't seem to be the case.

@huddlej
Copy link
Contributor Author

huddlej commented Nov 6, 2020

Thank you for testing this out, @rneher! I'll merge now.

@huddlej huddlej merged commit 5c1aacc into master Nov 6, 2020
@huddlej huddlej deleted the stream-filter-seqs branch November 6, 2020 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants