Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[filter] Index input sequences and stream output sequences #627

Merged
merged 1 commit into from
Nov 6, 2020

Commits on Nov 5, 2020

  1. Index input sequences and stream output sequences

    Reduces memory needed by augur filter to a smaller constant value
    instead of loading all input sequences into memory once and then again
    when writing sequences out to disk. The primary improvements here are:
    
    1. Use a BioPython index data structure [1] that tracks where each
    sequence is on disk but does not load sequences into memory. This
    structure acts the same as the dictionary structure we originally used
    except sequences are loaded lazily when they are requested.
    
    2. Use an iterator to write sequences back to disk. BioPython's SeqIO
    write method accepts an iterator that allows us to stream sequences back
    to disk without first loading them into memory as a list of sequence
    objects.
    
    [1] http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec66
    huddlej committed Nov 5, 2020
    Configuration menu
    Copy the full SHA
    9febe77 View commit details
    Browse the repository at this point in the history