-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow filtering by metadata only #679
Conversation
Codecov Report
@@ Coverage Diff @@
## master #679 +/- ##
==========================================
+ Coverage 30.27% 30.62% +0.35%
==========================================
Files 40 40
Lines 5563 5812 +249
Branches 1354 1463 +109
==========================================
+ Hits 1684 1780 +96
- Misses 3819 3946 +127
- Partials 60 86 +26
Continue to review full report at Codecov.
|
@@ -146,7 +188,8 @@ def run(args): | |||
#If VCF, open and get sequence names | |||
if is_vcf: | |||
seq_keep, all_seq = read_vcf(args.sequences) | |||
else: | |||
# TODO: How to best handle "source of truth" for VCF files vs. metadata? | |||
elif args.sequences or args.sequence_index: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One issue I just ran into while mocking up a ncov workflow with this interface was that I had to include a sequence index in the final filter step that pulls in all of the subsampled strains and extracts their FASTA sequences. Even though the index isn't necessary at that stage because there are no sequence-based filters being applied, if I didn't include the sequence index, augur would try to build it for me.
So, this line here should probably also check for whether any sequence-based filters have been requested (as in the argument validation code above) and then skip the sequence index completely if it isn't needed.
An alternative interface for merging the subsampled strains and extracting their sequences could be a combination of UNIX utilities and samtools. For example: # Collect all unique strains from previous filter steps into a single file.
sort -u brazil.txt colombia.txt > strains.txt
# Extract sequences for the given strains.
# This builds a samtools FASTA index (a `sequences.fasta.fai` file) the first time it is run
# and then uses that index for subsequent commands.
samtools faidx -r strains.txt tests/functional/filter/sequences.fasta > filtered.fasta This approach is much faster (1 second to extract sequences compared to 1-2 minutes with augur filter) because samtools is written in C and optimized for random reads in big sequence files by using an index. The downside here is that users can't easily get the metadata for the filtered strains, too. While having a filtered metadata file has been critical for my own flu analyses, that may not be a common use case and I've gotten by with my custom scripts to perform the same function. If we used this approach, we could get rid of the |
Builds on a prototype augur interface for metadata-only filtering [1] to speed up the ncov subsampling steps. Specifically, subsampling writes out lists of selected strain names to text files instead of writing out sequences. Then the "combine samples" step extracts sequences from the unique list of strain names with samtools faidx. The resulting workflow requires substantially less I/O and speeds up subsampling from an average of 163 seconds per job (for 12 jobs in a Nextstrain global build) to 36 seconds per job. One downside to the current metadata-only approach in Augur that reveals itself here is that its possible to have mismatches between metadata and sequences such that a strain has metadata but does not have a sequence in the filtered FASTA. One way around this issue is to export the filtered metadata earlier in the pipeline. For now, we use samtools faidx with the `-c` flag to ignore missing sequences. [1] nextstrain/augur#679
@jameshadfield I've mocked up a ncov workflow using with the proposed metadata-only interface from this PR. After upfront costs of indexing the sequences (6 minutes for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really good @huddlej and the speed improvements for ncov will be immense -- we currently find ourselves in a situation where aligning is faster than filtering!
I tried to cover the different paths available for running augur filter - there are many, and I know I will have missed some, especially around VCF handling. The main confusion I had was the inconsistencies between different sources of truth. For instance, using the ncov pipeline's example_metadata.tsv
echo -e ">something\nATCG" > dummy.fasta
augur filter --metadata example_metadata.tsv --sequences dummy.fasta \
--max-date 2020-01-30 --output-strains result.txt
# results in 14 strains. I argue it should be 0.
augur filter --metadata example_metadata.tsv --sequences dummy.fasta \
--max-date 2020-01-30 --output-strains result.txt --output result.fasta
# results in 0 strains in the FASTA -- which I think is correct
# and result.txt is not modified (so it may still exist from a previous call)
I've made in-line comments which would lead us towards taking the intersection of inputs, but there may be other directions to pursue.
augur/filter.py
Outdated
@@ -91,15 +91,16 @@ def filter_by_query(sequences, metadata_file, query): | |||
return [seq for seq in sequences if seq in filtered_meta_dict] | |||
|
|||
def register_arguments(parser): | |||
parser.add_argument('--sequences', '-s', required=True, help="sequences in fasta or VCF format") | |||
parser.add_argument('--sequences', '-s', help="sequences in FASTA or VCF format") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The augur filter
interface is becoming rather complicated, but I'm not sure how much we can explain within augur --help
. We could consider using argument groups to separate into "data input args", "parameters", and "outputs", each with a small preamble. This would help with errors such as ERROR: You need to select at least one output.
P.S. the arguement validation & error messages you've added in this PR are really helpful once a user starts to actually run things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this idea a lot. I'm not sure what those groups would be exactly though, so I'm hesitant to commit any specifics here. Maybe we can discuss this at the next meeting, too?
Thank you for this detailed review, @jameshadfield! Your example commands were helpful for mocking up end-to-end tests that we were missing. I've tried to address most of the comments. The only one I couldn't determine a solution to was the grouping of arguments. The current state of the PR still treats the metadata as "the source of truth". How we choose to handle this could be discussed more. My main thought here is that metadata are the only required input, so they basically have to be prompted to "source of truth". If we don't want this behavior, then we should require We still get the speed benefit of the "metadata-only" approach, if we require a sequence index, so maybe that is a better direction. One benefit of the current approach is it allows us to skip creating a sequence index completely if we don't request any sequence filters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me and works well in most situations, however I think there are still a couple of edge cases with strange behavior.
Firstly, given (e.g) a set of 4 sequences (FASTA), and 4 rows of metadata, where there is only 2 strains in common (i.e. there are 2 sequences without metadata, and two rows of metadata without sequences) then the output is:
2 sequences were dropped during filtering
2 were excluded because they had no sequence data
2 sequences were excluded because they did not have metadata
Ideally this would be:
4 samples were dropped during filtering
2 were excluded because they had no sequence data
2 were excluded because they did not have metadata
Secondly, there is inconsistent output stemming (I think) from the --include
ed samples. E.g. using the multiple-inputs
tutorial from nCoV we have a situation where we start with 92 seqs and 90 metadata rows,
$ augur filter --sequences results/masked_aus.fasta --metadata data/example_metadata_aus.tsv --include defaults/include.txt --output results/filtered_aus.fasta --output-strains results/filtered_aus.txt
-3 sequences were dropped during filtering
1 sequences were excluded because they did not have metadata
3 sequences were added back because they were in defaults/include.txt
94 strains have been passed all filters
and end up with a FASTA containing 91 sequences but a strains list containing 94! For consistency, to force-include strains I think they must be in the provided metadata and sequences (if supplied). Would also be nice to separate out "1 strain dropped during filtering" from "n samples added back in ..." (here n=3 but I think it should be 0).
Thank you so much for the detailed inspection of these edge cases, @jameshadfield. This helped me write a more effective functional test that captures these nuances and then update the code to make the test pass in 32a1ff2. The output from your last example above now looks like this: $ augur filter --sequences results/masked_aus.fasta --metadata data/example_metadata_aus.tsv --include defaults/include.txt --output results/filtered_aus.fasta --output-strains results/filtered_aus.txt
WARNING: A sequence index was not provided, so we are generating one. Generate your own index ahead of time with `augur index` and pass it with `augur filter --sequence-index`.
1 strains were dropped during filtering
1 had no metadata
0 strains were added back because they were requested by include files
3 strains from include files were not added because they lacked sequence or metadata
91 strains passed all filters
$ grep "^>" results/filtered_aus.fasta | wc -l
91
$ wc -l results/filtered_aus.txt
91 results/filtered_aus.txt For another example with more extensive reporting, you can change into the
Based on today's meeting discussion, I've left the sequences as optional. I'll test this more completely with the ncov workflow and also open it up for general review. |
Allows users to filter strains using only metadata information and output the strains that pass filters as either a metadata table or a list of names. Additionally, allows users to select only those sequences whose strain names are listed in one or more include files. To enable this functionality, this commit makes the following major changes to the filter interface: - sequence input and output are optional - non-sequence-based filters work without sequence input or index - the include argument accepts multiple inputs whose contents are unioned - a new `--exclude-all` flag excludes all strains by default, allowing only those in the include flags to be added back - new metadata and strain output arguments are supported - the initial complete list of strain names to consider for filtering originates from the metadata instead of the sequences Collectively, these changes allow multiple filter commands to be run in parallel without accessing the original sequence data and then the outputs of these separate commands can be joined by a single final filter command to output the sequences corresponding to all desired strain names. For workflows with large sequence inputs, like ncov, these changes should speed up analyses considerably.
Adds a `read_strains` utility function that can read strain names from one or more files and return the distinct set of all names. Updates the filter logic to use this function to read in names of strains to include.
Updates the `--exclude` argument to allow one or more values and the corresponding logic to read in strains from these files with the new `read_strains` utility function.
When sequences input (or a sequence index) is provided, intersect those strains with the strains in the metadata while also tracking the number of strains filtered by this operation. Note that this approach is not too different from the original filter implementation that required each sequence to have metadata or be filtered out. This approach has the additional effect of filtering out strains that have metadata and no sequences which is a filter that used to happen more implicitly before. This commit adds the number of strains filtered by this operation to the report at the end, so users have a better sense of which strains are missing data. As a side effect of getting the functional tests in this commit to work, this commit also reorders the report output such that the user always sees how many strains were filtered and why before the error message gets printed when all strains are filtered. This commit also adds a functional test for filtering VCFs. Fixes #681
Report the number of strains that had sequences but did not have any metadata.
Previously, augur filter treated comments differently for `include` and `exclude` files. This commit consistently allows the more permissive commenting associated with the `exclude` files such that users can make inline comments on the same line as a strain name. This commit also adds tests to confirm that we get the expected behavior from the current code. This change should be backwards compatible with existing `include` inputs, since the new comment standard is a superset of the older `include` format.
Fixes inconsistencies in numbers from filter reports associated with edge cases where strains are missing metadata or sequence data. Specific changes include: - calculate the intersection of strains with metadata and sequences (whenever sequences are provided) as the "source of truth" - track _strains_ as opposed to _sequences_ through sets instead of lists (enabling clear logic and testing of group membership) - report about strains dropped from either lack of metadata or sequence data - report how many strains could or could not be included from the include file inputs to clarify when requested strains do not have metadata or sequence data This commit also updates functional tests for augur filter to include a more complex example of how these filter behaviors should manifest, based on an excellent example from @jameshadfield.
Outputs how many "available strains" were initially dropped by the `--exclude-all` flag.
2304dd7
to
26ec5b6
Compare
I just hit another edge case where I wanted to filter a subsampled FASTA by date using a sequence index I prebuilt from the whole original set of sequences and the complete metadata. Given these inputs: subsampled FASTA: 6827 strains I ran the following command:
Instead of writing out the 5437 of 6827 sequences from the input FASTA that passed the filter, the report output was:
There are a couple of possible workarounds for this case:
But, the reporting behavior above is still a bug. Ideally, augur would use the number of strains that are actually written to disk as its final number of strains that passed the filter. One problem is that we don't know how many strains are in a sequence file without looping through the whole file again. |
thanks so much @huddlej! This will be very useful (though we should definitely break up the filter function!!) Regarding the edge case above, would a good solution be the following pseudo code?
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other than the edge case you flagged and the need to refactor, I think this is great. But refactor is probably best left for a later PR.
It is possible for the sequence index to be a superset of (or generally out of sync with) the given sequences output. To handle this case, we track the observed sequence strains when the user requests sequence output and update the statistics for which strains were filtered and why. We also move the handling of sequence outputs before the metadata and strain list outputs, so these outputs reflect any changes in the available sequences that occur when writing sequence output.
It's is a normal use case for the sequence index to contain a superset of strains present in the sequence inputs. Only warn users about a mismatch between the index and sequences when the superset relationship is not true.
Thanks for checking this out, @rneher! I just pushed a fix for the edge case in e1b1153. The output from my example above now looks like this (the total number of strains has changed slightly due to randomness in my subsampling):
I ended up checking the available sequence strains as part of the single pass through the sequences input and updating the
This update to The downside to this approach is that the number of strains dropped for missing sequence data can now overlap with the number of strains filtered for other reasons (like the date filter in the example above). We can continue to refine this reporting in the future, but I think this approach is worth the benefit of not reading through the sequences twice (at the beginning and end). I agree this module is ripe for a refactor! |
Provides a more logical presentation of the many filter arguments by grouping them into inputs, metadata filters, sequence filters, subsampling, and outputs. Reorders some of the arguments such that related arguments are close to each other and required arguments occur before optional arguments (e.g., metadata before sequences). Expands some of the help strings to clarify arguments, too.
@jameshadfield 1665e1b adds clearer argument grouping for inputs, metadata filters, sequence filters, subsampling options, and outputs. If this looks reasonable, I'll plan to merge this PR so we can release this feature and start integrating it into the ncov workflow. If you have any recommendations for changes, feel free to modify and push new commits. |
This is really nice @huddlej. Let's get this merged 😄 |
Description of proposed changes
Allows users to filter strains using only metadata information and output the strains that pass filters as either a metadata table or a list of names. Additionally, allows users to select only those sequences whose strain names are listed in one or more include files. To enable this functionality, this commit makes the following major changes to the filter interface:
--exclude-all
flag excludes all strains by default, allowing only those in the include flags to be added back (this explicit supports functionality that I've often hacked with an all-encompassing exclusion filter like--exclude-where 'country!=fakecountry'
plus the--include
argument)Collectively, these changes allow multiple filter commands to be run in parallel without accessing the original sequence data and then the outputs of these separate commands can be joined by a single final filter command to output the sequences corresponding to all desired strain names. For workflows with large sequence inputs, like ncov, these changes should speed up analyses considerably.
A useful pattern (also shown in the functional tests) that is enabled by this new interface looks like this:
Testing
This PR adds functional tests for the proposed new filter interface in
tests/functional/filter.t
.