-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Augur Mask: Add additional options from NCOV mask-alignment.py script #512
Augur Mask: Add additional options from NCOV mask-alignment.py script #512
Conversation
Codecov Report
@@ Coverage Diff @@
## master #512 +/- ##
==========================================
+ Coverage 18.65% 19.19% +0.54%
==========================================
Files 31 31
Lines 5061 5080 +19
Branches 1282 1289 +7
==========================================
+ Hits 944 975 +31
+ Misses 4090 4082 -8
+ Partials 27 23 -4
Continue to review full report at Codecov.
|
On first pass, this looks great! I'll test it out more tonight and post some feedback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ncov mask-alignment.py script doesn't explicitly document this, but its "mask_sites" are assumed to be 1-based coordinates. Since the masking operation assumes 0-based coordinates, this means args.mask_sites
values have to be decremented by 1 before adding them to the list of sites to mask.
Our (undocumented) standard for augur has been to keep coordinates 0-based internally and 1-based if specified externally by the user (either by a file or as a list of sites to the CLI). The exception to this 1-based input rule is when the input file format is already 0-based (e.g., BED format).
There is a bigger issue here because VCF files use 1-based coordinates and vcftools correspondingly expects positions to be 1-based. So, for augur mask we need to consistently track sites in 0-based coordinates and then convert these to 1-based coordinates prior to passing them to vcftools. We likely never encountered this bug before with VCF masking because the bug with how BED files are read in has been obscuring it. For FASTA files, we need 0-based coordinates.
I added a test in the inline comments that currently fails and should be fixed when --mask-sites
values are converted to 0-based coordinates. I think we’ll need the corresponding VCF test to explicitly confirm that 1-based positions have been removed as expected in the output from augur mask.
Let me know if the above is not clear. Also, let me know what you think about using your argparser fixture plus the mask.run
function for more general testing of the CLI. I can see how this would not be thread safe if all tests are writing the same output files, but that could probably be addressed…
augur/mask.py
Outdated
mask_sites = read_bed_file(args.mask) | ||
mask_sites = set() | ||
if args.mask_sites: | ||
mask_sites.update(args.mask_sites) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See main review, but this is where mask sites need to be converted from 1-based to 0-based coordinates.
augur/mask.py
Outdated
parser.add_argument('--mask-from-beginning', type=int, help="FASTA Only: Number of sites to mask from beginning") | ||
parser.add_argument('--mask-from-end', type=int, help="FASTA Only: Number of sites to mask from end") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two arguments need to have default values of 0 or they will get passed as None
to the downstream math in mask_fasta
and produce this exception:
Removing masked sites from FASTA file.
Traceback (most recent call last):
File "./bin/augur", line 24, in <module>
exit( main() )
File "./augur/__main__.py", line 10, in main
return augur.run( argv[1:] )
File "./augur/__init__.py", line 74, in run
return args.__command__.run(args)
File "./augur/mask.py", line 197, in run
mask_from_end=args.mask_from_end)
File "./augur/mask.py", line 121, in mask_fasta
if beginning + end > sequence_length:
TypeError: unsupported operand type(s) for +: 'NoneType' and 'NoneType'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Crap, I missed that. Per your comment below, I've added a couple E2E tests that are exercising the full code path - should've done that in the first place, I've gotten burned by the "unit tests but no integration tests" bug before.
tests/test_mask.py
Outdated
def test_run_with_mask_sites(self, vcf_file, out_file, argparser, mp_context): | ||
args = argparser("--mask-sites 2 8 -s %s -o %s" % (vcf_file, out_file)) | ||
def check_mask_sites(mask_sites, *args, **kwargs): | ||
assert mask_sites == [2,8] | ||
mp_context.setattr(mask, "mask_vcf", check_mask_sites) | ||
mask.run(args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test would be more helpful if it ran mask_vcf
and confirmed that the requested sites were no longer present in the output file. I actually love the way you've setup the argparser fixture here such that we can essentially run end-to-end tests from the unit testing framework. I know it is a little more computationally expensive, but running the whole program this way can catch more errors.
We also need a corresponding test for masking specific sites with a FASTA input. This test will fail right now because mask_sites values are not converted to 0-based coordinates (and because of the mask from beginning/end None issue mentioned earlier).
def test_run_with_mask_sites_for_fasta(self, fasta_file, out_file, argparser):
# 1-based coordinates from the user should translate to 0-based masking of the FASTA sequence.
args = argparser("--mask-sites 1 -s %s -o %s" % (fasta_file, out_file))
mask.run(args)
output = SeqIO.parse(out_file, "fasta")
for record in output:
# Site 1 in the sequence is the 0-indexed site in the sequence record.
assert record.seq[0] == "N"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually running fully through the functions is a good call. I've added E2E tests for handling both VCF and FASTA masking. The test files are all small enough that it's not really that expensive to do the full test, and the test files are created & written specifically for each test (that's what the pytest tmpdir fixture is giving us), so there shouldn't be any race conditions here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is so cool! Thank you!
Ok, so I want to make sure I'm getting this right - VCFTools expects a list of 1-indexed sites to be removed.
And should that then be incremented again before being passed to VCFTools? for instance, if c) is correct, does VCFTools then expect [7,8,9,10]? In terms of differentiating between the two, it's easy enough - mask_vcf writes its own masking file, so I can just add 1 to that and vcftools should "do the right thing", but I want to make sure I've got the right set of sites to begin with. ("Tests all pass - the code's doing the wrong things, but it's doing the correct wrong things.") |
Actually, let me rephrase this to be a bit clearer. Let's say we have the sequence below:
So, here's my understanding of the state of the world: 1. Sites passed into the CLIif someone issues:
2. Sites passed in via a masking file:Someone issues
Our expected output is:
3. Sites passed in via a BED fileThis is where I start to get confused - let's say a user passes in
The current code will output this for FASTA:
And this for VCF:
So it's obviously doing the wrong thing for one of these, but what's the correct output? (I'm noting I missed this because indeed any line that refers to the specified index is gone from the VCF, that just means something different for VCFs) |
The rephrasing in your last comment is super helpful! Parts 1 and 2 are exactly right. Part 3 is complicated by the bug in how BED files are currently read into augur. So there is what the output should be for the current augur code and then there is what the output should be for a properly implemented BED parser. I'll list both outputs below. If the input BED file looks like this:
The FASTA output for the current augur code should be:
The VCF output for the current augur code should be:
The FASTA output for a proper 0-indexed, half-open BED parser should be:
The last position in the interval isn't masked because BED is half-open. Resolving that bug can be a future PR though. Thanks for also adding the end-to-end tests! I'm going to use these to start a proposal for how we more generally implement end-to-end tests in augur along with other possible approaches like the cram tool @tsibley has recommended. |
Ok, that's perfect - thanks! I can fix the BED bug in this pr - it's a pretty quick fix. I'll also fix this in #514 as well. |
@danielsoneg Can you hold off from fixing the BED parsing bug? I think everything you've updated in augur mask so far is still backwards-compatible, but fixing the BED parsing issue will change the data API. I want to make sure we coordinate with @emmahodcroft when she has a free moment, so we don't break her builds that rely on the current way BED files are parsed. |
Ok, updated the diff - For what it's worth, I'm 95% sure this is a bug that's existed in mask.py's BED file handling for a while now - tree.py's been fine because treetime.read_vcf already zero-indexes bed files, but I think even adjusting the indexing for the mask file is a change in behavior. |
Ah, darn, you are totally right. Converting the masks to 1-indexed sites for vcftools is already a change in API, so we might as well fix the half-open bug for BED files here, too. Thanks for catching that! This means I'll need to work through one of @emmahodcroft's affected builds and verify the output with these new modifications before we merge this PR. |
Ok, updated with the BED fixes as well |
Perfect! I will work on testing these tomorrow afternoon. |
augur mask now reads in BED files following the standard expectation of a zero-indexed, half-open interval such that the last value in each interval is _not_ included in the coordinates [1]. This commit updates the mask BED file for this build to increment each interval by one to compensate this change in augur mask. [1] nextstrain/augur#512 (comment)
augur mask now reads in BED files following the standard expectation of a zero-indexed, half-open interval such that the last value in each interval is _not_ included in the coordinates [1]. This commit updates the mask BED file for this build to increment each interval by one to compensate this change in augur mask. [1] nextstrain/augur#512 (comment)
augur mask now reads in BED files following the standard expectation of a zero-indexed, half-open interval such that the last value in each interval is _not_ included in the coordinates [1]. This commit updates the mask BED file for this build to increment each interval by one to compensate this change in augur mask. [1] nextstrain/augur#512 (comment)
augur mask now reads in BED files following the standard expectation of a zero-indexed, half-open interval such that the last value in each interval is not included in the coordinates [1]. This commit updates the mask BED file for this build to decrement each interval's start by one to compensate this change in augur mask. [1] nextstrain/augur#512 (comment)
Sorry for the super delayed response here, @danielsoneg, but this worked well for me locally. I also dropped this new augur mask functionality into a local ncov build to replace the custom script it is currently using and I got exactly the same results. Merging this now and I'll look into the tree PR next. Thanks again for your help here! |
No prob - let me take a look at the tree one before you review, though. It's been a while and I don't think that had the BED file changes - I'll ping you once I'm sure it's in good shape |
Description of proposed changes
This PR adds the
--mask-sites
,--mask-from-beginning
, and--mask-from-end
args toaugur mask
.--mask-from-beginning
and--mask-from-end
are gated to only apply to FASTA files.Related issue(s)
Related to conversation in #459 and #148
Testing
Ran full test suite, updated tests to test for new cases.
Also expanded existing test coverage, now at 100% coverage 🕺
Thank you for contributing to Nextstrain!