Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Augur Mask: Add additional options from NCOV mask-alignment.py script #512

Merged
merged 17 commits into from
Apr 14, 2020

Conversation

danielsoneg
Copy link

@danielsoneg danielsoneg commented Apr 1, 2020

Description of proposed changes

This PR adds the --mask-sites, --mask-from-beginning, and --mask-from-end args to augur mask.

--mask-from-beginning and --mask-from-end are gated to only apply to FASTA files.

Related issue(s)

Related to conversation in #459 and #148

Testing

Ran full test suite, updated tests to test for new cases.

Also expanded existing test coverage, now at 100% coverage 🕺

Thank you for contributing to Nextstrain!

@codecov
Copy link

codecov bot commented Apr 1, 2020

Codecov Report

Merging #512 into master will increase coverage by 0.54%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #512      +/-   ##
==========================================
+ Coverage   18.65%   19.19%   +0.54%     
==========================================
  Files          31       31              
  Lines        5061     5080      +19     
  Branches     1282     1289       +7     
==========================================
+ Hits          944      975      +31     
+ Misses       4090     4082       -8     
+ Partials       27       23       -4     
Impacted Files Coverage Δ
augur/mask.py 100.00% <100.00%> (+13.63%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7258b8b...802c271. Read the comment docs.

@huddlej
Copy link
Contributor

huddlej commented Apr 2, 2020

On first pass, this looks great! I'll test it out more tonight and post some feedback.

Copy link
Contributor

@huddlej huddlej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ncov mask-alignment.py script doesn't explicitly document this, but its "mask_sites" are assumed to be 1-based coordinates. Since the masking operation assumes 0-based coordinates, this means args.mask_sites values have to be decremented by 1 before adding them to the list of sites to mask.

Our (undocumented) standard for augur has been to keep coordinates 0-based internally and 1-based if specified externally by the user (either by a file or as a list of sites to the CLI). The exception to this 1-based input rule is when the input file format is already 0-based (e.g., BED format).

There is a bigger issue here because VCF files use 1-based coordinates and vcftools correspondingly expects positions to be 1-based. So, for augur mask we need to consistently track sites in 0-based coordinates and then convert these to 1-based coordinates prior to passing them to vcftools. We likely never encountered this bug before with VCF masking because the bug with how BED files are read in has been obscuring it. For FASTA files, we need 0-based coordinates.

I added a test in the inline comments that currently fails and should be fixed when --mask-sites values are converted to 0-based coordinates. I think we’ll need the corresponding VCF test to explicitly confirm that 1-based positions have been removed as expected in the output from augur mask.

Let me know if the above is not clear. Also, let me know what you think about using your argparser fixture plus the mask.run function for more general testing of the CLI. I can see how this would not be thread safe if all tests are writing the same output files, but that could probably be addressed…

augur/mask.py Outdated
mask_sites = read_bed_file(args.mask)
mask_sites = set()
if args.mask_sites:
mask_sites.update(args.mask_sites)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See main review, but this is where mask sites need to be converted from 1-based to 0-based coordinates.

augur/mask.py Outdated
Comment on lines 138 to 139
parser.add_argument('--mask-from-beginning', type=int, help="FASTA Only: Number of sites to mask from beginning")
parser.add_argument('--mask-from-end', type=int, help="FASTA Only: Number of sites to mask from end")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two arguments need to have default values of 0 or they will get passed as None to the downstream math in mask_fasta and produce this exception:

Removing masked sites from FASTA file.
Traceback (most recent call last):
  File "./bin/augur", line 24, in <module>
    exit( main() )
  File "./augur/__main__.py", line 10, in main
    return augur.run( argv[1:] )
  File "./augur/__init__.py", line 74, in run
    return args.__command__.run(args)
  File "./augur/mask.py", line 197, in run
    mask_from_end=args.mask_from_end)
  File "./augur/mask.py", line 121, in mask_fasta
    if beginning + end > sequence_length:
TypeError: unsupported operand type(s) for +: 'NoneType' and 'NoneType'

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Crap, I missed that. Per your comment below, I've added a couple E2E tests that are exercising the full code path - should've done that in the first place, I've gotten burned by the "unit tests but no integration tests" bug before.

Comment on lines 298 to 303
def test_run_with_mask_sites(self, vcf_file, out_file, argparser, mp_context):
args = argparser("--mask-sites 2 8 -s %s -o %s" % (vcf_file, out_file))
def check_mask_sites(mask_sites, *args, **kwargs):
assert mask_sites == [2,8]
mp_context.setattr(mask, "mask_vcf", check_mask_sites)
mask.run(args)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test would be more helpful if it ran mask_vcf and confirmed that the requested sites were no longer present in the output file. I actually love the way you've setup the argparser fixture here such that we can essentially run end-to-end tests from the unit testing framework. I know it is a little more computationally expensive, but running the whole program this way can catch more errors.

We also need a corresponding test for masking specific sites with a FASTA input. This test will fail right now because mask_sites values are not converted to 0-based coordinates (and because of the mask from beginning/end None issue mentioned earlier).

def test_run_with_mask_sites_for_fasta(self, fasta_file, out_file, argparser):
    # 1-based coordinates from the user should translate to 0-based masking of the FASTA sequence.
    args = argparser("--mask-sites 1 -s %s -o %s" % (fasta_file, out_file))
    mask.run(args)
    output = SeqIO.parse(out_file, "fasta")
    for record in output:
        # Site 1 in the sequence is the 0-indexed site in the sequence record.
        assert record.seq[0] == "N"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually running fully through the functions is a good call. I've added E2E tests for handling both VCF and FASTA masking. The test files are all small enough that it's not really that expensive to do the full test, and the test files are created & written specifically for each test (that's what the pytest tmpdir fixture is giving us), so there shouldn't be any race conditions here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is so cool! Thank you!

@danielsoneg
Copy link
Author

danielsoneg commented Apr 3, 2020

There is a bigger issue here because VCF files use 1-based coordinates and vcftools correspondingly expects positions to be 1-based. So, for augur mask we need to consistently track sites in 0-based coordinates and then convert these to 1-based coordinates prior to passing them to vcftools. We likely never encountered this bug before with VCF masking because the bug with how BED files are read in has been obscuring it. For FASTA files, we need 0-based coordinates.

Ok, so I want to make sure I'm getting this right -

VCFTools expects a list of 1-indexed sites to be removed.
For FASTA, we're doing it ourselves, so it's 0-indexed.

  • For --mask-sites, we want to take these in as 1-indexed - so --mask-sites 1 means "remove the first character from the sequence". VCFTools should get this as "1".

  • Elsewhere, where we've taken in non-bed mask files, those are expected to be one site per line, 1-indexed?

  • For BED files, a line like:
    SEQ1 7 10 IG18_Rv0018c-Rv0019c
    corresponds to a zero-indexed site list of:
    a) [7,8,9,10]
    b) [7,8,9]
    c) [6,7,8,9]
    d) [6,7,8]

And should that then be incremented again before being passed to VCFTools? for instance, if c) is correct, does VCFTools then expect [7,8,9,10]?

In terms of differentiating between the two, it's easy enough - mask_vcf writes its own masking file, so I can just add 1 to that and vcftools should "do the right thing", but I want to make sure I've got the right set of sites to begin with.

("Tests all pass - the code's doing the wrong things, but it's doing the correct wrong things.")

@danielsoneg
Copy link
Author

Actually, let me rephrase this to be a bit clearer. Let's say we have the sequence below:

CCCCCCCCC # Sequence
012345678 # Zero-based indexes

So, here's my understanding of the state of the world:

1. Sites passed into the CLI

if someone issues: augur mask --mask-sites 1 3, the expected output is:

NCNCCCCCC # Sequence
012345678 # Zero-based indexes

2. Sites passed in via a masking file:

Someone issues augur mask --mask test.not_a_bed_file (I'm adding that next, so)
test.not_a_bed_file contains:

4
5

Our expected output is:

CCCNNCCCC # Sequence
012345678 # Zero-based indexes

3. Sites passed in via a BED file

This is where I start to get confused - let's say a user passes in augur mask --mask test.bed.
test.bed contains:

SEQ	5	7

The current code will output this for FASTA:

CCCCCNNNC # Sequence
012345678 # Zero-based indexes

And this for VCF:

CCCCNNNCC # Sequence
012345678 # Zero-based indexes

So it's obviously doing the wrong thing for one of these, but what's the correct output?

(I'm noting I missed this because indeed any line that refers to the specified index is gone from the VCF, that just means something different for VCFs)

@huddlej
Copy link
Contributor

huddlej commented Apr 4, 2020

The rephrasing in your last comment is super helpful! Parts 1 and 2 are exactly right. Part 3 is complicated by the bug in how BED files are currently read into augur. So there is what the output should be for the current augur code and then there is what the output should be for a properly implemented BED parser. I'll list both outputs below.

If the input BED file looks like this:

SEQ	5	7

The FASTA output for the current augur code should be:

CCCCCNNNC # Sequence
012345678 # Zero-based indexes

The VCF output for the current augur code should be:

CCCCCNNNC # Sequence
012345678 # Zero-based indexes
123456789 # One-based indexes ([6, 7, 8] would be written out for vcftools)

The FASTA output for a proper 0-indexed, half-open BED parser should be:

CCCCCNNCC # Sequence
012345678 # Zero-based indexes

The last position in the interval isn't masked because BED is half-open. Resolving that bug can be a future PR though.

Thanks for also adding the end-to-end tests! I'm going to use these to start a proposal for how we more generally implement end-to-end tests in augur along with other possible approaches like the cram tool @tsibley has recommended.

@danielsoneg
Copy link
Author

Ok, that's perfect - thanks!

I can fix the BED bug in this pr - it's a pretty quick fix. I'll also fix this in #514 as well.

@huddlej
Copy link
Contributor

huddlej commented Apr 4, 2020

@danielsoneg Can you hold off from fixing the BED parsing bug? I think everything you've updated in augur mask so far is still backwards-compatible, but fixing the BED parsing issue will change the data API.

I want to make sure we coordinate with @emmahodcroft when she has a free moment, so we don't break her builds that rely on the current way BED files are parsed.

@danielsoneg
Copy link
Author

Ok, updated the diff - mask_vcf is now taking the zero-indexed mask_site lists and incrementing them all by 1 before writing out the masking file that's passed to vcftools.

For what it's worth, I'm 95% sure this is a bug that's existed in mask.py's BED file handling for a while now - tree.py's been fine because treetime.read_vcf already zero-indexes bed files, but I think even adjusting the indexing for the mask file is a change in behavior.

@huddlej
Copy link
Contributor

huddlej commented Apr 4, 2020

Ah, darn, you are totally right. Converting the masks to 1-indexed sites for vcftools is already a change in API, so we might as well fix the half-open bug for BED files here, too. Thanks for catching that!

This means I'll need to work through one of @emmahodcroft's affected builds and verify the output with these new modifications before we merge this PR.

@danielsoneg
Copy link
Author

Ok, updated with the BED fixes as well

@huddlej
Copy link
Contributor

huddlej commented Apr 7, 2020

Perfect! I will work on testing these tomorrow afternoon.

huddlej added a commit to nextstrain/tb that referenced this pull request Apr 14, 2020
augur mask now reads in BED files following the standard expectation of a
zero-indexed, half-open interval such that the last value in each interval is
_not_ included in the coordinates [1]. This commit updates the mask BED file for
this build to increment each interval by one to compensate this change in augur
mask.

[1] nextstrain/augur#512 (comment)
huddlej added a commit to nextstrain/tb that referenced this pull request Apr 14, 2020
augur mask now reads in BED files following the standard expectation of a
zero-indexed, half-open interval such that the last value in each interval is
_not_ included in the coordinates [1]. This commit updates the mask BED file for
this build to increment each interval by one to compensate this change in augur
mask.

[1] nextstrain/augur#512 (comment)
huddlej added a commit to nextstrain/tb that referenced this pull request Apr 14, 2020
augur mask now reads in BED files following the standard expectation of a
zero-indexed, half-open interval such that the last value in each interval is
_not_ included in the coordinates [1]. This commit updates the mask BED file for
this build to increment each interval by one to compensate this change in augur
mask.

[1] nextstrain/augur#512 (comment)
huddlej added a commit to nextstrain/tb that referenced this pull request Apr 14, 2020
augur mask now reads in BED files following the standard expectation of a
zero-indexed, half-open interval such that the last value in each interval is
not included in the coordinates [1]. This commit updates the mask BED file for
this build to decrement each interval's start by one to compensate this change
in augur mask.

[1] nextstrain/augur#512 (comment)
@huddlej
Copy link
Contributor

huddlej commented Apr 14, 2020

Sorry for the super delayed response here, @danielsoneg, but this worked well for me locally. I also dropped this new augur mask functionality into a local ncov build to replace the custom script it is currently using and I got exactly the same results.

Merging this now and I'll look into the tree PR next. Thanks again for your help here!

@huddlej huddlej merged commit 9cecfa3 into nextstrain:master Apr 14, 2020
@danielsoneg
Copy link
Author

No prob - let me take a look at the tree one before you review, though. It's been a while and I don't think that had the BED file changes - I'll ping you once I'm sure it's in good shape

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants