Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add FASTA handling to Augur Mask #493

Merged
merged 26 commits into from
Apr 1, 2020

Conversation

danielsoneg
Copy link

@danielsoneg danielsoneg commented Mar 28, 2020

Description of proposed changes

Add FASTA handling to augur mask and two additional functions to augur.utils to deduplicate some code.

Note this is Just the FASTA handling for augur mask - I'll add the nonsense site handling and the rest of the capabilities from ncov's mask-alignment.py next.

Related issue(s)

Fixes #148

Testing

Ran the full test suite as well as wrote a lot of additional tests for augur.mask

However, as usual, please a) sanity check and b) actually test this with live files. I've got the files in the tests/build directory to work with, but that's about it.

Notes

This is a pretty big overhaul because augur.mask was written solely to handle .vcf files before, so I had to refactor most of the file before I could even start. There's no other new functionality except a debug flag - this is just adding masking to FASTA files.

The VCF-specific parts have been moved to their own functions, and I've ported the FASTA masking from augur.tree in. I added tests for this file, as well as a couple utility functions that I've already had to rewrite half a dozen times in various files.

After this is accepted, I'd like to add the nonsense site masking from #72 as well as the --mask-from-beginning and --mask-from-end functionality from the ncov mask-alignment script, but I wanted to get the basic rework done first.

There's also a bunch of parts of this around error handling and validating inputs I'd like to change as well, but again, I wanted to get the basic refactor in first. This doesn't introduce any regressions and is already a large enough PR.

Thank you for contributing to Nextstrain!

@danielsoneg
Copy link
Author

Also, if anyone’s interested in chatting about this in real-time, happy to hop on whatever medium y’all prefer

@huddlej
Copy link
Contributor

huddlej commented Mar 31, 2020

I'm checking this out now and may need until tomorrow afternoon to have helpful feedback. Chatting about this in real-time could be helpful then, if you're around, @danielsoneg. Thank you for continuing to take on these tricky issues!

Copy link
Contributor

@huddlej huddlej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the first pass, this makes sense and looks good. The main issue right now is the way BED files are read in assuming a header line that should not exist. I'm also a little concerned that some of the original BED coordinate logic might be incorrectly including extra positions. I'll need to spend a little time playing with this code with real data to figure out whether that's really the case.

augur/mask.py Outdated Show resolved Hide resolved
augur/mask.py Outdated Show resolved Hide resolved
augur/mask.py Show resolved Hide resolved
augur/mask.py Show resolved Hide resolved
augur/utils.py Show resolved Hide resolved
tests/test_mask.py Show resolved Hide resolved
augur/mask.py Outdated
sitesToMask = []
Second column is chromStart, 3rd is chromEnd. Generate a range from these two columns.
"""
sites_to_mask = []
bed = pd.read_csv(mask_file, sep='\t')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The basic BED format doesn't support an explicit header line, so this line should actually include a header=None argument to avoid skipping the first line. This is a very old bug as you can see from the mask sites for TB that repeat the first site.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops. Yeah, makes sense. Per the comment above, looks like the same bug's in tree, happy to fix it when we merge the two

TEST_BED_SEQUENCE = [1,2,4,6,7,8,9,10]
# IF YOU UPDATE ONE OF THESE, UPDATE THE OTHER.
TEST_BED="""\
Chrom ChromStart ChromEnd locus tag Comment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the comment above for the read_bed_file function about not supporting an explicit header in BED files.

augur/utils.py Show resolved Hide resolved
@danielsoneg
Copy link
Author

Hey @huddlej, thanks for the review!
My thinking with this was to get the basic refactor done and then run through on an enhancement/bug sweep - the diff is fairly hard to read right now, and the later changes will be a lot easier to grok as separate reviews. It's your workflow, though, so I'm happy to update this in place if you'd prefer.
I've got a pretty flexible schedule today - anytime after 1pm is good for me to chat.

@huddlej
Copy link
Contributor

huddlej commented Mar 31, 2020

@danielsoneg I'm ok with fixing the BED bug now and doing all of the other changes in separate smaller PRs. Since this PR adds tests with an example BED file, those tests should drop the current header line and then we can add the header=None bit to the read_csv function in read_bed_file to make those tests pass.

I sent a Zoom meeting link to your Twitter DMs and will be available there until 3:30, if you want to chat.

@danielsoneg
Copy link
Author

So the BED file in the TB build test has a header (which, in retrospect, was the one I built my example off) - so I've tweaked this a bit in that Pandas is now ignoring the header line, but also won't vomit if it gets a line that isn't all integers.

Happy to just scrap that and update the test file if that's all a big mistake.

@codecov
Copy link

codecov bot commented Mar 31, 2020

Codecov Report

❗ No coverage uploaded for pull request base (master@b17e78b). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##             master     #493   +/-   ##
=========================================
  Coverage          ?   18.65%           
=========================================
  Files             ?       31           
  Lines             ?     5061           
  Branches          ?     1282           
=========================================
  Hits              ?      944           
  Misses            ?     4090           
  Partials          ?       27           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b17e78b...e6b73fd. Read the comment docs.

@huddlej
Copy link
Contributor

huddlej commented Mar 31, 2020

Good catch! I've opened issues for this (#510) and how BED coordinates are loaded (#511). Merging this now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Augur mask should be able to take FASTA input
2 participants