-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Augur.utils, augur.tree: Move BED file & masking site loading to utils.py #514
Augur.utils, augur.tree: Move BED file & masking site loading to utils.py #514
Conversation
@emmahodcroft @huddlej specific to BED-file header handling - the code I've got right now is basically: try:
"""read the file, forcing columns 2 and 3 to integers"""
except ValueError: # pandas throws this when it can't convert columns to integers
"""read the file, skipping line 1, still forcing columns 2 and 3 into integers"""
# don't catch any errors here This is in contrast to the approach I took in #493 in that here I'm ONLY discarding errors that happen on line 1, whereas in #493 I'm discarding errors that happen on any line. In other words, we'll handle the header line, but error out if any other line is formatted incorrectly. I prefer this method because I think it will catch errors in the BED files that the other method will not instead of just silently passing over them (principle of least surprise), but I defer to your judgement on the right approach - it's not a lot of work to do it the other way. |
Codecov Report
@@ Coverage Diff @@
## master #514 +/- ##
==========================================
+ Coverage 19.16% 19.53% +0.37%
==========================================
Files 31 31
Lines 5072 5067 -5
Branches 1289 1283 -6
==========================================
+ Hits 972 990 +18
+ Misses 4077 4054 -23
Partials 23 23
Continue to review full report at Codecov.
|
Just realized I missed something here - original tree.load_excluded_sites has three formats - BED file, site-per-line, and DRM, or site-per-line-in-the-second-column. I missed DRM - updating now. |
@huddlej Ok, this is ready for review. Two primary things worth note -
and from that we get |
@danielsoneg I'm sorry that I am terribly behind on PR reviews in the last couple of weeks. Some related science projects have emerged that pulled me away from coding. My hope is to review this and merge as soon as possible, to finish up this thread of BED file processing prior to the next augur release. |
Follows official Python recommendations for installing modules with python3 [1] instead of pip to address issues reported by users who tried to use vanilla pip install where pip was linked to a Python 2 installation and the installation failed. [1] https://docs.python.org/3/installing/index.html#basic-usage
conda does not provide a pip3 command, but the base Travis python does. This means pip3 install places the augur development dependencies in the global python environment instead of the conda environment. As a result, cram tests fail when they rely on python packages that are only installed as dev dependencies because cram's "python" is the conda environment's "python". This commit should install development dependencies in the conda environment instead of the global environment and make those packages available to the python command inside cram tests.
Introduces two different, complementary approaches to functional testing with Cram. The first approach basically copies the commands already executed by the Snakefiles in the tests/builds directory into the Cram format. The zika build, for example, is partially represented by zika.t in that builds directory. The second approach tries to more comprehensively test a specific augur command with a variety of reasonable inputs. The mask.t file represents an example of that type of test for augur mask.
augur mask now supports multiple masking inputs and no longer requires the `--mask` argument. This commit updates the augur mask tests to reflect these new command line arguments and also modifies informational output from the `read_bed_file` function to clarify that the reported number of sites to mask only reflects the BED file contents and not the other mask arguments.
Run functional tests when running the generic "run_tests" script and when full unit tests are also being run. We skip functional tests when the user requests a subset of unit tests to facilitate rapid testing during test-driven development. Cram tests use pushd and popd which are bash-specific and not available in Travis's default shell, so we specify the shell for Cram tests to use.
Also places one sentence per line in the testing section and clarifies how to run a subset of unit tests.
Add functional tests with Cram
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danielsoneg This generally looks good to me and works well. I only had a minor comment (inline) about the behavior of the read_mask_file
function.
Before we can merge this, can you also address these specific changes?
- Rebase your branch onto master to resolve current conflict with
augur/mask.py
and remove merge commits from the git history - Update
CHANGES.md
to reflect a major change in the Python API for augur related to this new utility for reading masked sites
Thank you again for pushing this through and handling all the fiddly edge cases of the different formats we support. This PR moves us in a much better direction for general file handling tools.
augur/utils.py
Outdated
print("Could not read line %s of %s: '%s' - %s" % | ||
(idx, mask_file, line, err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error message is nicer than the standard exception message, but I think we should:
- print this error to sys.stderr
- re-raise the exception after printing the message
Since the BED file function throws errors when it cannot parse a line (other than the header), I would expect the same behavior from the mask file function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense - I think this matched the initial, more permissive behavior on parsing the bed files, you're right that the behaviors should match.
(idx, mask_file, line, err)) | ||
return sorted(set(mask_sites)) | ||
|
||
def load_mask_sites(mask_file): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this generic function for loading mask sites while still having the separate functions for BED and mask/DRM format that can be called as needed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah - if needed, we should be able to add additional formats in the future without much pain or any changes elsewhere in the code.
try: | ||
bed = pd.read_csv(bed_file, sep='\t', header=None, usecols=[1,2], | ||
dtype={1:int,2:int}) | ||
except ValueError: | ||
# Check if we have a header row. Otherwise, just fail. | ||
bed = pd.read_csv(bed_file, sep='\t', header=None, usecols=[1,2], | ||
dtype={1:int,2:int}, skiprows=1) | ||
print("Skipped row 1 of %s, assuming it is a header." % bed_file) | ||
for _, row in bed.iterrows(): | ||
mask_sites.extend(range(row[1], row[2])) | ||
return sorted(set(mask_sites)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach of only allowing parse errors on the header is really nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think it's closer to the Principle of Least Surprise re: masking sites - just ignoring random errors in the bed file seems like it could hide some ugly surprises.
…soneg/augur into egd-move_load_mask_sites_to_utils
Well, that made the diff nice and clean. The update to utils is here: |
@danielsoneg I took a shot at cleaning up the git history by rebasing your branch onto master and pushing a new branch to the nextstrain repo. My GitHub skills aren't quite up to doing the same reorganization for this PR, so I created #550 based on that rebase from your repo. If everything there looks OK to you, I'll merge that PR and close this one. I'm sorry things got so hairy here! |
Resolved by #550 |
Description of proposed changes
This moves the code from
augur.tree
for reading masking sites from bed files and normal text files intoaugur.utils
. Additionally, I've cleaned up the logic and refined how we're handling headers in BED files.Once #512 is merged, I'll replace the masking site loading code from
augur.mask
with a call to this function.Related issue(s)
Related to #510 and conversations in #493
Testing
Added testing for BED file and Masking file reading. Ran full test suite, no regressions. Did not add tests to tree.py, since I'm just deleting code from there.
Thank you for contributing to Nextstrain!
You're welcome! 🤗