Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Augur.utils, augur.tree: Move BED file & masking site loading to utils.py #514

Conversation

danielsoneg
Copy link

Description of proposed changes

This moves the code from augur.tree for reading masking sites from bed files and normal text files into augur.utils. Additionally, I've cleaned up the logic and refined how we're handling headers in BED files.

Once #512 is merged, I'll replace the masking site loading code from augur.mask with a call to this function.

Related issue(s)

Related to #510 and conversations in #493

Testing

Added testing for BED file and Masking file reading. Ran full test suite, no regressions. Did not add tests to tree.py, since I'm just deleting code from there.

Thank you for contributing to Nextstrain!

You're welcome! 🤗

@danielsoneg
Copy link
Author

danielsoneg commented Apr 2, 2020

@emmahodcroft @huddlej specific to BED-file header handling - the code I've got right now is basically:

try: 
  """read the file, forcing columns 2 and 3 to integers"""
except ValueError: # pandas throws this when it can't convert columns to integers
  """read the file, skipping line 1, still forcing columns 2 and 3 into integers"""
  # don't catch any errors here

This is in contrast to the approach I took in #493 in that here I'm ONLY discarding errors that happen on line 1, whereas in #493 I'm discarding errors that happen on any line. In other words, we'll handle the header line, but error out if any other line is formatted incorrectly. I prefer this method because I think it will catch errors in the BED files that the other method will not instead of just silently passing over them (principle of least surprise), but I defer to your judgement on the right approach - it's not a lot of work to do it the other way.

@codecov
Copy link

codecov bot commented Apr 2, 2020

Codecov Report

Merging #514 into master will increase coverage by 0.37%.
The diff coverage is 93.93%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #514      +/-   ##
==========================================
+ Coverage   19.16%   19.53%   +0.37%     
==========================================
  Files          31       31              
  Lines        5072     5067       -5     
  Branches     1289     1283       -6     
==========================================
+ Hits          972      990      +18     
+ Misses       4077     4054      -23     
  Partials       23       23              
Impacted Files Coverage Δ
augur/tree.py 9.74% <33.33%> (+0.48%) ⬆️
augur/mask.py 100.00% <100.00%> (ø)
augur/utils.py 27.63% <100.00%> (+4.73%) ⬆️
augur/refine.py 5.03% <0.00%> (-0.50%) ⬇️
augur/frequency_estimators.py 33.84% <0.00%> (+0.12%) ⬆️
augur/titer_model.py 18.90% <0.00%> (+0.29%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d81882f...d81882f. Read the comment docs.

@danielsoneg
Copy link
Author

Just realized I missed something here - original tree.load_excluded_sites has three formats - BED file, site-per-line, and DRM, or site-per-line-in-the-second-column. I missed DRM - updating now.

@danielsoneg
Copy link
Author

@huddlej Ok, this is ready for review. Two primary things worth note -

  1. As mentioned above, there's a small change to loading bed files between here and what I did in augur.mask - namely, we're only checking for a header line, and if there are any other badly-formatted lines, we'll bomb on those instead of continuing like we do in the other implementation.

  2. I don't know the DRM file format and couldn't find a good reference for it, so the implementation here is just copying from the existing implementation. From reading the code, it assumes the file format is:

SOMETHING	$SITE1
SOMETHING	$SITE2

and from that we get [$SITE1 - 1, $SITE2 -1]. If that's incorrect, both this and the current implementation are incorrect.

@huddlej
Copy link
Contributor

huddlej commented Apr 22, 2020

@danielsoneg I'm sorry that I am terribly behind on PR reviews in the last couple of weeks. Some related science projects have emerged that pulled me away from coding. My hope is to review this and merge as soon as possible, to finish up this thread of BED file processing prior to the next augur release.

Follows official Python recommendations for installing modules with python3 [1]
instead of pip to address issues reported by users who tried to use vanilla pip
install where pip was linked to a Python 2 installation and the installation
failed.

[1] https://docs.python.org/3/installing/index.html#basic-usage
conda does not provide a pip3 command, but the base Travis python does. This
means pip3 install places the augur development dependencies in the global
python environment instead of the conda environment. As a result, cram tests
fail when they rely on python packages that are only installed as dev
dependencies because cram's "python" is the conda environment's "python". This
commit should install development dependencies in the conda environment instead
of the global environment and make those packages available to the python
command inside cram tests.
Introduces two different, complementary approaches to functional testing with
Cram. The first approach basically copies the commands already executed by the
Snakefiles in the tests/builds directory into the Cram format. The zika build,
for example, is partially represented by zika.t in that builds directory.

The second approach tries to more comprehensively test a specific augur command
with a variety of reasonable inputs. The mask.t file represents an example of
that type of test for augur mask.
augur mask now supports multiple masking inputs and no longer requires the
`--mask` argument. This commit updates the augur mask tests to reflect these new
command line arguments and also modifies informational output from the
`read_bed_file` function to clarify that the reported number of sites to mask
only reflects the BED file contents and not the other mask arguments.
Run functional tests when running the generic "run_tests" script and when full
unit tests are also being run. We skip functional tests when the user requests a
subset of unit tests to facilitate rapid testing during test-driven development.

Cram tests use pushd and popd which are bash-specific and not available in
Travis's default shell, so we specify the shell for Cram tests to use.
Also places one sentence per line in the testing section and clarifies how to
run a subset of unit tests.
Copy link
Contributor

@huddlej huddlej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielsoneg This generally looks good to me and works well. I only had a minor comment (inline) about the behavior of the read_mask_file function.

Before we can merge this, can you also address these specific changes?

  • Rebase your branch onto master to resolve current conflict with augur/mask.py and remove merge commits from the git history
  • Update CHANGES.md to reflect a major change in the Python API for augur related to this new utility for reading masked sites

Thank you again for pushing this through and handling all the fiddly edge cases of the different formats we support. This PR moves us in a much better direction for general file handling tools.

augur/utils.py Outdated
Comment on lines 870 to 871
print("Could not read line %s of %s: '%s' - %s" %
(idx, mask_file, line, err))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error message is nicer than the standard exception message, but I think we should:

  1. print this error to sys.stderr
  2. re-raise the exception after printing the message

Since the BED file function throws errors when it cannot parse a line (other than the header), I would expect the same behavior from the mask file function.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense - I think this matched the initial, more permissive behavior on parsing the bed files, you're right that the behaviors should match.

(idx, mask_file, line, err))
return sorted(set(mask_sites))

def load_mask_sites(mask_file):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this generic function for loading mask sites while still having the separate functions for BED and mask/DRM format that can be called as needed!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - if needed, we should be able to add additional formats in the future without much pain or any changes elsewhere in the code.

Comment on lines +832 to +842
try:
bed = pd.read_csv(bed_file, sep='\t', header=None, usecols=[1,2],
dtype={1:int,2:int})
except ValueError:
# Check if we have a header row. Otherwise, just fail.
bed = pd.read_csv(bed_file, sep='\t', header=None, usecols=[1,2],
dtype={1:int,2:int}, skiprows=1)
print("Skipped row 1 of %s, assuming it is a header." % bed_file)
for _, row in bed.iterrows():
mask_sites.extend(range(row[1], row[2]))
return sorted(set(mask_sites))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach of only allowing parse errors on the header is really nice.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it's closer to the Principle of Least Surprise re: masking sites - just ignoring random errors in the bed file seems like it could hide some ugly surprises.

@danielsoneg
Copy link
Author

@huddlej
Copy link
Contributor

huddlej commented May 18, 2020

@danielsoneg I took a shot at cleaning up the git history by rebasing your branch onto master and pushing a new branch to the nextstrain repo. My GitHub skills aren't quite up to doing the same reorganization for this PR, so I created #550 based on that rebase from your repo.

If everything there looks OK to you, I'll merge that PR and close this one. I'm sorry things got so hairy here!

@huddlej
Copy link
Contributor

huddlej commented May 19, 2020

Resolved by #550

@huddlej huddlej closed this May 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants