Augur.utils, augur.tree: Move BED file & masking site loading to utils.py #514

danielsoneg · 2020-04-02T19:49:58Z

Description of proposed changes

This moves the code from augur.tree for reading masking sites from bed files and normal text files into augur.utils. Additionally, I've cleaned up the logic and refined how we're handling headers in BED files.

Once #512 is merged, I'll replace the masking site loading code from augur.mask with a call to this function.

Related issue(s)

Related to #510 and conversations in #493

Testing

Added testing for BED file and Masking file reading. Ran full test suite, no regressions. Did not add tests to tree.py, since I'm just deleting code from there.

Thank you for contributing to Nextstrain!

You're welcome! 🤗

danielsoneg · 2020-04-02T19:51:28Z

@emmahodcroft @huddlej specific to BED-file header handling - the code I've got right now is basically:

try: 
  """read the file, forcing columns 2 and 3 to integers"""
except ValueError: # pandas throws this when it can't convert columns to integers
  """read the file, skipping line 1, still forcing columns 2 and 3 into integers"""
  # don't catch any errors here

This is in contrast to the approach I took in #493 in that here I'm ONLY discarding errors that happen on line 1, whereas in #493 I'm discarding errors that happen on any line. In other words, we'll handle the header line, but error out if any other line is formatted incorrectly. I prefer this method because I think it will catch errors in the BED files that the other method will not instead of just silently passing over them (principle of least surprise), but I defer to your judgement on the right approach - it's not a lot of work to do it the other way.

codecov · 2020-04-02T19:54:28Z

Codecov Report

Merging #514 into master will increase coverage by 0.37%.
The diff coverage is 93.93%.

@@            Coverage Diff             @@
##           master     #514      +/-   ##
==========================================
+ Coverage   19.16%   19.53%   +0.37%     
==========================================
  Files          31       31              
  Lines        5072     5067       -5     
  Branches     1289     1283       -6     
==========================================
+ Hits          972      990      +18     
+ Misses       4077     4054      -23     
  Partials       23       23

Impacted Files	Coverage Δ
augur/tree.py	`9.74% <33.33%> (+0.48%)`	⬆️
augur/mask.py	`100.00% <100.00%> (ø)`
augur/utils.py	`27.63% <100.00%> (+4.73%)`	⬆️
augur/refine.py	`5.03% <0.00%> (-0.50%)`	⬇️
augur/frequency_estimators.py	`33.84% <0.00%> (+0.12%)`	⬆️
augur/titer_model.py	`18.90% <0.00%> (+0.29%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d81882f...d81882f. Read the comment docs.

danielsoneg · 2020-04-14T20:44:36Z

Just realized I missed something here - original tree.load_excluded_sites has three formats - BED file, site-per-line, and DRM, or site-per-line-in-the-second-column. I missed DRM - updating now.

danielsoneg · 2020-04-14T21:06:31Z

@huddlej Ok, this is ready for review. Two primary things worth note -

As mentioned above, there's a small change to loading bed files between here and what I did in augur.mask - namely, we're only checking for a header line, and if there are any other badly-formatted lines, we'll bomb on those instead of continuing like we do in the other implementation.
I don't know the DRM file format and couldn't find a good reference for it, so the implementation here is just copying from the existing implementation. From reading the code, it assumes the file format is:

SOMETHING	$SITE1
SOMETHING	$SITE2

and from that we get [$SITE1 - 1, $SITE2 -1]. If that's incorrect, both this and the current implementation are incorrect.

huddlej · 2020-04-22T19:13:26Z

@danielsoneg I'm sorry that I am terribly behind on PR reviews in the last couple of weeks. Some related science projects have emerged that pulled me away from coding. My hope is to review this and merge as soon as possible, to finish up this thread of BED file processing prior to the next augur release.

Follows official Python recommendations for installing modules with python3 [1] instead of pip to address issues reported by users who tried to use vanilla pip install where pip was linked to a Python 2 installation and the installation failed. [1] https://docs.python.org/3/installing/index.html#basic-usage

conda does not provide a pip3 command, but the base Travis python does. This means pip3 install places the augur development dependencies in the global python environment instead of the conda environment. As a result, cram tests fail when they rely on python packages that are only installed as dev dependencies because cram's "python" is the conda environment's "python". This commit should install development dependencies in the conda environment instead of the global environment and make those packages available to the python command inside cram tests.

Introduces two different, complementary approaches to functional testing with Cram. The first approach basically copies the commands already executed by the Snakefiles in the tests/builds directory into the Cram format. The zika build, for example, is partially represented by zika.t in that builds directory. The second approach tries to more comprehensively test a specific augur command with a variety of reasonable inputs. The mask.t file represents an example of that type of test for augur mask.

augur mask now supports multiple masking inputs and no longer requires the `--mask` argument. This commit updates the augur mask tests to reflect these new command line arguments and also modifies informational output from the `read_bed_file` function to clarify that the reported number of sites to mask only reflects the BED file contents and not the other mask arguments.

Run functional tests when running the generic "run_tests" script and when full unit tests are also being run. We skip functional tests when the user requests a subset of unit tests to facilitate rapid testing during test-driven development. Cram tests use pushd and popd which are bash-specific and not available in Travis's default shell, so we specify the shell for Cram tests to use.

Also places one sentence per line in the testing section and clarifies how to run a subset of unit tests.

Add functional tests with Cram

huddlej

@danielsoneg This generally looks good to me and works well. I only had a minor comment (inline) about the behavior of the read_mask_file function.

Before we can merge this, can you also address these specific changes?

Rebase your branch onto master to resolve current conflict with augur/mask.py and remove merge commits from the git history
Update CHANGES.md to reflect a major change in the Python API for augur related to this new utility for reading masked sites

Thank you again for pushing this through and handling all the fiddly edge cases of the different formats we support. This PR moves us in a much better direction for general file handling tools.

huddlej · 2020-05-02T19:19:02Z

augur/utils.py

+                print("Could not read line %s of %s: '%s' - %s" %
+                      (idx, mask_file, line, err))


This error message is nicer than the standard exception message, but I think we should:

print this error to sys.stderr

re-raise the exception after printing the message

Since the BED file function throws errors when it cannot parse a line (other than the header), I would expect the same behavior from the mask file function.

This makes sense - I think this matched the initial, more permissive behavior on parsing the bed files, you're right that the behaviors should match.

huddlej · 2020-05-02T19:19:49Z

augur/utils.py

+                      (idx, mask_file, line, err))
+    return sorted(set(mask_sites))
+
+def load_mask_sites(mask_file):


I really like this generic function for loading mask sites while still having the separate functions for BED and mask/DRM format that can be called as needed!

Yeah - if needed, we should be able to add additional formats in the future without much pain or any changes elsewhere in the code.

huddlej · 2020-05-02T19:20:39Z

augur/utils.py

+    try:
+        bed = pd.read_csv(bed_file, sep='\t', header=None, usecols=[1,2],
+                          dtype={1:int,2:int})
+    except ValueError:
+        # Check if we have a header row. Otherwise, just fail.
+        bed = pd.read_csv(bed_file, sep='\t', header=None, usecols=[1,2],
+                          dtype={1:int,2:int}, skiprows=1)
+        print("Skipped row 1 of %s, assuming it is a header." % bed_file)
+    for _, row in bed.iterrows():
+        mask_sites.extend(range(row[1], row[2]))
+    return sorted(set(mask_sites))


This approach of only allowing parse errors on the header is really nice.

Yeah, I think it's closer to the Principle of Least Surprise re: masking sites - just ignoring random errors in the bed file seems like it could hide some ugly surprises.

…soneg/augur into egd-move_load_mask_sites_to_utils

danielsoneg · 2020-05-04T22:10:51Z

Well, that made the diff nice and clean. The update to utils is here:
https://github.com/nextstrain/augur/pull/514/files#diff-b5772e65b5a8175ae1a542a1c7ad0e38R869
and the changelog entry is here:
https://github.com/nextstrain/augur/pull/514/files#diff-8b1c3fd0d4a6765c16dfd18509182f9dR14

huddlej · 2020-05-18T05:33:11Z

@danielsoneg I took a shot at cleaning up the git history by rebasing your branch onto master and pushing a new branch to the nextstrain repo. My GitHub skills aren't quite up to doing the same reorganization for this PR, so I created #550 based on that rebase from your repo.

If everything there looks OK to you, I'll merge that PR and close this one. I'm sorry things got so hairy here!

huddlej · 2020-05-19T04:22:39Z

Resolved by #550

Eric Danielson added 5 commits April 2, 2020 12:22

Add functions for loading masking sites to utils

ca3ce47

load_mask_file, read_mask_file, read_bed_file tests

2a13713

reformat ambiguous_date_to_date_range calls

188244a

modify tree to use load_mask_sites from utils

e43c372

Make pylint happy(er)

ef9560c

danielsoneg mentioned this pull request Apr 4, 2020

Augur Mask: Add additional options from NCOV mask-alignment.py script #512

Merged

Eric Danielson added 5 commits April 14, 2020 12:35

Merge branch 'master' into egd-move_load_mask_sites_to_utils

89a55fb

BED files are zero-indexed half-open intervals

d76e1d3

Update tests for half-open bed intervals

20e66f7

mask: use utils.load_mask_sites

0f053a4

Update mask tests: drop read_bed_files, test loading mask file.

4b29e20

Eric Danielson added 2 commits April 14, 2020 13:54

Add support for drm-file format

7d9de07

Update helpstring for augur mask --mask

0ab9638

jstoja mentioned this pull request Apr 16, 2020

BED file coordinates should be treated as 0-start, half-open intervals #521

Closed

huddlej added 7 commits April 24, 2020 21:44

Describe how to write and run functional tests with Cram

0878320

Also places one sentence per line in the testing section and clarifies how to run a subset of unit tests.

Merge pull request nextstrain#542 from nextstrain/test-with-cram

704f369

Add functional tests with Cram

huddlej requested changes May 2, 2020

View reviewed changes

Eric Danielson added 2 commits May 4, 2020 14:32

Add functions for loading masking sites to utils

7cef44a

load_mask_file, read_mask_file, read_bed_file tests

5720541

Eric Danielson added 13 commits May 4, 2020 14:32

reformat ambiguous_date_to_date_range calls

08fa2cb

modify tree to use load_mask_sites from utils

d85d9c4

Make pylint happy(er)

6d8ad31

BED files are zero-indexed half-open intervals

864ba28

Update tests for half-open bed intervals

0d70839

mask: use utils.load_mask_sites

b125130

Update mask tests: drop read_bed_files, test loading mask file.

41501c6

Add support for drm-file format

fcfcf8b

Update helpstring for augur mask --mask

cdff7dd

read_mask_file no longer continues on badly-formatted lines

8990bb5

Added changelog entry for load_mask_sites

8383c9a

Merge branch 'egd-move_load_mask_sites_to_utils' of github.com:daniel…

a74d707

…soneg/augur into egd-move_load_mask_sites_to_utils

fix bad merge

d81882f

huddlej mentioned this pull request May 18, 2020

Move BED file and masking site loading from augur tree to utils.py #550

Merged

huddlej closed this May 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Augur.utils, augur.tree: Move BED file & masking site loading to utils.py #514

Augur.utils, augur.tree: Move BED file & masking site loading to utils.py #514

danielsoneg commented Apr 2, 2020

danielsoneg commented Apr 2, 2020 •

edited

Loading

codecov bot commented Apr 2, 2020 •

edited

Loading

danielsoneg commented Apr 14, 2020

danielsoneg commented Apr 14, 2020

huddlej commented Apr 22, 2020

huddlej left a comment

huddlej May 2, 2020

danielsoneg May 4, 2020

huddlej May 2, 2020

danielsoneg May 4, 2020

huddlej May 2, 2020

danielsoneg May 4, 2020

danielsoneg commented May 4, 2020

huddlej commented May 18, 2020

huddlej commented May 19, 2020

		print("Could not read line %s of %s: '%s' - %s" %
		(idx, mask_file, line, err))

Augur.utils, augur.tree: Move BED file & masking site loading to utils.py #514

Augur.utils, augur.tree: Move BED file & masking site loading to utils.py #514

Conversation

danielsoneg commented Apr 2, 2020

Description of proposed changes

Related issue(s)

Testing

Thank you for contributing to Nextstrain!

danielsoneg commented Apr 2, 2020 • edited Loading

codecov bot commented Apr 2, 2020 • edited Loading

Codecov Report

danielsoneg commented Apr 14, 2020

danielsoneg commented Apr 14, 2020

huddlej commented Apr 22, 2020

huddlej left a comment

Choose a reason for hiding this comment

huddlej May 2, 2020

Choose a reason for hiding this comment

danielsoneg May 4, 2020

Choose a reason for hiding this comment

huddlej May 2, 2020

Choose a reason for hiding this comment

danielsoneg May 4, 2020

Choose a reason for hiding this comment

huddlej May 2, 2020

Choose a reason for hiding this comment

danielsoneg May 4, 2020

Choose a reason for hiding this comment

danielsoneg commented May 4, 2020

huddlej commented May 18, 2020

huddlej commented May 19, 2020

danielsoneg commented Apr 2, 2020 •

edited

Loading

codecov bot commented Apr 2, 2020 •

edited

Loading