Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix parsing of priorities tsv file to allow spaces in sequence IDs #668

Merged
merged 4 commits into from
Feb 12, 2021

Conversation

tomkinsc
Copy link
Contributor

@tomkinsc tomkinsc commented Feb 3, 2021

Description of proposed changes

When parsing priority tsv files, an error could be encountered if a sequence name contain spaces, as .split() breaks not only on the tabs delimiting the file, but any whitespace characters. This replaces .split() with .split('\t') to address this and preserve sequence names containing spaces.

Related issue(s)

Flagged by @dpark01 as a problem.

A priorities tsv file of the following form:

NC_045512.2 Wuhan seafood market pneumonia virus isolate Wuhan-Hu-1, complete genome	-0.20
Algeria/G0638_2264/2020	-235.60
Algeria/G0640_2265/2020	-140.70

Leads to the following failure:

Traceback (most recent call last):
  File "/usr/bin/augur", line 11, in <module>
    load_entry_point('nextstrain-augur', 'console_scripts', 'augur')()
  File "/nextstrain/augur/augur/__main__.py", line 10, in main
    return augur.run( argv[1:] )
  File "/nextstrain/augur/augur/__init__.py", line 74, in run
    return args.__command__.run(args)
  File "/nextstrain/augur/augur/filter.py", line 326, in run
    priorities = read_priority_scores(args.priority)
  File "/nextstrain/augur/augur/filter.py", line 67, in read_priority_scores
    raise e
  File "/nextstrain/augur/augur/filter.py", line 63, in read_priority_scores
    for elems in (line.strip().split() for line in pfile.readlines())
  File "/nextstrain/augur/augur/filter.py", line 63, in <dictcomp>
    for elems in (line.strip().split() for line in pfile.readlines())
ValueError: could not convert string to float: 'Wuhan'

Testing

What steps should be taken to test the changes you've proposed?
If you added or changed behavior in the codebase, did you update the tests, or do you need help with this?

No changes made to the tests (so far).

When parsing priorities.tsv files, an error could be encountered if a sequence name contain spaces, as `.split()` breaks not only on the tabs delimiting the file, but any whitespace character. This repalces `.split()` with `.split('\t')` to address this and preserve sequence names containg spaces.
@tsibley
Copy link
Member

tsibley commented Feb 4, 2021

Thanks, @tomkinsc! I agree it's an oversight that split() was used originally instead of split("\t"), but I worry now about priorities files out in the wild which rely on the behaviour. For example, CI is failing with this PR because even in Augur itself, the data used for tests uses a space instead of a tab.

I wonder if we can make this parsing a bit more backwards compat, while still making it more robust, by falling back to split() if there's not a tab in the line. I believe we've taken this approach elsewhere in Augur to preserve original behaviour while not holding back folks who want to write a clean TSV.

@dpark01
Copy link

dpark01 commented Feb 4, 2021

@tsibley historically, did any old versions of priorities.py write out space-delimited files (then yes, this would be a concern), or are you only referring to test input files (maybe backwards compatibility is less of an issue then?)?

@tsibley
Copy link
Member

tsibley commented Feb 4, 2021

@dpark01 VCS history shows that scripts/priorities.py in nextstrain/ncov always used tabs, but my concern is more broad than just the ncov build. The --priority option to augur filter was introduced in 2018 and may be used in other builds.

…resent in the line

in filter.py::read_priority_scores(), only split by tab if a tab is present in the line, to allow backward compatibility with space-delimited files
@tomkinsc
Copy link
Contributor Author

Great point about backward compatibility, @tsibley. Just added a conditional so it splits on tabs if they're present in a line, and otherwise does the generic whitespace .split() as before.

Copy link
Contributor

@huddlej huddlej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, @tomkinsc! Would you mind adding a test to tests/test_filter.py?

I'm not super pleased with the way test data get mocked up in that test module, but here is an example that works for me locally with your proposed bug fix:

@pytest.fixture
def mock_priorities_file_valid_with_spaces_and_tabs(mocker):
    mocker.patch(
        "builtins.open", mocker.mock_open(read_data="strain 1\t5\nstrain 2\t6\nstrain 3\t8\n")
    )

class TestFilter:
    def test_read_priority_scores_valid_with_spaces_and_tabs(self, mock_priorities_file_valid_with_spaces_and_tabs):
        # builtins.open is stubbed, but we need a valid file to satisfy the existence check
        priorities = augur.filter.read_priority_scores(
            "tests/builds/tb/data/lee_2015.vcf"
        )

        assert priorities == {"strain 1": 5, "strain 2": 6, "strain 3": 8}

Thank you for writing up a fix for this bug!

@huddlej huddlej added this to the Next release 11.X.X milestone Feb 12, 2021
add unit test for parsing tab-delimited priorities file, where column 1  has spaces in the values.  Thanks to @huddlej for providing this.
@tomkinsc
Copy link
Contributor Author

tomkinsc commented Feb 12, 2021

@huddlej added that test, thanks!

@huddlej huddlej merged commit 528cf28 into nextstrain:master Feb 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants