fix parsing of priorities tsv file to allow spaces in sequence IDs #668

tomkinsc · 2021-02-03T23:25:59Z

Description of proposed changes

When parsing priority tsv files, an error could be encountered if a sequence name contain spaces, as .split() breaks not only on the tabs delimiting the file, but any whitespace characters. This replaces .split() with .split('\t') to address this and preserve sequence names containing spaces.

Related issue(s)

Flagged by @dpark01 as a problem.

A priorities tsv file of the following form:

NC_045512.2 Wuhan seafood market pneumonia virus isolate Wuhan-Hu-1, complete genome	-0.20
Algeria/G0638_2264/2020	-235.60
Algeria/G0640_2265/2020	-140.70

Leads to the following failure:

Traceback (most recent call last):
  File "/usr/bin/augur", line 11, in <module>
    load_entry_point('nextstrain-augur', 'console_scripts', 'augur')()
  File "/nextstrain/augur/augur/__main__.py", line 10, in main
    return augur.run( argv[1:] )
  File "/nextstrain/augur/augur/__init__.py", line 74, in run
    return args.__command__.run(args)
  File "/nextstrain/augur/augur/filter.py", line 326, in run
    priorities = read_priority_scores(args.priority)
  File "/nextstrain/augur/augur/filter.py", line 67, in read_priority_scores
    raise e
  File "/nextstrain/augur/augur/filter.py", line 63, in read_priority_scores
    for elems in (line.strip().split() for line in pfile.readlines())
  File "/nextstrain/augur/augur/filter.py", line 63, in <dictcomp>
    for elems in (line.strip().split() for line in pfile.readlines())
ValueError: could not convert string to float: 'Wuhan'

Testing

What steps should be taken to test the changes you've proposed?
If you added or changed behavior in the codebase, did you update the tests, or do you need help with this?

No changes made to the tests (so far).

When parsing priorities.tsv files, an error could be encountered if a sequence name contain spaces, as `.split()` breaks not only on the tabs delimiting the file, but any whitespace character. This repalces `.split()` with `.split('\t')` to address this and preserve sequence names containg spaces.

tsibley · 2021-02-04T18:01:40Z

Thanks, @tomkinsc! I agree it's an oversight that split() was used originally instead of split("\t"), but I worry now about priorities files out in the wild which rely on the behaviour. For example, CI is failing with this PR because even in Augur itself, the data used for tests uses a space instead of a tab.

I wonder if we can make this parsing a bit more backwards compat, while still making it more robust, by falling back to split() if there's not a tab in the line. I believe we've taken this approach elsewhere in Augur to preserve original behaviour while not holding back folks who want to write a clean TSV.

dpark01 · 2021-02-04T18:04:04Z

@tsibley historically, did any old versions of priorities.py write out space-delimited files (then yes, this would be a concern), or are you only referring to test input files (maybe backwards compatibility is less of an issue then?)?

tsibley · 2021-02-04T18:13:54Z

@dpark01 VCS history shows that scripts/priorities.py in nextstrain/ncov always used tabs, but my concern is more broad than just the ncov build. The --priority option to augur filter was introduced in 2018 and may be used in other builds.

…resent in the line in filter.py::read_priority_scores(), only split by tab if a tab is present in the line, to allow backward compatibility with space-delimited files

tomkinsc · 2021-02-11T20:02:04Z

Great point about backward compatibility, @tsibley. Just added a conditional so it splits on tabs if they're present in a line, and otherwise does the generic whitespace .split() as before.

huddlej

This looks good to me, @tomkinsc! Would you mind adding a test to tests/test_filter.py?

I'm not super pleased with the way test data get mocked up in that test module, but here is an example that works for me locally with your proposed bug fix:

@pytest.fixture
def mock_priorities_file_valid_with_spaces_and_tabs(mocker):
    mocker.patch(
        "builtins.open", mocker.mock_open(read_data="strain 1\t5\nstrain 2\t6\nstrain 3\t8\n")
    )

class TestFilter:
    def test_read_priority_scores_valid_with_spaces_and_tabs(self, mock_priorities_file_valid_with_spaces_and_tabs):
        # builtins.open is stubbed, but we need a valid file to satisfy the existence check
        priorities = augur.filter.read_priority_scores(
            "tests/builds/tb/data/lee_2015.vcf"
        )

        assert priorities == {"strain 1": 5, "strain 2": 6, "strain 3": 8}

Thank you for writing up a fix for this bug!

@huddlej

add unit test for parsing tab-delimited priorities file, where column 1 has spaces in the values. Thanks to @huddlej for providing this.

tomkinsc · 2021-02-12T06:03:48Z

@huddlej added that test, thanks!

tomkinsc added 2 commits February 11, 2021 13:32

in filter.py::read_priority_scores(), only split by tab if a tab is p…

9891eac

…resent in the line in filter.py::read_priority_scores(), only split by tab if a tab is present in the line, to allow backward compatibility with space-delimited files

split on tabs if tabs are present as more common case

8ce21a4

huddlej reviewed Feb 12, 2021

View reviewed changes

huddlej added this to the Next release 11.X.X milestone Feb 12, 2021

add unit test for parsing tab-delimited priorities file

190ae73

add unit test for parsing tab-delimited priorities file, where column 1 has spaces in the values. Thanks to @huddlej for providing this.

huddlej merged commit 528cf28 into nextstrain:master Feb 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix parsing of priorities tsv file to allow spaces in sequence IDs #668

fix parsing of priorities tsv file to allow spaces in sequence IDs #668

tomkinsc commented Feb 3, 2021 •

edited

Loading

tsibley commented Feb 4, 2021

dpark01 commented Feb 4, 2021

tsibley commented Feb 4, 2021

tomkinsc commented Feb 11, 2021

huddlej left a comment

tomkinsc commented Feb 12, 2021 •

edited

Loading

fix parsing of priorities tsv file to allow spaces in sequence IDs #668

fix parsing of priorities tsv file to allow spaces in sequence IDs #668

Conversation

tomkinsc commented Feb 3, 2021 • edited Loading

Description of proposed changes

Related issue(s)

Testing

tsibley commented Feb 4, 2021

dpark01 commented Feb 4, 2021

tsibley commented Feb 4, 2021

tomkinsc commented Feb 11, 2021

huddlej left a comment

Choose a reason for hiding this comment

tomkinsc commented Feb 12, 2021 • edited Loading

tomkinsc commented Feb 3, 2021 •

edited

Loading

tomkinsc commented Feb 12, 2021 •

edited

Loading