Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

augur align: Ensure reference sequence is removed when no gaps are present #456

Conversation

danielsoneg
Copy link

@danielsoneg danielsoneg commented Mar 16, 2020

While investigating #454, I spotted a separate bug - in alignment files in which the reference sequence has no gaps, the strip_non_reference code bails out early and does not remove the reference sequence from the alignment file.

I adjusted the mafft output from the zika test data to remove the gaps in alignment from the reference sequence. Upon running the modified alignment data through the code path, I saw this error:

In [107]: alseq = augur.align.strip_non_reference("./aligned.fasta", reference, keep_reference=False)                                                                                                                        
No gaps in alignment to trim (with respect to the reference, KX369547.1)

In [108]: reference in {a.name:a for a in alseq}                                                                                                                                                                       
Out[108]: True

Patch attached corrects that bug:

In [123]: importlib.reload(align)                                                                                                                                                                                      
Out[123]: <module 'augur.align' from '/Users/eric/.venv/__Users__eric__Projects__nextstrain-b3a8/nextstrain/lib/python3.7/site-packages/augur/align.py'>

In [124]: alseq = align.strip_non_reference("./aligned.fasta", reference, keep_reference=False)                                                                                                                        
No gaps in alignment to trim (with respect to the reference, KX369547.1)

In [125]: reference in {a.name:a for a in alseq}                                                                                                                                                                       
Out[125]: False

and still keeps the reference if that's preferred:

In [126]: alseq = align.strip_non_reference("./aligned.fasta", reference, keep_reference=True)                                                                                                                         
No gaps in alignment to trim (with respect to the reference, KX369547.1)

In [127]: reference in {a.name:a for a in alseq}                                                                                                                                                                       
Out[127]: True

(Also: as mentioned in the last patch, I'm trying to find useful ways to contribute while now sheltering in place, but if having some rando on the internet airdropping patches on you is more distracting than helpful, just let me know and I can find someone else to bother.

I can also start actually commenting when I start looking at these, but I'm new enough to the codebase that I don't want to claim I can fix anything before digging into it.)

@danielsoneg
Copy link
Author

Travis issue is snakemake throwing an error:

Running ./add_to_alignment/Snakefile (quietly)

Error: you need to specify the maximum number of CPU cores to be used at the same time with --cores.

Test Script Failed at Line 22

The command "(bash tests/builds/runner.sh)" exited with 2.

I can reproduce that against master - looks like as of SnakeMake 5.11.0 "--cores" is mandatory:
[https://snakemake.readthedocs.io/en/stable/project_info/history.html#id2]

@tsibley
Copy link
Member

tsibley commented Mar 17, 2020

Re: the test failures: I think the Snakemake change in 5.11 is poorly considered and have asked for it to be reverted.

tsibley added a commit that referenced this pull request Mar 17, 2020
Version 5.11 made the --cores option mandatory, which breaks most
existing invocations of snakemake in our workflows (and the workflow
patterns we recommend).  This is causing Travis CI failures, as noted in
issue #456.

The previous commit makes the same version change for setup.py for when
augur is installed with pip.

If Snakemake reverts this breaking change¹, then we can revert this
change as well, while declaring 5.11.0 and 5.11.1 to be incompatible.

¹ snakemake/snakemake#284
@huddlej
Copy link
Contributor

huddlej commented Mar 18, 2020

@danielsoneg Thank you so much for digging into this issue! I haven't had time to confirm that this patch fixes Trevor's specific bug, but I will try it out today.

Also, we are thrilled to have contributions on these issues from volunteers like yourself. Every effort you make on an issue helps us in some way and we really appreciate it. If you are interested in finding other open issues that we've curated for volunteer contributions, check out the open project boards for Nextstrain.

When you find an issue you'd like to help out with, go ahead and comment on the issue so we know someone is looking into it. Even if you end up not working on that issue, it will help prevent overlapping work from multiple people on the same issue.

@danielsoneg
Copy link
Author

Sounds good, thanks!

@huddlej
Copy link
Contributor

huddlej commented Mar 20, 2020

After digging into this more, I think the bug @danielsoneg identified and corrected is a separate issue that we’ve been blind to. The issue with the Zika and ncov builds appears to be a mismatch in how sequences are identified by name or id depending on their source from FASTA or GenBank files, respectively.

Here is the breakdown for Zika:

  1. BioPython’s sequence record assigns the GenBank record accession as the record id and assigns the LOCUS as the name.

  2. The reference in augur align is tracked by its record id and not the record name. All other sequences in augur align are tracked by their record name.

  3. The Zika reference name is “PF13/251013_18” but its id is “KX369547.1”.

  4. A redundant copy of the reference sequence is included in the downloaded sequences from fauna and this sequence is tracked by its name as “PF13/251013_18”.

  5. Since augur align never knows this redundant sequence has the same name (because name != id), that sequence is never removed.

  6. The GenBank sequence with the accession name is properly removed.

The best solution to the Zika problem might be to filter the redundant reference sequence out prior to alignment by adding it to the dropped strains file.

The same problem might occur with the ncov dataset. The original reference GenBank file had an id of “MN908947.3” and a name of “MN908947”. If the reference appeared in the input sequences, too, as “MN908947”, it would never be removed.

However, @trvrb recently updated the GenBank file for the reference such that the id and name would be identical. This change to the GenBank file should result in redundant sequences being detected prior to alignment or being removed at the end of the alignment.

It might be helpful in the long term to have augur align always use a record’s name instead of its id, to avoid these kinds of issues. It would also be helpful to have unit tests for basic alignment functionality like this.

@trvrb, is this issue still open for you now that you’ve changed the ncov reference?

I think we should still merge this PR, as it fixes a real (if infrequent) bug.

@huddlej
Copy link
Contributor

huddlej commented Mar 20, 2020

@danielsoneg Before I merge this, can you remove the reference to fixing #454, since the issue you found is different from Trevor's original problem?

Can you also add a doctest to the strip_non_reference function to test the functionality you've fixed? You should be able to add your example FASTA file with no gaps to tests/data/ and add an example call to the function with those data to the docstring. See this example from the distance module as a starting place.

You can check your test worked by running the following command from the augur top-level directory:

pytest -c pytest.python3.ini

Thank you!

@danielsoneg
Copy link
Author

Thanks @huddlej - I updated the description and added some doctests and a ludicrously limited test alignment file.

@huddlej
Copy link
Contributor

huddlej commented Mar 24, 2020

@danielsoneg Awesome! This works great for me and is a timely fix, as we've just encountered this bug in the ncov builds.

I've just merged this after removing the "tests/data/align/.test_aligned_sequences.fasta.swp" file and squashing the separate commits into a single commit. Thank you again for your help!

@huddlej huddlej closed this Mar 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants