Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filter: --output-sequences silently allows duplicates #1602

Closed
victorlin opened this issue Aug 27, 2024 · 0 comments · Fixed by #1613
Closed

filter: --output-sequences silently allows duplicates #1602

victorlin opened this issue Aug 27, 2024 · 0 comments · Fixed by #1613
Labels
bug Something isn't working

Comments

@victorlin
Copy link
Member

initially reported in #810 (comment)

Current Behavior

When there are duplicate sequence records in the input --sequences, they are propagated to --output-sequences without any warning or error.

Expected behavior

augur filter should exit with an error.

How to reproduce

cat >metadata.tsv <<~~
strain	col1
SEQ1	A
SEQ2	B
~~

cat >sequences.fasta <<~~
>SEQ1
AAAA
>SEQ2
GGGG
>SEQ2
CCCC
~~

augur filter \
    --metadata metadata.tsv \
    --sequences sequences.fasta \
    --output-metadata output-metadata.tsv \
    --output-sequences output-sequences.fasta

cat output-sequences.fasta
# >SEQ1
# AAAA
# >SEQ2
# GGGG
# >SEQ2
# CCCC

Possible solution

Check for duplicates and throw an error similar to

raise AugurError(f"The following strains are duplicated in '{args.metadata}':\n" + "\n".join(sorted(duplicate_strains)))

Your environment: if running Nextstrain locally

  • Version: augur 25.3.0
@victorlin victorlin added the bug Something isn't working label Aug 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant