Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_metadata: Allow graceful handling of duplicate strain names #810

Closed
corneliusroemer opened this issue Dec 10, 2021 · 5 comments
Closed
Labels
enhancement New feature or request

Comments

@corneliusroemer
Copy link
Member

Context
I'm using GISAID's data export for Augur. Someone uploaded two different sequences with different EPI ISLs but the same strain name. This causes augur's read_metadata to crash here:

def check_metadata_duplicates(self):
        duplicates = (
            self.metadata[self.key_type]
            .value_counts()
            .reset_index()
            .query(f"{self.key_type} > 1")["index"]
            .values
        )

        if len(duplicates) > 0:
            raise ValueError(
                f"Duplicated {self.key_type} in metadata: {', '.join(duplicates)}"
            )

It would be good if there was a flag that one could set to be happy with duplicates and just choose the first (or last or none.

Because now I need to manually remove this duplicate or run a different script to deduplicate the metadata myself. It can be done but it'd be nice if the util had a flag taking care of this.

@corneliusroemer corneliusroemer added the enhancement New feature or request label Dec 10, 2021
@tsibley
Copy link
Member

tsibley commented Jul 10, 2024

Consider augur curate's standardized duplicate handling option:

shared_inputs.add_argument("--duplicate-reporting",
type=DataErrorMethod.argtype,
choices=list(DataErrorMethod),
default=DataErrorMethod.ERROR_FIRST,
help="How should duplicate records be reported.")

augur/augur/types.py

Lines 39 to 49 in 18e07b4

@enum.unique
class DataErrorMethod(ArgparseEnum):
"""
Enum representation of string values that represent how a data error should
be handled.
"""
ERROR_FIRST = 'error_first'
ERROR_ALL = 'error_all'
WARN = 'warn'
SILENT = 'silent'

@trvrb
Copy link
Member

trvrb commented Aug 13, 2024

Re-upping this here in the context of augur filter. My guess had been that augur filter would de-duplicate entries, but I just discovered in the context of augur merge review that metadata and sequences are handled differently in augur filter.

For metadata I tested by doubling entries in metadata.tsv from zika and then ran a simple augur filter command:

augur filter --sequences sequences.fasta --metadata metadata_doubled.tsv --output-sequences test.fasta --output-metadata test.tsv

this errors out ERROR: The following strains are duplicated in 'metadata_doubled.tsv':. This makes sense as otherwise further commands like augur refine won't know which entry to use.

For sequences I tested by doubling entries in sequences.fasta and then ran a similar filter command:

augur filter --sequences sequences_doubled.fasta --metadata metadata.tsv --output-sequences test.fasta --output-metadata test.tsv

Here, there's no complaints. The resulting files have n entries in test.tsv and 2 n entries in test.fasta.

I think we should be consistent with behavior for sequences and metadata and error out if there are duplicate sequences when running augur filter.

@trvrb trvrb mentioned this issue Aug 13, 2024
7 tasks
@victorlin
Copy link
Member

@trvrb thanks for resurfacing this. Zooming out a bit, there have been several issues where de-duplication has been discussed:

#586, #725, and #810 all seem to be discussing the same thing: allow existing Augur commands to handle de-duplication.

#919 and #616 are separately similar: create a new Augur command to handle de-duplication.

The current status quo and consensus used by most commands is well summarized by @huddlej in #616 (comment):

We should try to keep Augur commands as tightly scoped as possible to solve specific problems. Resolving duplicates should be outside the responsibilities of the parse command; another Augur command should be responsible for deduplicating/sanitizing sequences and metadata.

Based on that, the solution here is to close #586, #725, and #810 as "not planned". Separately, as pointed out in the previous comment, there is a bug in augur filter where --output-sequences allows duplicates to pass through. This should be fixed to error on duplicates.

@trvrb
Copy link
Member

trvrb commented Aug 15, 2024

Thanks for the write up @victorlin! augur merge --sequences would handle de-duplication naturally and would seem to be a good place for it to live.

I suspect that it doesn't make sense to even check for duplicates in many commands. For instance augur mask could be run generically there could be good reason for the sequences passed in to have duplicate FASTA names. However, as above, in augur filter we already error on duplicates in metadata and we should do the same for duplicate sequences.

Though other commands like augur traits would have strange behavior if the metadata passed in has duplicates. I'd probably suggest identifying the commands where duplicates are potentially toxic and making these specific commands error if duplicates are found.

@victorlin victorlin changed the title Allow graceful handling of duplicate strain names read_metadata: Allow graceful handling of duplicate strain names Aug 27, 2024
@victorlin
Copy link
Member

Closing this as not planned per #810 (comment)

@victorlin victorlin closed this as not planned Won't fix, can't repro, duplicate, stale Aug 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants