read_metadata: Allow graceful handling of duplicate strain names #810

corneliusroemer · 2021-12-10T11:04:33Z

Context
I'm using GISAID's data export for Augur. Someone uploaded two different sequences with different EPI ISLs but the same strain name. This causes augur's read_metadata to crash here:

def check_metadata_duplicates(self):
        duplicates = (
            self.metadata[self.key_type]
            .value_counts()
            .reset_index()
            .query(f"{self.key_type} > 1")["index"]
            .values
        )

        if len(duplicates) > 0:
            raise ValueError(
                f"Duplicated {self.key_type} in metadata: {', '.join(duplicates)}"
            )

It would be good if there was a flag that one could set to be happy with duplicates and just choose the first (or last or none.

Because now I need to manually remove this duplicate or run a different script to deduplicate the metadata myself. It can be done but it'd be nice if the util had a flag taking care of this.

The text was updated successfully, but these errors were encountered:

tsibley · 2024-07-10T17:50:51Z

Consider augur curate's standardized duplicate handling option:

augur/augur/curate/__init__.py

Lines 79 to 83 in 18e07b4

    
           shared_inputs.add_argument("--duplicate-reporting", 
        
               type=DataErrorMethod.argtype, 
        
               choices=list(DataErrorMethod), 
        
               default=DataErrorMethod.ERROR_FIRST, 
        
               help="How should duplicate records be reported.")

augur/augur/types.py

Lines 39 to 49 in 18e07b4

    
           @enum.unique 
        
           class DataErrorMethod(ArgparseEnum): 
        
               """ 
        
               Enum representation of string values that represent how a data error should 
        
               be handled. 
        
               """ 
        
               ERROR_FIRST     = 'error_first' 
        
               ERROR_ALL       = 'error_all' 
        
               WARN            = 'warn' 
        
               SILENT          = 'silent'

trvrb · 2024-08-13T01:20:45Z

Re-upping this here in the context of augur filter. My guess had been that augur filter would de-duplicate entries, but I just discovered in the context of augur merge review that metadata and sequences are handled differently in augur filter.

For metadata I tested by doubling entries in metadata.tsv from zika and then ran a simple augur filter command:

augur filter --sequences sequences.fasta --metadata metadata_doubled.tsv --output-sequences test.fasta --output-metadata test.tsv

this errors out ERROR: The following strains are duplicated in 'metadata_doubled.tsv':. This makes sense as otherwise further commands like augur refine won't know which entry to use.

For sequences I tested by doubling entries in sequences.fasta and then ran a similar filter command:

augur filter --sequences sequences_doubled.fasta --metadata metadata.tsv --output-sequences test.fasta --output-metadata test.tsv

Here, there's no complaints. The resulting files have n entries in test.tsv and 2 n entries in test.fasta.

I think we should be consistent with behavior for sequences and metadata and error out if there are duplicate sequences when running augur filter.

victorlin · 2024-08-13T18:21:06Z

@trvrb thanks for resurfacing this. Zooming out a bit, there have been several issues where de-duplication has been discussed:

#586, #725, and #810 all seem to be discussing the same thing: allow existing Augur commands to handle de-duplication.

#919 and #616 are separately similar: create a new Augur command to handle de-duplication.

The current status quo and consensus used by most commands is well summarized by @huddlej in #616 (comment):

We should try to keep Augur commands as tightly scoped as possible to solve specific problems. Resolving duplicates should be outside the responsibilities of the parse command; another Augur command should be responsible for deduplicating/sanitizing sequences and metadata.

Based on that, the solution here is to close #586, #725, and #810 as "not planned". Separately, as pointed out in the previous comment, there is a bug in augur filter where --output-sequences allows duplicates to pass through. This should be fixed to error on duplicates.

trvrb · 2024-08-15T23:01:18Z

Thanks for the write up @victorlin! augur merge --sequences would handle de-duplication naturally and would seem to be a good place for it to live.

I suspect that it doesn't make sense to even check for duplicates in many commands. For instance augur mask could be run generically there could be good reason for the sequences passed in to have duplicate FASTA names. However, as above, in augur filter we already error on duplicates in metadata and we should do the same for duplicate sequences.

Though other commands like augur traits would have strange behavior if the metadata passed in has duplicates. I'd probably suggest identifying the commands where duplicates are potentially toxic and making these specific commands error if duplicates are found.

victorlin · 2024-08-27T19:06:22Z

Closing this as not planned per #810 (comment)

corneliusroemer added the enhancement New feature or request label Dec 10, 2021

trvrb mentioned this issue Aug 13, 2024

augur merge #1563

Merged

7 tasks

This was referenced Aug 27, 2024

filter: --output-sequences silently allows duplicates #1602

Closed

Provide options to handle duplicate sequences more gracefully in augur filter #586

Closed

read_metadata: Provide an interface to resolve duplicate strain names in metadata #725

Closed

victorlin changed the title ~~Allow graceful handling of duplicate strain names~~ read_metadata: Allow graceful handling of duplicate strain names Aug 27, 2024

victorlin closed this as not planned Won't fix, can't repro, duplicate, stale Aug 27, 2024

corneliusroemer mentioned this issue Oct 4, 2024

filter: Error on sequence duplicates #1613

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_metadata: Allow graceful handling of duplicate strain names #810

read_metadata: Allow graceful handling of duplicate strain names #810

corneliusroemer commented Dec 10, 2021

tsibley commented Jul 10, 2024

trvrb commented Aug 13, 2024

victorlin commented Aug 13, 2024

trvrb commented Aug 15, 2024 •

edited

Loading

victorlin commented Aug 27, 2024

read_metadata: Allow graceful handling of duplicate strain names #810

read_metadata: Allow graceful handling of duplicate strain names #810

Comments

corneliusroemer commented Dec 10, 2021

tsibley commented Jul 10, 2024

trvrb commented Aug 13, 2024

victorlin commented Aug 13, 2024

trvrb commented Aug 15, 2024 • edited Loading

victorlin commented Aug 27, 2024

trvrb commented Aug 15, 2024 •

edited

Loading