Group segments by strains #18

jameshadfield · 2024-09-26T04:15:55Z

This PR closes #15 and is the first implementation of nextstrain/pathogen-repo-guide#59. There are still some improvements to be made, but it's at a good stopping point to reflect on the approach. The main outstanding tasks/improvements are around:

how we resolve metadata disagreements across segments (of the same strain), and probably providing a way to manually resolve disagreements (e.g. via a YAML file)
improved logging
potentially change the way segment-specific information is presented in the Auspice JSON

Live builds

https://next.nextstrain.org/staging/oropouche/S/grouped-by-strain
https://next.nextstrain.org/staging/oropouche/M/grouped-by-strain
https://next.nextstrain.org/staging/oropouche/L/grouped-by-strain

Running locally

Trial ingest files are available at:

s3://nextstrain-data/files/workflows/oropouche/trials/grouped-by-strain/metadata.tsv.zst
s3://nextstrain-data/files/workflows/oropouche/trials/grouped-by-strain/S/sequences.fasta.zst
s3://nextstrain-data/files/workflows/oropouche/trials/grouped-by-strain/M/sequences.fasta.zst
s3://nextstrain-data/files/workflows/oropouche/trials/grouped-by-strain/L/sequences.fasta.zst

The phylo workflow may be run locally by adding --config ingest_url_prefix:"https://data.nextstrain.org/files/workflows/oropouche/trials/grouped-by-strain/"

Current conflicts in the metadata

The approach is currently overly conservative and removes a number of strains due to conflicting metadata. They fall into three groups, however in total we only drop 21 strains + another 8 segments so I think providing a manually curated conflict-resolution-YAML is the best way to resolve these, where appropriate.

Multiple sequences for a segment
There are 14 of these. Example:

Strain TRVL9760 had multiple sequences for segment S. Dropping this entire strain.

Disagreement of common strain fields
Seven occurrences. One for sampling date, six for authors:

ERROR! 'H498913' sampling dates disagree: {'1988-XX-XX', '1990-XX-XX'}
ERROR! 'LET-2083' Disagreement for 'authors', observed values: {'Limonta,D.,Peres-Restrepo,L.S.,Ciouderis,K.,Hernandez-Ortiz,J.P.,Osorio,J.R.,Perez,L.J.,Perez-Restrepo,L.,Ciuoderis,K.,Usuga,J.,Moreno,I.,Vargas,V.,Arevalo-Arbelaez,A.J.,Berg,M.G.,Cloherty,G.A.,Osorio,J.E.', 'Limonta,D.,Peres-Restrepo,L.S.,Ciouderis,K.,Hernandez-Ortiz,J.P.,Osorio,J.R.', 'Limonta,D.,Peres-Restrepo,L.S.,Ciouderis,K.,Hernandez-Ortiz,J.P.,Osorio,J.R.,Perez,L.J.,Perez-Restrepo,L.S.,Ciuoderis,K.,Usuga,J.,Moreno,I.,Vargas,V.,Arevalo-Arbelaez,A.J.,Berg,M.G.,Cloherty,G.A.,Osorio,J.E.'}

Accession mapped to no segments / multiple segments
n=8, examples:

Accession 'PP952119' (strain 'IRCCS-SCDC_1/2024') mapped to multiple segments: L, S
Accession 'KP347674' (strain 'OROV/MT_CbaH498/2012') mapped to no segments

joverlee521

I haven't fully finished review, but wanted to provide some workflow feedback.

ingest/rules/curate.smk

ingest/rules/nextclade.smk

ingest/scripts/parse_nextclade_tsv.py

ingest/rules/curate.smk

ingest/scripts/group_metadata.py

miparedes · 2024-10-04T21:19:37Z

Maybe I'm missing it somewhere but does this new workflow output a file, or even to the terminal, the strains that are being deduped? I ask because I noticed in the builds that are staged that all the Colombia seqs are missing (which can be seen in the live site) and I can't really tell if that's because they were dropped because theyre duplicates or not. Either way, even if some are duplicated I'd expect at least one of those seqs to make it through?

ingest/defaults/segment_resolutions.yaml

ingest/defaults/annotations.tsv

joverlee521

Thanks for the multiple iterations on this James! I mostly left non-blocking comments.

I think the only things that need to be resolved

move augur curate apply-record-annotations back to the curate chain (as you've noted)
fix the segment sequence filtering

ingest/rules/nextclade.smk

joverlee521 · 2024-10-08T18:38:23Z

ingest/scripts/group_segments.py

+    if len(segments_present)>1:
+        raise AssertionError(f"Accession '{accession}' (strain '{strain}') mapped to multiple segments: {', '.join(segments_present)}. Skipping this accession.")


non-blocking

Huh, could a single sequence align to multiple segments? Did you actually run into this situation in testing?

Yes!

Accession 'PP952119' (strain 'IRCCS-SCDC_1/2024') mapped to multiple segments: L, S. Skipping this accession.

Oh weird, I would have expected the alignment to fail for one of the segments. I wonder if the params for the Nextclade alignment are too lax.

ingest/rules/curate.smk

ingest/rules/group_segments.smk

Produces a single large metadata file keyed off accession with both strain names and segment-level information encoded therein. The following commit will group this "long" metadata TSV into a "wide" TSV where each row represents a strain.

@joverlee521

Allows us to avoid another python script. Original implementation by @joverlee521 in <#18 (comment)>

@joverlee521

Mismatched field values across segments (e.g. segments disagree on the 'date') are now resolved by choosing the most common occurrence with the intention they are resolved upstream, as implemented here. This approach was the third implementation. Initially I resolved disagreements within `group_segments.py` via a provided resolutions YAML. After discussion with @joverlee521 we decided this could be better implemented via `augur curate` and the original implementation here did this _after_ the segment grouping, however this made it impossible to distinguish disagreements which will be fixed vs those which won't¹ ¹ <#18 (comment)>

They're provided in our other runtimes (almost by happenstance, as part of the underlying OS image) and having them available in all runtimes makes it much easier to write portable programs without having to deal with GNU vs. BSD differences. Note that typically these GNU programs would already be available in the Conda runtime on Linux (via the host system), but not the Conda runtime on macOS (unless installed separately, e.g. via Homebrew). So explicitly including the GNU-flavored programs here increases consistency, isolation, and portability of the runtime. These were ostensibly overlooked in "Provide GNU coreutils in the runtime" (a0b6609). Related-to: <nextstrain/oropouche#18 (comment)>

@joverlee521

Mismatched field values across segments (e.g. segments disagree on the 'date') are now resolved by choosing the most common occurrence with the intention they are resolved upstream, as implemented here. This approach was the third implementation. Initially I resolved disagreements within `group_segments.py` via a provided resolutions YAML. After discussion with @joverlee521 we decided this could be better implemented via `augur curate` and the original implementation here did this _after_ the segment grouping, however this made it impossible to distinguish disagreements which will be fixed vs those which won't¹ NOTE: Here we use accession as the ID, however using strain name would be better going forward as it would reduce the duplication needed in the current format. We can't (currently) do this in oropouche because strain names are added _after_ the curate chain runs. ¹ <#18 (comment)>

jameshadfield · 2024-10-09T23:37:39Z

@miparedes - I hope to merge this tomorrow, please let me know if you don't want this merged!

miparedes · 2024-10-10T16:45:27Z

I just did a fresh pull and everything seems to be running okay. The trees look pretty different than whats currently on nextstrain.org though. I reran everything using this current branch and uploaded the jsons to staging. If you compare https://nextstrain.org/staging/oropouche/S to https://nextstrain.org/oropouche/S you see a different tree topology plus 400 more seqs although the nextstrain.org one was just updated yesterday it seems. same with https://nextstrain.org/staging/oropouche/M where you see the time tree adn the divergence tree looking pretty different (the lower lineage in M seems to be the less diverged but close to the present in time?) but it's probably due to this weird tip https://nextstrain.org/staging/oropouche/M?l=clock&s=BeH505764 which should be easy to exclude.

There doesnt seem to be a massive upload of new seqs to explain the large difference between the two builds so I guess im not too sure where the discrepancy is coming from? Thoughts @jameshadfield ?

jameshadfield · 2024-10-10T19:09:56Z

There doesnt seem to be a massive upload of new seqs to explain the large difference between the two builds so I guess im not too sure where the discrepancy is coming from?

Thanks for flagging! I forgot to update the phylo exclude/include lists to use strain names. Will sort that out ASAP.

miparedes

This looks great to me. Thanks for all the hard work @jameshadfield ! The only outstanding thing is to figure out why the original build didnt include those recent brazil seqs but it isn't blocking given that it seems like the build from this PR is probably the more accurate one.

Transforms a metadata file with one row per accession into a file with one row per strain. Where a strain contains sequences for multiple segments this will group those segments together. Segment-specific field names (i.e. those not in --common-strain-fields) are modified to ensure their suffix is "_{segment}". For instance "accession → accession_HA". Rows are matched on strain name and basic sanity checking is performed when grouping. Segments with multiple matches for a given strain are dropped, and strains where all segments have either zero or multiple matches are dropped entirely. Manual resolutions may be provided via a `--resolutions` YAML which is a list of dictionaries, each with keys "strain", "accession" and "segment" informing the program which accession (of multiple) to use. The resolutions here are taken both from exploring the data and the existing phylo exclude list. Any disagreement within the "--common-strain-fields" will result in the strain being dropped, however empty values may be replaced and ambiguous dates may be replaced with specific ones (where appropriate).

@joverlee521

Mismatched field values across segments (e.g. segments disagree on the 'date') are now resolved by choosing the most common occurrence with the intention they are resolved upstream, as implemented here. This approach was the third implementation. Initially I resolved disagreements within `group_segments.py` via a provided resolutions YAML. After discussion with @joverlee521 we decided this could be better implemented via `augur curate` and the original implementation here did this _after_ the segment grouping, however this made it impossible to distinguish disagreements which will be fixed vs those which won't¹ NOTE: Here we use accession as the ID, however using strain name would be better going forward as it would reduce the duplication needed in the current format. We can't (currently) do this in oropouche because strain names are added _after_ the curate chain runs. ¹ <#18 (comment)>

Updates the files which ingest uploads and makes the corresponding changes to the phylogenetic workflow. As metadata (and sequences) now use "strain" as the unique ID a number of simplifications can be made to the workflow. There is one regression: the "accession" column no longer exists and is thus not exported. We'll fix this in a subsequent commit.

This adds back in segment-specific metadata to the Auspice JSON. There are multiple ways this can be done, each with trade-offs. The approach employed here leaves the "_{segment}" suffix on the field names. Alternatively we could remap the metadata file for each `export` call so that (e.g.) "accession_S" becomes "accession".

To use the new metadata format where we group by strain. Steps to regenerate: 1. Populate 'data/' with metadata & sequences from an ingest run. 2. Subsample this as example data via: ``` augur filter --metadata data/metadata.tsv --group-by country --subsample-max-sequences 30 --output-metadata example_data/metadata.tsv augur filter --metadata example_data/metadata.tsv --sequences data/L/sequences.fasta --output-sequences example_data/sequences_L.fasta augur filter --metadata example_data/metadata.tsv --sequences data/M/sequences.fasta --output-sequences example_data/sequences_M.fasta augur filter --metadata example_data/metadata.tsv --sequences data/S/sequences.fasta --output-sequences example_data/sequences_S.fasta ```

genehack mentioned this pull request Oct 2, 2024

Extend oropouche segment-matching approach to avian flu nextstrain/avian-flu#92

Open

jameshadfield force-pushed the james/dedup-segments branch from 1883dbf to cbcc232 Compare October 2, 2024 23:37

joverlee521 reviewed Oct 4, 2024

View reviewed changes

jameshadfield commented Oct 4, 2024

View reviewed changes

ingest/defaults/segment_resolutions.yaml Outdated Show resolved Hide resolved

jameshadfield force-pushed the james/dedup-segments branch from cbcc232 to 7324fd2 Compare October 8, 2024 02:11

jameshadfield commented Oct 8, 2024

View reviewed changes

ingest/defaults/annotations.tsv Outdated Show resolved Hide resolved

joverlee521 requested changes Oct 8, 2024

View reviewed changes

jameshadfield added 2 commits October 9, 2024 14:49

[ingest] Add segment information to metadata

d2b13d1

Produces a single large metadata file keyed off accession with both strain names and segment-level information encoded therein. The following commit will group this "long" metadata TSV into a "wide" TSV where each row represents a strain.

[ingest] parse nextclade TSV using csvtk

9bfce1b

Allows us to avoid another python script. Original implementation by @joverlee521 in <#18 (comment)>

jameshadfield force-pushed the james/dedup-segments branch from 7324fd2 to afa412f Compare October 9, 2024 02:32

tsibley mentioned this pull request Oct 9, 2024

Provide GNU awk, grep, sed, and tar in the runtime nextstrain/conda-base#94

Merged

1 task

joverlee521 approved these changes Oct 9, 2024

View reviewed changes

genehack approved these changes Oct 9, 2024

View reviewed changes

genehack requested a review from miparedes October 9, 2024 23:25

jameshadfield force-pushed the james/dedup-segments branch from afa412f to 2ee0109 Compare October 9, 2024 23:27

miparedes approved these changes Oct 10, 2024

View reviewed changes

jameshadfield added 5 commits October 11, 2024 12:30

jameshadfield force-pushed the james/dedup-segments branch from 6af6b61 to 091dc7f Compare October 10, 2024 23:31

jameshadfield merged commit 06b3e6e into main Oct 11, 2024
5 checks passed

jameshadfield deleted the james/dedup-segments branch October 11, 2024 00:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Group segments by strains #18

Group segments by strains #18

jameshadfield commented Sep 26, 2024

joverlee521 left a comment

miparedes commented Oct 4, 2024 •

edited

Loading

joverlee521 left a comment •

edited

Loading

joverlee521 Oct 8, 2024

jameshadfield Oct 8, 2024

joverlee521 Oct 9, 2024

jameshadfield commented Oct 9, 2024

miparedes commented Oct 10, 2024

jameshadfield commented Oct 10, 2024

miparedes left a comment

		if len(segments_present)>1:
		raise AssertionError(f"Accession '{accession}' (strain '{strain}') mapped to multiple segments: {', '.join(segments_present)}. Skipping this accession.")

Group segments by strains #18

Group segments by strains #18

Conversation

jameshadfield commented Sep 26, 2024

Live builds

Running locally

Current conflicts in the metadata

joverlee521 left a comment

Choose a reason for hiding this comment

miparedes commented Oct 4, 2024 • edited Loading

joverlee521 left a comment • edited Loading

Choose a reason for hiding this comment

joverlee521 Oct 8, 2024

Choose a reason for hiding this comment

jameshadfield Oct 8, 2024

Choose a reason for hiding this comment

joverlee521 Oct 9, 2024

Choose a reason for hiding this comment

jameshadfield commented Oct 9, 2024

miparedes commented Oct 10, 2024

jameshadfield commented Oct 10, 2024

miparedes left a comment

Choose a reason for hiding this comment

miparedes commented Oct 4, 2024 •

edited

Loading

joverlee521 left a comment •

edited

Loading