Filtering by genotype #1265

jameshadfield · 2021-01-14T21:42:16Z

This comment was updated Jan 22nd

PR Status

Ready for review. I believe there are no bugs (see below) but testing is required and performance may not be good enough for release.

Summary

This PR adds genotype filters to the side-bar dropdown allowing filtering of the dataset by any observed genotype, as long as it was at a variable site.

Notes and example URLs

Updated to relect the changes introduced in 3ca2c02

Example of N501Y + D614G + P681H which corresponds to 20I/501Y.v1 / B.1.1.17: https://auspice-filter-by-genot-lf7x8o.herokuapp.com/ncov/global?gt=S.501Y,614G,681H
Example of N501Y + E484K, a constellation which is so far shared by 20J/501Y.V3 and 20H/501Y.V2: https://auspice-filter-by-genot-lf7x8o.herokuapp.com/ncov/global?gt=S.501Y,484K
Example of a basal "mutation": https://auspice-filter-by-genot-lf7x8o.herokuapp.com/ncov/global?gt=S.982S
Example of a mutation at a variable site but which is not observed in the data, which is silently removed at load: https://auspice-filter-by-genot-lf7x8o.herokuapp.com/ncov/global?gt=S.982S
Example of a genotype state (nuc:502A) which does exist, but is at a non-variable site. The filter is silently removed: https://auspice-filter-by-genot-lf7x8o.herokuapp.com/ncov/global?c=gt-nuc_502&gt=nuc.502A
"zoom-to-selected" hasn't been considered and doesn't work as it ideally should. This seems to require the same fix as that which @joverlee521 is currently working on.
~~Initial rendering of the filtering-dropdown is very slow. This seems to be one solution, which I may add in this PR or via another PR.~~ See below for update.
We don't yet allow OR filtering for the same position, thus one cannot filter by (e.g.) S 501Y || S 501T https://auspice-filter-by-genot-lf7x8o.herokuapp.com/ncov/global?gt=S.501Y,501T

cc @kairstenfay @joverlee521 @trvrb @eharkins @huddlej re: this morning's meeting, any testing would be super useful!

See added code for a number of "todos" before this can be considered ready.

Functions and tests for encoding and decoding genotype filters.

We now remove any genotype filters that we don't observe in the tree. This isn't ideal because we can't filter by basal mutations. Sometimes this just a bit frustrating -- e.g. "ncov/global?gt=ORF1a:4261" just gets removed, because it's a basal nucleotide with no reversions. Other times it's highly confusing, e.g. "ncov/global?gt=S:614D" only filters to tips which have reverted to aspartic acid and ignores all the genomes which have a basal D codon.

Refactor code to expose a function which collects all genotype states in the tree with the exception of non-variable sites (for which the option root-sequence JSON would be needed). This updates the behavior to preserve filter-by-genotype URL queries on load which define basal states for variable sites. If the site is not variable, then the filter is removed as it makes little sense to filter to non-variable sites. Note that the filtering algorithm needs to be updated (future commit) to take into account basal states.

Builds on the previous commit to allow filtering by basal states (of variable sites).

trvrb · 2021-01-22T01:39:49Z

Fantastic work here James! Observations during testing:

(I know you flagged this...) If I click on the "filter data" box here https://nextstrain.org/ncov/global it takes ~0.5 seconds to load a text cursor and provide list of clickable options, while if I click "filter data" box at https://auspice-filter-by-genot-lf7x8o.herokuapp.com/ncov/global, it takes ~2 seconds to get to the same place.
Similarly, when deleting elements in the text box, getting back to zero text results in a ~2 second beachball.
It's a bit of a secret language that it needs to be s:417. I started with just 417 and got a bunch of samples with 417 in their name before mutations and then I tried s 417 and nothing came up. I know that we label branches tooltip as "S: K417N...", but our clade spec calls for labels of 20C/S.417N with a . rather than a :. We use the . notation in places like https://nextstrain.org/groups/neherlab/ncov/S.E484?c=gt-S_484.
I know this is special casing but I'd push "mutation" for s in particular above "sample". I wanted to just type 417 (or 501) and something useful to happen. If I just type one of these I have to scroll through a bunch of options to find what I was looking for. This said, the S. entries could just be pushed to the top of the mutation list.
A little strange that the URL for genotype filtering is gt=S:501Y while other filters are of the form f_division=Washington. Maybe f_gt instead? Colorby is c=gt- vs c=division.
The genotype filter pillboxes in the header don't list number of tips with a genotype. We have Washington (20) vs S:501Y. That said I think it's going to be too many numbers if we had S:501Y (429), so agree with this design decision. Side note: I like including sample counts in the footer text, but I'm not sure if it's as useful in the header.
I think that exposing this through magic "filter data" select box is definitely the right place to start, but that said, most people won't discover it this way. I'd imagine immediate use would most come from deep-linking into nextstrain.org by CoVariants, etc... to highlight constellations of interest. Surfacing genotypes as filters falls into our general issue of how to highlight this more via UI (another sidebar tab? or perhaps other options as we've discussed). Though keeping in universal search in the sidebar definitely makes total sense for now.

Generally, this will definitely be useful for flu as well: https://auspice-filter-by-genot-lf7x8o.herokuapp.com/flu/seasonal/h3n2/ha/2y?gt=HA1:159Y,198P. And I can see it working for variants of interest. 20C/S.542R ie CAL.20C is: https://auspice-filter-by-genot-lf7x8o.herokuapp.com/ncov/north-america?f_clade_membership=20C&gt=S:452R. Or the "Ohio variant" is seen by searching for US sequences with 501Y: https://auspice-filter-by-genot-lf7x8o.herokuapp.com/ncov/north-america?f_country=USA&gt=S:501Y.

The performance issue is blocker for me for release (but not for merge). The magic text boxes really need to be snappy to be truly a good user experience.

trvrb · 2021-01-22T02:05:40Z

Other thought here is that in written text it's most common to say S 417N, so this might be the best way to list these even if our clade labels use .. This is most likely to work with how people would naturally type things into the filter box. Same for HA1 159Y in flu.

Using `react-virtualized-select` here makes the sidebar filtering UI much more responsive. Note that our version of react select is well out of date and we should upgrade sometime in the future to take advantage of more efficient rendering options.

Shifts the URL query syntax to `{gene}.{pos}{residue}` and the rendering in auspice to `{gene} {pos}{residue}`.

It's user-friendly to have mutations appear above strains in the filtering UI. We also sort the mutations for improved usability.

jameshadfield · 2021-01-22T04:02:18Z

Thanks @trvrb!

The performance issue is blocker for me for release (but not for merge). The magic text boxes really need to be snappy to be truly a good user experience.

Absolutely, and thankfully this turned out to be easier than feared 😌 The performance has been greatly improved by f9944eb, and now seems good enough for production IMO.

It's a bit of a secret language that it needs to be s:417 ...

Yeah. I've changed to using . in the URL query (to match prior art) and use a space in the rendering now, which is nicer but still somewhat secret. I think URL sharing will expose these to most users, but agree there's more to be done here.

I know this is special casing but I'd push "mutation" for s in particular above "sample". I wanted to just type 417 (or 501) and something useful to happen. If I just type one of these I have to scroll through a bunch of options to find what I was looking for. This said, the S. entries could just be pushed to the top of the mutation list.

I've pushed mutations above strains, and sorted them, which help a bit. But one still needs to type the gene in most cases. I'll think about this a bit more.

A little strange that the URL for genotype filtering is gt=S:501Y while other filters are of the form f_division=Washington. Maybe f_gt instead? Colorby is c=gt- vs c=division.

So there is a reason for this, but perhaps it's a bit ridiculous. Note that filtering to strain(s) behaves similar -- they get ?s= not ?f_strain=. Since "gt" and "strain" are commenly interpreted as metadata traits (e.g. the "NY99 genotype", or the "NY99 strain"), this allows JSONs to define node-traits "gt" and "strain" without any issues (they would be filterable via f_gt and f_strain).

The genotype filter pillboxes in the header don't list number of tips with a genotype.

Yes. This is inconsistent but I think i'll leave it for another day. And as you say, we may choose to remove them from the header as they don't update when filtering, so are arguably more confusing than helpful.

P.S. Bundlesizes are causing CI to fail, but I'll tackle this with #1266.

P.P.S here are the updated URLs from your comment:

https://auspice-filter-by-genot-lf7x8o.herokuapp.com/flu/seasonal/h3n2/ha/2y?gt=HA1.159Y,198P
CAL.20C is: https://auspice-filter-by-genot-lf7x8o.herokuapp.com/ncov/north-america?f_clade_membership=20C&gt=S.452R
"Ohio variant" is seen by searching for US sequences with 501Y: https://auspice-filter-by-genot-lf7x8o.herokuapp.com/ncov/north-america?f_country=USA&gt=S.501Y.

trvrb · 2021-01-22T04:34:05Z

This all looks great. Thanks for making these changes. Performance is good now. The URL query is fine as it stands. I don't have any other suggestions for the PR.

huddlej · 2021-01-22T04:59:21Z

This is really cool, @jameshadfield! After playing around with flu data (where I know what to look for), I really like the performance and UI experience, especially with the ability to type "mutation" or "HA1" to shortcut to specific sites.

You mentioned this, but I wish I could select multiple alleles for the same site and have them OR'd in the display. For example, all recent H3N2 strains have a HA1:186 mutation, but some have 186S and others have 186D. The number of results tells me there's an intersection happening behind the scenes, but the UI set logic tells me it should be a union. I don't think it's a blocking issue for a secret new feature, but I could see users getting caught by this when all the other filters within a group (like regions) are displayed additively.

Actually, as I play with this more, I see that even across different positions, the set UI doesn't indicate that mutation filters provide an intersection:

I guess for the OR'ing to work across positions, you'd have to treat each position (HA1:186, etc.) as its own filter category like region, country, etc.? But then you could still intersect across sites (categories intersect) and show multiple alleles per site (within categories union).

emmahodcroft · 2021-01-22T16:48:56Z

This is so very cool James! I am so excited to have this in action 🌮 🎉 (that was supposed to be 2 'tada's, but I decided tacos are equally celebratory).

I didn't read John's comment thoroughly enough and immediately got into the same scenario as him. Though I did figure it out, it confused me initially and took me a few minutes to figure out why. I'm not sure how to handle that - looking to highlight both 501Y and 501T, for example. Perhaps as John says, making it a little clearer at the top would help this?

I also don't know if others would find it helpful or not - or if it would even be possible! - but the other thing I ran into was that I was trying out random mutations I spotted in the tree, and I'd immediately forget while typing which one was wild-type/reference & which was the mutation. I guess if you're looking at this more specifically you'd always know, but it did occur to me that maybe wild-type could be marked by an asterisk or something? But also maybe not worth the trouble!

trvrb · 2021-01-22T17:18:11Z

I guess if you're looking at this more specifically you'd always know, but it did occur to me that maybe wild-type could be marked by an asterisk or something? But also maybe not worth the trouble!

I'm worried that asterisk would be confused with stop codon. Immediate thought here would be to surface the count observed for different genotypes like mutation S 501Y (169) vs mutation S 501N (3851). But this has the same issue of being too many numbers and I don't really like it.

Though the thing that this makes me realize is that mutation S 501N as currently displayed is incorrect. This is not a "mutation" it's a genotype because it's wildtype. Generally I would write the "N501Y mutation" where you state the change from one AA to another vs the "501Y genotype" that just states the state of the AA at a position. I'd change the dropdown text to genotype S 501N and genotype S 501N. This will match how we label color by as well, where this is also Genotype.

As we use both wildtype and alt residues for filters, using "genotype" is preferable over "mutation". Furthermore, we tend to use mutation to refer to things such as "N501Y" whereas "501Y" would be the genotype.

Previously genotypes at the same residue (i.e. same gene and position) would intersect ("AND"), and thus couldn't be used. We now treat these as unions ("OR"). This allows useful queries such as `?gt=HA1.186D,186S` Note: The page info header does not accurately express the behavior of genotype filters. This will be improved in a subsequent commit.

No changes were made but the components which render the dataset info (above the viz panels) were broken out into smaller components and (partially) cleaned up. In general, the styles were shifted into the `<Info>` component so that the child components can focus on returning content.

Makes the union logic we employ for genotypes at the same position clear by rendering these as sets of residues within a set, and providing useful on-hover info boxes. The on-hover info boxes for other filters are also improved in this commit.

jameshadfield · 2021-01-25T04:24:55Z

This is not a "mutation" it's a genotype because it's wildtype.

Agreed - have changed the wording to "genotype" in 0ac6701

I wish I could select multiple alleles for the same site and have them OR'd in the display.

Thanks @huddlej for this feedback -- it made me realise this was important to do right away rather than as a later improvement. I also took the oppertunity to improve the on-hover info boxes for filters in general. This is how it now looks:

emmahodcroft · 2021-01-25T10:25:50Z

@jameshadfield The ability to filter to multiple AAs at the same position is awesome. 🚀

huddlej · 2021-01-25T17:27:57Z

Thank you, @jameshadfield! Totally agree with @emmahodcroft here. This is amazing. I'm going to use this today for our upcoming flu call. :)

trvrb · 2021-01-25T18:44:57Z

Amazing! I don't have further suggestions here. Looking forward to seeing this merged.

trvrb · 2021-01-25T20:38:31Z

Actually... I just found one small issue:

If you navigate to https://auspice-filter-by-genot-lf7x8o.herokuapp.com/ncov/north-america?f_clade_membership=20I/501Y.V1 and click on "zoom to selected" it properly zooms to the clade of interest.
while if you navigate to https://auspice-filter-by-genot-lf7x8o.herokuapp.com/ncov/north-america?f_clade_membership=20C&gt=S.452R and click "zoom to selected" it doesn't properly get the subclade bearing 452R.

Previously we were not recomputing the MRCA of the the filtered nodes if genotype filters were applied, which resulted in the "zoom to selected" button behaving as if the genotype filters did not exist. For non-genotype ("normal") filters, given a set `X` of visible nodes, we simultaneously find the MRCA of `X` and add the paths from the MRCA to `X`. For genotype filtering, we wish to find the MRCA of `X` _without_ modifying `X`. Note that this allows situations where `MRCA \not\in X`. This is introduced in this commit via the function `findFilteredMRCA` which uses a 3 step process: 1. Find the basal-most nodes of each (potentially non-monophyletic) visible cluster 2. Identify the paths from the root to the nodes in (1). 3. Find the first fork in this set of paths

Find smallest subtree with all visible data when using genotype filters

First pass at implementing filtering by genotype

960dfea

See added code for a number of "todos" before this can be considered ready.

jameshadfield temporarily deployed to auspice-filter-by-genot-b4jcxv January 14, 2021 21:42 Inactive

jameshadfield added 4 commits January 21, 2021 15:58

Functions for encoding and decoding genotype filters

47a8181

Functions and tests for encoding and decoding genotype filters.

Update genotype filtering algorithm to consider basal states

e4ecf3e

Builds on the previous commit to allow filtering by basal states (of variable sites).

jameshadfield force-pushed the filter-by-genotype branch from beca0c9 to e4ecf3e Compare January 22, 2021 00:25

jameshadfield changed the title ~~First pass at implementing filtering by genotype~~ Filtering by genotype Jan 22, 2021

jameshadfield temporarily deployed to auspice-filter-by-genot-lf7x8o January 22, 2021 00:40 Inactive

jameshadfield mentioned this pull request Jan 22, 2021

Improve rendering performance of filter-by sidebar select box #1270

Closed

jameshadfield added 2 commits January 22, 2021 16:10

Update encoding and display of genotype filters

3ca2c02

Shifts the URL query syntax to `{gene}.{pos}{residue}` and the rendering in auspice to `{gene} {pos}{residue}`.

jameshadfield temporarily deployed to auspice-filter-by-genot-lf7x8o January 22, 2021 03:37 Inactive

Genotype filters are sorted and above strains

1431b09

It's user-friendly to have mutations appear above strains in the filtering UI. We also sort the mutations for improved usability.

jameshadfield temporarily deployed to auspice-filter-by-genot-lf7x8o January 22, 2021 03:46 Inactive

jameshadfield mentioned this pull request Jan 22, 2021

Improve search algorithm in sidebar filtering UI #1271

Closed

jameshadfield added 4 commits January 25, 2021 10:49

Use "genotype" for filter name

0ac6701

As we use both wildtype and alt residues for filters, using "genotype" is preferable over "mutation". Furthermore, we tend to use mutation to refer to things such as "N501Y" whereas "501Y" would be the genotype.

jameshadfield temporarily deployed to auspice-filter-by-genot-lf7x8o January 25, 2021 04:19 Inactive

jameshadfield mentioned this pull request Jan 25, 2021

Filtering does not work as expected if internal traits are defined #1275

Open

jameshadfield mentioned this pull request Jan 26, 2021

Find smallest subtree with all visible data when using genotype filters #1276

Merged

Merge pull request #1276 from nextstrain/mrca-of-visible

6b58c32

Find smallest subtree with all visible data when using genotype filters

jameshadfield temporarily deployed to auspice-filter-by-genot-lf7x8o January 26, 2021 04:45 Inactive

jameshadfield merged commit d0232f8 into master Jan 26, 2021

jameshadfield mentioned this pull request Jan 26, 2021

Download Data producing empty trees #1267

Closed

trvrb deleted the filter-by-genotype branch January 27, 2021 03:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filtering by genotype #1265

Filtering by genotype #1265

jameshadfield commented Jan 14, 2021 •

edited

Loading

trvrb commented Jan 22, 2021

trvrb commented Jan 22, 2021

jameshadfield commented Jan 22, 2021 •

edited

Loading

trvrb commented Jan 22, 2021 •

edited

Loading

huddlej commented Jan 22, 2021

emmahodcroft commented Jan 22, 2021

trvrb commented Jan 22, 2021

jameshadfield commented Jan 25, 2021

emmahodcroft commented Jan 25, 2021

huddlej commented Jan 25, 2021

trvrb commented Jan 25, 2021 •

edited

Loading

trvrb commented Jan 25, 2021

Filtering by genotype #1265

Filtering by genotype #1265

Conversation

jameshadfield commented Jan 14, 2021 • edited Loading

PR Status

Summary

Notes and example URLs

trvrb commented Jan 22, 2021

trvrb commented Jan 22, 2021

jameshadfield commented Jan 22, 2021 • edited Loading

trvrb commented Jan 22, 2021 • edited Loading

huddlej commented Jan 22, 2021

emmahodcroft commented Jan 22, 2021

trvrb commented Jan 22, 2021

jameshadfield commented Jan 25, 2021

emmahodcroft commented Jan 25, 2021

huddlej commented Jan 25, 2021

trvrb commented Jan 25, 2021 • edited Loading

trvrb commented Jan 25, 2021

jameshadfield commented Jan 14, 2021 •

edited

Loading

jameshadfield commented Jan 22, 2021 •

edited

Loading

trvrb commented Jan 22, 2021 •

edited

Loading

trvrb commented Jan 25, 2021 •

edited

Loading