Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filtering by genotype #1265

Merged
merged 14 commits into from
Jan 26, 2021
Merged

Filtering by genotype #1265

merged 14 commits into from
Jan 26, 2021

Conversation

jameshadfield
Copy link
Member

@jameshadfield jameshadfield commented Jan 14, 2021

This comment was updated Jan 22nd

PR Status

Ready for review. I believe there are no bugs (see below) but testing is required and performance may not be good enough for release.

Summary

This PR adds genotype filters to the side-bar dropdown allowing filtering of the dataset by any observed genotype, as long as it was at a variable site.

Notes and example URLs

Updated to relect the changes introduced in 3ca2c02

cc @kairstenfay @joverlee521 @trvrb @eharkins @huddlej re: this morning's meeting, any testing would be super useful!

See added code for a number of "todos" before this can be considered ready.
@jameshadfield jameshadfield temporarily deployed to auspice-filter-by-genot-b4jcxv January 14, 2021 21:42 Inactive
Functions and tests for encoding and decoding genotype filters.
We now remove any genotype filters that we don't observe in the tree. This isn't ideal because we can't filter by basal mutations.

Sometimes this just a bit frustrating -- e.g. "ncov/global?gt=ORF1a:4261" just gets removed, because it's a basal nucleotide with no reversions. Other times it's highly confusing, e.g. "ncov/global?gt=S:614D" only filters to tips which have reverted to aspartic acid and ignores all the genomes which have a basal D codon.
Refactor code to expose a function which collects all genotype states in the tree with the exception of non-variable sites (for which the option root-sequence JSON would be needed).

This updates the behavior to preserve filter-by-genotype URL queries on load which define basal states for variable sites. If the site is not variable, then the filter is removed as it makes little sense to filter to non-variable sites.

Note that the filtering algorithm needs to be updated (future commit) to take into account basal states.
Builds on the previous commit to allow filtering by basal states (of variable sites).
@jameshadfield jameshadfield changed the title First pass at implementing filtering by genotype Filtering by genotype Jan 22, 2021
@jameshadfield jameshadfield temporarily deployed to auspice-filter-by-genot-lf7x8o January 22, 2021 00:40 Inactive
@trvrb
Copy link
Member

trvrb commented Jan 22, 2021

Fantastic work here James! Observations during testing:

  1. (I know you flagged this...) If I click on the "filter data" box here https://nextstrain.org/ncov/global it takes ~0.5 seconds to load a text cursor and provide list of clickable options, while if I click "filter data" box at https://auspice-filter-by-genot-lf7x8o.herokuapp.com/ncov/global, it takes ~2 seconds to get to the same place.
  2. Similarly, when deleting elements in the text box, getting back to zero text results in a ~2 second beachball.
  3. It's a bit of a secret language that it needs to be s:417. I started with just 417 and got a bunch of samples with 417 in their name before mutations and then I tried s 417 and nothing came up. I know that we label branches tooltip as "S: K417N...", but our clade spec calls for labels of 20C/S.417N with a . rather than a :. We use the . notation in places like https://nextstrain.org/groups/neherlab/ncov/S.E484?c=gt-S_484.
  4. I know this is special casing but I'd push "mutation" for s in particular above "sample". I wanted to just type 417 (or 501) and something useful to happen. If I just type one of these I have to scroll through a bunch of options to find what I was looking for. This said, the S. entries could just be pushed to the top of the mutation list.
  5. A little strange that the URL for genotype filtering is gt=S:501Y while other filters are of the form f_division=Washington. Maybe f_gt instead? Colorby is c=gt- vs c=division.
  6. The genotype filter pillboxes in the header don't list number of tips with a genotype. We have Washington (20) vs S:501Y. That said I think it's going to be too many numbers if we had S:501Y (429), so agree with this design decision. Side note: I like including sample counts in the footer text, but I'm not sure if it's as useful in the header.
  7. I think that exposing this through magic "filter data" select box is definitely the right place to start, but that said, most people won't discover it this way. I'd imagine immediate use would most come from deep-linking into nextstrain.org by CoVariants, etc... to highlight constellations of interest. Surfacing genotypes as filters falls into our general issue of how to highlight this more via UI (another sidebar tab? or perhaps other options as we've discussed). Though keeping in universal search in the sidebar definitely makes total sense for now.

Generally, this will definitely be useful for flu as well: https://auspice-filter-by-genot-lf7x8o.herokuapp.com/flu/seasonal/h3n2/ha/2y?gt=HA1:159Y,198P. And I can see it working for variants of interest. 20C/S.542R ie CAL.20C is: https://auspice-filter-by-genot-lf7x8o.herokuapp.com/ncov/north-america?f_clade_membership=20C&gt=S:452R. Or the "Ohio variant" is seen by searching for US sequences with 501Y: https://auspice-filter-by-genot-lf7x8o.herokuapp.com/ncov/north-america?f_country=USA&gt=S:501Y.

The performance issue is blocker for me for release (but not for merge). The magic text boxes really need to be snappy to be truly a good user experience.

@trvrb
Copy link
Member

trvrb commented Jan 22, 2021

Other thought here is that in written text it's most common to say S 417N, so this might be the best way to list these even if our clade labels use .. This is most likely to work with how people would naturally type things into the filter box. Same for HA1 159Y in flu.

Using `react-virtualized-select` here makes the sidebar filtering UI much more responsive.

Note that our version of react select is well out of date and we should upgrade sometime in the future to take advantage of more efficient rendering options.
Shifts the URL query syntax to `{gene}.{pos}{residue}` and the rendering in auspice to `{gene} {pos}{residue}`.
@jameshadfield jameshadfield temporarily deployed to auspice-filter-by-genot-lf7x8o January 22, 2021 03:37 Inactive
It's user-friendly to have mutations appear above strains in the filtering UI. We also sort the mutations for improved usability.
@jameshadfield jameshadfield temporarily deployed to auspice-filter-by-genot-lf7x8o January 22, 2021 03:46 Inactive
@jameshadfield
Copy link
Member Author

jameshadfield commented Jan 22, 2021

Thanks @trvrb!

The performance issue is blocker for me for release (but not for merge). The magic text boxes really need to be snappy to be truly a good user experience.

Absolutely, and thankfully this turned out to be easier than feared 😌 The performance has been greatly improved by f9944eb, and now seems good enough for production IMO.

It's a bit of a secret language that it needs to be s:417 ...

Yeah. I've changed to using . in the URL query (to match prior art) and use a space in the rendering now, which is nicer but still somewhat secret. I think URL sharing will expose these to most users, but agree there's more to be done here.

I know this is special casing but I'd push "mutation" for s in particular above "sample". I wanted to just type 417 (or 501) and something useful to happen. If I just type one of these I have to scroll through a bunch of options to find what I was looking for. This said, the S. entries could just be pushed to the top of the mutation list.

I've pushed mutations above strains, and sorted them, which help a bit. But one still needs to type the gene in most cases. I'll think about this a bit more.

A little strange that the URL for genotype filtering is gt=S:501Y while other filters are of the form f_division=Washington. Maybe f_gt instead? Colorby is c=gt- vs c=division.

So there is a reason for this, but perhaps it's a bit ridiculous. Note that filtering to strain(s) behaves similar -- they get ?s= not ?f_strain=. Since "gt" and "strain" are commenly interpreted as metadata traits (e.g. the "NY99 genotype", or the "NY99 strain"), this allows JSONs to define node-traits "gt" and "strain" without any issues (they would be filterable via f_gt and f_strain).

The genotype filter pillboxes in the header don't list number of tips with a genotype.

Yes. This is inconsistent but I think i'll leave it for another day. And as you say, we may choose to remove them from the header as they don't update when filtering, so are arguably more confusing than helpful.

P.S. Bundlesizes are causing CI to fail, but I'll tackle this with #1266.

P.P.S here are the updated URLs from your comment:

@trvrb
Copy link
Member

trvrb commented Jan 22, 2021

This all looks great. Thanks for making these changes. Performance is good now. The URL query is fine as it stands. I don't have any other suggestions for the PR.

@huddlej
Copy link
Contributor

huddlej commented Jan 22, 2021

This is really cool, @jameshadfield! After playing around with flu data (where I know what to look for), I really like the performance and UI experience, especially with the ability to type "mutation" or "HA1" to shortcut to specific sites.

You mentioned this, but I wish I could select multiple alleles for the same site and have them OR'd in the display. For example, all recent H3N2 strains have a HA1:186 mutation, but some have 186S and others have 186D. The number of results tells me there's an intersection happening behind the scenes, but the UI set logic tells me it should be a union. I don't think it's a blocking issue for a secret new feature, but I could see users getting caught by this when all the other filters within a group (like regions) are displayed additively.

image

Actually, as I play with this more, I see that even across different positions, the set UI doesn't indicate that mutation filters provide an intersection:

image

I guess for the OR'ing to work across positions, you'd have to treat each position (HA1:186, etc.) as its own filter category like region, country, etc.? But then you could still intersect across sites (categories intersect) and show multiple alleles per site (within categories union).

@emmahodcroft
Copy link
Member

This is so very cool James! I am so excited to have this in action 🌮 🎉 (that was supposed to be 2 'tada's, but I decided tacos are equally celebratory).

I didn't read John's comment thoroughly enough and immediately got into the same scenario as him. Though I did figure it out, it confused me initially and took me a few minutes to figure out why. I'm not sure how to handle that - looking to highlight both 501Y and 501T, for example. Perhaps as John says, making it a little clearer at the top would help this?

I also don't know if others would find it helpful or not - or if it would even be possible! - but the other thing I ran into was that I was trying out random mutations I spotted in the tree, and I'd immediately forget while typing which one was wild-type/reference & which was the mutation. I guess if you're looking at this more specifically you'd always know, but it did occur to me that maybe wild-type could be marked by an asterisk or something? But also maybe not worth the trouble!

@trvrb
Copy link
Member

trvrb commented Jan 22, 2021

I guess if you're looking at this more specifically you'd always know, but it did occur to me that maybe wild-type could be marked by an asterisk or something? But also maybe not worth the trouble!

I'm worried that asterisk would be confused with stop codon. Immediate thought here would be to surface the count observed for different genotypes like mutation S 501Y (169) vs mutation S 501N (3851). But this has the same issue of being too many numbers and I don't really like it.

Though the thing that this makes me realize is that mutation S 501N as currently displayed is incorrect. This is not a "mutation" it's a genotype because it's wildtype. Generally I would write the "N501Y mutation" where you state the change from one AA to another vs the "501Y genotype" that just states the state of the AA at a position. I'd change the dropdown text to genotype S 501N and genotype S 501N. This will match how we label color by as well, where this is also Genotype.

As we use both wildtype and alt residues for filters, using
"genotype" is preferable over "mutation". Furthermore, we tend
to use mutation to refer to things such as "N501Y" whereas
"501Y" would be the genotype.
Previously genotypes at the same residue (i.e. same gene and position)
would intersect ("AND"), and thus couldn't be used. We now treat these
as unions ("OR"). This allows useful queries such as
`?gt=HA1.186D,186S`

Note: The page info header does not accurately express the behavior of
genotype filters. This will be improved in a subsequent commit.
No changes were made but the components which render the dataset info (above the viz panels) were broken out into smaller components and (partially) cleaned up.

In general, the styles were shifted into the `<Info>` component so that the child components can focus on returning content.
Makes the union logic we employ for genotypes at the same position clear by rendering these as sets of residues within a set, and providing useful on-hover info boxes.

The on-hover info boxes for other filters are also improved in this commit.
@jameshadfield jameshadfield temporarily deployed to auspice-filter-by-genot-lf7x8o January 25, 2021 04:19 Inactive
@jameshadfield
Copy link
Member Author

This is not a "mutation" it's a genotype because it's wildtype.

Agreed - have changed the wording to "genotype" in 0ac6701

I wish I could select multiple alleles for the same site and have them OR'd in the display.

Thanks @huddlej for this feedback -- it made me realise this was important to do right away rather than as a later improvement. I also took the oppertunity to improve the on-hover info boxes for filters in general. This is how it now looks:

image

@emmahodcroft
Copy link
Member

@jameshadfield The ability to filter to multiple AAs at the same position is awesome. 🚀

@huddlej
Copy link
Contributor

huddlej commented Jan 25, 2021

Thank you, @jameshadfield! Totally agree with @emmahodcroft here. This is amazing. I'm going to use this today for our upcoming flu call. :)

@trvrb
Copy link
Member

trvrb commented Jan 25, 2021

Amazing! I don't have further suggestions here. Looking forward to seeing this merged.

@trvrb
Copy link
Member

trvrb commented Jan 25, 2021

Actually... I just found one small issue:

Previously we were not recomputing the MRCA of the the filtered nodes if genotype filters were applied, which resulted in the "zoom to selected" button behaving as if the genotype filters did not exist.

For non-genotype ("normal") filters, given a set `X` of visible nodes, we simultaneously find the MRCA  of `X` and add the paths from the MRCA to `X`.

For genotype filtering, we wish to find the MRCA of `X` _without_ modifying `X`.  Note that this allows situations where `MRCA \not\in X`. This is introduced in this commit via the function `findFilteredMRCA` which uses a 3 step process:

1. Find the basal-most nodes of each (potentially non-monophyletic) visible cluster
2. Identify the paths from the root to the nodes in (1).
3. Find the first fork in this set of paths
Find smallest subtree with all visible data when using genotype filters
@jameshadfield jameshadfield temporarily deployed to auspice-filter-by-genot-lf7x8o January 26, 2021 04:45 Inactive
@jameshadfield jameshadfield merged commit d0232f8 into master Jan 26, 2021
@trvrb trvrb deleted the filter-by-genotype branch January 27, 2021 03:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants