-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filtering by genotype #1265
Filtering by genotype #1265
Conversation
See added code for a number of "todos" before this can be considered ready.
Functions and tests for encoding and decoding genotype filters.
We now remove any genotype filters that we don't observe in the tree. This isn't ideal because we can't filter by basal mutations. Sometimes this just a bit frustrating -- e.g. "ncov/global?gt=ORF1a:4261" just gets removed, because it's a basal nucleotide with no reversions. Other times it's highly confusing, e.g. "ncov/global?gt=S:614D" only filters to tips which have reverted to aspartic acid and ignores all the genomes which have a basal D codon.
Refactor code to expose a function which collects all genotype states in the tree with the exception of non-variable sites (for which the option root-sequence JSON would be needed). This updates the behavior to preserve filter-by-genotype URL queries on load which define basal states for variable sites. If the site is not variable, then the filter is removed as it makes little sense to filter to non-variable sites. Note that the filtering algorithm needs to be updated (future commit) to take into account basal states.
Builds on the previous commit to allow filtering by basal states (of variable sites).
beca0c9
to
e4ecf3e
Compare
Fantastic work here James! Observations during testing:
Generally, this will definitely be useful for flu as well: https://auspice-filter-by-genot-lf7x8o.herokuapp.com/flu/seasonal/h3n2/ha/2y?gt=HA1:159Y,198P. And I can see it working for variants of interest. The performance issue is blocker for me for release (but not for merge). The magic text boxes really need to be snappy to be truly a good user experience. |
Other thought here is that in written text it's most common to say |
Using `react-virtualized-select` here makes the sidebar filtering UI much more responsive. Note that our version of react select is well out of date and we should upgrade sometime in the future to take advantage of more efficient rendering options.
Shifts the URL query syntax to `{gene}.{pos}{residue}` and the rendering in auspice to `{gene} {pos}{residue}`.
It's user-friendly to have mutations appear above strains in the filtering UI. We also sort the mutations for improved usability.
Thanks @trvrb!
Absolutely, and thankfully this turned out to be easier than feared 😌 The performance has been greatly improved by f9944eb, and now seems good enough for production IMO.
Yeah. I've changed to using
I've pushed mutations above strains, and sorted them, which help a bit. But one still needs to type the gene in most cases. I'll think about this a bit more.
So there is a reason for this, but perhaps it's a bit ridiculous. Note that filtering to strain(s) behaves similar -- they get
Yes. This is inconsistent but I think i'll leave it for another day. And as you say, we may choose to remove them from the header as they don't update when filtering, so are arguably more confusing than helpful. P.S. Bundlesizes are causing CI to fail, but I'll tackle this with #1266. P.P.S here are the updated URLs from your comment:
|
This all looks great. Thanks for making these changes. Performance is good now. The URL query is fine as it stands. I don't have any other suggestions for the PR. |
This is really cool, @jameshadfield! After playing around with flu data (where I know what to look for), I really like the performance and UI experience, especially with the ability to type "mutation" or "HA1" to shortcut to specific sites. You mentioned this, but I wish I could select multiple alleles for the same site and have them OR'd in the display. For example, all recent H3N2 strains have a HA1:186 mutation, but some have 186S and others have 186D. The number of results tells me there's an intersection happening behind the scenes, but the UI set logic tells me it should be a union. I don't think it's a blocking issue for a secret new feature, but I could see users getting caught by this when all the other filters within a group (like regions) are displayed additively. Actually, as I play with this more, I see that even across different positions, the set UI doesn't indicate that mutation filters provide an intersection: I guess for the OR'ing to work across positions, you'd have to treat each position (HA1:186, etc.) as its own filter category like region, country, etc.? But then you could still intersect across sites (categories intersect) and show multiple alleles per site (within categories union). |
This is so very cool James! I am so excited to have this in action 🌮 🎉 (that was supposed to be 2 'tada's, but I decided tacos are equally celebratory). I didn't read John's comment thoroughly enough and immediately got into the same scenario as him. Though I did figure it out, it confused me initially and took me a few minutes to figure out why. I'm not sure how to handle that - looking to highlight both 501Y and 501T, for example. Perhaps as John says, making it a little clearer at the top would help this? I also don't know if others would find it helpful or not - or if it would even be possible! - but the other thing I ran into was that I was trying out random mutations I spotted in the tree, and I'd immediately forget while typing which one was wild-type/reference & which was the mutation. I guess if you're looking at this more specifically you'd always know, but it did occur to me that maybe wild-type could be marked by an asterisk or something? But also maybe not worth the trouble! |
I'm worried that asterisk would be confused with stop codon. Immediate thought here would be to surface the count observed for different genotypes like Though the thing that this makes me realize is that |
As we use both wildtype and alt residues for filters, using "genotype" is preferable over "mutation". Furthermore, we tend to use mutation to refer to things such as "N501Y" whereas "501Y" would be the genotype.
Previously genotypes at the same residue (i.e. same gene and position) would intersect ("AND"), and thus couldn't be used. We now treat these as unions ("OR"). This allows useful queries such as `?gt=HA1.186D,186S` Note: The page info header does not accurately express the behavior of genotype filters. This will be improved in a subsequent commit.
No changes were made but the components which render the dataset info (above the viz panels) were broken out into smaller components and (partially) cleaned up. In general, the styles were shifted into the `<Info>` component so that the child components can focus on returning content.
Makes the union logic we employ for genotypes at the same position clear by rendering these as sets of residues within a set, and providing useful on-hover info boxes. The on-hover info boxes for other filters are also improved in this commit.
Agreed - have changed the wording to "genotype" in 0ac6701
Thanks @huddlej for this feedback -- it made me realise this was important to do right away rather than as a later improvement. I also took the oppertunity to improve the on-hover info boxes for filters in general. This is how it now looks: |
@jameshadfield The ability to filter to multiple AAs at the same position is awesome. 🚀 |
Thank you, @jameshadfield! Totally agree with @emmahodcroft here. This is amazing. I'm going to use this today for our upcoming flu call. :) |
Amazing! I don't have further suggestions here. Looking forward to seeing this merged. |
Actually... I just found one small issue:
|
Previously we were not recomputing the MRCA of the the filtered nodes if genotype filters were applied, which resulted in the "zoom to selected" button behaving as if the genotype filters did not exist. For non-genotype ("normal") filters, given a set `X` of visible nodes, we simultaneously find the MRCA of `X` and add the paths from the MRCA to `X`. For genotype filtering, we wish to find the MRCA of `X` _without_ modifying `X`. Note that this allows situations where `MRCA \not\in X`. This is introduced in this commit via the function `findFilteredMRCA` which uses a 3 step process: 1. Find the basal-most nodes of each (potentially non-monophyletic) visible cluster 2. Identify the paths from the root to the nodes in (1). 3. Find the first fork in this set of paths
Find smallest subtree with all visible data when using genotype filters
This comment was updated Jan 22nd
PR Status
Ready for review. I believe there are no bugs (see below) but testing is required and performance may not be good enough for release.
Summary
This PR adds genotype filters to the side-bar dropdown allowing filtering of the dataset by any observed genotype, as long as it was at a variable site.
Notes and example URLs
Updated to relect the changes introduced in 3ca2c02
Initial rendering of the filtering-dropdown is very slow. This seems to be one solution, which I may add in this PR or via another PR.See below for update.S 501Y || S 501T
https://auspice-filter-by-genot-lf7x8o.herokuapp.com/ncov/global?gt=S.501Y,501Tcc @kairstenfay @joverlee521 @trvrb @eharkins @huddlej re: this morning's meeting, any testing would be super useful!