Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decide on graph walking vs EL inference for using taxon constraints to make subsets #2137

Open
cmungall opened this issue Oct 28, 2021 · 14 comments

Comments

@cmungall
Copy link
Member

cmungall commented Oct 28, 2021

Current strategy to make a taxon subset:

  • Add axiom Thing subClassOf part-of some NCBITaxon:nnnn
  • Remove any inter-species existentials
    • homology represented as reciprocal existentials
    • inter-species edges (less relevant for uberon)
  • Ensure all TCs are EL-ified
    • never_in becomes disjointWith in-taxon some X
    • Ensure NCBITaxon GCIs are added (in-taxon some X disjoint with in-taxon some Y for all sibs)
  • Reason with Elk
  • Eliminate all unsatisfiables

Note this has issues if we have:

  • A subclass part-of some B
  • B never-in-taxon

See
geneontology/go-annotation#3942

@balhoff says he has a solution

Some additional issues with the approach

  • hard to mentally reason over - logic is separated across ad-hoc filters on OPs, various pre-processing steps, property chains stored centrally in GO
  • Can require 10s of gigs of memory
    • this gets worse the further away from human we go
  • Hard to 'customize'. E.g. GO may want to include occurs_in when making their subsets

Outline alternative strategies in this ticket

@cmungall cmungall added this to To do in Taxon Constraints across AOs via automation Oct 28, 2021
@cmungall
Copy link
Member Author

Whelk Strategy

@balhoff to fill in

@cmungall
Copy link
Member Author

cmungall commented Oct 28, 2021

Relation Graph strategy

proponent: @cmungall

  1. Run relation-graph over combined ontology (e.g. uberon + ncbitaxon)
    • no special GCIs required, no pre-processing
    • no taxon property chains required (but of course other prop chains included)
  2. Run sparql queries to to obtain exclusion criteria for a taxon t and property p
    • EXCLUDE ?class IF: ?class ?p ?ancestor [inferred] . ?ancestor only-in-taxon ?t1 [direct] . NOT(?t subclass* ?t1) [inferred]
    • EXCLUDE ?class IF: ?class ?p ?ancestor [inferred] . ?ancestor never-in-taxon ?t1 [direct] . ?t subclass* ?t1 [inferred]
    • otherwise INCLUDE

Advantages

  • very simple and easy to mentally reason over and IMO better corresponds to biologists mental models
  • customizable. For p, plug in the top level relations than make sense. E.g. (overlaps|occurs_in|...)
  • high guarantees of scalability
    • we should already be running RG on our ontologies (at least for subsets of OPs)
  • this is essentially what other groups are doing, e.g. interpro, ensembl (for filtering predicted annotations)

Disadvantage:

  • does not generate unsats hence cannot use robot/protege explanation features. But I think this is OK if we can show the explanation for the core RG triple

@balhoff
Copy link
Member

balhoff commented Oct 28, 2021

hard to mentally reason over - logic is separated across ad-hoc filters on OPs, various pre-processing steps, property chains stored centrally in GO

The property chains are in RO, right? (although I think Uberon adds some itself). Filters and pre-processing seems to apply to all approaches.

Hard to 'customize'. E.g. GO may want to include occurs_in when making their subsets

One note on this: this is effectively included in the subset computation, because any existential to an unsatisfiable class is unsatisfiable. But it's not included in "normal" reasoning tasks checking the regular ontology classification. The subset computation is more aggressive.

@dosumis
Copy link
Contributor

dosumis commented Oct 29, 2021

@cmungall - your graph strategy looks reasonable to me.
@balhoff good point on how aggressive trimming is with the current strategy.

It would be good to see a side-by-side comparison. How many additional unwanted classes make it through if we switch to the graph-based approach? Maybe we could compare on the previous Uberon release?

It looks like the process with unsatisfiables will never scale, even with ELK, but I wonder if we can get a better sense of scaling with by running some tests with stripped down OWL files as input.

I appreciate that we're resource constrained right now for running these tests (unless someone in Chris's group can take this on). I think it's something that devs in my group could work on in the new year. Can we get by for now? Does anyone have a juiced up machine or access to a cluster we could use to run the current release?

@matentzn
Copy link
Contributor

matentzn commented Nov 8, 2021

@dosumis if you want @shawntanzk, @anitacaron and me to act on this, we need some specific instructions as it will fall to @anitacaron to take the bulk of this work, and it will occupy her for a few weeks (given she only has a few hours a week to dedicate to this project I mean).

From the meeting I gather we need to:

  • Decide who should document, and where the documentation lives
  • Decide on the curation strategy (how do we get the taxon constraints into the ontologies)
  • Decide on the technical strategy to materialise the logical constraints (DOSDP vs SPARQL)
  • Decide on the technical strategy on extracting taxon views on the basis of these constraints

@shawntanzk I think we can handle this after all, its just going to be a slow process. If you want, you can put it up on the board again for next week.

@cmungall
Copy link
Member Author

cmungall commented Nov 8, 2021

For the graph strategy, it is easy to explore this using the existing ubergraph instance, which includes relation-graph inferences

See this query:
https://api.triplydb.com/s/8hs8rvxj3

Which is hardcoded to return classes EXCLUDED from a human view

Scroll up for an explanation of the query

Note that for demonstrative purposes, this is highly aggressive. For example, annotation shortcuts like spatially-disjoint-with are treated like any other triples in relation-graph (we should exclude not owl entailed NG in query). If there were homology assertions in any ontology, these would also be propagated over. However, it is trivial to exclude these either at sparql time or as a post-processing step

cmungall added a commit to cmungall/sparqlprog that referenced this issue Nov 9, 2021
@cmungall
Copy link
Member Author

cmungall commented Nov 9, 2021

I introduced some steps to run RG-based taxon checks into the Makefile here #2160

This is NOT yet part of any build dependency. It also not fully tested. It relies on certain assumptions such that TCs in external ontologies use RO:0002161 triples for never-in-taxon.

Results of running on uberon-edit from Nov 1 here:

https://s3.amazonaws.com/bbop-ontologies/uberon/tmp/class-taxon-exclusions.tsv.gz

URL not guaranteed stable, this is for testing

The query hardcodes Human and Mammal. The TSV lists classes that are excluded for a given taxon, together with the reason.

Apologies for the duplication due to different labels of OPs:

?c ?cLabel ?p ?pLabel ?clsWithConstraint ?clsWithConstraintLabel ?taxonWithConstraint ?taxonWithConstraintLabel ?queryTaxon
http://purl.obolibrary.org/obo/UBERON_0003221 "phalanx" http://purl.obolibrary.org/obo/RO_0002202 "develops_from" http://purl.obolibrary.org/obo/UBERON_2001544 "sublingual cartilage" http://purl.obolibrary.org/obo/NCBITaxon_40674 "Mammalia" http://purl.obolibrary.org/obo/NCBITaxon_9606
http://purl.obolibrary.org/obo/UBERON_0003221 "phalanx" http://purl.obolibrary.org/obo/RO_0002202 "develops from"@en http://purl.obolibrary.org/obo/UBERON_2001544 "sublingual cartilage" http://purl.obolibrary.org/obo/NCBITaxon_40674 "Mammalia" http://purl.obolibrary.org/obo/NCBITaxon_9606
http://purl.obolibrary.org/obo/UBERON_0003221 "phalanx" http://purl.obolibrary.org/obo/RO_0002202 "develops from" http://purl.obolibrary.org/obo/UBERON_2001544 "sublingual cartilage" http://purl.obolibrary.org/obo/NCBITaxon_40674 "Mammalia" http://purl.obolibrary.org/obo/NCBITaxon_9606

This is obviously a false positve, caused by #2159 (if this is fixed, then 16642 lines will disappear from the file)

Note: would be great to have robot output saner TSVs, can anyone work on:
ontodev/robot#176

others are as-expected:

?c ?cLabel ?p ?pLabel ?clsWithConstraint ?clsWithConstraintLabel ?taxonWithConstraint ?taxonWithConstraintLabel ?queryTaxon
http://purl.obolibrary.org/obo/UBERON_8200004 "copepodite stage 3" http://purl.obolibrary.org/obo/BFO_0000050 "part of"@en http://purl.obolibrary.org/obo/UBERON_0000069 "larval stage" http://purl.obolibrary.org/obo/NCBITaxon_32524 "Amniota" http://purl.obolibrary.org/obo/NCBITaxon_40674
http://purl.obolibrary.org/obo/UBERON_4200208 "pectoral fin intermediate radial bone" http://purl.obolibrary.org/obo/RO_0002202 "develops from" http://purl.obolibrary.org/obo/UBERON_2001456 "pectoral fin endoskeletal disc" http://purl.obolibrary.org/obo/NCBITaxon_40674 "Mammalia" http://purl.obolibrary.org/obo/NCBITaxon_9606
http://purl.obolibrary.org/obo/UBERON_4500010 "unbranched pectoral fin ray" http://purl.obolibrary.org/obo/BFO_0000050 "part of"@en http://purl.obolibrary.org/obo/UBERON_0002534 "paired fin" http://purl.obolibrary.org/obo/NCBITaxon_32523 "Tetrapoda" http://purl.obolibrary.org/obo/NCBITaxon_40674

@shawntanzk
Copy link
Collaborator

@dosumis - could you provide some guidance for how the tech team can proceed with this? thanks

@github-actions
Copy link

This issue has not seen any activity in the past 6 months; it will be closed automatically in one year from now if no action is taken.

@github-actions github-actions bot added the Stale label May 29, 2022
@matentzn matentzn removed the Stale label May 30, 2022
@matentzn
Copy link
Contributor

Should be reconsidered eventually

@github-actions
Copy link

github-actions bot commented Feb 8, 2023

This issue has not seen any activity in the past 6 months; it will be closed automatically one year from now if no action is taken.

@anitacaron
Copy link
Collaborator

Since last year, this has been a low priority. If it should have a higher priority, please give some action items.

@matentzn
Copy link
Contributor

I think this is covered by the new subset command, we should double check

@github-actions
Copy link

This issue has not seen any activity in the past 6 months; it will be closed automatically one year from now if no action is taken.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

6 participants