Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore hosting KG2 in Plater/Automat #2200

Closed
amykglen opened this issue Nov 13, 2023 · 3 comments
Closed

Explore hosting KG2 in Plater/Automat #2200

amykglen opened this issue Nov 13, 2023 · 3 comments
Assignees
Labels

Comments

@amykglen
Copy link
Member

amykglen commented Nov 13, 2023

have been working on this for a few weeks and realized we don't yet have an issue for it

current status is that Evan Morris got a preliminary version of KG2 up using Plater (https://automat.renci.org/#/rtx-kg2/).

but it currently doesn't do category reasoning because Plater expects categories to be pre-expanded to their ancestors in the json lines files it ingests.

I'm going to make that tweak to our json lines files and then play around with the re-deployed Plater KG2 to see how it seems to do.

one interesting difference vs. our KG2 API is that Plater expects queries to come in using only canonical node identifiers. I don't think this should cause a problem for ARAX Expand, which I believe only queries using canonical identifiers anyway, but need to check on that...

this is an example query that produces answers from the dev Plater KG2:

curl -X 'POST' \
  'https://automat.renci.org/rtx-kg2/1.4/query' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
   "message":{
      "query_graph":{
         "nodes":{
            "n0":{
               "ids":[
                  "PUBCHEM.COMPOUND:1983"
               ],
               "categories": ["biolink:Drug"]
            },
            "n1":{
               "categories": ["biolink:Protein"]
            }
         },
         "edges":{
            "e01":{
               "subject":"n0",
               "object":"n1",
               "predicates":[
                  "biolink:physically_interacts_with"
               ]
            }
         }
      }
   }
}'
@saramsey
Copy link
Member

saramsey commented Nov 13, 2023

Thanks for the update. Great progress! Can the PloverDB regression (pytest) suite, perhaps with a bit of modification, be run against the Plater/KG2?

@amykglen
Copy link
Member Author

yes, with slight modification! that may be a good starting point for tests.

@amykglen
Copy link
Member Author

amykglen commented Feb 22, 2024

An update here for the record:

  • KG2.8.4c has been successfully hosted in Plater (for now on kg2cplover2.rtx.ai), using the new v1.5.0 Plater code, which seems to be much faster than the previous version (I'm told this is due to changes unrelated to Neo4j - i.e., just to the 'wrapper' sort of code that converts answers into TRAPI format and such)
  • In preliminary testing, Plater is faster than our KG2 API - maybe by about 25% - for single-curie queries (haven't looked into multi-curie queries yet)
    • The Plater system does seem to do some caching (I think in Neo4j), but the 25% faster estimate is when caching isn't at play..
  • I have scripts (verified working) for automating the whole setup/building/hosting of KG2 Plater, starting from KG2c TSV files; for now that code is in this repo, but I'll move it wherever makes sense if we end up using Plater to host KG2
  • I also have a pytest suite set up that can easily be pointed to whatever endpoint (Plater vs. Plover) and records query times as well as other data like number of nodes/edges returned (it also saves responses locally for later analysis)
    • The testing framework is in place (again in this repo) but I need to go through and actually select tests to use for our official comparison
  • Interestingly, Plater returns a lot more nodes/results per single-curie query than our KG2 API does (on the order of 5-10x as many)
    • Looking into this, a lot of it seems to be due to erroneous subclass_of reasoning on Plater's part, which I think is because Plater considers all subclass_of edges in KG2c, whereas we only consider such edges from certain trusted sources. For instance, Plater thinks that "Placental Growth Factor" and "magnesium" are descendants of Aspirin
    • I think this problem is significant enough to make KG2 Plater kind of useless in its current state. To get around it, I'm thinking of maybe changing the predicate of the subclass_of edges that come from non-trusted sources to related_to_at_concept_level (the parent of subclass_of in Biolink) when loading KG2c into Plater... this would at least allow for a much more fair/useful comparison of the two tools..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants