Roll out KG2.7.4 (Biolink 2.2.6) #1728

amykglen · 2021-11-13T06:00:10Z

1. Build and load KG2c:

build a synonymizer from the new KG2 (in the kg2integration branch)
- make sure synonymizer uses Biolink 2.2.6!
build a new KG2c (in the kg2integration branch)
load the new KG2c into neo4j at http://kg2-X-Yc.rtx.ai:7474/browser/)
upload the new kg2c_lite_2.X.Y.json.gz file to the translator-lfs-artifacts repo
load the new KG2c into plover (available at http://kg2-X-Ycplover.rtx.ai:9990)

2. Rebuild downstream databases:

Copies of all of these should be put in /data/orangeboard/databases/KG2.X.Y on arax.ncats.io.

NOTE: As databases are rebuilt, the new copy of config_local.json will need to be updated to point to their new paths. However, if the rollout of KG2 has already occurred, then you should update the master configv2.json directly.

3. Update the ARAX codebase:

Associated code changes should go in the kg2integration branch.

update the Biolink version number (to 2.2.6) and KG2 version number (to 2.7.4) in the openapi yaml @edeutsch?
- update Biolink version in ARAX OpenAPI yaml (so that BiolinkHelper uses the right version)
update Expand code as needed
update any other modules as needed
test everything together (entire ARAX pytest suite should pass when using the new config_local.json - must locally set force_local = True in ARAX_expander.py to avoid using the old KG2 API)

4. Do the rollout:

merge master into kg2integration
merge kg2integration into master
make config_local.json the new master config file on araxconfig.rtx.ai (rename it to configv2.json)
roll master out to the various arax.ncats.io endpoints and delete their configv2.jsons
run the pytest suite on the various endpoints

5. Final items/clean up:

The text was updated successfully, but these errors were encountered:

amykglen · 2021-11-13T17:12:40Z

alright, the synonymizer+KG2c build is ongoing on buildkg2c.rtx.ai (in a screen session, from the kg2integration branch).

to kick off the build, all I did was 1) update (locally) RTX/code/kg2c/kg2c_config.json to look like this:

{
  "kg2pre_version": "2.7.4",
  "kg2pre_neo4j_endpoint": "kg2endpoint3.rtx.ai",
  "biolink_version": "2.2.6",
  "upload_to_arax.ncats.io": true,
  "upload_directory": "/data/orangeboard/databases/KG2.7.4",
  "synonymizer": {
    "build": true,
    "name": "node_synonymizer_v1.0_KG2.7.4.sqlite"
  },
  "kg2c": {
    "build": true,
    "use_nlp_to_choose_descriptions": true,
    "upload_to_s3": true,
    "start_from_kg2c_json": false,
    "use_local_kg2pre_tsvs": false
  }
}

and 2) run:

python3 RTX/code/kg2c/build_kg2c.py

(note: I made sure to create an (empty) /data/orangeboard/databases/KG2.7.4 directory on arax.ncats.io before starting the build)

amykglen · 2021-11-13T17:21:36Z

if all goes well the build should be done this evening (at which point I'll take care of loading it into Plover)

saramsey · 2021-11-13T18:22:06Z

Thank you!

saramsey · 2021-11-13T18:36:34Z

Just an FYI that in KG2pre, the edge property formerly called relation is now called original_predicate, per a change in Biolink 2.2 from Biolink 2.1. Not sure if this will break anything in the KG2c build process. Details in RTX-KG2 issue 165.

amykglen · 2021-11-14T02:25:56Z

the synonymizer build completed successfully and things seem fine so far with that, but the KG2c build errored out while using BiolinkHelper due to some strange mixin predicates in 2.2.6. fixed that issue and resumed the build.

amykglen · 2021-11-14T17:25:37Z

alright, the new KG2c is ready in Neo4j: http://kg2-7-4c.rtx.ai:7474/browser/

everything looks fine so far on spot checking. upload to PloverDB is in progress.

amykglen · 2021-11-15T02:26:29Z

KG2c has been loaded into Plover and all necessary downstream databases have been rebuilt. will test everything together tomorrow morning.

amykglen · 2021-11-15T04:14:07Z

actually, ran the ARAX test suite tonight and all fast tests passed on the first try! I'm impressed.

I'll do some deeper testing (e.g., Expand's slow tests) tomorrow morning.

saramsey · 2021-11-15T04:27:47Z

This is great! Thank you @amykglen !!

chunyuma · 2021-11-15T04:38:20Z

Hi @finnagin, do we need slim databases for Travis in this time point? Currently. due to the limited time, I only built the refreshed database but the full databases might need longer time. Just want to see if you also need to slim version for the refreshed database. Thanks!

finnagin · 2021-11-15T06:03:09Z

@chunyuma We do still need those but since it's only used for testing and not the actual system I think we don't need to be sure to make the deadline for the slim database part. Though @amykglen, we will also need to come up with a way to generate slim kg2c and node synonymizer versions if we want Travis to run.

amykglen · 2021-11-15T16:45:08Z

ah, yeah, I dropped the ball on the slim database thing. I added an agenda item for this week's AHM to touch base on that! (not a blocker for this KG2 rollout)

amykglen · 2021-11-15T18:28:28Z

everything still looks good on further testing - one slow DTD expand test is failing (test_dtd_expand_2), though perhaps that would be fixed once the full DTD rebuild is done? maybe @chunyuma could take a look, but I don't think it's critical for the rollout...

chunyuma · 2021-11-16T21:32:50Z

Hi @amykglen, sorry for late response. For test_dtd_expand_2 , it seems like the error is from kg2c. Based on the query of this test case, it generates the query for neo4j but this query doesn't have any returns from kg2c neo4j. Could you please help take a look?

Here is the neo4j query for this test:

MATCH (n0:`biolink:SmallMolecule` {id:'CHEMBL.COMPOUND:CHEMBL112'})-[e0:`['biolink:related_to']`]-(n1) WHERE (n1:`biolink:Disease` OR n1:`biolink:DiseaseOrPhenotypicFeature` OR n1:`biolink:PhenotypicFeature`) WITH collect(distinct n0) as nodes_n0, collect(distinct n1) as nodes_n1, collect(distinct e0{.*, id:ID(e0), n0:n0.id, n1:n1.id}) as edges_e0 RETURN nodes_n0, nodes_n1, edges_e0

It returns nothing:

chunyuma · 2021-11-16T22:07:48Z

@amykglen, I think I figure out the problem. It seems like that the function _get_cypher_for_query_edge is deprecated now. This might be an old function that expand used to create the neo4j query. Perhaps we now have other functions somewhere in expand to process this. I'm now modifying this function to solve this error temporarily. Could you please let me know where I can find the new function to replace this function so that we can make everything consistent? Thanks!

saramsey · 2021-11-16T23:31:08Z

Should we add "slim database" to the KG2c template checklist? (maybe it's already on there, I didn't check).

amykglen · 2021-11-17T00:02:53Z

Should we add "slim database" to the KG2c template checklist? (maybe it's already on there, I didn't check).

yep, we have an item for slim databases already

amykglen · 2021-11-17T00:06:20Z

@amykglen, I think I figure out the problem. It seems like that the function _get_cypher_for_query_edge is deprecated now. This might be an old function that expand used to create the neo4j query. Perhaps we now have other functions somewhere in expand to process this. I'm now modifying this function to solve this error temporarily. Could you please let me know where I can find the new function to replace this function so that we can make everything consistent? Thanks!

the rest of Expand doesn't use neo4j at all anymore, so there is no current _get_cypher_for_query_edge function. what do you use neo4j for in DTD? would it maybe be possible to query Plover instead?

chunyuma · 2021-11-17T00:18:33Z

the rest of Expand doesn't use neo4j at all anymore, so there is no current _get_cypher_for_query_edge function. what do you use neo4j for in DTD? would it maybe be possible to query Plover instead?

@amykglen, in DTD querier, it contains two modes "fast mode" and "slow mode". The "fast mode" is to query the DTD database directly while the "slow mode" is to call the DTD model and compute the drug repurposing probability on the fly. So when we use "slow mode", we need _get_cypher_for_query_edge function to query the possible subject node or the possible object node based on the query_graph. Is it possible to query Plover for this goal?

Take test_dtd_expand_2 as example:

"add_qnode(name=acetaminophen, key=n0)",
"add_qnode(categories=biolink:Disease, key=n1)",
"add_qedge(subject=n0, object=n1, key=e0)",
"expand(edge_key=e0, kp=DTD, DTD_threshold=0, DTD_slow_mode=True)",
"return(message=true, store=false)"

the "slow mode" needs to know what "n1" nodes should be paired with acetaminophen to compute their probabilities by using the model.

amykglen · 2021-11-17T17:23:06Z

so you mean you need to run the one-hop query on KG2 to get diseases connected to acetaminophen?

you can do that with Plover like so:

trapi_qg = {
        "edges": {
            "e00": {
                "subject": "n00",
                "object": "n01",
            }
        },
        "nodes": {
            "n00": {
                "ids": ["CHEMBL.COMPOUND:CHEMBL112"]
            },
            "n01": {
                "categories": ["biolink:Disease"]
            }
        }
    }
rtxc = RTXConfiguration()
response = requests.post(f"{rtxc.plover_url}/query", json=trapi_qg, headers={'accept': 'application/json'})

by default it will return answers in this format (only including node/edge IDs):

{
   "edges":{
      "e00":[
         19308544,
         26624039,
         11296815,
         12484663,
         15564856,
         9568317,
         12222530,
         23814212,
         12222534,
         11395143,
         11214936,
         16932955,
          ...
      ]
   },
   "nodes":{
      "n00":[
         "CHEMBL.COMPOUND:CHEMBL112"
      ],
      "n01":[
         "MESH:D014886",
         "MONDO:0009323",
         "MONDO:0020722",
         "MONDO:0001384",
         "MONDO:0003406",
         "UMLS:C0429001",
         "MESH:D010539",
         "MONDO:0007254",
         "MESH:D020078",
         "CHEMBL.COMPOUND:CHEMBL326958",
         "UMLS:C0375314",
         "MONDO:0100053",
         "MONDO:0005812",
         "MONDO:0005010",
         "MONDO:0001246",
         "MONDO:0001046",
         "MESH:D048949",
         "MONDO:0002334",
         "MONDO:0004553",
         "MONDO:0007186",
         "MESH:D014950",
         "MONDO:0100192",
         "UMLS:C0442797",
         "UMLS:C0231225",
         "MONDO:0005101",
         "MONDO:0010667",
         "MONDO:0001156",
           ...
      ]
   }
}

but if you want more info included in the results you can add "include_metadata": True to your query graph

chunyuma · 2021-11-17T18:54:38Z

Thanks @amykglen. If I only want all nodes with categories 'biolink:Disease' or 'biolink:DiseaseOrPhenotypicFeature' or 'biolink:PhenotypicFeature', I think the Plover can also do this by modifying the trapi_qg like:

trapi_qg = {
        "nodes": {
            "n00": {
                "categories": [ 'biolink:Disease', 'biolink:DiseaseOrPhenotypicFeature, 'biolink:PhenotypicFeature']
            }
        }
    }
rtxc = RTXConfiguration()
response = requests.post(f"{rtxc.plover_url}/query", json=trapi_qg, headers={'accept': 'application/json'})

Is it right?

amykglen · 2021-11-17T19:11:56Z

yep!

amykglen · 2021-11-17T19:12:41Z

or wait, so you're trying to get all disease-like nodes in KG2? (not just connected to acetaminophen?) not sure whether that would work...

chunyuma · 2021-11-17T19:26:15Z

@amykglen, yes, I'm thinking that the DTD expand should be independent of RTX-KG2c, right? This means there are some edges generated by DTD expand based on the DTD model with probability > certain threshold which might exist in the RTX-KG2c. So back to the acetaminophen case, the DTD expand should consider all disease-like nodes in KG2 and then use DTD model to calculate the probabilities and then expand the edges, right?

amykglen · 2021-11-17T19:33:53Z

ah, ok, I didn't realize you're looking up all disease-like nodes. yeah, that won't work with Plover.

so you really only need to get the list of all disease-like node IDs once, right? (for each KG2 version.) not on every query?

could you do that during building of DTD? (and then just store the list of IDs in one of your DTD databases, or a separate database if you preferred, which could be added to the database manger)

chunyuma · 2021-11-17T20:04:06Z

Actually, not just all disease-like node IDs. The reason it is a list of all disease-like node IDs is because in this query, we try to expand n0:'acetaminophen' to n1:'disease-like' nodes:

"add_qnode(name=acetaminophen, key=n0)",
"add_qnode(categories=biolink:Disease, key=n1)",
"add_qedge(subject=n0, object=n1, key=e0)",
"expand(edge_key=e0, kp=DTD, DTD_threshold=0, DTD_slow_mode=True)",
"return(message=true, store=false)"

Perhaps in other queries, people are interested in the expand for acetaminophen to other categories via DTD expand. (Note that currently we don't check the category provided by the user via slow mode. In other words, it is allowed that people can provide any kinds of categories via DTD expand.). So actually, we need a function that can extract all nodes corresponding to the categories provided by the user. Do you think it is feasible?

I think we can pre-store the ID list corresponding to different categories. However, I'm not sure if we need to consider the hierarchical relation. For example, if the user set add_qnode(categories=biolink:ChemicalEntity, key=n1), we need to use all nodes corresponding to biolink:ChemicalEntity and also include all its children. I think this is what exapnd is currently doing in RTX-KG2, right?

amykglen · 2021-11-23T02:59:08Z

I think that's right that you would want to do hierarchical reasoning for these category ID lists. if you query the KG2c neo4j by label (e.g., match (n:biolink:ChemicalEntity) return n.id), that reasoning should be done for you (since nodes are labeled with their direct categories as well as the ancestors of those categories).

(Plover can't help here since it doesn't currently allow queries where no qnode is "pinned")

amykglen · 2021-11-23T03:01:41Z

hey @finnagin - have you updated the test triples (for the NCATS repo) for KG2.7.4 yet?

finnagin · 2021-11-24T02:19:46Z

The pull request for updating the test triples is now in the NCATSTranslator/testing repo.

finnagin · 2022-02-02T07:01:19Z

Closing as the smart api registry looks to be updated and everything else nopt marked as able to be skipped has been checked

amykglen added the kg2 rollout label Nov 13, 2021

amykglen added a commit that referenced this issue Nov 13, 2021

Merge remote-tracking branch 'origin/master' into kg2integration #1728

3c83412

amykglen referenced this issue Nov 13, 2021

Move synonymizer to Biolink version 2.2.6

0b35f46

amykglen added a commit that referenced this issue Nov 14, 2021

Add patch for mixin predicates with non-mixin parents #1728

b6df8f7

amykglen added a commit that referenced this issue Nov 14, 2021

Grab system mem script from KG2 repo during Neo4j setup #1728

a8f0af6

amykglen added a commit that referenced this issue Nov 15, 2021

Upgrade ARAX's biolink version #1728

244a707

finnagin added a commit that referenced this issue Nov 15, 2021

#1728, add kg2c_sqlite & node_synonymizer to slim download

b3eeb07

amykglen added a commit that referenced this issue Nov 15, 2021

Merge remote-tracking branch 'origin/master' into kg2integration #1728

764fb7b

amykglen added a commit that referenced this issue Nov 15, 2021

Merge remote-tracking branch 'origin/master' into kg2integration #1728

848355d

amykglen added a commit that referenced this issue Nov 15, 2021

Merge remote-tracking branch 'origin/kg2integration' #1728

54c7481

edeutsch added a commit that referenced this issue Nov 15, 2021

update API versions and Biolink version #1728

1155a55

chunyuma added a commit that referenced this issue Nov 16, 2021

fixed a bug report in #1728 for test_dtd_expand_2

0693572

saramsey added the verify in next deployment label Nov 19, 2021

finnagin added a commit that referenced this issue Nov 24, 2021

#1728, comment out non KG2c dataframes

d5863b7

finnagin added a commit that referenced this issue Nov 24, 2021

#1728, generate 2.7.4 data

d97831b

finnagin added a commit to finnagin/testing that referenced this issue Nov 24, 2021

RTXteam/RTX#1728, update test triples to current KG2 version (2.7.4)

6df083e

finnagin closed this as completed Feb 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roll out KG2.7.4 (Biolink 2.2.6) #1728

Roll out KG2.7.4 (Biolink 2.2.6) #1728

amykglen commented Nov 13, 2021 •

edited by finnagin

Loading

amykglen commented Nov 13, 2021 •

edited

Loading

amykglen commented Nov 13, 2021

saramsey commented Nov 13, 2021

saramsey commented Nov 13, 2021 •

edited

Loading

amykglen commented Nov 14, 2021

amykglen commented Nov 14, 2021

amykglen commented Nov 15, 2021

amykglen commented Nov 15, 2021 •

edited

Loading

saramsey commented Nov 15, 2021

chunyuma commented Nov 15, 2021

finnagin commented Nov 15, 2021

amykglen commented Nov 15, 2021

amykglen commented Nov 15, 2021

chunyuma commented Nov 16, 2021

chunyuma commented Nov 16, 2021

saramsey commented Nov 16, 2021

amykglen commented Nov 17, 2021

amykglen commented Nov 17, 2021

chunyuma commented Nov 17, 2021 •

edited

Loading

amykglen commented Nov 17, 2021 •

edited

Loading

chunyuma commented Nov 17, 2021

amykglen commented Nov 17, 2021

amykglen commented Nov 17, 2021 •

edited

Loading

chunyuma commented Nov 17, 2021

amykglen commented Nov 17, 2021

chunyuma commented Nov 17, 2021

amykglen commented Nov 23, 2021

amykglen commented Nov 23, 2021

finnagin commented Nov 24, 2021

finnagin commented Feb 2, 2022

Roll out KG2.7.4 (Biolink 2.2.6) #1728

Roll out KG2.7.4 (Biolink 2.2.6) #1728

Comments

amykglen commented Nov 13, 2021 • edited by finnagin Loading

1. Build and load KG2c:

2. Rebuild downstream databases:

3. Update the ARAX codebase:

4. Do the rollout:

5. Final items/clean up:

amykglen commented Nov 13, 2021 • edited Loading

amykglen commented Nov 13, 2021

saramsey commented Nov 13, 2021

saramsey commented Nov 13, 2021 • edited Loading

amykglen commented Nov 14, 2021

amykglen commented Nov 14, 2021

amykglen commented Nov 15, 2021

amykglen commented Nov 15, 2021 • edited Loading

saramsey commented Nov 15, 2021

chunyuma commented Nov 15, 2021

finnagin commented Nov 15, 2021

amykglen commented Nov 15, 2021

amykglen commented Nov 15, 2021

chunyuma commented Nov 16, 2021

chunyuma commented Nov 16, 2021

saramsey commented Nov 16, 2021

amykglen commented Nov 17, 2021

amykglen commented Nov 17, 2021

chunyuma commented Nov 17, 2021 • edited Loading

amykglen commented Nov 17, 2021 • edited Loading

chunyuma commented Nov 17, 2021

amykglen commented Nov 17, 2021

amykglen commented Nov 17, 2021 • edited Loading

chunyuma commented Nov 17, 2021

amykglen commented Nov 17, 2021

chunyuma commented Nov 17, 2021

amykglen commented Nov 23, 2021

amykglen commented Nov 23, 2021

finnagin commented Nov 24, 2021

finnagin commented Feb 2, 2022

amykglen commented Nov 13, 2021 •

edited by finnagin

Loading

amykglen commented Nov 13, 2021 •

edited

Loading

saramsey commented Nov 13, 2021 •

edited

Loading

amykglen commented Nov 15, 2021 •

edited

Loading

chunyuma commented Nov 17, 2021 •

edited

Loading

amykglen commented Nov 17, 2021 •

edited

Loading

amykglen commented Nov 17, 2021 •

edited

Loading