Stand up "creative DTD" endpoint #1818

dkoslicki · 2022-04-05T18:02:12Z

@chunyuma will create a class/method/script/function that has the following structure:
Input: A single disease CURIE and two integers M & N
output: the top M drugs predicted to treat the disease, along with N explanation paths for each drug
This will be using his reinforcement learning model.

@finnagin will stand up an endpoint to arax.ncats.io (with a name something like "CreativeDTD") that will:
Take as input a TRAPI query structured like: (x)-[r]-(y) where x is biolink:ChemicalEntity (or any of its descendants), r is any biolink relationship (effectively ignoring the relationship type) and y is biolink:DiseaseOrPhenotypicFeature (or any of its descendants). Everything else will be ignored (including the workflow portion of TRAPI).
As output, it will give a standard TRAPI response. The only nuance here is that the paths that Chunyu's method returns can have variable length: anywhere from 1 to 3 hops. As such, the query graph associated with this may need to be something like:
(x)-[r_opt1]-(y)
(x)-[r_opt2]-()-[r_opt2]-(y)
(x)-[r_opt3]-()-[r_opt3]-()-[r_opt3]-(y)
Or something similar to communicate 1 to 3 hops. Note that the current constraint of expand requiring at least one non-optional edge shouldn't matter here as expand will not be used.

Timeline: Preliminary implementation by May 3, production ready by May 31. LMK if this timeline is reasonable (of course, the earlier the better, but there are other priorities each of you have as well).

The text was updated successfully, but these errors were encountered:

edeutsch · 2022-04-05T23:59:00Z

I'm thinking that it would easier to implement this with a suitable "knowledge type" flag/constraint (#1815) on [r] using the standard endpoint instead of a separate endpoint.

dkoslicki · 2022-04-06T13:45:53Z

@edeutsch I was imagining a separate endpoint as the output format (and input that triggers it) may change, and figured it would be nice to keep a separate endpoint to fiddle with. I do like your idea of indicating what type of result you want by using flags/constraints. I figure, though, that it will be much faster to implement this one type of query via an operation like infer() rather than trying to get everyone to agree on what flags/constraints to use.

dkoslicki · 2022-04-06T19:18:13Z

Oh, and another thing @edeutsch: we're going to host some clinical data working group workflows, so I thought we could use the normal route (and query graph interpreter) for that. I was also going to ask during the next AHM if people are cool with us listing this endpoint in the manuscript Chunyu is publishing. Basically for those who want to see it in action, but don't want to learn ARAXi/full TRAPI/Translator to use it.

finnagin · 2022-04-06T20:55:34Z

@edeutsch Fyi, I made the branch issue1818 for work on this.

chunyuma · 2022-04-08T14:43:53Z

Hi @finnagin, I've written some scripts to run my model and uploaded them with the necessary data to the arax.ncats.io server:/data/code_creative_DTD_endpoint. Since some datasets are large, I didn't push them to this repo. There is a temple called run_template.py within the scripts folder for running the scripts. Please let me know if you meet any problems.

Here are some statistics for generating the top 20 paths for the top 50 drugs predicted to treat Alkaptonuria (MONDO:0008753). The results are here.

Memory cost: 40 - 50 GB
Running time with CPU: 1158.4289071559906 s
Running time with GPU: 1021.6582832336426 s

finnagin · 2022-04-08T18:58:26Z

Thanks @chunyuma ! Does the model usually take that much ram to make predictions?

I don't think arax.ncats.io will have enough extra ram to run that so we might need to spin up another ec2 instance instead of running as a separate service on arax.ncats.io.

chunyuma · 2022-04-08T19:06:20Z

@finnagin, right. It generally needs that ram to make predictions because it needs to read in some pre-training embeddings and the whole RTX-KG2c. Perhaps we might need the thoughts of @edeutsch, @saramsey, or @dkoslicki to see if it is worthy to set up another ec2 instance for this model.

edeutsch · 2022-04-08T19:49:03Z

running it on arax.ncats.io itself seems risky because the container itself has a limit of around 55 GB.

…wards compatibility

finnagin · 2022-05-19T19:35:21Z

From meeting with Amy:

Get prefered node ids using this:

RTX/code/ARAX/NodeSynonymizer/node_synonymizer.py

Lines 1926 to 2099 in 2f40f2d

    
               def get_canonical_curies(self, curies=None, names=None, return_all_categories=False, return_type='canonical_curies'): 
        
                   # If the provided curies or names is just a string, turn it into a list 
        
                   if isinstance(curies,str): 
        
                       curies = [ curies ] 
        
                   if isinstance(names,str): 
        
                       names = [ names ] 
        
                   # Set up containers for the batches and results 
        
                   batches = [] 
        
                   results = {} 
        
                   # Set up the category manager 
        
                   category_manager = CategoryManager() 
        
                   # Make sets of comma-separated list strings for the curies and set up the results dict with all the input values 
        
                   uc_curies = [] 
        
                   curie_map = {} 
        
                   batch_size = 0 
        
                   if curies is not None: 
        
                       for curie in curies: 
        
                           if curie is None: 
        
                               continue 
        
                           results[curie] = None 
        
                           uc_curie = curie.upper() 
        
                           curie_map[uc_curie] = curie 
        
                           uc_curie = re.sub(r"'","''",uc_curie)   # Replace embedded ' characters with '' 
        
                           uc_curies.append(uc_curie) 
        
                           batch_size += 1 
        
                           if batch_size > 5000: 
        
                               batches.append( { 'batch_type': 'curies', 'batch_str': "','".join(uc_curies) } ) 
        
                               uc_curies = [] 
        
                               batch_size = 0 
        
                       if batch_size > 0: 
        
                           batches.append( { 'batch_type': 'curies', 'batch_str': "','".join(uc_curies) } ) 
        
                   # Make sets of comma-separated list strings for the names 
        
                   lc_names = [] 
        
                   name_map = {} 
        
                   batch_size = 0 
        
                   if names is not None: 
        
                       for name in names: 
        
                           if name is None: 
        
                               continue 
        
                           results[name] = None 
        
                           lc_name = name.lower() 
        
                           name_map[lc_name] = name 
        
                           lc_name = re.sub(r"'","''",lc_name)   # Replace embedded ' characters with '' 
        
                           lc_names.append(lc_name) 
        
                           batch_size += 1 
        
                           if batch_size > 5000: 
        
                               batches.append( { 'batch_type': 'names', 'batch_str': "','".join(lc_names) } ) 
        
                               lc_names = [] 
        
                               batch_size = 0 
        
                       if batch_size > 0: 
        
                           batches.append( { 'batch_type': 'names', 'batch_str': "','".join(lc_names) } ) 
        
                   for batch in batches: 
        
                       #print(f"INFO: Batch {i_batch} of {batch['batch_type']}") 
        
                       #i_batch += 1 
        
                       if batch['batch_type'] == 'curies': 
        
                           if return_type == 'equivalent_nodes': 
        
                               sql = f""" 
        
                                   SELECT C.curie,C.unique_concept_curie,N.curie,N.category,U.category 
        
                                     FROM curies AS C 
        
                                    INNER JOIN nodes AS N ON C.unique_concept_curie == N.unique_concept_curie 
        
                                    INNER JOIN unique_concepts AS U ON C.unique_concept_curie == U.uc_curie 
        
                                    WHERE C.uc_curie in ( '{batch['batch_str']}' )""" 
        
                           else: 
        
                               sql = f""" 
        
                                   SELECT C.curie,C.unique_concept_curie,U.curie,U.name,U.category 
        
                                     FROM curies AS C 
        
                                    INNER JOIN unique_concepts AS U ON C.unique_concept_curie == U.uc_curie 
        
                                    WHERE C.uc_curie in ( '{batch['batch_str']}' )""" 
        
                       else: 
        
                           sql = f""" 
        
                               SELECT S.name,S.unique_concept_curie,U.curie,U.name,U.category 
        
                                 FROM names AS S 
        
                                INNER JOIN unique_concepts AS U ON S.unique_concept_curie == U.uc_curie 
        
                                WHERE S.lc_name in ( '{batch['batch_str']}' )""" 
        
                       #print(f"INFO: Processing {batch['batch_type']} batch: {batch['batch_str']}") 
        
                       cursor = self.connection.cursor() 
        
                       cursor.execute( sql ) 
        
                       rows = cursor.fetchall() 
        
                       # Loop through all rows, building the list 
        
                       batch_curie_map = {} 
        
                       for row in rows: 
        
                           # If the curie or name is not found in results, try to use the curie_map{}/name_map{} to resolve capitalization issues 
        
                           entity = row[0] 
        
                           if entity not in results: 
        
                               if batch['batch_type'] == 'curies': 
        
                                   if entity.upper() in curie_map: 
        
                                       entity = curie_map[entity.upper()] 
        
                               else: 
        
                                   if entity.lower() in name_map: 
        
                                       entity = name_map[entity.lower()] 
        
                           # Now store this curie in the list 
        
                           if entity in results: 
        
                               if row[1] not in batch_curie_map: 
        
                                   batch_curie_map[row[1]] = {} 
        
                               batch_curie_map[row[1]][entity] = 1 
        
                               # If the return turn is equivalent_nodes, then add the node curie to the dict 
        
                               if return_type == 'equivalent_nodes': 
        
                                   if results[entity] is None: 
        
                                       results[entity] = {} 
        
                                   node_curie = row[2] 
        
                                   results[entity][node_curie] = row[3] 
        
                               # Else the return type is assumed to be the canonical node 
        
                               else: 
        
                                   results[entity] = { 
        
                                       'preferred_curie': row[2], 
        
                                       'preferred_name': row[3], 
        
                                       'preferred_category': row[4] 
        
                                   } 
        
                               #### Also store tidy categories 
        
                               if return_all_categories: 
        
                                   results[entity]['expanded_categories'] = category_manager.get_expansive_categories(row[4]) 
        
                           else: 
        
                               print(f"ERROR: Unable to find entity {entity}") 
        
                       # If all_categories were requested, do another query for those 
        
                       if return_all_categories: 
        
                           # Create the SQL IN list 
        
                           uc_curies_list = [] 
        
                           for uc_curie in batch_curie_map: 
        
                               uc_curie = re.sub(r"'","''",uc_curie)   # Replace embedded ' characters with '' 
        
                               uc_curies_list.append(uc_curie) 
        
                           curies_list_str = "','".join(uc_curies_list) 
        
                           # Get all the curies for these concepts and their categories 
        
                           sql = f""" 
        
                               SELECT curie,unique_concept_curie,category 
        
                                 FROM curies 
        
                                WHERE unique_concept_curie IN ( '{curies_list_str}' )""" 
        
                           cursor = self.connection.cursor() 
        
                           cursor.execute( sql ) 
        
                           rows = cursor.fetchall() 
        
                           entity_all_categories = {} 
        
                           for row in rows: 
        
                               uc_unique_concept_curie = row[1] 
        
                               node_category = row[2] 
        
                               entities = batch_curie_map[uc_unique_concept_curie] 
        
                               #### Eric says: I'm a little concerned that this entity is stomping on the previous entity. What's really going on here? FIXME 
        
                               for entity in entities: 
        
                                   # Now store this category in the list 
        
                                   if entity in results: 
        
                                       if entity not in entity_all_categories: 
        
                                           entity_all_categories[entity] = {} 
        
                                       if node_category is None: 
        
                                           continue 
        
                                       if node_category not in entity_all_categories[entity]: 
        
                                           entity_all_categories[entity][node_category] = 0 
        
                                       entity_all_categories[entity][node_category] +=1 
        
                                   else: 
        
                                       print(f"ERROR: Unable to find entity {entity}") 
        
                           # Now store the final list of categories into the list 
        
                           for entity,all_categories in entity_all_categories.items(): 
        
                               if entity in results and results[entity] is not None: 
        
                                   results[entity]['all_categories'] = all_categories 
        
                   return results

example of the above here:

RTX/code/ARAX/ARAXQuery/Expand/expand_utilities.py

Lines 312 to 332 in 2f40f2d

    
           def get_canonical_curies_dict(curie: Union[str, List[str]], log: ARAXResponse) -> Dict[str, Dict[str, str]]: 
        
               curies = convert_to_list(curie) 
        
               try: 
        
                   synonymizer = NodeSynonymizer() 
        
                   log.debug(f"Sending NodeSynonymizer.get_canonical_curies() a list of {len(curies)} curies") 
        
                   canonical_curies_dict = synonymizer.get_canonical_curies(curies) 
        
                   log.debug(f"Got response back from NodeSynonymizer") 
        
               except Exception: 
        
                   tb = traceback.format_exc() 
        
                   error_type, error, _ = sys.exc_info() 
        
                   log.error(f"Encountered a problem using NodeSynonymizer: {tb}", error_code=error_type.__name__) 
        
                   return {} 
        
               else: 
        
                   if canonical_curies_dict is not None: 
        
                       unrecognized_curies = {input_curie for input_curie in canonical_curies_dict if not canonical_curies_dict.get(input_curie)} 
        
                       if unrecognized_curies: 
        
                           log.warning(f"NodeSynonymizer did not recognize: {unrecognized_curies}") 
        
                       return canonical_curies_dict 
        
                   else: 
        
                       log.error(f"NodeSynonymizer returned None", error_code="NodeNormalizationIssue") 
        
                       return {}

Instantiate the class from trapi_querier.py with the kp name "infores:rtx-kg2"
Use the method _get_arax_edge_key from the trapi querier class to get the correct edge key

Add edge attribute that specified here:

RTX/code/ARAX/ARAXQuery/Expand/kg2_querier.py

Lines 226 to 229 in 2f40f2d

    
           edge.attributes.append(Attribute(attribute_type_id="biolink:aggregator_knowledge_source", 
        
                                            value=self.kg2_infores_curie, 
        
                                            value_type_id="biolink:InformationResource", 
        
                                            attribute_source=self.kg2_infores_curie))

After adding all edges to the knowledge graph instantiate arax decortator class.
pass the whole response and use the method decorate_nodes and decorate_edges to get metadata. (This requires the edge attribute that specifies it came form kg2)

dkoslicki · 2022-05-23T17:06:04Z

Issue: nodes returned by name from the model
Solution: model should return CURIES and names instead. Instead of a string format encoding the paths, have them encoded via (V,E) (vertices and edges) that way properties (like CURIES and names) can decorate the nodes and edges.
Tagging @chunyuma so he's aware

dkoslicki · 2022-06-05T18:33:25Z

See brain dump file for current state and what to do moving forward

dkoslicki · 2022-07-02T20:56:30Z

Multiprocessing may be a red herring: it looks like Finn's code in infer_utilities.py expects the query graph to be empty, as it populates the QG here. Looks like we will need to either a) pass the query edge and nodes that have the inferred property on it to infer_utilities so it knows where to do its edits or b) don't edit the QG and just be fine with the results not matching the QG
I figure option b) may cause issues with resultify if a QG is given with more edges than the inferred one.

… for today. Run test_ARAX_infer -k test_with_qg to see what's wrong

…w probably_treats edge when a treats edge already exists in the QG) #1818

…precisely, replace it with biolink:NamedThing) since the model can return quite the variety of categories, and does not respect the biolink traversal up the heirarchy #1818

dkoslicki · 2022-07-03T02:51:52Z

@amykglen I think I've fixed everything, as all tests appear to be passing now. I had to do some jiggering with node categories, which nodes and edges are marked as filled, and inserting the optional_group_ids in their proper place if a QG already exists.

LMK if things look good to you

…s provided. Fix leftover hardcoded curie #1818

…at's not yet implemented, so mark as should fail #1818

… done after GC has closed the log #1818

amykglen · 2022-07-03T18:39:42Z

awesome, yep, things are looking good to me!

thanks for figuring out the node_curie thing. maybe eventually I'll add a wrapper function somewhere so that the interface around ARAXInfer/XDTD is more convenient for Expand (e.g., ideally Expand could just pass in a QG, like it does for other KPs).

I'm planning to work on handling multi-qedge inferred queries over the next few days, but it may take me a little time as that makes things quite a bit more complex in Expand (having to merge answers/QGs and etc.) I'm thinking I'll do that in a branch off of issue1818 so that we can still merge issue1818 into master whenever we're ready and have a functioning creative mode, at least for simple single-qedge inferred queries.

random question: does XDTD do any subclass_of reasoning? so if you asked for treatments for Adams-Oliver syndrome it would also give you treatments for Adams-Oliver syndrome 2?

dkoslicki · 2022-07-03T21:04:34Z

Sounds good; mixed inferred and lookup edges is ahead of the curve as only the template (single inferred edge) is required by Tuesday, so there's plenty of time to do mixed knowledge types

Re: subclass reasoning, no, it only does the inference for the exact curie supplied.

edeutsch · 2022-07-05T18:41:21Z

So I am ready/trying to deploy this for dev/testing, but seeing this error:

  File "/mnt/data/orangeboard/devED/RTX/code/RTXConfiguration.py", line 162, in live
    self.explainable_dtd_db_host = self.config["Global"]["explainable_dtd_db"]["host"]
KeyError: 'explainable_dtd_db'

I'm hoping @amykglen or someone can update the central configv2? and that should fix it?

dkoslicki · 2022-07-05T19:01:45Z

Yes, an updated to configv2 should fix it. I just don't know which machine has the "authoritative" version of it. Perhaps @amykglen knows (and I'd like to know too!)

amykglen · 2022-07-05T19:17:19Z

the authoritative configv2.json lives on araxconfig.rtx.ai - I'll put the new configv2.json that David shared in slack on there now

amykglen · 2022-07-05T19:26:41Z

ok, the authoritative configv2.json has been updated now - so if you delete yours to force a redownload it should work

amykglen · 2022-07-05T19:27:05Z

note that when rolling out to prod we'll have to edit the config_local.json that prod uses

edeutsch · 2022-07-05T19:42:58Z

okay, we are deployed to /test and /beta. The endpoints pass our basic test query.
I have not tested creative mode.
Please test if you have time. I am slammed for the rest of the day.

dkoslicki · 2022-07-05T21:27:47Z

Tested and is looking good! I will want to make some changes later, (ala #1862), but I think it's fine to roll out to all endpoints.

edeutsch · 2022-07-06T04:14:14Z

okay, will do, even production?

edeutsch · 2022-07-06T04:15:12Z

note that when rolling out to prod we'll have to edit the config_local.json that prod uses

Can we do this together at the hackathon tomorrow?

dkoslicki · 2022-07-06T04:15:45Z

Yup, considering the UI team is expecting creative results, we should roll it out everywhere

amykglen · 2022-07-06T04:15:56Z

sure, sounds good to me!

For #1865

dkoslicki · 2022-07-08T01:28:28Z

Deployed everywhere, so closing

dkoslicki added high priority Reasoning labels Apr 5, 2022

dkoslicki assigned chunyuma and finnagin Apr 5, 2022

finnagin added a commit that referenced this issue Apr 11, 2022

#1818, added arax infer

ade0556

finnagin added a commit that referenced this issue Apr 11, 2022

#1818, add dtd predictions

fc17ad3

finnagin added a commit that referenced this issue Apr 11, 2022

#1818, add n_paths parameter

0862d0b

finnagin added a commit that referenced this issue May 2, 2022

#1818, added logic to add paths to knowledge graph and fill results

02d0322

finnagin added a commit that referenced this issue May 2, 2022

#1818, added example pickle files

b969275

finnagin added a commit that referenced this issue May 2, 2022

#1818, added creative DTD scripts

d3c7b30

finnagin added a commit that referenced this issue May 18, 2022

#1818, added csv of saved top drugs response to allow for pandas back…

0eca43a

…wards compatibility

finnagin added a commit that referenced this issue May 18, 2022

#1818, update infer code to expand one path at a time

b3be691

finnagin added a commit that referenced this issue May 30, 2022

#1818, sped up TRAPI generation step

5ca3d9a

finnagin added a commit that referenced this issue May 30, 2022

#1818, added test

f1f3261

finnagin added a commit that referenced this issue May 31, 2022

#1818, breakout subgraph generation into infer utilities

e4083bf

finnagin added a commit that referenced this issue Jun 5, 2022

#1818, added bug fix

c456de8

finnagin added a commit that referenced this issue Jun 5, 2022

#1818, update dockerfile to checkout issue1818

014e6e0

finnagin added a commit that referenced this issue Jun 5, 2022

#1818, regen documantation

4392890

finnagin added a commit that referenced this issue Jun 5, 2022

#1818, regen documantation with infer added

dd1a93f

dkoslicki added a commit that referenced this issue Jul 2, 2022

much closer to a solution to #1818 issue, but this is all I have time…

e855a26

… for today. Run test_ARAX_infer -k test_with_qg to see what's wrong

dkoslicki added a commit that referenced this issue Jul 3, 2022

works now save for the after query testing (needed to not inject a ne…

de0c2ce

…w probably_treats edge when a treats edge already exists in the QG) #1818

dkoslicki mentioned this issue Jul 3, 2022

Issue1818 #1859

Merged

dkoslicki added a commit that referenced this issue Jul 3, 2022

add test to change the categories to various things #1818

11cd111

dkoslicki added a commit that referenced this issue Jul 3, 2022

add tests for different predicates and error handling when no curie i…

96ec20b

…s provided. Fix leftover hardcoded curie #1818

dkoslicki added a commit that referenced this issue Jul 3, 2022

start refactoring Finn's code for readability 1 #1818

33af283

dkoslicki added a commit that referenced this issue Jul 3, 2022

continue to clean up code #1818

3686f1b

dkoslicki added a commit that referenced this issue Jul 3, 2022

continue cleaning up Finn's code #1818

2c87222

dkoslicki added a commit that referenced this issue Jul 3, 2022

clean up predicates to be TRAPI compliant #1818

60b9cbe

dkoslicki added a commit that referenced this issue Jul 3, 2022

add test for mixed lookup and inferred edges. Currently failing as th…

2e3710f

…at's not yet implemented, so mark as should fail #1818

dkoslicki added a commit that referenced this issue Jul 3, 2022

remove logging for disconnecting to database since this appears to be…

b3fc56d

… done after GC has closed the log #1818

dkoslicki added a commit that referenced this issue Jul 6, 2022

remove the issue #1818 checkout

548ac96

For #1865

dkoslicki closed this as completed Jul 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stand up "creative DTD" endpoint #1818

Stand up "creative DTD" endpoint #1818

dkoslicki commented Apr 5, 2022

edeutsch commented Apr 5, 2022

dkoslicki commented Apr 6, 2022

dkoslicki commented Apr 6, 2022

finnagin commented Apr 6, 2022

chunyuma commented Apr 8, 2022

finnagin commented Apr 8, 2022 •

edited

Loading

chunyuma commented Apr 8, 2022

edeutsch commented Apr 8, 2022

finnagin commented May 19, 2022

dkoslicki commented May 23, 2022

dkoslicki commented Jun 5, 2022

dkoslicki commented Jul 2, 2022

dkoslicki commented Jul 3, 2022

amykglen commented Jul 3, 2022

dkoslicki commented Jul 3, 2022

edeutsch commented Jul 5, 2022

dkoslicki commented Jul 5, 2022

amykglen commented Jul 5, 2022

amykglen commented Jul 5, 2022

amykglen commented Jul 5, 2022

edeutsch commented Jul 5, 2022

dkoslicki commented Jul 5, 2022

edeutsch commented Jul 6, 2022

edeutsch commented Jul 6, 2022

dkoslicki commented Jul 6, 2022

amykglen commented Jul 6, 2022

dkoslicki commented Jul 8, 2022

Stand up "creative DTD" endpoint #1818

Stand up "creative DTD" endpoint #1818

Comments

dkoslicki commented Apr 5, 2022

edeutsch commented Apr 5, 2022

dkoslicki commented Apr 6, 2022

dkoslicki commented Apr 6, 2022

finnagin commented Apr 6, 2022

chunyuma commented Apr 8, 2022

finnagin commented Apr 8, 2022 • edited Loading

chunyuma commented Apr 8, 2022

edeutsch commented Apr 8, 2022

finnagin commented May 19, 2022

dkoslicki commented May 23, 2022

dkoslicki commented Jun 5, 2022

dkoslicki commented Jul 2, 2022

dkoslicki commented Jul 3, 2022

amykglen commented Jul 3, 2022

dkoslicki commented Jul 3, 2022

edeutsch commented Jul 5, 2022

dkoslicki commented Jul 5, 2022

amykglen commented Jul 5, 2022

amykglen commented Jul 5, 2022

amykglen commented Jul 5, 2022

edeutsch commented Jul 5, 2022

dkoslicki commented Jul 5, 2022

edeutsch commented Jul 6, 2022

edeutsch commented Jul 6, 2022

dkoslicki commented Jul 6, 2022

amykglen commented Jul 6, 2022

dkoslicki commented Jul 8, 2022

finnagin commented Apr 8, 2022 •

edited

Loading