Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stand up "creative DTD" endpoint #1818

Closed
dkoslicki opened this issue Apr 5, 2022 · 40 comments
Closed

Stand up "creative DTD" endpoint #1818

dkoslicki opened this issue Apr 5, 2022 · 40 comments

Comments

@dkoslicki
Copy link
Member

@chunyuma will create a class/method/script/function that has the following structure:
Input: A single disease CURIE and two integers M & N
output: the top M drugs predicted to treat the disease, along with N explanation paths for each drug
This will be using his reinforcement learning model.

@finnagin will stand up an endpoint to arax.ncats.io (with a name something like "CreativeDTD") that will:
Take as input a TRAPI query structured like: (x)-[r]-(y) where x is biolink:ChemicalEntity (or any of its descendants), r is any biolink relationship (effectively ignoring the relationship type) and y is biolink:DiseaseOrPhenotypicFeature (or any of its descendants). Everything else will be ignored (including the workflow portion of TRAPI).
As output, it will give a standard TRAPI response. The only nuance here is that the paths that Chunyu's method returns can have variable length: anywhere from 1 to 3 hops. As such, the query graph associated with this may need to be something like:
(x)-[r_opt1]-(y)
(x)-[r_opt2]-()-[r_opt2]-(y)
(x)-[r_opt3]-()-[r_opt3]-()-[r_opt3]-(y)
Or something similar to communicate 1 to 3 hops. Note that the current constraint of expand requiring at least one non-optional edge shouldn't matter here as expand will not be used.

Timeline: Preliminary implementation by May 3, production ready by May 31. LMK if this timeline is reasonable (of course, the earlier the better, but there are other priorities each of you have as well).

@edeutsch
Copy link
Collaborator

edeutsch commented Apr 5, 2022

I'm thinking that it would easier to implement this with a suitable "knowledge type" flag/constraint (#1815) on [r] using the standard endpoint instead of a separate endpoint.

@dkoslicki
Copy link
Member Author

@edeutsch I was imagining a separate endpoint as the output format (and input that triggers it) may change, and figured it would be nice to keep a separate endpoint to fiddle with. I do like your idea of indicating what type of result you want by using flags/constraints. I figure, though, that it will be much faster to implement this one type of query via an operation like infer() rather than trying to get everyone to agree on what flags/constraints to use.

@dkoslicki
Copy link
Member Author

Oh, and another thing @edeutsch: we're going to host some clinical data working group workflows, so I thought we could use the normal route (and query graph interpreter) for that. I was also going to ask during the next AHM if people are cool with us listing this endpoint in the manuscript Chunyu is publishing. Basically for those who want to see it in action, but don't want to learn ARAXi/full TRAPI/Translator to use it.

@finnagin
Copy link
Member

finnagin commented Apr 6, 2022

@edeutsch Fyi, I made the branch issue1818 for work on this.

@chunyuma
Copy link
Collaborator

chunyuma commented Apr 8, 2022

Hi @finnagin, I've written some scripts to run my model and uploaded them with the necessary data to the arax.ncats.io server:/data/code_creative_DTD_endpoint. Since some datasets are large, I didn't push them to this repo. There is a temple called run_template.py within the scripts folder for running the scripts. Please let me know if you meet any problems.

Here are some statistics for generating the top 20 paths for the top 50 drugs predicted to treat Alkaptonuria (MONDO:0008753). The results are here.

Memory cost: 40 - 50 GB
Running time with CPU: 1158.4289071559906 s
Running time with GPU: 1021.6582832336426 s

@finnagin
Copy link
Member

finnagin commented Apr 8, 2022

Thanks @chunyuma ! Does the model usually take that much ram to make predictions?

I don't think arax.ncats.io will have enough extra ram to run that so we might need to spin up another ec2 instance instead of running as a separate service on arax.ncats.io.

@chunyuma
Copy link
Collaborator

chunyuma commented Apr 8, 2022

@finnagin, right. It generally needs that ram to make predictions because it needs to read in some pre-training embeddings and the whole RTX-KG2c. Perhaps we might need the thoughts of @edeutsch, @saramsey, or @dkoslicki to see if it is worthy to set up another ec2 instance for this model.

@edeutsch
Copy link
Collaborator

edeutsch commented Apr 8, 2022

running it on arax.ncats.io itself seems risky because the container itself has a limit of around 55 GB.

@finnagin
Copy link
Member

From meeting with Amy:

  1. Get prefered node ids using this:
    def get_canonical_curies(self, curies=None, names=None, return_all_categories=False, return_type='canonical_curies'):
    # If the provided curies or names is just a string, turn it into a list
    if isinstance(curies,str):
    curies = [ curies ]
    if isinstance(names,str):
    names = [ names ]
    # Set up containers for the batches and results
    batches = []
    results = {}
    # Set up the category manager
    category_manager = CategoryManager()
    # Make sets of comma-separated list strings for the curies and set up the results dict with all the input values
    uc_curies = []
    curie_map = {}
    batch_size = 0
    if curies is not None:
    for curie in curies:
    if curie is None:
    continue
    results[curie] = None
    uc_curie = curie.upper()
    curie_map[uc_curie] = curie
    uc_curie = re.sub(r"'","''",uc_curie) # Replace embedded ' characters with ''
    uc_curies.append(uc_curie)
    batch_size += 1
    if batch_size > 5000:
    batches.append( { 'batch_type': 'curies', 'batch_str': "','".join(uc_curies) } )
    uc_curies = []
    batch_size = 0
    if batch_size > 0:
    batches.append( { 'batch_type': 'curies', 'batch_str': "','".join(uc_curies) } )
    # Make sets of comma-separated list strings for the names
    lc_names = []
    name_map = {}
    batch_size = 0
    if names is not None:
    for name in names:
    if name is None:
    continue
    results[name] = None
    lc_name = name.lower()
    name_map[lc_name] = name
    lc_name = re.sub(r"'","''",lc_name) # Replace embedded ' characters with ''
    lc_names.append(lc_name)
    batch_size += 1
    if batch_size > 5000:
    batches.append( { 'batch_type': 'names', 'batch_str': "','".join(lc_names) } )
    lc_names = []
    batch_size = 0
    if batch_size > 0:
    batches.append( { 'batch_type': 'names', 'batch_str': "','".join(lc_names) } )
    for batch in batches:
    #print(f"INFO: Batch {i_batch} of {batch['batch_type']}")
    #i_batch += 1
    if batch['batch_type'] == 'curies':
    if return_type == 'equivalent_nodes':
    sql = f"""
    SELECT C.curie,C.unique_concept_curie,N.curie,N.category,U.category
    FROM curies AS C
    INNER JOIN nodes AS N ON C.unique_concept_curie == N.unique_concept_curie
    INNER JOIN unique_concepts AS U ON C.unique_concept_curie == U.uc_curie
    WHERE C.uc_curie in ( '{batch['batch_str']}' )"""
    else:
    sql = f"""
    SELECT C.curie,C.unique_concept_curie,U.curie,U.name,U.category
    FROM curies AS C
    INNER JOIN unique_concepts AS U ON C.unique_concept_curie == U.uc_curie
    WHERE C.uc_curie in ( '{batch['batch_str']}' )"""
    else:
    sql = f"""
    SELECT S.name,S.unique_concept_curie,U.curie,U.name,U.category
    FROM names AS S
    INNER JOIN unique_concepts AS U ON S.unique_concept_curie == U.uc_curie
    WHERE S.lc_name in ( '{batch['batch_str']}' )"""
    #print(f"INFO: Processing {batch['batch_type']} batch: {batch['batch_str']}")
    cursor = self.connection.cursor()
    cursor.execute( sql )
    rows = cursor.fetchall()
    # Loop through all rows, building the list
    batch_curie_map = {}
    for row in rows:
    # If the curie or name is not found in results, try to use the curie_map{}/name_map{} to resolve capitalization issues
    entity = row[0]
    if entity not in results:
    if batch['batch_type'] == 'curies':
    if entity.upper() in curie_map:
    entity = curie_map[entity.upper()]
    else:
    if entity.lower() in name_map:
    entity = name_map[entity.lower()]
    # Now store this curie in the list
    if entity in results:
    if row[1] not in batch_curie_map:
    batch_curie_map[row[1]] = {}
    batch_curie_map[row[1]][entity] = 1
    # If the return turn is equivalent_nodes, then add the node curie to the dict
    if return_type == 'equivalent_nodes':
    if results[entity] is None:
    results[entity] = {}
    node_curie = row[2]
    results[entity][node_curie] = row[3]
    # Else the return type is assumed to be the canonical node
    else:
    results[entity] = {
    'preferred_curie': row[2],
    'preferred_name': row[3],
    'preferred_category': row[4]
    }
    #### Also store tidy categories
    if return_all_categories:
    results[entity]['expanded_categories'] = category_manager.get_expansive_categories(row[4])
    else:
    print(f"ERROR: Unable to find entity {entity}")
    # If all_categories were requested, do another query for those
    if return_all_categories:
    # Create the SQL IN list
    uc_curies_list = []
    for uc_curie in batch_curie_map:
    uc_curie = re.sub(r"'","''",uc_curie) # Replace embedded ' characters with ''
    uc_curies_list.append(uc_curie)
    curies_list_str = "','".join(uc_curies_list)
    # Get all the curies for these concepts and their categories
    sql = f"""
    SELECT curie,unique_concept_curie,category
    FROM curies
    WHERE unique_concept_curie IN ( '{curies_list_str}' )"""
    cursor = self.connection.cursor()
    cursor.execute( sql )
    rows = cursor.fetchall()
    entity_all_categories = {}
    for row in rows:
    uc_unique_concept_curie = row[1]
    node_category = row[2]
    entities = batch_curie_map[uc_unique_concept_curie]
    #### Eric says: I'm a little concerned that this entity is stomping on the previous entity. What's really going on here? FIXME
    for entity in entities:
    # Now store this category in the list
    if entity in results:
    if entity not in entity_all_categories:
    entity_all_categories[entity] = {}
    if node_category is None:
    continue
    if node_category not in entity_all_categories[entity]:
    entity_all_categories[entity][node_category] = 0
    entity_all_categories[entity][node_category] +=1
    else:
    print(f"ERROR: Unable to find entity {entity}")
    # Now store the final list of categories into the list
    for entity,all_categories in entity_all_categories.items():
    if entity in results and results[entity] is not None:
    results[entity]['all_categories'] = all_categories
    return results
  2. example of the above here:
    def get_canonical_curies_dict(curie: Union[str, List[str]], log: ARAXResponse) -> Dict[str, Dict[str, str]]:
    curies = convert_to_list(curie)
    try:
    synonymizer = NodeSynonymizer()
    log.debug(f"Sending NodeSynonymizer.get_canonical_curies() a list of {len(curies)} curies")
    canonical_curies_dict = synonymizer.get_canonical_curies(curies)
    log.debug(f"Got response back from NodeSynonymizer")
    except Exception:
    tb = traceback.format_exc()
    error_type, error, _ = sys.exc_info()
    log.error(f"Encountered a problem using NodeSynonymizer: {tb}", error_code=error_type.__name__)
    return {}
    else:
    if canonical_curies_dict is not None:
    unrecognized_curies = {input_curie for input_curie in canonical_curies_dict if not canonical_curies_dict.get(input_curie)}
    if unrecognized_curies:
    log.warning(f"NodeSynonymizer did not recognize: {unrecognized_curies}")
    return canonical_curies_dict
    else:
    log.error(f"NodeSynonymizer returned None", error_code="NodeNormalizationIssue")
    return {}
  3. Instantiate the class from trapi_querier.py with the kp name "infores:rtx-kg2"
  4. Use the method _get_arax_edge_key from the trapi querier class to get the correct edge key
  5. Add edge attribute that specified here:
    edge.attributes.append(Attribute(attribute_type_id="biolink:aggregator_knowledge_source",
    value=self.kg2_infores_curie,
    value_type_id="biolink:InformationResource",
    attribute_source=self.kg2_infores_curie))
  6. After adding all edges to the knowledge graph instantiate arax decortator class.
  7. pass the whole response and use the method decorate_nodes and decorate_edges to get metadata. (This requires the edge attribute that specifies it came form kg2)

@dkoslicki
Copy link
Member Author

Issue: nodes returned by name from the model
Solution: model should return CURIES and names instead. Instead of a string format encoding the paths, have them encoded via (V,E) (vertices and edges) that way properties (like CURIES and names) can decorate the nodes and edges.
Tagging @chunyuma so he's aware

finnagin added a commit that referenced this issue May 30, 2022
finnagin added a commit that referenced this issue May 30, 2022
finnagin added a commit that referenced this issue Jun 5, 2022
finnagin added a commit that referenced this issue Jun 5, 2022
@dkoslicki
Copy link
Member Author

See brain dump file for current state and what to do moving forward

@dkoslicki
Copy link
Member Author

Multiprocessing may be a red herring: it looks like Finn's code in infer_utilities.py expects the query graph to be empty, as it populates the QG here. Looks like we will need to either a) pass the query edge and nodes that have the inferred property on it to infer_utilities so it knows where to do its edits or b) don't edit the QG and just be fine with the results not matching the QG
I figure option b) may cause issues with resultify if a QG is given with more edges than the inferred one.

dkoslicki added a commit that referenced this issue Jul 2, 2022
… for today. Run test_ARAX_infer -k test_with_qg to see what's wrong
dkoslicki added a commit that referenced this issue Jul 3, 2022
…w probably_treats edge when a treats edge already exists in the QG) #1818
dkoslicki added a commit that referenced this issue Jul 3, 2022
…precisely, replace it with biolink:NamedThing) since the model can return quite the variety of categories, and does not respect the biolink traversal up the heirarchy #1818
@dkoslicki
Copy link
Member Author

@amykglen I think I've fixed everything, as all tests appear to be passing now. I had to do some jiggering with node categories, which nodes and edges are marked as filled, and inserting the optional_group_ids in their proper place if a QG already exists.

LMK if things look good to you

@dkoslicki dkoslicki mentioned this issue Jul 3, 2022
dkoslicki added a commit that referenced this issue Jul 3, 2022
dkoslicki added a commit that referenced this issue Jul 3, 2022
dkoslicki added a commit that referenced this issue Jul 3, 2022
dkoslicki added a commit that referenced this issue Jul 3, 2022
…at's not yet implemented, so mark as should fail #1818
dkoslicki added a commit that referenced this issue Jul 3, 2022
@amykglen
Copy link
Member

amykglen commented Jul 3, 2022

awesome, yep, things are looking good to me!

thanks for figuring out the node_curie thing. maybe eventually I'll add a wrapper function somewhere so that the interface around ARAXInfer/XDTD is more convenient for Expand (e.g., ideally Expand could just pass in a QG, like it does for other KPs).

I'm planning to work on handling multi-qedge inferred queries over the next few days, but it may take me a little time as that makes things quite a bit more complex in Expand (having to merge answers/QGs and etc.) I'm thinking I'll do that in a branch off of issue1818 so that we can still merge issue1818 into master whenever we're ready and have a functioning creative mode, at least for simple single-qedge inferred queries.

random question: does XDTD do any subclass_of reasoning? so if you asked for treatments for Adams-Oliver syndrome it would also give you treatments for Adams-Oliver syndrome 2?

@dkoslicki
Copy link
Member Author

Sounds good; mixed inferred and lookup edges is ahead of the curve as only the template (single inferred edge) is required by Tuesday, so there's plenty of time to do mixed knowledge types

Re: subclass reasoning, no, it only does the inference for the exact curie supplied.

@edeutsch
Copy link
Collaborator

edeutsch commented Jul 5, 2022

So I am ready/trying to deploy this for dev/testing, but seeing this error:

  File "/mnt/data/orangeboard/devED/RTX/code/RTXConfiguration.py", line 162, in live
    self.explainable_dtd_db_host = self.config["Global"]["explainable_dtd_db"]["host"]
KeyError: 'explainable_dtd_db'

I'm hoping @amykglen or someone can update the central configv2? and that should fix it?

@dkoslicki
Copy link
Member Author

Yes, an updated to configv2 should fix it. I just don't know which machine has the "authoritative" version of it. Perhaps @amykglen knows (and I'd like to know too!)

@amykglen
Copy link
Member

amykglen commented Jul 5, 2022

the authoritative configv2.json lives on araxconfig.rtx.ai - I'll put the new configv2.json that David shared in slack on there now

@amykglen
Copy link
Member

amykglen commented Jul 5, 2022

ok, the authoritative configv2.json has been updated now - so if you delete yours to force a redownload it should work

@amykglen
Copy link
Member

amykglen commented Jul 5, 2022

note that when rolling out to prod we'll have to edit the config_local.json that prod uses

@edeutsch
Copy link
Collaborator

edeutsch commented Jul 5, 2022

okay, we are deployed to /test and /beta. The endpoints pass our basic test query.
I have not tested creative mode.
Please test if you have time. I am slammed for the rest of the day.

@dkoslicki
Copy link
Member Author

Tested and is looking good! I will want to make some changes later, (ala #1862), but I think it's fine to roll out to all endpoints.

@edeutsch
Copy link
Collaborator

edeutsch commented Jul 6, 2022

okay, will do, even production?

@edeutsch
Copy link
Collaborator

edeutsch commented Jul 6, 2022

note that when rolling out to prod we'll have to edit the config_local.json that prod uses

Can we do this together at the hackathon tomorrow?

@dkoslicki
Copy link
Member Author

Yup, considering the UI team is expecting creative results, we should roll it out everywhere

@amykglen
Copy link
Member

amykglen commented Jul 6, 2022

sure, sounds good to me!

dkoslicki added a commit that referenced this issue Jul 6, 2022
@dkoslicki
Copy link
Member Author

Deployed everywhere, so closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants