Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More filtering errors #1399

Closed
dkoslicki opened this issue Apr 22, 2021 · 20 comments
Closed

More filtering errors #1399

dkoslicki opened this issue Apr 22, 2021 · 20 comments

Comments

@dkoslicki
Copy link
Member

Query:

add_qnode(id=CHEBI:17754, category=biolink:ChemicalSubstance, key=n0)
add_qnode(category=biolink:Gene, key=n1)
add_qedge(subject=n1, object=n0, key=e0,predicate=biolink:negatively_regulates_entity_to_entity)
expand(kp=ARAX/KG2,continue_if_no_results=false,enforce_directionality=true,use_synonyms=true)
filter_kg(,action=remove_edges_by_property,edge_property=biolink:provided_by,property_value=SEMMEDDB,remove_connected_nodes=false)
resultify()
filter_results(action=limit_number_of_results, max_results=30)

Error:
Processing action 'filter_kg' with parameters {'': 'true', 'action': 'remove_edges_by_property', 'edge_property': 'biolink:provided_by', 'property_value': 'SEMMEDDB', 'remove_connected_nodes': 'false'} 2021-04-22T16:31:46.273095 ERROR: An uncaught error occurred: unhashable type: 'list': ['Traceback (most recent call last):\n', ' File "/mnt/data/orangeboard/production/RTX/code/UI/OpenAPI/python-flask-server/openapi_server/../../../../ARAX/ARAXQuery/ARAX_query.py", line 604, in execute_processing_plan\n filter_kg.apply(response, action[\'parameters\'])\n', ' File "/mnt/data/orangeboard/production/RTX/code/UI/OpenAPI/python-flask-server/openapi_server/../../../../ARAX/ARAXQuery/ARAX_filter_kg.py", line 426, in apply\n getattr(self, \'_\' + self.__class__.__name__ + \'__\' + parameters[\'action\'])() # thank you https://stackoverflow.com/questions/11649848/call-methods-by-string\n', ' File "/mnt/data/orangeboard/production/RTX/code/UI/OpenAPI/python-flask-server/openapi_server/../../../../ARAX/ARAXQuery/ARAX_filter_kg.py", line 524, in __remove_edges_by_property\n known_values.add(attribute.value)\n', "TypeError: unhashable type: 'list'\n"]

@finnagin
Copy link
Member

finnagin commented Apr 23, 2021

I've fixed this bug though just as an fyi there are a couple issues with the DSL in this.

  1. the filter_kg command has a leading comma in the parameters which causes an error
  2. In KG2 SemMedDB edges are no longer labeled with provided_by = "SEMMEDDB" it now has a trailing colon so should be "SEMMEDDB:"
  3. It looks like this query only returns SemMedDB edges so when we filter those out no edges are left and this returns 0 results.

Here is the query with the first 2 fixed:

add_qnode(id=CHEBI:17754, category=biolink:ChemicalSubstance, key=n0)
add_qnode(category=biolink:Gene, key=n1)
add_qedge(subject=n1, object=n0, key=e0,predicate=biolink:negatively_regulates_entity_to_entity)
expand(kp=ARAX/KG2,continue_if_no_results=false,enforce_directionality=true,use_synonyms=true)
filter_kg(action=remove_edges_by_property,edge_property=biolink:provided_by,property_value=SEMMEDDB:,remove_connected_nodes=false)
resultify()
filter_results(action=limit_number_of_results, max_results=30)

@dkoslicki
Copy link
Member Author

Thank @finnagin ! If you couldn't tell, this was in the midst of an NCATS hackathon, so typos clearly slipped by me. Good to know about the colon (I had tried a few different ways, but now I know at least). FWIW, will provided_by=SEMMEDDB: work to filter out those edges now? I understand that might be all of them

@finnagin
Copy link
Member

Yes, I've tried running this locally and it filters out all edges.

@dkoslicki
Copy link
Member Author

@finnagin Trying the DSL you wrote above, it appears there are still SEMMEDDB edges in there: https://arax.ncats.io/?r=8208 Can you look into this?

@dkoslicki
Copy link
Member Author

Ah, I think I found the issue: the attribute doesn’t want the biolink: prefix, so this works:


add_qnode(id=CHEBI:17754, category=biolink:ChemicalSubstance, key=n0)
add_qnode(category=biolink:Gene, key=n1)
add_qedge(subject=n1, object=n0, key=e0,predicate=biolink:negatively_regulates_entity_to_entity)
expand(kp=ARAX/KG2,continue_if_no_results=false,enforce_directionality=true,use_synonyms=true)
filter_kg(action=remove_edges_by_property,edge_property=provided_by,property_value=SEMMEDDB:,remove_connected_nodes=false)
resultify()
filter_results(action=limit_number_of_results, max_results=30)

It would be great if we could:

  1. Include in the documentation that the biolink: prefix should not be included and
  2. Change it so we don’t need to specify the trailing :, which is rather confusing

@finnagin
Copy link
Member

  1. Sure! I can do that. Though, this may need updating after we finalize the TRAPI 1.1 attribute stuff.
  2. So, I believe this is coming directly from KG2 since the provided_by field points to curies of biolink:InformationContentEntity nodes in KG2 which for SemMedDB happens to be SEMMEDDB:. (Since each source has a corresponding node in KG2) I could add something hardcoded for SemMedDB specifically but I don't think the other nodes necessarily follow the general pattern of <name_of_source>:. I could hard code in a set of often used ones maybe?
    Another strategy could be to make a dictionary mapping the name of the node in neo4j to the curie for all of these types of nodes but this might be a bit large as there are 96,776 of these nodes. Also, the name might not be what you expect. For example SemMedDB's name is "Semantic Medline Database (SemMedDB)"

@dkoslicki
Copy link
Member Author

I'm seeing 6 such ids that end with : in KG2.5.2 and 10 in KG2.6.1.

match (n:`biolink:InformationContentEntity`) where n.id ends with ":" return count(n)

I think this is probably a bug, so we could just add an exception for these cases.
And I think we should avoid names: a user should be able to look at a set of results and immediately be able to ascertain what they can filter onl

@saramsey
Copy link
Member

saramsey commented May 5, 2021

2\. In KG2 SemMedDB edges are no longer labeled with `provided_by` = "SEMMEDDB" it now has a trailing colon so should be "SEMMEDDB:"

Correct

@saramsey
Copy link
Member

saramsey commented May 5, 2021

I'm seeing 6 such ids that end with : in KG2.5.2 and 10 in KG2.6.1.

match (n:`biolink:InformationContentEntity`) where n.id ends with ":" return count(n)

I think this is probably a bug, so we could just add an exception for these cases.
And I think we should avoid names: a user should be able to look at a set of results and immediately be able to ascertain what they can filter onl

This is intentional, so not a bug (though if it causes problems downstream, we could talk about how to work around it, of course). A bit of background: a CURIE that is just a prefix and a colon is perfectly valid. See the example CURIE with an empty "reference" field, in the W3C Curie Syntax 1.0 document:
https://www.w3.org/TR/2010/NOTE-curie-20101216/#sec_2.1.

Screen Shot 2021-05-05 at 2 35 36 PM

@dkoslicki
Copy link
Member Author

This is intentional, so not a bug

Ah, thanks for the clarification @saramsey! I think we might need to make a decision, since there are some with trailing : and some without, which makes it difficult to figure out what to use for the filter

@saramsey
Copy link
Member

saramsey commented May 5, 2021

I'm confused. Are there provided_by strings in KG2 that do not contain any colon? That's a bug.

Also, the CURIE spec doesn't say that a trailing colon is required for a CURIE to be valid. But a CURIE does have to contain a colon. It is the required separator between the prefix and the reference portions of the CURIE. The reference portion of the CURIE can be empty.

@saramsey
Copy link
Member

saramsey commented May 5, 2021

Hi @finnagin and @dkoslicki, note, anywhere in ARAX's code-base, if you want to translate a provided_by CURIE from KG2 into a valid URI, just do the python-requests version of this POST, against the KG2 API:

curl -X 'POST' \
  'https://arax.ncats.io/api/rtxkg2/v1.0/query?bypass_cache=false' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "asynchronous": "false",
  "bypass_cache": "true",
  "max_results": 100,
  "message": {
    "query_graph": {
      "edges": {
      },
      "nodes": {
        "n00": {
          "id": "SEMMEDDB:"
        }
      }
    }
  },
  "page_number": 1,
  "page_size": 50
}'

you can pull the URI out of the response, see here:

{
  "context": "https://raw.githubusercontent.com/biolink/biolink-model/master/context.jsonld",
  "datetime": "2021-05-05 21:45:32",
  "description": "Normal completion",
  "logs": [
    <STEVE CUT SOME STUFF>
  ],
  "message": {
    "knowledge_graph": {
      "edges": {},
      "nodes": {
        "SEMMEDDB:": {
          "attributes": [
            {
              "name": "iri",
              "type": "biolink:IriType",
              "url": "https://skr3.nlm.nih.gov/SemMedDB",
              "value": "https://skr3.nlm.nih.gov/SemMedDB"
            },
            {
              "name": "all_names",
              "type": "biolink:synonym",
              "value": [
                "Semantic Medline Database (SemMedDB)"
              ]
            },
            {
              "name": "all_categories",
              "type": "biolink:Unknown",
              "value": [
                "biolink:InformationContentEntity"
              ]
            },
            {
              "name": "expanded_categories",
              "type": "biolink:Unknown",
              "value": [
                "biolink:InformationContentEntity",
                "biolink:NamedThing"
              ]
            },
            {
              "name": "equivalent_curies",
              "type": "biolink:synonym",
              "value": [
                "SEMMEDDB:"
              ]
            }
          ],
          "category": [
            "biolink:InformationContentEntity"
          ],
          "name": "Semantic Medline Database (SemMedDB)"
        }
      }
    },
    "query_graph": {
      "edges": {},
      "nodes": {
        "n00": {
          "id": [
            "SEMMEDDB:"
          ],
          "is_set": false
        }
      }
    },
    "results": [
      {
        "confidence": 1,
        "description": "No description available",
        "edge_bindings": {},
        "essence": "Semantic Medline Database (SemMedDB)",
        "essence_category": "None",
        "node_bindings": {
          "n00": [
            {
              "id": "SEMMEDDB:"
            }
          ]
        },
        "reasoner_id": "ARAX",
        "row_data": [
          1,
          "Semantic Medline Database (SemMedDB)",
          "None"
        ]
      }
    ]
  },
  "operations": {
    "actions": [
      "expand(kp=ARAX/KG2)",
      "resultify()",
      "return(store=false)"
    ]
  },
  "query_options": {
    "actions": [
      "expand(kp=ARAX/KG2)",
      "resultify()",
      "return(store=false)"
    ]
  },
  "reasoner_id": "ARAX",
  "schema_version": "1.0.0",
  "status": "OK",
  "table_column_names": [
    "confidence",
    "essence",
    "essence_category"
  ],
  "tool_version": "ARAX 0.7.0",
  "type": "translator_reasoner_response"
}

@saramsey
Copy link
Member

saramsey commented May 5, 2021

This is intentional, so not a bug

I think we might need to make a decision, since there are some with trailing : and some without, which makes it difficult to figure out what to use for the filter

Yeah, the biolink model is a bit formal here and not so helpful, but my understanding was that the KPs were agreed that provided_by should be a CURIE, and that understanding is what I see from the KGX JSON example on the KGX GitHub site:

Screen Shot 2021-05-05 at 2 52 09 PM

Also, if there is any reversal of the decision to have provided_by be a CURIE, that would likely break a lot of KG2 build code.

@dkoslicki
Copy link
Member Author

dkoslicki commented May 5, 2021

@saramsey I think the main issue is a practical one. If a user says "I want to filter out SEMMEDDB edges", it's unclear if they should do this in DSL with:
filter_kg(action=remove_edges_by_property,edge_property=biolink:provided_by, property_value=SEMMEDDB) or filter_kg(action=remove_edges_by_property,edge_property=biolink:provided_by, property_value=SEMMEDDB:) or filter_kg(action=remove_edges_by_property,edge_property=biolink:provided_by, property_value=SemMedDB) or filter_kg(action=remove_edges_by_property,edge_property=biolink:provided_by, property_value=biolink:SEMMEDDB) (since the property starts with biolink) etc.

I see that we have provided_by (biolink:provided_by): ARAX with no colon, and I could swear I say more without a : anywhere, but am having trouble finding them. I'll update if I find any more examples.

@dkoslicki
Copy link
Member Author

Perhaps we can address this issue by:

  1. Stating in the documentation and the UI that the value to property_value= needs to be a CURIE and
  2. Throw an error if the value of property_value= is missing a :
  3. Make sure that all KG2 provided_by entries have a :

@dkoslicki
Copy link
Member Author

@finnagin with the most recent deployment, it looks like the DSL you wrote here doesn’t actually remove the SEMMEDDB: edges. Mind taking a look?

@saramsey
Copy link
Member

saramsey commented May 6, 2021

Hi @dkoslicki thank you for expanding on the issue. Very helpful. Looks like the apple cart may again be turned over by #1414. And on that note, @amykglen has a dictionary of provider strings that we have developed. I wonder if (assuming the change to "infores" described in the parent issue linked in #1414 gets ratified) we should (in the DSL) have the user specify the "infores" string like semmeddb and then the infores: prefix can be prepended in software.

@saramsey
Copy link
Member

saramsey commented May 6, 2021

And I totally get that requiring a "magic" CURIE in the DSL is not user-friendly. I definitely agree we should address this; I think in middleware.

@saramsey
Copy link
Member

saramsey commented May 6, 2021

Current provided_by strings in KG2.6.2, with counts. Based on the following Cypher:

match (n) return distinct n.provided_by, count(*) as c order by c descending;

Results are:

"n.provided_by"                                │"c"    │
╞═══════════════════════════════════════════════╪═══════╡
│"identifiers_org_registry:umls"                │3547714│
├───────────────────────────────────────────────┼───────┤
│"identifiers_org_registry:chembl.compound"     │1976191│
├───────────────────────────────────────────────┼───────┤
│"umls_source:NCBITAXON"                        │1776515│
├───────────────────────────────────────────────┼───────┤
│"PathWhiz:"                                    │373744 │
├───────────────────────────────────────────────┼───────┤
│"umls_source:SNOMEDCT"                         │359358 │
├───────────────────────────────────────────────┼───────┤
│"umls_source:MSH"                              │298495 │
├───────────────────────────────────────────────┼───────┤
│"OBO:pr.owl"                                   │288104 │
├───────────────────────────────────────────────┼───────┤
│"umls_source:LNC"                              │260727 │
├───────────────────────────────────────────────┼───────┤
│"umls_source:ICD10PCS"                         │189503 │
├───────────────────────────────────────────────┼───────┤
│"OBO:chebi.owl"                                │156123 │
├───────────────────────────────────────────────┼───────┤
│"umls_source:NCI"                              │147265 │
├───────────────────────────────────────────────┼───────┤
│"umls_source:OMIM"                             │111589 │
├───────────────────────────────────────────────┼───────┤
│"identifiers_org_registry:smpdb"               │110217 │
├───────────────────────────────────────────────┼───────┤
│"umls_source:RXNORM"                           │108937 │
├───────────────────────────────────────────────┼───────┤
│"umls_source:FMA"                              │104529 │
├───────────────────────────────────────────────┼───────┤
│"umls_source:ICD10CM"                          │84684  │
├───────────────────────────────────────────────┼───────┤
│"umls_source:MEDDRA"                           │73943  │
├───────────────────────────────────────────────┼───────┤
│"identifiers_org_registry:ensembl"             │67993  │
├───────────────────────────────────────────────┼───────┤
│"identifiers_org_registry:reactome"            │67306  │
├───────────────────────────────────────────────┼───────┤
│"identifiers_org_registry:ncbigene"            │62035  │
├───────────────────────────────────────────────┼───────┤
│"umls_source:MTH"                              │61576  │
├───────────────────────────────────────────────┼───────┤
│"umls_source:GO"                               │45033  │
├───────────────────────────────────────────────┼───────┤
│"umls_source:HGNC"                             │41095  │
├───────────────────────────────────────────────┼───────┤
│"umls_source:NDDF"                             │29990  │
├───────────────────────────────────────────────┼───────┤
│"umls_source:VANDF"                            │28496  │
├───────────────────────────────────────────────┼───────┤
│"identifiers_org_registry:uniprot"             │26513  │
├───────────────────────────────────────────────┼───────┤
│"OBO:foodon.owl"                               │24912  │
├───────────────────────────────────────────────┼───────┤
│"umls_source:ICD9CM"                           │22414  │
├───────────────────────────────────────────────┼───────┤
│"EFO:efo.owl"                                  │22317  │
├───────────────────────────────────────────────┼───────┤
│"OBO:mondo.owl"                                │21785  │
├───────────────────────────────────────────────┼───────┤
│"KEGG_source:"                                 │18782  │
├───────────────────────────────────────────────┼───────┤
│"umls_source:HPO"                              │14593  │
├───────────────────────────────────────────────┼───────┤
│"OBO:doid.owl"                                 │14356  │
├───────────────────────────────────────────────┼───────┤
│"umls_source:CPT"                              │14335  │
├───────────────────────────────────────────────┼───────┤
│"umls_source:PDQ"                              │13342  │
├───────────────────────────────────────────────┼───────┤
│"OBO:go/extensions/go-plus.owl"                │12957  │
├───────────────────────────────────────────────┼───────┤
│"OBO:uberon/ext.owl"                           │12342  │
├───────────────────────────────────────────────┼───────┤
│"umls_source:ICD10"                            │12319  │
├───────────────────────────────────────────────┼───────┤
│"identifiers_org_registry:hmdb"                │10001  │
├───────────────────────────────────────────────┼───────┤
│"ORPHANET:"                                    │8109   │
├───────────────────────────────────────────────┼───────┤
│"umls_source:PSY"                              │7969   │
├───────────────────────────────────────────────┼───────┤
│"umls_source:DRUGBANK"                         │7562   │
├───────────────────────────────────────────────┼───────┤
│"identifiers_org_registry:ttd.target"          │7147   │
├───────────────────────────────────────────────┼───────┤
│"umls_source:HCPCS"                            │6968   │
├───────────────────────────────────────────────┼───────┤
│"umls_source:HL7"                              │6594   │
├───────────────────────────────────────────────┼───────┤
│"umls_source:ATC"                              │6335   │
├───────────────────────────────────────────────┼───────┤
│"identifiers_org_registry:drugbank"            │6009   │
├───────────────────────────────────────────────┼───────┤
│"DrugCentral:"                                 │4643   │
├───────────────────────────────────────────────┼───────┤
│"OBO:ncbitaxon/subsets/taxslim.owl"            │3616   │
├───────────────────────────────────────────────┼───────┤
│"OBO:ehdaa2.owl"                               │2659   │
├───────────────────────────────────────────────┼───────┤
│"OBO:hp.owl"                                   │2188   │
├───────────────────────────────────────────────┼───────┤
│"OBO:pato.owl"                                 │2142   │
├───────────────────────────────────────────────┼───────┤
│"OBO:genepio.owl"                              │2050   │
├───────────────────────────────────────────────┼───────┤
│"identifiers_org_registry:mirbase"             │1918   │
├───────────────────────────────────────────────┼───────┤
│"OBO:mi.owl"                                   │1485   │
├───────────────────────────────────────────────┼───────┤
│"OBO:cl.owl"                                   │798    │
├───────────────────────────────────────────────┼───────┤
│"OBO:nbo.owl"                                  │782    │
├───────────────────────────────────────────────┼───────┤
│"biolink_download_source:biolink-model.owl.ttl"│754    │
├───────────────────────────────────────────────┼───────┤
│"umls_source:MED-RT"                           │680    │
├───────────────────────────────────────────────┼───────┤
│"OBO:fma.owl"                                  │660    │
├───────────────────────────────────────────────┼───────┤
│"OBO:ro.owl"                                   │471    │
├───────────────────────────────────────────────┼───────┤
│"umls_source:MEDLINEPLUS"                      │298    │
├───────────────────────────────────────────────┼───────┤
│"OBO:ino.owl"                                  │295    │
├───────────────────────────────────────────────┼───────┤
│"OBO:bspo.owl"                                 │170    │
├───────────────────────────────────────────────┼───────┤
│"UMLS_STY:"                                    │129    │
├───────────────────────────────────────────────┼───────┤
│"OBO:ddanat.owl"                               │70     │
├───────────────────────────────────────────────┼───────┤
│"OBO:bfo.owl"                                  │51     │
├───────────────────────────────────────────────┼───────┤
│"umls_source:HCPT"                             │44     │
├───────────────────────────────────────────────┼───────┤
│"SEMMEDDB:"                                    │38     │
├───────────────────────────────────────────────┼───────┤
│"umls_source:ICD10AE"                          │1      │
├───────────────────────────────────────────────┼───────┤
│"umls_source:NDFRT"                            │1      │
├───────────────────────────────────────────────┼───────┤
│"UNICHEM_source:"                              │1      │
├───────────────────────────────────────────────┼───────┤
│"REPODB:"                                      │1      │
├───────────────────────────────────────────────┼───────┤
│"JensenLab:"                                   │1      │
├───────────────────────────────────────────────┼───────┤
│"identifiers_org_registry:intact"              │1      │
├───────────────────────────────────────────────┼───────┤
│"DisGeNET:"                                    │1      │
├───────────────────────────────────────────────┼───────┤
│"RTX:"                                         │1      │
└───────────────────────────────────────────────┴─────

@dkoslicki
Copy link
Member Author

Closing this one since attributes were changed around quite a bit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants