Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow FET #1402

Closed
dkoslicki opened this issue Apr 22, 2021 · 20 comments
Closed

Slow FET #1402

dkoslicki opened this issue Apr 22, 2021 · 20 comments

Comments

@dkoslicki
Copy link
Member

I don't know if this is due to how the Fisher exact test is querying expand, but I noticed that a very simple FET query takes quite a bit of time. Eg. the default "protein targets of acetaminophen" query without the results limited gives ~230 proteins. Trying to add FET in there:

add_qnode(name=acetaminophen, category=biolink:ChemicalSubstance, key=n0)
add_qnode(category=biolink:Protein, key=n1)
add_qedge(subject=n0, object=n1, key=e0)
expand(edge_key=e0,kp=ARAX/KG2)
overlay(action=fisher_exact_test,subject_qnode_key=n0,virtual_relation_label=F1,object_qnode_key=n1,filter_type=top_n,value=10)
resultify()

Results in a query that takes 2.5 minutes:

2021-04-22T18:27:06.717156 DEBUG: 233 object nodes with qnode key n1 and node type biolink:Protein was found in message KG and used to calculate Fisher's Exact Test
2021-04-22T1**8:27:06**.717177 DEBUG: ARAX/KG2 was used to calculate total object nodes in Fisher's Exact Test
2021-04-22T1**8:29:29**.394992 DEBUG: ARAX/KG2C were used to calculate total number of node with the same type of source node in Fisher's Exact Test
2021-04-22T18:29:29.395089 DEBUG: Total 2049669 unique concepts with node category biolink:ChemicalSubstance was found in ARAX/KG2C

Given the desire to use FET in the similarity search, we should investigate why this is. It seems it might be due to computing the total number of (in this case), Chemical Substances.

I tried this yesterday with a two hop query and it took over 30min. I no longer have that query, but will update this issue when I can replicate it.

@chunyuma
Copy link
Collaborator

@dkoslicki, I think the most time-consuming part within FET is not due to computing the total number of specific types. Thanks to @edeutsch's nodesynonymizer, we now can easily access this information by using nodesynonymizer.get_total_entity_count(node_type, kg_name=kg). So this should be fast.

The most time-consuming part is that given a bunch of nodes with specific type, we have to count the number of the adjacent nodes with specific type for each of these nodes. Originally, we use cypher query to access this information here (it should be faster) but for some reasons (it seems like we don't want to depend on cypher query) we decided to use DSL query here. Actually this is the most time-consuming part of FET.

@amykglen
Copy link
Member

hmm, yeah, definitely not ideal to have to return all of those neighbor nodes when all FET really needs is the count.

plover has constant-time access to these counts (how many neighbors of each category a given node has). alternatively, I suppose these could be precomputed and put in a sqlite database or something? the KG2c build process could pretty easily output a sqlite file with these counts (so, for example, you could look up a particular node ID and get a little JSON string like "{"biolink:Protein": 21, "biolink:Disease": 10, ...}")

@chunyuma
Copy link
Collaborator

alternatively, I suppose these could be precomputed and put in a sqlite database or something? the KG2c build process could pretty easily output a sqlite file with these counts (so, for example, you could look up a particular node ID and get a little JSON string like "{"biolink:Protein": 21, "biolink:Disease": 10, ...}")

@amykglen, if it is easy to build such database, I think this should be the fastest way to get these counts. If plover only needs constant-time, I think this is also much faster than the DSL query I'm using now. Not sure why one might be better?

@dkoslicki
Copy link
Member Author

I see! Yes, given we just need counts, there should be a way to pre-compute this easily. Here's an example of such a slow query if you'd like it for testing (credit to @amykglen ):

add_qnode(id=MONDO:0001475, key=n0, category=biolink:Disease)
add_qnode(category=biolink:Protein, key=n1, is_set=true)
add_qedge(subject=n0, object=n1, key=e0)
expand(kp=ARAX/KG2,edge_key=e0,continue_if_no_results=false,enforce_directionality=false,use_synonyms=true)
overlay(action=fisher_exact_test,subject_qnode_key=n0,virtual_relation_label=F1,object_qnode_key=n1,filter_type=top_n,value=50)
resultify()
add_qnode(category=biolink:Drug, key=n2)
add_qedge(subject=n1, object=n2, key=e1)
expand(kp=ARAX/KG2,edge_key=e1,continue_if_no_results=false,enforce_directionality=false,use_synonyms=true)
overlay(action=fisher_exact_test,subject_qnode_key=n1,virtual_relation_label=F2,object_qnode_key=n2,filter_type=top_n,value=50)
resultify()
filter_results(action=limit_number_of_results, max_results=30)

30 minutes and it still hasn't terminated.

@amykglen
Copy link
Member

ok, I adjusted the KG2c build process so that it adds these neighbor counts to kg2c.sqlite - now I just need to regenerate the full kg2c.sqlite and put it in place on arax.ncats.io... will let @chunyuma know when that is done!

@amykglen
Copy link
Member

@chunyuma - ok, kg2c_v1.1_KG2.5.2.sqlite has been rolled out (so if you delete your local configv2.json it will auto-download next time you run an ARAX query) and it contains neighbor counts by category for KG2c nodes - the code to get the neighbor counts from it would look something like this (this is what I used to test):

# Get connected to kg2c sqlite
response.debug(f"Looking up neighbor counts in sqlite")
path_list = os.path.realpath(__file__).split(os.path.sep)
rtx_index = path_list.index("RTX")
rtxc = RTXConfiguration()
sqlite_dir_path = os.path.sep.join([*path_list[:(rtx_index + 1)], 'code', 'ARAX', 'KnowledgeSources', 'KG2c'])
sqlite_name = rtxc.kg2c_sqlite_path.split('/')[-1]
sqlite_file_path = f"{sqlite_dir_path}{os.path.sep}{sqlite_name}"
connection = sqlite3.connect(sqlite_file_path)
cursor = connection.cursor()

# Extract the neighbor count data
node_keys = list(message.knowledge_graph.nodes)
node_keys_str = "','".join(node_keys)  # SQL wants ('node1', 'node2') format for string lists
sql_query = f"SELECT N.id, N.neighbor_counts " \
            f"FROM neighbors AS N " \
            f"WHERE N.id IN ('{node_keys_str}')"
cursor.execute(sql_query)
rows = cursor.fetchall()
cursor.close()
connection.close()
neighbor_counts_dict = dict()

# Load the counts into a dictionary
for row in rows:
    node_id = row[0]
    neighbor_counts = ujson.loads(row[1])
    neighbor_counts_dict[node_id] = neighbor_counts
print(neighbor_counts_dict)

the neighbor_counts_dict this code creates would look like:

{'CHEBI:39027': {'biolink:BiologicalEntity': 924, 'biolink:ChemicalSubstance': 265, 'biolink:Drug': 265, 'biolink:MolecularEntity': 566, 'biolink:NamedThing': 1145, 'biolink:AnatomicalEntity': 152, 'biolink:OrganismalEntity': 170, 'biolink:Procedure': 41, 'biolink:GeneFamily': 2, 'biolink:Gene': 115, 'biolink:GenomicEntity': 148, 'biolink:Protein': 115, 'biolink:OntologyClass': 16, 'biolink:CellularComponent': 7, 'biolink:GrossAnatomicalStructure': 11, 'biolink:Disease': 108, 'biolink:DiseaseOrPhenotypicFeature': 108, 'biolink:PhenotypicFeature': 108, 'biolink:RNAProduct': 1, 'biolink:Transcript': 1, 'biolink:BiologicalProcessOrActivity': 65, 'biolink:MolecularActivity': 24, 'biolink:BiologicalProcess': 40, 'biolink:PhysiologicalProcess': 31, 'biolink:PathologicalProcess': 9, 'biolink:AdministrativeEntity': 6, 'biolink:Agent': 6, 'biolink:Activity': 7, 'biolink:InformationContentEntity': 6, 'biolink:IndividualOrganism': 11, 'biolink:PopulationOfIndividualOrganisms': 5, 'biolink:OrganismTaxon': 14, 'biolink:Cell': 9, 'biolink:Phenomenon': 2, 'biolink:PhysicalEntity': 1, 'biolink:Metabolite': 1}, 'CHEBI:4056': {'biolink:BiologicalEntity': 2030, 'biolink:ChemicalSubstance': 829, 'biolink:Drug': 829, 'biolink:MolecularEntity': 1386, 'biolink:NamedThing': 2552, 'biolink:AnatomicalEntity': 261, 'biolink:OrganismalEntity': 370, 'biolink:Gene': 177, 'biolink:GenomicEntity': 222, 'biolink:Protein': 177, 'biolink:Procedure': 46, 'biolink:Cell': 21, 'biolink:OntologyClass': 20, 'biolink:GrossAnatomicalStructure': 25, 'biolink:PhysicalEntity': 1, 'biolink:CellularComponent': 23, 'biolink:RNAProduct': 1, 'biolink:Transcript': 1, 'biolink:BiologicalProcessOrActivity': 122, 'biolink:MolecularActivity': 40, 'biolink:BiologicalProcess': 79, 'biolink:PhysiologicalProcess': 65, 'biolink:PathologicalProcess': 14, 'biolink:Disease': 135, 'biolink:DiseaseOrPhenotypicFeature': 135, 'biolink:PhenotypicFeature': 135, 'biolink:IndividualOrganism': 101, 'biolink:InformationContentEntity': 7, 'biolink:AdministrativeEntity': 6, 'biolink:Agent': 6, 'biolink:Activity': 9, 'biolink:PopulationOfIndividualOrganisms': 6, 'biolink:OrganismTaxon': 84, 'biolink:GeneFamily': 5, 'biolink:Metabolite': 2, 'biolink:Phenomenon': 7, 'biolink:Device': 1}}

let me know if you have any questions, @chunyuma! (fyi, if a node has no neighbors of a particular category, then there will be no entry for that category under that node.)

@chunyuma
Copy link
Collaborator

Thanks so much @amykglen! I will try it later today and let you know if I have any problems.

@edeutsch
Copy link
Collaborator

Do I need to run the ARAX_database_manager or something on all the deployments in response to this?

@amykglen
Copy link
Member

amykglen commented Apr 27, 2021

it's not required at this point, since there's no code yet that actually uses these counts. if you're worried about it, you can delete each endpoint's configv2.json and all the rest will happen automatically.

(basically, I'm pretty sure the configv2.json will be auto-downloaded again onto all endpoints before @chunyuma's code that will actually use these counts will make it out to any endpoints. or, if we want to be more precise, I suppose the point at which to delete each endpoint's configv2.json would be whenever @chunyuma's eventual code for this issue is rolled out, if it's been less than 24 hours from now.)

@chunyuma
Copy link
Collaborator

Hi @edeutsch, can I know if the nodesynonymizer.get_total_entity_count(node_type, kg_name=KG2) returns the total entity count of given node type in kg2? If so, I'm wondering if it can return the count in kg2c. Since the counts stored in kg2c_v1.1_KG2.5.2.sqlite are based on kg2c. In order to make FET calculation correct, we need to make sure all counts are computed from kg2c.

@chunyuma
Copy link
Collaborator

chunyuma commented Apr 27, 2021

One thing that I concerned is that if the users specify use_synonyms=false in expand(), then the curies it returns are not the canonicalized and then the message.knowledge_graph.edges might not be the canonicalized but the kg2c_v1.1_KG2.5.2.sqlite is based on kg2c. This might cause some biases to FET calculation because we need to calculate the ratio of the count of adjacent nodes with specified node type of a given node in message.knowledge_graph and the count of adjacent nodes with that node type of that given node in kg2c_v1.1_KG2.5.2.sqlite. If kg2c_v1.1_KG2.5.2.sqlite is based on kg2c but message.knowledge_graph is based on kg1 or kg2 (if use_synonyms=false is specified), it might cause error for FET probability.

@amykglen
Copy link
Member

amykglen commented Apr 27, 2021

was this little discrepancy already at play with the old method, @chunyuma? (meaning, with the old method where you used expand() to get the neighbor counts, were you passing use_synonyms=false if the user had specified use_synonyms=false in their query? or were you just letting it use the default? (which is use_synonyms=true))

I'm also not totally clear on what your concern is relating to use_synonyms=false..

@edeutsch
Copy link
Collaborator

edeutsch commented Apr 27, 2021

Hi @edeutsch, can I know if the nodesynonymizer.get_total_entity_count(node_type, kg_name=KG2) returns the total entity count of given node type in kg2? If so, I'm wondering if it can return the count in kg2c. Since the counts stored in kg2c_v1.1_KG2.5.2.sqlite are based on kg2c. In order to make FET calculation correct, we need to make sure all counts are computed from kg2c.

I am fairly certain (by looking at the code; I have not performed an experiment to verify) that nodesynonymizer.get_total_entity_count() returns the total number of coalesced concepts, not nodes. And KG2c is based on concepts, so the method should currently be returning results that match KG2c. This was done intentionally because it seems most useful to know the non-redundant number of concepts rather than the highly redundant number of nodes.

So I think everything is as you want it? Let me know if you have evidence to the contrary.

Note that the kg_name='KG2' parameter is deprecated and ignored if supplied (for backward compatibility). There is only one source for the data now and that is (effectively) KG2c.

@chunyuma
Copy link
Collaborator

Yes, I think it has been little discrepancy at playing with old method. FET didn't check the use_synonyms but it checked the kg. If expand(kg=ARAX/KG1) is specified, then within FET it also uses expand(kg=ARAX/KG1) to get the neighbor counts. But right now, no matter the user use expand(kg=ARAX/KG1) or expand(kg=ARAX/KG2), if we use kg2c_v1.1_KG2.5.2.sqlite, it is always based on kg2c. So it is the thing that I concern.

@chunyuma
Copy link
Collaborator

Note that the kg_name='KG2' parameter is deprecated and ignored if supplied (for backward compatibility). There is only one source for the data now and that is (effectively) KG2c.

Ah, I see, thanks @edeutsch!

@amykglen
Copy link
Member

ok, I see. I thought that I recall that being desired, to always use KG2c to count neighbors, regardless of whether KG1/KG2/KG2c is used? but maybe I'm just remembering wrong.

but regardless, I think it's not much of a concern right now since KG1 will be deprecated soon and the regular KG2 can't be used through the KG2 API anyway (since the default is to use KG2c).. maybe if you're worried about it though you could use the old method if kp=ARAX/KG1?

@chunyuma
Copy link
Collaborator

ok, based on the feedback from @edeutsch and @amykglen, in order to make FET run faster, I will update the FET code on NewFmt branch and decide to always use the counts based on kg2c regardless of whether expand(kg=KG1/KG2/KG2c) is used or use_synonyms=false is specified. It might have little discrepancy for the FET probability right now, but once KG1 is deprecated, the FET calculation should be correct.

@edeutsch
Copy link
Collaborator

edeutsch commented Apr 27, 2021

ok, I see. I thought that I recall that being desired, to always use KG2c to count neighbors, regardless of whether KG1/KG2/KG2c is used? but maybe I'm just remembering wrong.

I think the discussion we once had was way back when we were querying KG2 directly (before KG2c existed), it was preferable to use NodeSynonymizer concepts instead of KG2 nodes because coalesced concepts are non-redundant and that's good (e.g. there were 20,000 protein concepts but 200,000 nodes, and 20,000 is approximately the right answer). Since we no longer query KG2 directly, the point is somewhat moot since KG2c entity counts is the only thing that makes sense. Or maybe I'm misremembering history..

@chunyuma
Copy link
Collaborator

ok, FET script is updated in NewFmt branch and it now runs faster. Thanks to @amykglen's kg2c_v1.1_KG2.5.2.sqlite, running the below DSL query posted by @dkoslicki only needs ~10 seconds now instead of 2.5 minutes:

"add_qnode(name=acetaminophen, categories=biolink:ChemicalSubstance, key=n0)",
"add_qnode(categories=biolink:Protein, key=n1)",
"add_qedge(subject=n0, object=n1, key=e0)",
"expand(edge_key=e0,kp=ARAX/KG2)",
"overlay(action=fisher_exact_test,subject_qnode_key=n0,virtual_relation_label=F1,object_qnode_key=n1,filter_type=top_n,value=10)",
"resultify()"

On thing I have to point out is that if we query DSL with add_qedge(subject=n00, object=n01, key=e00, predicates=biolink:___) and expand(edge_key=e00, kp=ARAX/KG1), the FET result might not be accurate because this predicate might only exist in kg1 rather than kg2c. We have to query the neighbor count by using DSL query within FET for this case but nodesynonymizer.get_total_entity_count(node_type, kg_name=KG2) and kg2c.sqlite are based on kg2c, this might cause the FET probability inaccurate. This can be fixed once kg1 is deprecated.

@dkoslicki
Copy link
Member Author

Confirmed working in the deployment at https://arax.ncats.io/NewFmt/. Great work @chunyuma !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants