-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow FET #1402
Comments
@dkoslicki, I think the most time-consuming part within The most time-consuming part is that given a bunch of nodes with specific type, we have to count the number of the adjacent nodes with specific type for each of these nodes. Originally, we use |
hmm, yeah, definitely not ideal to have to return all of those neighbor nodes when all FET really needs is the count. plover has constant-time access to these counts (how many neighbors of each category a given node has). alternatively, I suppose these could be precomputed and put in a sqlite database or something? the KG2c build process could pretty easily output a sqlite file with these counts (so, for example, you could look up a particular node ID and get a little JSON string like "{"biolink:Protein": 21, "biolink:Disease": 10, ...}") |
@amykglen, if it is easy to build such database, I think this should be the fastest way to get these counts. If plover only needs constant-time, I think this is also much faster than the DSL query I'm using now. Not sure why one might be better? |
I see! Yes, given we just need counts, there should be a way to pre-compute this easily. Here's an example of such a slow query if you'd like it for testing (credit to @amykglen ):
30 minutes and it still hasn't terminated. |
ok, I adjusted the KG2c build process so that it adds these neighbor counts to |
@chunyuma - ok,
the
let me know if you have any questions, @chunyuma! (fyi, if a node has no neighbors of a particular category, then there will be no entry for that category under that node.) |
Thanks so much @amykglen! I will try it later today and let you know if I have any problems. |
Do I need to run the ARAX_database_manager or something on all the deployments in response to this? |
it's not required at this point, since there's no code yet that actually uses these counts. if you're worried about it, you can delete each endpoint's (basically, I'm pretty sure the |
Hi @edeutsch, can I know if the |
One thing that I concerned is that if the users specify |
was this little discrepancy already at play with the old method, @chunyuma? (meaning, with the old method where you used I'm also not totally clear on what your concern is relating to |
I am fairly certain (by looking at the code; I have not performed an experiment to verify) that nodesynonymizer.get_total_entity_count() returns the total number of coalesced concepts, not nodes. And KG2c is based on concepts, so the method should currently be returning results that match KG2c. This was done intentionally because it seems most useful to know the non-redundant number of concepts rather than the highly redundant number of nodes. So I think everything is as you want it? Let me know if you have evidence to the contrary. Note that the kg_name='KG2' parameter is deprecated and ignored if supplied (for backward compatibility). There is only one source for the data now and that is (effectively) KG2c. |
Yes, I think it has been little discrepancy at playing with old method. FET didn't check the |
Ah, I see, thanks @edeutsch! |
ok, I see. I thought that I recall that being desired, to always use KG2c to count neighbors, regardless of whether KG1/KG2/KG2c is used? but maybe I'm just remembering wrong. but regardless, I think it's not much of a concern right now since KG1 will be deprecated soon and the regular KG2 can't be used through the KG2 API anyway (since the default is to use KG2c).. maybe if you're worried about it though you could use the old method if |
ok, based on the feedback from @edeutsch and @amykglen, in order to make FET run faster, I will update the FET code on |
I think the discussion we once had was way back when we were querying KG2 directly (before KG2c existed), it was preferable to use NodeSynonymizer concepts instead of KG2 nodes because coalesced concepts are non-redundant and that's good (e.g. there were 20,000 protein concepts but 200,000 nodes, and 20,000 is approximately the right answer). Since we no longer query KG2 directly, the point is somewhat moot since KG2c entity counts is the only thing that makes sense. Or maybe I'm misremembering history.. |
ok, FET script is updated in
On thing I have to point out is that if we query DSL with |
Confirmed working in the deployment at https://arax.ncats.io/NewFmt/. Great work @chunyuma ! |
I don't know if this is due to how the Fisher exact test is querying
expand
, but I noticed that a very simple FET query takes quite a bit of time. Eg. the default "protein targets of acetaminophen" query without the results limited gives ~230 proteins. Trying to add FET in there:Results in a query that takes 2.5 minutes:
Given the desire to use FET in the similarity search, we should investigate why this is. It seems it might be due to computing the total number of (in this case), Chemical Substances.
I tried this yesterday with a two hop query and it took over 30min. I no longer have that query, but will update this issue when I can replicate it.
The text was updated successfully, but these errors were encountered: