-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RTX-KG2 endpoint queries load and slowness and BTE #1680
Comments
Following up via Slack DM |
Issue starting up again right now: Also, should issue #1678 be closed in favor of this? |
Here is a block with several long-running KG2 API queries with not many results (which is often a correlate of long-running queries, but not in this case): for convenience, I will link the first few here: query 23217, query 23128, query 23219 |
I have re-blocked the offending BTE IP and restarted kg2 and arax services, hopefully clearing everything. |
The query_graph for query 23217 seems innocent enough at first glance:
|
Hi @saramsey . The query above yields 345 results from kg2, but it was submitted dozens of times in a row in the past hour, as you can see below (yes, the identical query every few seconds; why??). Also, notice the short completion times: Would it be useful to pull the queries submitted and look for the most common/duplicates, @edeutsch ? |
@isbluis I am guessing that will not help very much. |
What might be useful to know is if that is the first time that that query appeared. It probably is. If so, that might lend credence to the idea that this is a caching problem. Once that query is run once, subsequent instances are fine. |
Hi @edeutsch: in the host OS on
(the
|
Assuming that the kernel ring buffer timestamps are in UTC, I am seeing that the OOM kernel function terminated processes at the following timestamps:
@edeutsch do those timestamps correlate at all with system slowdowns? I thought we had a major slowdown event at 2021.09.29 03:47, and oddly enough, I don't see the OOM killer having kicked in at that time. So maybe these things are uncorrelated. Nevertheless, it is a bit concerning that the OOM killer is having to intervene several times a day. |
On
|
From the nginx logfile
|
Interestingly, if we look at bracketing log entries in
it looks like the slowdown occurred right when there was a |
Is it possible that the |
It would seem highly unlikely to me. |
I think we would see a 499 in the nginx logfile if the server (i.e., https://stackoverflow.com/questions/12973304/possible-reason-for-nginx-499-error-codes |
There was another GET query to
|
ARAX production was just noticed dead. I have restarted it. |
When I add a fifth client on another system (a
|
Conclusions so far:
With five clients loading the system simultaneously, I am getting an overall (combined) throughput rate of one query serviced per two seconds, which is about what I got with one client. BTE was loading the system with one query per three seconds, so we are only loading the system slightly more than BTE. I am not currently seeing catastrophic slowdowns, however. I will do some more testing with other types of ARAX queries to see if they can trip a catastrophic slowdown. |
Hopefully not me! I am running against |
I am unfortunately not able to test the effect of clicking the "system activity" button in the ARAX browser UI on the functioning of the RTX-KG2 API on the same server ( Hey @finnagin do you know what might be going on there? |
It does seem that the KG2 progress takes up a lot of RAM:
5628 is KG2 |
I thought the system activity UI tab required the mysql database. Is that correct @edeutsch ? If so that would be why it doesn't work on arax-backup. |
Thanks @finnagin that is helpful. |
correct, it does. But isn't the MySQL replicated on arax-backup? |
The "system activity" may be a red herring in this investigation, as @edeutsch has posited. |
Ah, let me start mysql.... |
Are you saying that (in the context of a fork scenario) you want to have a Unix pipe connecting the child process's STDOUT to a file descriptor in the parent process, that the parent process can read from? That is doable. The |
Or are you wanting the parent process to "fire and forget", i.e., fork and have the child go off and do the work, and the parent process meanwhile goes on (simultaneously) to handle the next incoming HTTP request? I assume you are talking about a change to this function? RTX/code/UI/OpenAPI/python-flask-server/KG2/openapi_server/controllers/query_controller.py Line 14 in d01fb6f
|
well, yes and no. Similar to your "fire and forget" but it's deep in the bowels of the system. The system is already multithreaded so each request is its own thread (but subject to the GIL), but the idea would be to fork a child, have the child do the work while the parent waits for the child to finish its work, but then the child exits, taking all its problems with it, leaving the parent fresh and unsullied to rejoin the pool. The tricky part is having the child communicate the results while the parent is quiet. |
Ah, I understand! Thank you for patiently explaining. Yes, I think
|
thanks, but I am forking not running a subprocess command. |
I finally got a workable setup in devED and was all happy, only to discover that the child processes intermittently seg fault. I have no solution for that. I wonder if a Python upgrade from 3.7.3 to 3.7.9 might fix that.. |
It took me a while to figure out that the definition for the class |
I think |
I think the main flaw in this approach (forking in query_controller.py) is that the streaming features that we have won't work. The streaming is performed below this layer. I have implemented in devED something similar to what you have shown above. And it works (go ahead and try a query in devED and you will be using a child progress to do the work) usually. But sometimes (~10% of the time) it appears that the child seg faults while doing Expand asyncio. This is making me sad. I am wondering if a Python version upgrade might fix this. If I can fix that, then we may be in good shape. |
Perhaps we could delete |
interesting about the seg faults, @edeutsch - that reminds me that a couple weeks ago I was experiencing intermittent seg faults in my dev setup when running the pytest suite - I thought maybe something was messed up with my Python install and so I upgraded from 3.7.8 to 3.7.9... haven't seen any seg faults since. (not sure if that's related to what you're seeing, but thought I'd note in case..) |
ah, interesting indeed. I don't think there's a problem in our code, but I am uncertain. Is anyone skillful enough and willing to install Python 3.7.12 and modules in our container? I did a bit of a hack job compiling 3.7.3 myself in /mnt/data/python/ a couple years ago. But maybe it would be good to do it better this time? |
yes, we should, that was the intent, but I forgot. Anyone want to try? |
I have deleted it in the I have logged this specific request (to remove |
Fixed (I think?) in #1691 (see PR #1692). I think if we can do it, it is safer to |
When I try to run the standard example JSON query through the fork-in-ARAXQuery code (via https://arax-backup.rtx.ai/devED), I am seeing this error: Running the same query with the "fork in query_controller" code (via https://arax-backup.rtx.ai/fork-test) produces: Obviously, more testing is needed, this is just a single test run, etc. See my specific testing recommendations in #1691 and #1692. |
yes, seems to be. Although there is a minor issue of apparent buffering through the pipe
This sounds reasonable. |
Python 3.7.12 has been installed in /mnt/data/orangeboard/python/Python-3.7.12 in arax-backup.rtx.ai inside the rtx1 container. |
Thanks. Although I don't really use arax-backup. I could test 3.7.12 on the main arax. |
OK, installing now.... |
thanks! |
It looks like the forking query_controller.py has been working for a while, so I am going to close this issue out. |
As I posted on Slack, Chunlei requested that we unblock his BTE servers and I have done so. shortly thereafter we received a (light) barrage of queries from them around
23158 2021-09-29 03:46:57
to
23236 2021-09-29 03:50:39
(UTC) in the table at:
https://arax.ncats.io/?recent=1
Many of the queries complete in less than 10 seconds and we can sustain this load.
But then toward the end of the barrage there were a set of queries that took 200+ seconds.
It would be super useful if someone can try to figure out what's going on here.
Is it legitimately hard queries? Is there some strange bottleneck that is causing the problem where there shouldn't be one?
Amy previously observed a possible NodeSynonymizer problem where it suddenly took dozens of seconds to respond to something that should only have been a subsecond query...? Is that the source?
We need more diagnostics to understand what's going on.
I suspect we can sustain this load with current hardware if we can figure out what's going wrong here.
I need help.
thanks!
The text was updated successfully, but these errors were encountered: