-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigating mysterious KG2 dying issue #1708
Comments
Interestingly, after posting the last message, I noticed that KG2 was dead again! I restarted it. nginx log again:
Unclear where the issue lies. the 499 seem to be an immediate cause or symptom of failure. After googling around, I tried adding this to the nginx config:
We'll see if that has any effect. |
So, on
I think we should change that to add the |
Reverse DNS lookup of |
Similarly, reverse DNS lookup of |
ARAX also mysteriously died at around 8:50am. log is at: restarted. |
done. running this way now. |
Regarding specific IP address, a Slack message from Chunlei said: |
OK, perhaps we should get the specific JSON query that Chunlei's team is running, so we can test it ourselves? |
The intent of the "fork" architecture was to insulate the parent process from the child process dying, but perhaps that insulation is not working quite right and is taking the parent process down with it. This can be tested; I can set up a child process to do all manner of bad things (including SIGKILL itself). |
It seems to me that the last couple of deaths have occurred during a time when there are no active queries, but rather some obstreperous /meta_knowledge_graph calls. It might be that this is the result of some instability caused by forking. Or it might be completely unrelated. |
KG2 died again:
nginx log (kg2 calls only):
|
So it would appear that the KG2 API was running until this request came in:
at which point it died. I wonder if high concurrency to the /meta_knowledge_graph endpoint is a killer? |
All incoming queries are logged here: unless the failure occurs so early that it doesn't reach ARAXQuery.query() |
This time ARAX died instead of KG2. The log is here: The very last line of the log file was:
so it was apparently working on something at the time. It is in an interesting coincidence that the nginx log shows this:
The only other request in the same second is a hit to /meta_knowledge_graph that failed. hmm. |
So the good news is that I've figured out how to repro the problem! The bad news is that I don't know how to fix it. But I'm hoping one of the experts out there can help! The problem is indeed the client hanging up prematurely. I can repro the problem by writing a little script that will hit a /query endpoint in streaming mode but then hang up. The easy way is with something like this: curl -X POST "https://arax.ncats.io/test/api/arax/v1.2/query" -H "accept: application/json" -H "Content-Type: application/json" -d "{"stream_progress":true,"bypass_cache":true,"max_results":100,"submitter":"ISB_watchdog","message":{"query_graph":{"edges":{"e00":{"object":"n01","predicates":["biolink:physically_interacts_with"],"subject":"n00"}},"nodes":{"n00":{"categories":["biolink:ChemicalEntity"],"ids":["CHEMBL.COMPOUND:CHEMBL112"]},"n01":{"categories":["biolink:Protein"]}}}},"page_number":1,"page_size":50}" | head The piping into head means that after the first few lines curl just hangs up on the server. I read some things on the Internet (haha) that suggested that the yield() was dying. I tried putting try-except around the yield()s but that didn't help. One further possible clue is that it often dies at the same place, near
It's possible that it's somehow dying around asyncio getting info from KPs. I might be fruitful to try stopping the use of asyncio and just do these calls serially single threaded to see if that fixes the problem. not that we want to leave it that way, but a good test. Anyone willing to work on this problem? |
interesting! nice work figuring that out... I don't think it's related to use of
which doesn't trigger any Expand code, I think.. |
hmm, /test was not dead. and I just did:
a bunch of times in a row and it survived. |
ah, whoops, should've checked more carefully. I ran a query through the UI on /test, then ran that curl command, then reran the query in the UI, but that time I got an "end of json input" error, which I assumed meant the service was broken. but I guess it wasn't. who knows what I did wrong! interesting. for whoever works this issue: in case it's helpful, if you want to use RTX/code/ARAX/ARAXQuery/ARAX_expander.py Line 286 in 19fe4ac
|
So the bad news is that the upgrade of packages and Python 3.7.12 did not fix the problem. But perhaps the good news is that I just switched us back to non-forking mode in test and it still dies. So it would seem that forking is not the source of our problem. |
Let's check out |
OK, so it is looking like |
I have managed to reproduce this bug on
and the script
and the test-running command:
|
creating a branch |
OK I have switched the |
OK, neither tracing using the Confirming that this bug happens even when Flask is run with threading turned off and even when fork-mode is turned off in |
OK, I believe that commit 14575ef fixes the issue. |
I was alerted by the watchdog that the KG2 service was dead at 12m. Restarted. The STDERR log was saved here: /tmp/RTX_OpenAPI_kg2.elog-foundDead1
No obvious sign of trauma. In fact, the last records events were quite sedate. Just a lot of checking of /meta_knowledge_graph by some service apparently with a very short memory. Here's what Flask showed for requests:
Here's what nginx showed:
This is a bit puzzling.
The text was updated successfully, but these errors were encountered: