Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigating mysterious KG2 dying issue #1708

Closed
edeutsch opened this issue Oct 21, 2021 · 26 comments
Closed

Investigating mysterious KG2 dying issue #1708

edeutsch opened this issue Oct 21, 2021 · 26 comments

Comments

@edeutsch
Copy link
Collaborator

I was alerted by the watchdog that the KG2 service was dead at 12m. Restarted. The STDERR log was saved here: /tmp/RTX_OpenAPI_kg2.elog-foundDead1

No obvious sign of trauma. In fact, the last records events were quite sedate. Just a lot of checking of /meta_knowledge_graph by some service apparently with a very short memory. Here's what Flask showed for requests:

127.0.0.1 - - [21/Oct/2021 06:10:07] "GET /api/rtxkg2/v1.2/meta_knowledge_graph HTTP/1.1" 200 -
127.0.0.1 - - [21/Oct/2021 06:10:07] "GET /api/rtxkg2/v1.2/meta_knowledge_graph HTTP/1.1" 200 -
127.0.0.1 - - [21/Oct/2021 06:10:07] "GET /api/rtxkg2/v1.2/meta_knowledge_graph HTTP/1.1" 200 -
127.0.0.1 - - [21/Oct/2021 06:10:08] "GET /api/rtxkg2/v1.2/meta_knowledge_graph HTTP/1.1" 200 -
127.0.0.1 - - [21/Oct/2021 06:11:28] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 -
127.0.0.1 - - [21/Oct/2021 06:19:02] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 -
127.0.0.1 - - [21/Oct/2021 06:19:29] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 -
127.0.0.1 - - [21/Oct/2021 06:20:05] "GET /api/rtxkg2/v1.2/meta_knowledge_graph HTTP/1.1" 200 -
127.0.0.1 - - [21/Oct/2021 06:20:08] "GET /api/rtxkg2/v1.2/meta_knowledge_graph HTTP/1.1" 200 -
127.0.0.1 - - [21/Oct/2021 06:20:08] "GET /api/rtxkg2/v1.2/meta_knowledge_graph HTTP/1.1" 200 -
127.0.0.1 - - [21/Oct/2021 06:20:08] "GET /api/rtxkg2/v1.2/meta_knowledge_graph HTTP/1.1" 200 -
127.0.0.1 - - [21/Oct/2021 06:30:06] "GET /api/rtxkg2/v1.2/meta_knowledge_graph HTTP/1.1" 200 -
127.0.0.1 - - [21/Oct/2021 06:30:07] "GET /api/rtxkg2/v1.2/meta_knowledge_graph HTTP/1.1" 200 -

Here's what nginx showed:

35.161.155.225 - - [21/Oct/2021:06:30:06 +0000] "GET /api/rtxkg2/v1.2/meta_knowledge_graph HTTP/1.1" 200 18045432 "-" "axios/0.21.4"
54.149.211.163 - - [21/Oct/2021:06:30:06 +0000] "GET /api/rtxkg2/v1.2/meta_knowledge_graph HTTP/1.1" 499 0 "-" "axios/0.21.4"
54.189.90.50 - - [21/Oct/2021:06:30:07 +0000] "GET /api/rtxkg2/v1.2/meta_knowledge_graph HTTP/1.1" 502 530 "-" "axios/0.21.4"
52.4.10.150 - - [21/Oct/2021:06:30:07 +0000] "GET /api/rtxkg2/v1.2/meta_knowledge_graph HTTP/1.1" 502 530 "-" "axios/0.21.4"
209.97.185.16 - - [21/Oct/2021:06:34:25 +0000] "GET / HTTP/1.1" 302 170 "http://35.81.149.105:80/left.html" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0"
54.189.90.50 - - [21/Oct/2021:06:40:01 +0000] "GET /api/rtxkg2/v1.2/meta_knowledge_graph HTTP/1.1" 503 378 "-" "axios/0.21.4"
35.161.155.225 - - [21/Oct/2021:06:40:02 +0000] "GET /api/rtxkg2/v1.2/meta_knowledge_graph HTTP/1.1" 503 378 "-" "axios/0.21.4"
54.149.211.163 - - [21/Oct/2021:06:40:03 +0000] "GET /api/rtxkg2/v1.2/meta_knowledge_graph HTTP/1.1" 503 378 "-" "axios/0.21.4"
52.4.10.150 - - [21/Oct/2021:06:40:04 +0000] "GET /api/rtxkg2/v1.2/meta_knowledge_graph HTTP/1.1" 503 378 "-" "axios/0.21.4"

This is a bit puzzling.

@edeutsch
Copy link
Collaborator Author

Interestingly, after posting the last message, I noticed that KG2 was dead again! I restarted it. nginx log again:

54.186.177.111 - - [21/Oct/2021:07:09:22 +0000] "POST /api/rtxkg2/v1.2/asyncquery HTTP/1.1" 400 115 "-" "SmartAPI API status monitor"
54.186.177.111 - - [21/Oct/2021:07:09:22 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 400 115 "-" "SmartAPI API status monitor"
54.186.177.111 - - [21/Oct/2021:07:09:22 +0000] "GET /api/rtxkg2/v1.2/status HTTP/1.1" 200 17 "-" "SmartAPI API status monitor"
54.189.90.50 - - [21/Oct/2021:07:10:06 +0000] "GET /api/rtxkg2/v1.2/meta_knowledge_graph HTTP/1.1" 200 18045432 "-" "axios/0.21.4"
54.149.211.163 - - [21/Oct/2021:07:10:06 +0000] "GET /api/rtxkg2/v1.2/meta_knowledge_graph HTTP/1.1" 499 0 "-" "axios/0.21.4"
35.161.155.225 - - [21/Oct/2021:07:10:07 +0000] "GET /api/rtxkg2/v1.2/meta_knowledge_graph HTTP/1.1" 499 0 "-" "axios/0.21.4"
52.4.10.150 - - [21/Oct/2021:07:10:07 +0000] "GET /api/rtxkg2/v1.2/meta_knowledge_graph HTTP/1.1" 502 530 "-" "axios/0.21.4"
152.54.3.118 - - [21/Oct/2021:07:12:40 +0000] "GET /api/rtxkg2/v1.2/meta_knowledge_graph HTTP/1.1" 503 378 "-" "python-httpx/0.18.2"
152.54.3.118 - - [21/Oct/2021:07:12:49 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 503 378 "-" "python-httpx/0.18.2"
152.54.3.118 - - [21/Oct/2021:07:12:50 +0000] "GET /api/rtxkg2/v1.2/meta_knowledge_graph HTTP/1.1" 503 378 "-" "python-httpx/0.18.2"
152.54.3.118 - - [21/Oct/2021:07:12:57 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 503 378 "-" "python-httpx/0.18.2"
54.189.90.50 - - [21/Oct/2021:07:20:01 +0000] "GET /api/rtxkg2/v1.2/meta_knowledge_graph HTTP/1.1" 503 378 "-" "axios/0.21.4"
54.149.211.163 - - [21/Oct/2021:07:20:01 +0000] "GET /api/rtxkg2/v1.2/meta_knowledge_graph HTTP/1.1" 503 378 "-" "axios/0.21.4"
35.161.155.225 - - [21/Oct/2021:07:20:02 +0000] "GET /api/rtxkg2/v1.2/meta_knowledge_graph HTTP/1.1" 503 378 "-" "axios/0.21.4"
52.4.10.150 - - [21/Oct/2021:07:20:04 +0000] "GET /api/rtxkg2/v1.2/meta_knowledge_graph HTTP/1.1" 503 378 "-" "axios/0.21.4"

Unclear where the issue lies. the 499 seem to be an immediate cause or symptom of failure.

After googling around, I tried adding this to the nginx config:

proxy_ignore_client_abort on

We'll see if that has any effect.

@saramsey
Copy link
Member

saramsey commented Oct 21, 2021

So, on arax.ncats.io, inside the rtx1 container, script /mnt/data/orangeboard/kg2/RTX/code/UI/OpenAPI/python-flask-server/KG2/RTX_OpenAPI_kg2.start has, on line 15,

exec python3 -m openapi_server 1>$LOGFILE 2>$ELOGFILE

I think we should change that to add the -u option, for debugging purposes. So that STDERR is always immediately flushed; I suspect (theorize?) we are missing STDERR print statements that are in the stream buffer but do not get flushed because python3 terminates so precipitously.

@saramsey
Copy link
Member

Reverse DNS lookup of 54.149.211.163 reveals the probably-not-helpful hostname ec2-54-149-211-163.us-west-2.compute.amazonaws.com (but it is in us-west-2, so maybe coming from one of our tools?).

@saramsey
Copy link
Member

Similarly, reverse DNS lookup of 35.161.155.225 reveals the hostname ec2-35-161-155-225.us-west-2.compute.amazonaws.com

@edeutsch
Copy link
Collaborator Author

ARAX also mysteriously died at around 8:50am. log is at:
/tmp/RTX_OpenAPI_production.elog-foundDead3

restarted.

@edeutsch
Copy link
Collaborator Author

I think we should change that to add the -u option, for debugging purposes. So that STDERR is always immediately flushed; I suspect (theorize?) we are missing STDERR print statements that are in the stream buffer but do not get flushed because python3 terminates so precipitously.

done. running this way now.

@edeutsch
Copy link
Collaborator Author

Regarding specific IP address, a Slack message from Chunlei said:
My machine 66.27.73.63
api.bte.ncats.io (54.189.90.50)
dev.api.bte.ncats.io (54.149.211.163)
35.161.155.225 (this is another server where we run asyncquery right now)

@saramsey
Copy link
Member

OK, perhaps we should get the specific JSON query that Chunlei's team is running, so we can test it ourselves?

@saramsey
Copy link
Member

The intent of the "fork" architecture was to insulate the parent process from the child process dying, but perhaps that insulation is not working quite right and is taking the parent process down with it. This can be tested; I can set up a child process to do all manner of bad things (including SIGKILL itself).

@edeutsch
Copy link
Collaborator Author

It seems to me that the last couple of deaths have occurred during a time when there are no active queries, but rather some obstreperous /meta_knowledge_graph calls. It might be that this is the result of some instability caused by forking. Or it might be completely unrelated.

@edeutsch
Copy link
Collaborator Author

KG2 died again:

Thu Oct 21 20:30:03 UTC 2021
 * /mnt/data/orangeboard/kg2/RTX/code/UI/OpenAPI/python-flask-server/KG2/RTX_OpenAPI_kg2.start is running
Thu Oct 21 20:30:13 UTC 2021
 * /mnt/data/orangeboard/kg2/RTX/code/UI/OpenAPI/python-flask-server/KG2/RTX_OpenAPI_kg2.start is not running
Thu Oct 21 20:30:23 UTC 2021

nginx log (kg2 calls only):

152.54.3.118 - - [21/Oct/2021:20:27:29 +0000] "GET /api/rtxkg2/v1.2/meta_knowledge_graph HTTP/1.1" 200 18045432 "-" "python-httpx/0.18.2"
152.54.3.118 - - [21/Oct/2021:20:27:42 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 22496 "-" "python-httpx/0.18.2"
99.150.229.254 - - [21/Oct/2021:20:27:48 +0000] "GET /api/rtxkg2/v1.2/meta_knowledge_graph HTTP/1.1" 200 18045432 "-" "python-httpx/0.18.2"
99.150.229.254 - - [21/Oct/2021:20:28:02 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13914 "-" "python-httpx/0.18.2"
54.149.211.163 - - [21/Oct/2021:20:30:07 +0000] "GET /api/rtxkg2/v1.2/meta_knowledge_graph HTTP/1.1" 200 0 "-" "axios/0.21.4"
35.161.155.225 - - [21/Oct/2021:20:30:08 +0000] "GET /api/rtxkg2/v1.2/meta_knowledge_graph HTTP/1.1" 200 0 "-" "axios/0.21.4"
99.150.229.254 - - [21/Oct/2021:20:30:08 +0000] "GET /api/rtxkg2/v1.2/meta_knowledge_graph HTTP/1.1" 502 530 "-" "python-httpx/0.18.2"
52.4.10.150 - - [21/Oct/2021:20:30:08 +0000] "GET /api/rtxkg2/v1.2/meta_knowledge_graph HTTP/1.1" 502 530 "-" "axios/0.21.4"
99.150.229.254 - - [21/Oct/2021:20:30:14 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 503 378 "-" "python-httpx/0.18.2"
152.54.3.118 - - [21/Oct/2021:20:33:19 +0000] "GET /api/rtxkg2/v1.2/meta_knowledge_graph HTTP/1.1" 503 378 "-" "python-httpx/0.18.2"
152.54.3.118 - - [21/Oct/2021:20:33:26 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 503 378 "-" "python-httpx/0.18.2"

@edeutsch
Copy link
Collaborator Author

So it would appear that the KG2 API was running until this request came in:

54.149.211.163 - - [21/Oct/2021:20:30:07 +0000] "GET /api/rtxkg2/v1.2/meta_knowledge_graph HTTP/1.1" 200 0 "-" "axios/0.21.4"
35.161.155.225 - - [21/Oct/2021:20:30:08 +0000] "GET /api/rtxkg2/v1.2/meta_knowledge_graph HTTP/1.1" 200 0 "-" "axios/0.21.4"
99.150.229.254 - - [21/Oct/2021:20:30:08 +0000] "GET /api/rtxkg2/v1.2/meta_knowledge_graph HTTP/1.1" 502 530 "-" "python-httpx/0.18.2"

at which point it died.

I wonder if high concurrency to the /meta_knowledge_graph endpoint is a killer?

@edeutsch
Copy link
Collaborator Author

OK, perhaps we should get the specific JSON query that Chunlei's team is running, so we can test it ourselves?

All incoming queries are logged here:
https://arax.ncats.io/?recent=1

unless the failure occurs so early that it doesn't reach ARAXQuery.query()

@edeutsch
Copy link
Collaborator Author

This time ARAX died instead of KG2. The log is here:
/tmp/RTX_OpenAPI_production.elog-foundDead4
Jared was using the system at the time as described in Slack #deployment
But not probably not directly related.

The very last line of the log file was:

2021-10-22T02:55:12.644803 DEBUG: [] BTE: Converted n3's 491 curies to a list of 586 curies tailored for BTE

so it was apparently working on something at the time.

It is in an interesting coincidence that the nginx log shows this:

174.61.152.86 - - [22/Oct/2021:02:55:10 +0000] "GET /rtx.version HTTP/1.1" 304 0 "https://arax.ncats.io/index.html" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.0 Safari/605.1.15"
174.61.152.86 - - [22/Oct/2021:02:55:12 +0000] "GET /api/arax/v1.2/meta_knowledge_graph HTTP/1.1" 502 526 "https://arax.ncats.io/index.html" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.0 Safari/605.1.15"
71.217.37.211 - - [22/Oct/2021:02:55:14 +0000] "GET /api/arax/v1.2/response/3151a980-4990-4a57-8592-7e4e7f8aa728 HTTP/1.1" 503 378 "https://arax.ncats.io/?r=3151a980-4990-4a57-8592-7e4e7f8aa728" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:92.0) Gecko/20100101 Firefox/92.0"
174.61.152.86 - - [22/Oct/2021:02:55:35 +0000] "GET /?r=30506 HTTP/1.1" 200 6599 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.0 Safari/605.1.15"

The only other request in the same second is a hit to /meta_knowledge_graph that failed. hmm.

edeutsch added a commit that referenced this issue Oct 22, 2021
edeutsch added a commit that referenced this issue Oct 22, 2021
@edeutsch
Copy link
Collaborator Author

So the good news is that I've figured out how to repro the problem! The bad news is that I don't know how to fix it. But I'm hoping one of the experts out there can help!

The problem is indeed the client hanging up prematurely.

I can repro the problem by writing a little script that will hit a /query endpoint in streaming mode but then hang up. The easy way is with something like this:

curl -X POST "https://arax.ncats.io/test/api/arax/v1.2/query" -H "accept: application/json" -H "Content-Type: application/json" -d "{"stream_progress":true,"bypass_cache":true,"max_results":100,"submitter":"ISB_watchdog","message":{"query_graph":{"edges":{"e00":{"object":"n01","predicates":["biolink:physically_interacts_with"],"subject":"n00"}},"nodes":{"n00":{"categories":["biolink:ChemicalEntity"],"ids":["CHEMBL.COMPOUND:CHEMBL112"]},"n01":{"categories":["biolink:Protein"]}}}},"page_number":1,"page_size":50}" | head

The piping into head means that after the first few lines curl just hangs up on the server.
This reliably kills the flask process.

I read some things on the Internet (haha) that suggested that the yield() was dying. I tried putting try-except around the yield()s but that didn't help.

One further possible clue is that it often dies at the same place, near

2021-10-23T03:14:30.319695 INFO: [] The KPs Expand decided to answer e00 with are: {'BTE', 'RTX-KG2', 'MolePro'}
2021-10-23T03:14:30.319848 DEBUG: [] Loading record of KP timeouts
2021-10-23T03:14:30.320038 DEBUG: [] Loading meta map (already exists and isn't due for a refresh)
2021-10-23T03:14:30.982789 DEBUG: [] Will use asyncio to run KP queries concurrently
2021-10-23T03:14:30.983290 INFO: [] Expanding qedge e00 using BTE
2021-10-23T03:14:31.055709 DEBUG: [] Loading record of KP timeouts
2021-10-23T03:14:31.055957 DEBUG: [] Loading meta map (already exists and isn't due for a refresh)
2021-10-23T03:14:32.376700 DEBUG: [] BTE: All n00 curies use prefix(es) BTE supports; no conversion necessary
2021-10-23T03:14:32.376943 DEBUG: [] BTE: Sending query to BTE API
2021-10-23T03:14:32.377872 INFO: [] Expanding qedge e00 using MolePro
2021-10-23T03:14:33.042828 DEBUG: [] MolePro: One or more n00 curies use a prefix MolePro doesn't support; will convert these
2021-10-23T03:14:33.042980 DEBUG: [] MolePro: Converting curies in the QG to kinds that MolePro can answer

It's possible that it's somehow dying around asyncio getting info from KPs. I might be fruitful to try stopping the use of asyncio and just do these calls serially single threaded to see if that fixes the problem. not that we want to leave it that way, but a good test.

Anyone willing to work on this problem?

@amykglen
Copy link
Member

amykglen commented Oct 23, 2021

interesting! nice work figuring that out... I don't think it's related to use of asyncio though, because I think I just killed the /test ARAX service (sorry!) with this variation:

curl -X GET "https://arax.ncats.io/test/api/arax/v1.2/meta_knowledge_graph" -H "accept: application/json" | head

which doesn't trigger any Expand code, I think..

@edeutsch
Copy link
Collaborator Author

hmm, /test was not dead. and I just did:

curl -X GET "https://arax.ncats.io/test/api/arax/v1.2/meta_knowledge_graph" -H "accept: application/json" | head

a bunch of times in a row and it survived.
Try again?
What made you think it was dead?

@amykglen
Copy link
Member

ah, whoops, should've checked more carefully. I ran a query through the UI on /test, then ran that curl command, then reran the query in the UI, but that time I got an "end of json input" error, which I assumed meant the service was broken. but I guess it wasn't. who knows what I did wrong!

interesting. for whoever works this issue: in case it's helpful, if you want to use multiprocessing instead of asyncio in Expand, you can just flip this variable to False:

use_asyncio = True # Flip this to False if you want to use multiprocessing instead

@edeutsch
Copy link
Collaborator Author

edeutsch commented Nov 4, 2021

So the bad news is that the upgrade of packages and Python 3.7.12 did not fix the problem.

But perhaps the good news is that I just switched us back to non-forking mode in test and it still dies. So it would seem that forking is not the source of our problem.

@saramsey
Copy link
Member

saramsey commented Nov 4, 2021

@saramsey
Copy link
Member

saramsey commented Nov 7, 2021

OK, so it is looking like socketio will not enable us to immediately detect a client disconnect.

@saramsey
Copy link
Member

saramsey commented Nov 7, 2021

I have managed to reproduce this bug on arax-backup.rtx.ai (in the production DEVAREA), using the JSON query file query.json:

{ "stream_progress": true,
  "message": { "query_graph":
{
   "edges": {
      "e00": {
         "subject":   "n00",
         "object":    "n01",
         "predicates": ["biolink:physically_interacts_with"]
      }
   },
   "nodes": {
      "n00": {
         "ids":        ["CHEMBL.COMPOUND:CHEMBL112"]
      },
      "n01": {
         "categories":  ["biolink:Protein"]
      }
   }
}}}

and the script run-test.sh:

#!/bin/bash

curl -s -X POST https://arax-backup.rtx.ai/api/arax/v1.2/query \
     -H  "accept: application/json" \
     -H  "Content-Type: application/json" \
     -d @query.json

and the test-running command:

bash -x ./run-test.sh | head

@saramsey
Copy link
Member

saramsey commented Nov 7, 2021

creating a branch issue-1708 to work this issue

@saramsey
Copy link
Member

saramsey commented Nov 7, 2021

OK I have switched the production DEVAREA on arax-backup.rtx.ai to branch issue-1708

@saramsey
Copy link
Member

saramsey commented Nov 7, 2021

OK, neither tracing using the trace module not setting an exit handler with the atexit module in python can catch control before cpython exits. My feeling is that there is something with thread corruption causing instant death of cpython.

Confirming that this bug happens even when Flask is run with threading turned off and even when fork-mode is turned off in query_controller.py.

saramsey pushed a commit that referenced this issue Nov 8, 2021
@saramsey
Copy link
Member

saramsey commented Nov 8, 2021

OK, I believe that commit 14575ef fixes the issue.

saramsey added a commit that referenced this issue Nov 8, 2021
saramsey pushed a commit that referenced this issue Nov 8, 2021
saramsey pushed a commit that referenced this issue Nov 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants