Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RTX-KG2 endpoint queries load and slowness and BTE #1680

Closed
edeutsch opened this issue Sep 29, 2021 · 62 comments
Closed

RTX-KG2 endpoint queries load and slowness and BTE #1680

edeutsch opened this issue Sep 29, 2021 · 62 comments
Assignees

Comments

@edeutsch
Copy link
Collaborator

As I posted on Slack, Chunlei requested that we unblock his BTE servers and I have done so. shortly thereafter we received a (light) barrage of queries from them around
23158 2021-09-29 03:46:57
to
23236 2021-09-29 03:50:39
(UTC) in the table at:
https://arax.ncats.io/?recent=1

Many of the queries complete in less than 10 seconds and we can sustain this load.
But then toward the end of the barrage there were a set of queries that took 200+ seconds.
It would be super useful if someone can try to figure out what's going on here.
Is it legitimately hard queries? Is there some strange bottleneck that is causing the problem where there shouldn't be one?
Amy previously observed a possible NodeSynonymizer problem where it suddenly took dozens of seconds to respond to something that should only have been a subsecond query...? Is that the source?
We need more diagnostics to understand what's going on.
I suspect we can sustain this load with current hardware if we can figure out what's going wrong here.
I need help.
thanks!

@saramsey
Copy link
Member

Following up via Slack DM

@isbluis
Copy link
Member

isbluis commented Sep 29, 2021

Issue starting up again right now:

image

Also, should issue #1678 be closed in favor of this?

@saramsey
Copy link
Member

Here is a block with several long-running KG2 API queries with not many results (which is often a correlate of long-running queries, but not in this case):
Screen Shot 2021-09-28 at 9 38 03 PM

for convenience, I will link the first few here: query 23217, query 23128, query 23219

@edeutsch
Copy link
Collaborator Author

I have re-blocked the offending BTE IP and restarted kg2 and arax services, hopefully clearing everything.

@saramsey
Copy link
Member

The query_graph for query 23217 seems innocent enough at first glance:

{
  "message": {
    "query_graph": {
      "edges": {
        "e01": {
          "object": "n1",
          "predicates": [
            "biolink:physically_interacts_with"
          ],
          "subject": "n0"
        }
      },
      "nodes": {
        "n0": {
          "categories": [
            "biolink:Gene"
          ],
          "ids": [
            "NCBIGene:23221",
            "NCBIGene:10765",
            "NCBIGene:22836",
            "NCBIGene:8065",
            "NCBIGene:8452",
            "NCBIGene:23145",
            "NCBIGene:284217",
            "NCBIGene:114784",
            "NCBIGene:3645",
            "NCBIGene:1462",
            "NCBIGene:29919",
            "NCBIGene:672",
            "NCBIGene:65267",
            "NCBIGene:26960",
            "NCBIGene:23379",
            "HGNC:39004",
            "UMLS:C4085573",
            "NCBIGene:1869",
            "NCBIGene:5241",
            "HGNC:9335",
            "NCBIGene:5555",
            "NCBIGene:7332",
            "UMLS:C0079427"
          ]
        },
        "n1": {
          "categories": [
            "biolink:SmallMolecule"
          ]
        }
      }
    }
  }
}

@isbluis
Copy link
Member

isbluis commented Sep 29, 2021

Hi @saramsey . The query above yields 345 results from kg2, but it was submitted dozens of times in a row in the past hour, as you can see below (yes, the identical query every few seconds; why??). Also, notice the short completion times:

image

Would it be useful to pull the queries submitted and look for the most common/duplicates, @edeutsch ?

@edeutsch
Copy link
Collaborator Author

@isbluis I am guessing that will not help very much.
If I am reading the above thread correctly, this query normally completes in 1-3 seconds. However, at 2021-09-29 03:49:39 UTC, that same query took 266 seconds. We need to know why. And fix it.

@edeutsch
Copy link
Collaborator Author

What might be useful to know is if that is the first time that that query appeared. It probably is. If so, that might lend credence to the idea that this is a caching problem. Once that query is run once, subsequent instances are fine.

@saramsey
Copy link
Member

saramsey commented Sep 29, 2021

Hi @edeutsch: in the host OS on arax.ncats.io, I checked the kernel ring buffer using

dmesg -k -T > /tmp/dmesg-dump.txt

(the -T option converts kernel milliseconds since start-up to clock timestamps). In that kernel ring buffer dump, I am seeing python3 processes killed by the kernel out-of-memory (OOM) feature, such as this one at 07:12:20 UTC, which would have been 00:12 PDT this morning (I think):

[Wed Sep 29 07:12:20 2021] python3 invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
[Wed Sep 29 07:12:20 2021] CPU: 5 PID: 1975 Comm: python3 Not tainted 5.4.0-1056-aws #59~18.04.1-Ubuntu
[Wed Sep 29 07:12:20 2021] Hardware name: Amazon EC2 m5a.4xlarge/, BIOS 1.0 10/16/2017
[Wed Sep 29 07:12:20 2021] Call Trace:
[Wed Sep 29 07:12:20 2021]  dump_stack+0x6d/0x8b
[Wed Sep 29 07:12:20 2021]  dump_header+0x4f/0x200
[Wed Sep 29 07:12:20 2021]  oom_kill_process+0xe6/0x120
[Wed Sep 29 07:12:20 2021]  out_of_memory+0x109/0x510
[Wed Sep 29 07:12:20 2021]  mem_cgroup_out_of_memory+0xbb/0xd0
[Wed Sep 29 07:12:20 2021]  try_charge+0x79a/0x7d0
[Wed Sep 29 07:12:20 2021]  ? __alloc_pages_nodemask+0x153/0x320
[Wed Sep 29 07:12:20 2021]  mem_cgroup_try_charge+0x75/0x190
[Wed Sep 29 07:12:20 2021]  mem_cgroup_try_charge_delay+0x22/0x50
[Wed Sep 29 07:12:20 2021]  __handle_mm_fault+0x8a5/0x1240
[Wed Sep 29 07:12:20 2021]  handle_mm_fault+0xcb/0x210
[Wed Sep 29 07:12:20 2021]  __do_page_fault+0x2a1/0x4d0
[Wed Sep 29 07:12:20 2021]  do_page_fault+0x2c/0xe0
[Wed Sep 29 07:12:20 2021]  do_async_page_fault+0x54/0x70
[Wed Sep 29 07:12:20 2021]  async_page_fault+0x34/0x40
[Wed Sep 29 07:12:20 2021] RIP: 0033:0x47ccef
[Wed Sep 29 07:12:20 2021] Code: 8b 17 48 89 50 08 e9 ec fe ff ff 41 8b 45 10 8d 50 ff 49 8b 45 08 4c 89 e9 48 2b 0d 1b fc 47 00 48 be ab aa aa aa aa aa aa aa <c7> 40 24 ff ff 00 00 48 c1 f9 04 48 0f af ce 85 d2 89 48 20 48 8d
[Wed Sep 29 07:12:20 2021] RSP: 002b:00007f9f1d5a6000 EFLAGS: 00010202
[Wed Sep 29 07:12:20 2021] RAX: 00007f9d476a7000 RBX: 000000000000001b RCX: 0000000000125970
[Wed Sep 29 07:12:20 2021] RDX: 0000000000000018 RSI: aaaaaaaaaaaaaaab RDI: 00007f9f1d5a6030
[Wed Sep 29 07:12:20 2021] RBP: 0000000000000036 R08: 0000000000000000 R09: 00007f9eb3bea228
[Wed Sep 29 07:12:20 2021] R10: 0000000000000000 R11: 00007f9e8c6a93f0 R12: 00007f9f1d5a6030
[Wed Sep 29 07:12:20 2021] R13: 0000000003d438e0 R14: 00000000008a3620 R15: 00007f9ece3cfa58
[Wed Sep 29 07:12:20 2021] memory: usage 57671680kB, limit 57671680kB, failcnt 23866557
[Wed Sep 29 07:12:20 2021] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
[Wed Sep 29 07:12:20 2021] kmem: usage 255440kB, limit 9007199254740988kB, failcnt 0
[Wed Sep 29 07:12:20 2021] Memory cgroup stats for /docker/d1fd345478a0243463d02b95ca125557641c39539709e39d29ab0f35fea6d895:
[Wed Sep 29 07:12:20 2021] anon 58793193472
                           file 1302528
                           kernel_stack 3723264
                           slab 125009920
                           sock 0
                           shmem 0
                           file_mapped 811008
                           file_dirty 675840
                           file_writeback 0
                           anon_thp 0
                           inactive_anon 0
                           active_anon 58793496576
                           inactive_file 471040
                           active_file 581632
                           unevictable 0
                           slab_reclaimable 106176512
                           slab_unreclaimable 18833408
                           pgfault 594070389
                           pgmajfault 8976
                           workingset_refault 24113529
                           workingset_activate 559350
                           workingset_nodereclaim 0
                           pgrefill 26206520
                           pgscan 32787662
                           pgsteal 32620105
                           pgactivate 6963
                           pgdeactivate 26203655
                           pglazyfree 0
                           pglazyfreed 0
                           thp_fault_alloc 0
                           thp_collapse_alloc 33
[Wed Sep 29 07:12:20 2021] Tasks state (memory values in pages):
[Wed Sep 29 07:12:20 2021] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[Wed Sep 29 07:12:20 2021] [  21338]     0 21338     4605      874    77824        0             0 bash
[Wed Sep 29 07:12:20 2021] [  21755]     0 21755    20003     1014   208896        0             0 apache2
[Wed Sep 29 07:12:20 2021] [  21759]    33 21759   318232     2441   360448        0             0 apache2
[Wed Sep 29 07:12:20 2021] [  21760]    33 21760   318250     2523   360448        0             0 apache2
[Wed Sep 29 07:12:20 2021] [  21842]   107 21842     1127      416    57344        0             0 mysqld_safe
[Wed Sep 29 07:12:20 2021] [  22264]   107 22264  1538927   128706  1601536        0             0 mysqld
[Wed Sep 29 07:12:20 2021] [  22377]  1025 22377    28734     7835   258048        0             0 python3
[Wed Sep 29 07:12:20 2021] [  22438]  1025 22438   260662    25121   905216        0             0 python3
[Wed Sep 29 07:12:20 2021] [  22484]  1025 22484   290455    30973  1134592        0             0 python3
[Wed Sep 29 07:12:20 2021] [  29343]     0 29343     4605      858    81920        0             0 bash
[Wed Sep 29 07:12:20 2021] [  29358]     0 29358    12336      661   139264        0             0 su
[Wed Sep 29 07:12:20 2021] [  29359]  1025 29359     4608      874    86016        0             0 bash
[Wed Sep 29 07:12:20 2021] [    824]     0   824     4606      874    81920        0             0 bash
[Wed Sep 29 07:12:20 2021] [    839]     0   839    12336      657   139264        0             0 su
[Wed Sep 29 07:12:20 2021] [    840]  1025   840     4608      914    86016        0             0 bash
[Wed Sep 29 07:12:20 2021] [    984]     0   984     4606      878    77824        0             0 bash
[Wed Sep 29 07:12:20 2021] [   1000]     0  1000    12336      671   139264        0             0 su
[Wed Sep 29 07:12:20 2021] [   1001]  1025  1001     4608      913    86016        0             0 bash
[Wed Sep 29 07:12:20 2021] [   1141]     0  1141     4606      860    81920        0             0 bash
[Wed Sep 29 07:12:20 2021] [   1158]     0  1158    12336      657   143360        0             0 su
[Wed Sep 29 07:12:20 2021] [   1161]  1025  1161     4608      913    81920        0             0 bash
[Wed Sep 29 07:12:20 2021] [   1340]     0  1340     4606      869    86016        0             0 bash
[Wed Sep 29 07:12:20 2021] [   2694]     0  2694     4606      865    81920        0             0 bash
[Wed Sep 29 07:12:20 2021] [   2710]     0  2710    12336      632   143360        0             0 su
[Wed Sep 29 07:12:20 2021] [   2711]  1025  2711     4608      875    86016        0             0 bash
[Wed Sep 29 07:12:20 2021] [   2745]  1025  2745    97812     3731   573440        0             0 emacs
[Wed Sep 29 07:12:20 2021] [   3885]     0  3885     4605      880    77824        0             0 bash
[Wed Sep 29 07:12:20 2021] [   3905]     0  3905    12336      658   139264        0             0 su
[Wed Sep 29 07:12:20 2021] [   3906]  1025  3906     4608      914    77824        0             0 bash
[Wed Sep 29 07:12:20 2021] [   5391]     0  5391     4606      859    81920        0             0 bash
[Wed Sep 29 07:12:20 2021] [   5406]     0  5406    12336      626   139264        0             0 su
[Wed Sep 29 07:12:20 2021] [   5407]  1025  5407     4608      912    81920        0             0 bash
[Wed Sep 29 07:12:20 2021] [  12803]     0 12803     4606      873    77824        0             0 bash
[Wed Sep 29 07:12:20 2021] [  12913]     0 12913     1103      212    57344        0             0 tail
[Wed Sep 29 07:12:20 2021] [  26119]  1025 26119     9689      776   114688        0             0 top
[Wed Sep 29 07:12:20 2021] [  28239]     0 28239     4605      856    86016        0             0 bash
[Wed Sep 29 07:12:20 2021] [  28255]     0 28255    12336      672   143360        0             0 su
[Wed Sep 29 07:12:20 2021] [  28256]  1025 28256     4608      875    86016        0             0 bash
[Wed Sep 29 07:12:20 2021] [   9443]  1025  9443   293583    32176  1171456        0             0 python3
[Wed Sep 29 07:12:20 2021] [   9500]  1025  9500   293581    31918  1175552        0             0 python3
[Wed Sep 29 07:12:20 2021] [   9558]  1025  9558   312336    32420  1179648        0             0 python3
[Wed Sep 29 07:12:20 2021] [  10135]     0 10135     4606      858    77824        0             0 bash
[Wed Sep 29 07:12:20 2021] [  10151]     0 10151    12336      653   139264        0             0 su
[Wed Sep 29 07:12:20 2021] [  10152]  1025 10152     4608      886    77824        0             0 bash
[Wed Sep 29 07:12:20 2021] [  25116]  1025 25116   354897    44198  1642496        0             0 python3
[Wed Sep 29 07:12:20 2021] [  27573]     0 27573     4604      865    86016        0             0 bash
[Wed Sep 29 07:12:20 2021] [  27589]     0 27589    12336      674   143360        0             0 su
[Wed Sep 29 07:12:20 2021] [  27590]  1025 27590     4608      895    81920        0             0 bash
[Wed Sep 29 07:12:20 2021] [  30568]  1025 30568  2097938  1715372 14794752        0             0 python3
[Wed Sep 29 07:12:20 2021] [  30613]  1025 30613   429368   137088  4186112        0             0 python3
[Wed Sep 29 07:12:20 2021] [  30638]     0 30638     1103      180    57344        0             0 tail
[Wed Sep 29 07:12:20 2021] [  30858]     0 30858     4606      873    77824        0             0 bash
[Wed Sep 29 07:12:20 2021] [  31002]     0 31002     1103      211    53248        0             0 tail
[Wed Sep 29 07:12:20 2021] [  31003]     0 31003     1686      183    53248        0             0 less
[Wed Sep 29 07:12:20 2021] [  31398]     0 31398     4606      874    86016        0             0 bash
[Wed Sep 29 07:12:20 2021] [  31413]     0 31413    12336      664   147456        0             0 su
[Wed Sep 29 07:12:20 2021] [  31414]  1025 31414     4608      875    77824        0             0 bash
[Wed Sep 29 07:12:20 2021] [  31684]  1025 31684 12591380 12246047 99069952        0             0 python3
[Wed Sep 29 07:12:20 2021] [   1823]     0  1823     4606      894    90112        0             0 bash
[Wed Sep 29 07:12:20 2021] [   1846]     0  1846    12336      640   147456        0             0 su
[Wed Sep 29 07:12:20 2021] [   1847]  1025  1847     4608      876    77824        0             0 bash

@saramsey
Copy link
Member

saramsey commented Sep 29, 2021

Assuming that the kernel ring buffer timestamps are in UTC, I am seeing that the OOM kernel function terminated processes at the following timestamps:

sramsey@ip-172-31-53-16:~$ sudo dmesg -T | grep "Memory cgroup out"
[Wed Sep 22 18:33:42 2021] Memory cgroup out of memory: Killed process 26901 (python3) total-vm:48387708kB, anon-rss:47049328kB, file-rss:0kB, shmem-rss:0kB, UID:1025 pgtables:93188kB oom_score_adj:0
[Sat Sep 25 23:48:08 2021] Memory cgroup out of memory: Killed process 5925 (python3) total-vm:46499008kB, anon-rss:44982540kB, file-rss:39716kB, shmem-rss:0kB, UID:1025 pgtables:89376kB oom_score_adj:0
[Sun Sep 26 08:47:33 2021] Memory cgroup out of memory: Killed process 17645 (python3) total-vm:54812236kB, anon-rss:53387872kB, file-rss:15468kB, shmem-rss:0kB, UID:1025 pgtables:105616kB oom_score_adj:0
[Mon Sep 27 00:14:53 2021] Memory cgroup out of memory: Killed process 12906 (python3) total-vm:46982208kB, anon-rss:45306688kB, file-rss:40328kB, shmem-rss:0kB, UID:1025 pgtables:90128kB oom_score_adj:0
[Tue Sep 28 00:06:01 2021] Memory cgroup out of memory: Killed process 32316 (python3) total-vm:54573572kB, anon-rss:52936380kB, file-rss:38936kB, shmem-rss:0kB, UID:1025 pgtables:104828kB oom_score_adj:0
[Tue Sep 28 08:34:21 2021] Memory cgroup out of memory: Killed process 18104 (python3) total-vm:57197696kB, anon-rss:55634568kB, file-rss:15700kB, shmem-rss:0kB, UID:1025 pgtables:110052kB oom_score_adj:0
[Wed Sep 29 01:10:23 2021] Memory cgroup out of memory: Killed process 9673 (python3) total-vm:49202556kB, anon-rss:47877236kB, file-rss:38504kB, shmem-rss:0kB, UID:1025 pgtables:94900kB oom_score_adj:0
[Wed Sep 29 07:12:20 2021] Memory cgroup out of memory: Killed process 31684 (python3) total-vm:50365520kB, anon-rss:48968692kB, file-rss:15496kB, shmem-rss:0kB, UID:1025 pgtables:96748kB oom_score_adj:0
[Wed Sep 29 07:48:29 2021] Memory cgroup out of memory: Killed process 30568 (python3) total-vm:57268280kB, anon-rss:55696068kB, file-rss:39736kB, shmem-rss:0kB, UID:1025 pgtables:110104kB oom_score_adj:0

@edeutsch do those timestamps correlate at all with system slowdowns? I thought we had a major slowdown event at 2021.09.29 03:47, and oddly enough, I don't see the OOM killer having kicked in at that time. So maybe these things are uncorrelated. Nevertheless, it is a bit concerning that the OOM killer is having to intervene several times a day.

@saramsey
Copy link
Member

saramsey commented Sep 29, 2021

On arax.ncats.io, I checked /var/log/syslog.1 and I don't see any relevant log entries during the slowdown event on 2021.09.29 03:49. The nearest bracketing logfile entries are:

Sep 29 03:28:16 ip-172-31-53-16 systemd-timesyncd[735]: Synchronized to time server 91.189.89.199:123 (ntp.ubuntu.com).
Sep 29 03:58:15 ip-172-31-53-16 systemd-timesyncd[735]: Network configuration changed, trying to establish connection.

@saramsey
Copy link
Member

saramsey commented Sep 29, 2021

From the nginx logfile /var/log/nginx/access.log.1 in the host OS on arax.ncats.io, I'm seeing that BTE made 77 POST queries of the /api/rtxkg2 API within 223 seconds, so actually the query rate is not as high as I first thought last night when I initially looked at the logfile. That's a query rate of 0.35 queries per second (or 21 queries per minute); I imagine we would in principle want our system to be able to handle queries at that kind of rate. But I note that most of the queries returned a 499 status code, which means the client shut off in the middle of processing the request... perhaps because our system was so slow in responding:

root@ip-172-31-53-16:/var/log/nginx# grep 35.161.155.225 access.log.1 | grep '03:' | grep axios | grep -v "04:03" | grep POST
35.161.155.225 - - [29/Sep/2021:03:47:05 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13586 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:47:05 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13582 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:47:06 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 19036 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:47:13 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13610 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:47:13 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13554 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:47:13 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13570 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:47:20 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13558 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:47:20 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13598 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:47:21 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 61679 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:47:27 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13566 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:47:27 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13558 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:47:28 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 19039 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:47:34 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13566 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:47:35 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13554 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:47:35 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13562 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:47:43 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13582 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:47:44 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13602 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:47:44 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 20412 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:47:51 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13550 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:47:51 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13570 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:47:51 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13614 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:47:57 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13558 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:47:58 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13546 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:47:58 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13554 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:48:05 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13550 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:48:05 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13606 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:48:06 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 36339 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:48:12 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13618 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:48:13 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13574 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:48:13 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13586 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:48:19 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13554 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:48:20 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13606 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:48:20 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13606 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:48:30 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 499 0 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:48:30 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 499 0 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:48:30 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 499 0 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:48:40 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 499 0 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:48:40 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 499 0 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:48:40 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 499 0 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:48:48 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13562 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:48:49 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13554 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:48:49 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13610 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:48:55 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13554 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:48:55 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13642 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:48:56 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13546 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:49:07 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 14710 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:49:09 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 499 0 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:49:09 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 499 0 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:49:18 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 14782 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:49:19 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 499 0 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:49:19 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 499 0 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:49:28 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 14674 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:49:29 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 14678 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:49:29 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 499 0 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:49:38 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 14690 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:49:38 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 14674 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:49:38 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 14734 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:49:48 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 499 0 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:49:48 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 499 0 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:49:48 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 499 0 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:49:58 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 14690 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:49:58 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 499 0 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:49:58 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 499 0 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:50:08 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 499 0 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:50:08 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 499 0 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:50:08 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 499 0 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:50:18 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 499 0 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:50:18 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 499 0 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:50:18 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 499 0 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:50:28 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 14694 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:50:28 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 14734 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:50:28 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 14682 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:50:38 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 499 0 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:50:38 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 499 0 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:50:38 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 499 0 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:50:47 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 14746 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:50:48 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 499 0 "-" "axios/0.21.4"

@saramsey
Copy link
Member

saramsey commented Sep 29, 2021

Interestingly, if we look at bracketing log entries in /var/log/nginx/access.log.1 for the start of the 2021.09.29 03:48 slowdown event,

35.161.155.225 - - [29/Sep/2021:03:47:57 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13558 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:47:58 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13546 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:47:58 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13554 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:48:05 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13550 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:48:05 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13606 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:48:06 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 36339 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:48:12 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13618 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:48:13 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13574 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:48:13 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13586 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:48:19 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13554 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:48:20 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13606 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:48:20 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13606 "-" "axios/0.21.4"
73.140.212.108 - - [29/Sep/2021:03:48:29 +0000] "GET /api/arax/v1.2/status?last_n_hours=8 HTTP/1.1" 200 89346 "https://arax.ncats.io/?recent=1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:92.0) Gecko/20100101 Firefox/92.0"
35.161.155.225 - - [29/Sep/2021:03:48:30 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 499 0 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:48:30 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 499 0 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:48:30 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 499 0 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:48:40 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 499 0 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:48:40 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 499 0 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:03:48:40 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 499 0 "-" "axios/0.21.4"

it looks like the slowdown occurred right when there was a /status GET query from 73.140.212.108 (which is in Comcast address space) from a Firefox browser.

@saramsey
Copy link
Member

saramsey commented Sep 29, 2021

Is it possible that the /status handler itself is causing the slowdowns??

@edeutsch
Copy link
Collaborator Author

It would seem highly unlikely to me.
499 seems like a client problem.

@saramsey
Copy link
Member

saramsey commented Sep 29, 2021

I think we would see a 499 in the nginx logfile if the server (i.e., arax.ncats.io took too long to respond to the client and the client gave up, based on its internal programmed response timeout).

https://stackoverflow.com/questions/12973304/possible-reason-for-nginx-499-error-codes

@saramsey
Copy link
Member

Posting in the issue thread in case it is useful:
Screen Shot 2021-09-29 at 8 54 52 AM

@saramsey
Copy link
Member

saramsey commented Sep 29, 2021

There was another GET query to /status (as well as to /meta_knowledge_graph) 30 seconds before the slowdown at 2021.09.29 04:31 UTC:

73.240.98.16 - - [29/Sep/2021:04:31:16 +0000] "GET /api/arax/v1.2/status?last_n_hours=24 HTTP/1.1" 200 377882 "https://arax.ncats.io/?recent=1" "Mozilla/5.0
(Macintosh; Intel Mac OS X 10.15; rv:92.0) Gecko/20100101 Firefox/92.0"
73.240.98.16 - - [29/Sep/2021:04:31:17 +0000] "GET /api/arax/v1.2/meta_knowledge_graph HTTP/1.1" 200 18045432 "https://arax.ncats.io/?recent=1" "Mozilla/5.0
(Macintosh; Intel Mac OS X 10.15; rv:92.0) Gecko/20100101 Firefox/92.0"
35.161.155.225 - - [29/Sep/2021:04:31:18 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13823 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:04:31:18 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 24239 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:04:31:18 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13783 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:04:31:22 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13787 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:04:31:22 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13771 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:04:31:22 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13819 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:04:31:26 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 19593 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:04:31:26 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13799 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:04:31:26 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13779 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:04:31:30 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13791 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:04:31:30 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13815 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:04:31:30 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13823 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:04:31:34 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13771 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:04:31:34 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13803 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:04:31:34 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13791 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:04:31:40 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13793 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:04:31:40 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13859 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:04:31:41 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13829 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:04:31:46 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13761 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:04:31:46 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13789 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:04:31:46 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13793 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:04:31:46 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 499 0 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:04:31:46 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 499 0 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:04:31:46 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 499 0 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:04:31:54 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13761 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:04:31:54 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13769 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:04:31:54 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13741 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:04:31:56 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 499 0 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:04:31:56 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 499 0 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:04:31:56 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 499 0 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:04:32:02 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13769 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:04:32:04 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13761 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:04:32:04 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 200 13757 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:04:32:06 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 499 0 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:04:32:06 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 499 0 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:04:32:06 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 499 0 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:04:32:14 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 499 0 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:04:32:14 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 499 0 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:04:32:14 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 499 0 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:04:32:16 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 499 0 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:04:32:16 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 499 0 "-" "axios/0.21.4"
35.161.155.225 - - [29/Sep/2021:04:32:16 +0000] "POST /api/rtxkg2/v1.2/query HTTP/1.1" 499 0 "-" "axios/0.21.4"

@saramsey
Copy link
Member

I ran a test in which I ran the following shell script test-curl.sh:

for i in $(seq 1 2 1000)
do
    { time curl -s -X POST "https://arax-backup.rtx.ai/api/rtxkg2/v1.2/query?bypass_cache=false" -H "accept: application/json" -H "Content-Type: application/json" -d @test.json >/dev/null; } 2>&1 | grep real
done

on my 2018 MBP (macOS 11.6) at the bash terminal. The file test.json contains:

{
  "message": {
    "query_graph": {
      "edges": {
        "e01": {
          "object": "n1",
          "predicates": [
            "biolink:physically_interacts_with"
          ],
          "subject": "n0"
        }
      },
      "nodes": {
        "n0": {
          "categories": [
            "biolink:Gene"
          ],
          "ids": [
            "NCBIGene:23221",
            "NCBIGene:10765",
            "NCBIGene:22836",
            "NCBIGene:8065",
            "NCBIGene:8452",
            "NCBIGene:23145",
            "NCBIGene:284217",
            "NCBIGene:114784",
            "NCBIGene:3645",
            "NCBIGene:1462",
            "NCBIGene:29919",
            "NCBIGene:672",
            "NCBIGene:65267",
            "NCBIGene:26960",
            "NCBIGene:23379",
            "HGNC:39004",
            "UMLS:C4085573",
            "NCBIGene:1869",
            "NCBIGene:5241",
            "HGNC:9335",
            "NCBIGene:5555",
            "NCBIGene:7332",
            "UMLS:C0079427"
          ]
        },
        "n1": {
          "categories": [
            "biolink:SmallMolecule"
          ]
        }
      }
    }
  }
}

The arax-backup.rtx.ai system started out idle with negligible load:

ubuntu@ip-172-31-48-139:~$ uptime
 16:51:23 up 5 days,  5:55,  3 users,  load average: 0.12, 0.45, 0.30

When I run test-curl.sh in one terminal window, the average time for a single curl to complete is 2.3 seconds, as shown here:

(base) sramsey-laptop:kg2-issue sramsey$ ./test-curl.sh
real	0m2.325s
real	0m2.256s
real	0m2.367s
real	0m2.497s
real	0m2.334s
real	0m2.430s
real	0m2.324s
real	0m2.292s
real	0m2.273s
real	0m2.298s
real	0m2.376s
real	0m2.410s
real	0m2.304s
real	0m2.453s

during which time the system server load is 0.53

ubuntu@ip-172-31-48-139:~$ uptime
 16:53:25 up 5 days,  5:57,  3 users,  load average: 0.53, 0.48, 0.32

with a single process using most of the CPU:
Screen Shot 2021-09-29 at 9 52 48 AM

Now, what happens when I run this script in two terminal sessions simultaneously on my MBP?

Well, each curl command now has an average running time of about 3.6 seconds:

(base) sramsey-laptop:kg2-issue sramsey$ ./test-curl.sh
real	0m2.383s
real	0m2.366s
real	0m3.698s
real	0m3.669s
real	0m3.587s
real	0m3.582s
real	0m3.644s
real	0m3.669s
real	0m3.586s
real	0m3.625s

and the server load is about 90% and only one process is serving the queries:
Screen Shot 2021-09-29 at 9 55 46 AM

What happens when I run this script in three terminal sessions simulataneously on my MBP:

Each curl command now has an average running time of about 5 seconds,

(base) sramsey-laptop:kg2-issue sramsey$ ./test-curl.sh
real	0m3.911s
real	0m4.933s
real	0m4.968s
real	0m5.138s
real	0m5.365s
real	0m5.410s
real	0m4.887s

and the load on the server is 0.91

ubuntu@ip-172-31-48-139:~$ uptime
 16:58:09 up 5 days,  6:02,  3 users,  load average: 0.91, 0.72, 0.47

and it is still being serviced by one python3 process,
Screen Shot 2021-09-29 at 9 57 44 AM

With four test-curl.sh scripts running simultaneously on my MBP, the time to run a query increases to about 7 seconds,

(base) sramsey-laptop:kg2-issue sramsey$ ./test-curl.sh
real	0m7.826s
real	0m5.165s
real	0m6.421s
real	0m7.390s
real	0m7.403s
real	0m6.154s
real	0m7.726s
real	0m10.920s
real	0m4.829s
real	0m8.129s
real	0m7.825s

(with the variance in response times really starting to tick up) and the load is about 1.0

ubuntu@ip-172-31-48-139:~$ uptime
 16:59:35 up 5 days,  6:03,  3 users,  load average: 0.98, 0.79, 0.52

but still pegging only one processor,
Screen Shot 2021-09-29 at 10 00 01 AM

@edeutsch
Copy link
Collaborator Author

ARAX production was just noticed dead. I have restarted it.

@saramsey
Copy link
Member

When I add a fifth client on another system (a t2.micro instance in us-west-2) running test-curl.sh, the average response times go up to about 10 seconds, again with high variance:

real	0m14.156s
real	0m7.616s
real	0m9.171s
real	0m6.990s
real	0m6.467s
real	0m9.141s
real	0m10.283s
real	0m7.141s
real	0m9.755s
real	0m13.382s
real	0m9.941s
real	0m10.022s
real	0m6.346s
real	0m8.661s

with the system load being on average about 104%,
Screen Shot 2021-09-29 at 10 10 48 AM

@saramsey
Copy link
Member

Conclusions so far:

  1. The bottleneck task (i.e., longest-running work) for the/api/rtxkg2/v1.2/query handler appears to be effectively single-threaded; i.e., under concurrent load, it doesn't in practice use any more than one processor at a time.
  2. From the client's perspective, average response time therefore scales linearly with the request frequency, but with the variance in the response time also going up with request frequency.
  3. So when BTE fires off multiple requests a second, we see a slowdown in response.
  4. Memory usage in this test scenario is negligible, though of course this system is not otherwise loaded by ARAX queries currently.

With five clients loading the system simultaneously, I am getting an overall (combined) throughput rate of one query serviced per two seconds, which is about what I got with one client. BTE was loading the system with one query per three seconds, so we are only loading the system slightly more than BTE. I am not currently seeing catastrophic slowdowns, however. I will do some more testing with other types of ARAX queries to see if they can trip a catastrophic slowdown.

@saramsey
Copy link
Member

saramsey commented Sep 29, 2021

ARAX production was just noticed dead. I have restarted it.

Hopefully not me! I am running against arax-backup.rtx.ai. Thank you for restarting it.

@saramsey
Copy link
Member

I am unfortunately not able to test the effect of clicking the "system activity" button in the ARAX browser UI on the functioning of the RTX-KG2 API on the same server (arax-backup.rtx.ai), due to the fact that doing so gives some kind of JSON error immediately (so probably it isn't doing any of the work to gather the status information, and thus not loading the server at all):

Screen Shot 2021-09-29 at 10 24 38 AM

Hey @finnagin do you know what might be going on there?

@edeutsch
Copy link
Collaborator Author

It does seem that the KG2 progress takes up a lot of RAM:

top - 17:40:18 up 8 days,  3:12,  0 users,  load average: 1.00, 1.28, 1.47
Tasks:  57 total,   1 running,  56 sleeping,   0 stopped,   0 zombie
%Cpu(s):  7.2 us,  0.3 sy,  0.0 ni, 92.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 64705984 total, 21095084 free, 12154436 used, 31456464 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 51627920 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                         
 5628 rt        20   0 10.202g 9.145g  30084 S 104.7 14.8  96:47.42 python3                                                                                         
 7115 rt        20   0 3436476 998620  39928 S  16.3  1.5  15:52.26 python3                       

5628 is KG2

@finnagin
Copy link
Member

finnagin commented Sep 29, 2021

I thought the system activity UI tab required the mysql database. Is that correct @edeutsch ?

If so that would be why it doesn't work on arax-backup.

@saramsey
Copy link
Member

Thanks @finnagin that is helpful.

@edeutsch
Copy link
Collaborator Author

correct, it does. But isn't the MySQL replicated on arax-backup?
maybe not started?

@saramsey
Copy link
Member

The "system activity" may be a red herring in this investigation, as @edeutsch has posited.

@saramsey
Copy link
Member

Ah, let me start mysql....

@saramsey
Copy link
Member

Another possible solution that I am thinking about is to do most of the work in forks that terminate. That should make better use of the multiple cores and terminate bloat. But I am uncertain if the forks can maintain the connection to stdout. I will try.

Are you saying that (in the context of a fork scenario) you want to have a Unix pipe connecting the child process's STDOUT to a file descriptor in the parent process, that the parent process can read from? That is doable. The subprocess.Popen method is I think the preferred python tool for that.

@saramsey
Copy link
Member

Or are you wanting the parent process to "fire and forget", i.e., fork and have the child go off and do the work, and the parent process meanwhile goes on (simultaneously) to handle the next incoming HTTP request?

I assume you are talking about a change to this function?

@edeutsch
Copy link
Collaborator Author

well, yes and no. Similar to your "fire and forget" but it's deep in the bowels of the system. The system is already multithreaded so each request is its own thread (but subject to the GIL), but the idea would be to fork a child, have the child do the work while the parent waits for the child to finish its work, but then the child exits, taking all its problems with it, leaving the parent fresh and unsullied to rejoin the pool. The tricky part is having the child communicate the results while the parent is quiet.

@saramsey
Copy link
Member

Ah, I understand! Thank you for patiently explaining. Yes, I think subprocess.Popen is the ticket for this. You get a pipe to the child process so you can read stdout from the child process. As a concrete example, here is showing python forking and running bash in a child process to use curl and aws to copy a file from a URL to an S3 bucket, while constantly getting (line-buffered) stderr from curl via a pipe:

#!/usr/bin/env python3

import re
import subprocess

download_url = "https://arax-public.s3.us-west-2.amazonaws.com/test.json"
upload_to_bucket_url = "s3://arax/"

command_str = "curl --progress-bar -L " + download_url + " | aws s3 cp - " + upload_to_bucket_url

with subprocess.Popen(command_str,
                      bufsize=1,
                      universal_newlines=True,
                      stderr=subprocess.PIPE,
                      shell=True).stderr as proc_stderr:
    for line in proc_stderr:
        if line.startswith("##O#"):
            next
        print("the length of the progress bar is: " + str(len(re.findall("#", line))))

@edeutsch
Copy link
Collaborator Author

thanks, but I am forking not running a subprocess command.

@edeutsch
Copy link
Collaborator Author

I finally got a workable setup in devED and was all happy, only to discover that the child processes intermittently seg fault. I have no solution for that. I wonder if a Python upgrade from 3.7.3 to 3.7.9 might fix that..

@saramsey
Copy link
Member

saramsey commented Oct 2, 2021

It took me a while to figure out that the definition for the class
Response in the module RTX/code/UI/OpenAPI/python-flask-server/openapi_server/models/response.py is not the same as, and actually can be masked by, the definition of class Response in the module (with the same name!) RTX/code/ARAX/ARAXQuery/response.py. It would be nice if I could ask python via some kind of reflection, "in what module was this type defined?". Have to say, Python's documentation for the sys and os libraries is pretty darn good.

@edeutsch
Copy link
Collaborator Author

edeutsch commented Oct 2, 2021

I think RTX/code/ARAX/ARAXQuery/response.py was deprecated in favor of RTX/code/ARAX/ARAXQuery/ARAX_response.py for this very reason. Is there still usage of RTX/code/ARAX/ARAXQuery/response.py. If there is, I think it should be switched and the latter should be deleted. but I'm not completely certain.

@edeutsch
Copy link
Collaborator Author

edeutsch commented Oct 2, 2021

I think the main flaw in this approach (forking in query_controller.py) is that the streaming features that we have won't work. The streaming is performed below this layer.

I have implemented in devED something similar to what you have shown above. And it works (go ahead and try a query in devED and you will be using a child progress to do the work) usually. But sometimes (~10% of the time) it appears that the child seg faults while doing Expand asyncio. This is making me sad. I am wondering if a Python version upgrade might fix this. If I can fix that, then we may be in good shape.

@saramsey
Copy link
Member

saramsey commented Oct 2, 2021

Perhaps we could delete RTX/code/ARAX/ARAXQuery/response.py and just see if anything breaks.

@amykglen
Copy link
Member

amykglen commented Oct 2, 2021

interesting about the seg faults, @edeutsch - that reminds me that a couple weeks ago I was experiencing intermittent seg faults in my dev setup when running the pytest suite - I thought maybe something was messed up with my Python install and so I upgraded from 3.7.8 to 3.7.9... haven't seen any seg faults since. (not sure if that's related to what you're seeing, but thought I'd note in case..)

@edeutsch
Copy link
Collaborator Author

edeutsch commented Oct 2, 2021

interesting about the seg faults, @edeutsch - that reminds me that a few weeks ago I was experiencing intermittent seg faults in my dev setup when running the pytest suite - I thought maybe something was messed up with my Python install and so I upgraded from 3.7.8 to 3.7.9... haven't seen any seg faults since. (not sure if that's related to what you're seeing, but thought I'd note in case..)

ah, interesting indeed. I don't think there's a problem in our code, but I am uncertain.

Is anyone skillful enough and willing to install Python 3.7.12 and modules in our container? I did a bit of a hack job compiling 3.7.3 myself in /mnt/data/python/ a couple years ago. But maybe it would be good to do it better this time?

@edeutsch
Copy link
Collaborator Author

edeutsch commented Oct 2, 2021

Perhaps we could delete RTX/code/ARAX/ARAXQuery/response.py and just see if anything breaks.

yes, we should, that was the intent, but I forgot. Anyone want to try?

@saramsey
Copy link
Member

saramsey commented Oct 4, 2021

Perhaps we could delete RTX/code/ARAX/ARAXQuery/response.py and just see if anything breaks.

yes, we should, that was the intent, but I forgot. Anyone want to try?

I have deleted it in the fork-test devarea on arax-backup.rtx.ai, to test out this change.

I have logged this specific request (to remove code/ARAX/ARAXQuery/response.py) in #1690

@saramsey
Copy link
Member

saramsey commented Oct 5, 2021

I think the main flaw in this approach (forking in query_controller.py) is that the streaming features that we have won't work. The streaming is performed below this layer.

Fixed (I think?) in #1691 (see PR #1692). I think if we can do it, it is safer to fork() in query_controller.py so that the ARAX_query object (which has a ton of internal state) is only created inside the disposable child process, and it burps out only the relatively slim JSONified response(s) to the server. I think this is likely to be safer and more stable than if we fork within the ARAX_query object (and thus have two ARAX_query objects, one in the parent and one in the child process).

@saramsey
Copy link
Member

saramsey commented Oct 5, 2021

When I try to run the standard example JSON query through the fork-in-ARAXQuery code (via https://arax-backup.rtx.ai/devED), I am seeing this error:
Screen Shot 2021-10-05 at 1 44 25 PM

Running the same query with the "fork in query_controller" code (via https://arax-backup.rtx.ai/fork-test) produces:
Screen Shot 2021-10-05 at 1 45 56 PM

Obviously, more testing is needed, this is just a single test run, etc. See my specific testing recommendations in #1691 and #1692.

@edeutsch
Copy link
Collaborator Author

edeutsch commented Oct 6, 2021

I think the main flaw in this approach (forking in query_controller.py) is that the streaming features that we have won't work. The streaming is performed below this layer.

Fixed (I think?) in #1691 (see PR #1692).

yes, seems to be. Although there is a minor issue of apparent buffering through the pipe

I think if we can do it, it is safer to fork() in query_controller.py so that the ARAX_query object (which has a ton of internal state) is only created inside the disposable child process, and it burps out only the relatively slim JSONified response(s) to the server. I think this is likely to be safer and more stable than if we fork within the ARAX_query object (and thus have two ARAX_query objects, one in the parent and one in the child process).

This sounds reasonable.
We should ponder how this will interact with the /asyncquery endpoint. There we do fork in ARAXQuery so that the processing happens detached. The question is if we apply your changes to the asyncquery_controller.py or not. I'm thinking probably not, because I think that was already working fine (it seemed) and there's likely no reason to fork for the small amount of work that is done before the fork.

@saramsey
Copy link
Member

saramsey commented Oct 6, 2021

Python 3.7.12 has been installed in /mnt/data/orangeboard/python/Python-3.7.12 in arax-backup.rtx.ai inside the rtx1 container.

@edeutsch
Copy link
Collaborator Author

edeutsch commented Oct 6, 2021

Thanks. Although I don't really use arax-backup. I could test 3.7.12 on the main arax.

@saramsey
Copy link
Member

saramsey commented Oct 6, 2021

Thanks. Although I don't really use arax-backup. I could test 3.7.12 on the main arax.

OK, installing now....

@saramsey
Copy link
Member

saramsey commented Oct 6, 2021

Regarding the buffering issue, see commit bf641b6 as referenced in #1691. I think it's fixed now?

@saramsey
Copy link
Member

saramsey commented Oct 6, 2021

Python 3.7.12 is now installed in the rtx1 container on arax.ncats.io in /mnt/data/python/Python-3.7.12
Screen Shot 2021-10-05 at 9 00 44 PM

@edeutsch
Copy link
Collaborator Author

edeutsch commented Oct 6, 2021

thanks!

@finnagin
Copy link
Member

finnagin commented Feb 2, 2022

@edeutsch @saramsey is this issue still relevant or are we good to close?

@saramsey
Copy link
Member

It looks like the forking query_controller.py has been working for a while, so I am going to close this issue out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants