Discuss breaking the response database out into it's own central thing. #1685

finnagin · 2021-10-01T02:32:32Z

We had touched on this a bit previously bu I wanted to maker an issue to keep track of progress on the discussion. Essentially the idea would be to remove the mysql response database from inside the container and replace it with a separate central response database that the other instances of ARAX could also use at the same time.

edeutsch · 2021-10-01T04:39:12Z

Yeah, I think this would be good practice for the ITRB deployment. I'm thinking they would have a completely separate MySQL instance on their services. But doing it ourselves would be nice practice, I think.

Coupled with this we should probably have a place to store the Responses and maybe Queries. These are JSON blobs. Seems like the cool kids use MongoDB for this sort of thing. Although any blob store could be fine. I was originally using MySQL but it proved slow. One concern is speed. We don't want it to take more than a couple seconds to store a 20 MB JSON blob. The store doesn't need to query over the content. It's just central storage. Any other recommended remotely accessible fast blob store?

I'm thinking setting up something like responsedb.rtx.ai that hosts both the MySQL and the MongoDB and we migrate to that. As practice for having a similar system on ITRB that those systems use.

Thoughts?

finnagin · 2021-10-01T07:22:48Z

Yeah I think that makes sense to keep them separate.

I like that idea of standing up a responsedb.rtx.ai. So would the purpose of having both MongoDB be for storing the response json blobs in MongoDB and the Server Load info on MySQL?

edeutsch · 2021-10-01T15:45:01Z

Yeah I think that makes sense to keep them separate.

I am not sure I really understand what "them" is..

I like that idea of standing up a responsedb.rtx.ai. So would the purpose of having both MongoDB be for storing the response json blobs in MongoDB and the Server Load info on MySQL?

The MySQL instance actually has several purposes. It most importantly stores the response metadata as well. So there are several tables on MySQL. And use MongoDB to store Response JSON blobs. And maybe we should also use it to store the Query blobs as well. At the moment I am using MySQL for that, but when large Queries with a fully formed KG come in for workflow steps, this is going to fail.

finnagin · 2021-10-01T15:59:22Z

Sorry to clarify by them I meant the MySQL instance for ITRB and our own ARAX deployment.

Oh ok. Thanks for breaking that down for me! So it sounds like we have a couple things (query and response) we are storing that are json blobs which makes sense to put in MongoDB and then metadata for those things will live in MySQL along with the server load info.

edeutsch · 2021-10-01T18:33:47Z

Great. And one other thing to consider is whether we can make a clean break with the past and start our Response counter back at 1 with the fresh database/server, or whether is seems important to preserve r=28548 et al. by migrating the existing database. We have 39 GB of responses accumulated since Jan 2021 and maybe we can flush and do a fresh start?

We can probably expect our MongoDB database to accumulate at least 50 GB of responses per year, probably increasing. So perhaps a purging strategy should be designed. Keeping the metadata about each response in MySQL seems pretty trivial (perhaps 50,000 per year). But sunsetting the JSON contents after 6 months or something may be good.

If we have tests that begin with a cached response by id, we may need to consider how to handle that.

dkoslicki · 2021-10-01T19:30:40Z

One thing to consider though: some users have been relying quite a bit on the persistence of the r=123's. Eg. Kara Fecho demonstrated as much during her Dec demo portion of the relay. If we plan a periodic purge, we should somehow communicate this to the user so they don't get a nasty surprise (eg. pre-computing results for a talk or demo, then finding those results disappeared).

edeutsch · 2021-10-01T20:51:29Z

okay, thanks, if we determine that it is important to retain these, it is certainly quite possible, just a bit more work.

One idea that I implemented long ago on a different system is the ability to "name" cached responses, and IFF they are so named, then they are excepted from purging sweeps.

isbluis · 2021-10-03T05:49:23Z

Is it feasible to start by only removing queries that are pre-TRAPI 1.2, while leaving the others in the database with ids intact, @edeutsch ?

edeutsch · 2021-10-04T01:36:20Z

Yes, this is certainly feasible and quite sensible. If everyone else is okay with losing all old TRAPI 1.0 and 1.1 results?

saramsey · 2021-10-13T20:13:15Z

Some facts about MongoDB, taken from MongoDB Limits and Thresholds:

Maximum BSON document size is 16 MB (maybe we can get away with that if we have one BSON document per result?)
Maximum collection size is 32 TB, so that seems, um, ample
Limit on the number of documents is 2^32, which seems, um, ample

Obligatory disclaimer: I am not a MongoDB person, YMMV.

saramsey · 2021-10-13T20:36:52Z

It looks like we are no longer storing the response response.envelope in MySQL, per lines 188-195 of response_cache.py:

RTX/code/ARAX/ResponseCache/response_cache.py

Lines 188 to 195 in a6e5ed8

    
           #### Instead of storing the large response object in the MySQL database as a blob 
        
           #### now store it as a JSON file on the filesystem 
        
           response_dir = os.path.dirname(os.path.abspath(__file__)) + '/../../../data/responses_1_0' 
        
           if not os.path.exists(response_dir): 
        
               try: 
        
                   os.mkdir(response_dir) 
        
               except: 
        
                   eprint(f"ERROR: Unable to create dir {response_dir}")

so I am not sure if MongoDB would be expected to have a significant speed advantage over reading the JSON blob from the file system, unless the directory of responses had a huge count of response files in it (in which case, yes, MongoDB would be expected to be much faster than the filesystem in terms of lookup time to find the response).

saramsey · 2021-10-13T20:37:28Z

Not sure I have much else to add here, so I will take myself off the assignment for this issue, for now.

edeutsch · 2021-10-13T20:38:44Z

ooo, thanks, max BSON of 16 MB is potentially a problem. However, maybe we can just let it slide. I'm guessing that BSON is more compact than JSON and relatively few JSONs are larger than 16 MB (maybe 5%?) If they just silently fell on the floor, that might not be terrible. Only the other hand, MongoDB does fancy indexing into the BSON I think, which we totally don't need.

What do you think of throwing all JSON blobs into S3? Would that be performant and doesn't have a file size limit?

edeutsch · 2021-10-13T20:41:33Z

so I am not sure if MongoDB would be expected to have a significant speed advantage over reading the JSON blob from the file system, unless the directory of responses had a huge count of response files in it (in which case, yes, MongoDB would be expected to be much faster than the filesystem in terms of lookup time to find the response).

Note that we are not looking for a speed advantage over the local file system (that would be very hard). We are looking for a mechanism that multiple worker nodes can write to a shared space (instead of their local ephemeral filesystems) that is not that much slower than writing to a local filesystem. S3?

finnagin · 2022-01-10T21:42:37Z

Closing since this was done in #1702

finnagin assigned finnagin, saramsey, dkoslicki, edeutsch, chunyuma, isbluis and amykglen Oct 1, 2021

finnagin mentioned this issue Oct 1, 2021

Update the ARAX container to ubuntu 20.04 #1686

Closed

saramsey removed their assignment Oct 13, 2021

finnagin closed this as completed Jan 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discuss breaking the response database out into it's own central thing. #1685

Discuss breaking the response database out into it's own central thing. #1685

finnagin commented Oct 1, 2021

edeutsch commented Oct 1, 2021

finnagin commented Oct 1, 2021

edeutsch commented Oct 1, 2021

finnagin commented Oct 1, 2021 •

edited

Loading

edeutsch commented Oct 1, 2021

dkoslicki commented Oct 1, 2021

edeutsch commented Oct 1, 2021

isbluis commented Oct 3, 2021

edeutsch commented Oct 4, 2021

saramsey commented Oct 13, 2021

saramsey commented Oct 13, 2021 •

edited

Loading

saramsey commented Oct 13, 2021

edeutsch commented Oct 13, 2021 •

edited

Loading

edeutsch commented Oct 13, 2021

finnagin commented Jan 10, 2022

Discuss breaking the response database out into it's own central thing. #1685

Discuss breaking the response database out into it's own central thing. #1685

Comments

finnagin commented Oct 1, 2021

edeutsch commented Oct 1, 2021

finnagin commented Oct 1, 2021

edeutsch commented Oct 1, 2021

finnagin commented Oct 1, 2021 • edited Loading

edeutsch commented Oct 1, 2021

dkoslicki commented Oct 1, 2021

edeutsch commented Oct 1, 2021

isbluis commented Oct 3, 2021

edeutsch commented Oct 4, 2021

saramsey commented Oct 13, 2021

saramsey commented Oct 13, 2021 • edited Loading

saramsey commented Oct 13, 2021

edeutsch commented Oct 13, 2021 • edited Loading

edeutsch commented Oct 13, 2021

finnagin commented Jan 10, 2022

finnagin commented Oct 1, 2021 •

edited

Loading

saramsey commented Oct 13, 2021 •

edited

Loading

edeutsch commented Oct 13, 2021 •

edited

Loading