-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discuss breaking the response database out into it's own central thing. #1685
Comments
Yeah, I think this would be good practice for the ITRB deployment. I'm thinking they would have a completely separate MySQL instance on their services. But doing it ourselves would be nice practice, I think. Coupled with this we should probably have a place to store the Responses and maybe Queries. These are JSON blobs. Seems like the cool kids use MongoDB for this sort of thing. Although any blob store could be fine. I was originally using MySQL but it proved slow. One concern is speed. We don't want it to take more than a couple seconds to store a 20 MB JSON blob. The store doesn't need to query over the content. It's just central storage. Any other recommended remotely accessible fast blob store? I'm thinking setting up something like responsedb.rtx.ai that hosts both the MySQL and the MongoDB and we migrate to that. As practice for having a similar system on ITRB that those systems use. Thoughts? |
Yeah I think that makes sense to keep them separate. I like that idea of standing up a responsedb.rtx.ai. So would the purpose of having both MongoDB be for storing the response json blobs in MongoDB and the Server Load info on MySQL? |
I am not sure I really understand what "them" is..
The MySQL instance actually has several purposes. It most importantly stores the response metadata as well. So there are several tables on MySQL. And use MongoDB to store Response JSON blobs. And maybe we should also use it to store the Query blobs as well. At the moment I am using MySQL for that, but when large Queries with a fully formed KG come in for workflow steps, this is going to fail. |
Sorry to clarify by them I meant the MySQL instance for ITRB and our own ARAX deployment. Oh ok. Thanks for breaking that down for me! So it sounds like we have a couple things (query and response) we are storing that are json blobs which makes sense to put in MongoDB and then metadata for those things will live in MySQL along with the server load info. |
Great. And one other thing to consider is whether we can make a clean break with the past and start our Response counter back at 1 with the fresh database/server, or whether is seems important to preserve r=28548 et al. by migrating the existing database. We have 39 GB of responses accumulated since Jan 2021 and maybe we can flush and do a fresh start? We can probably expect our MongoDB database to accumulate at least 50 GB of responses per year, probably increasing. So perhaps a purging strategy should be designed. Keeping the metadata about each response in MySQL seems pretty trivial (perhaps 50,000 per year). But sunsetting the JSON contents after 6 months or something may be good. If we have tests that begin with a cached response by id, we may need to consider how to handle that. |
One thing to consider though: some users have been relying quite a bit on the persistence of the |
okay, thanks, if we determine that it is important to retain these, it is certainly quite possible, just a bit more work. One idea that I implemented long ago on a different system is the ability to "name" cached responses, and IFF they are so named, then they are excepted from purging sweeps. |
Is it feasible to start by only removing queries that are pre-TRAPI 1.2, while leaving the others in the database with ids intact, @edeutsch ? |
Yes, this is certainly feasible and quite sensible. If everyone else is okay with losing all old TRAPI 1.0 and 1.1 results? |
Some facts about MongoDB, taken from MongoDB Limits and Thresholds:
Obligatory disclaimer: I am not a MongoDB person, YMMV. |
It looks like we are no longer storing the response RTX/code/ARAX/ResponseCache/response_cache.py Lines 188 to 195 in a6e5ed8
so I am not sure if MongoDB would be expected to have a significant speed advantage over reading the JSON blob from the file system, unless the directory of responses had a huge count of response files in it (in which case, yes, MongoDB would be expected to be much faster than the filesystem in terms of lookup time to find the response). |
Not sure I have much else to add here, so I will take myself off the assignment for this issue, for now. |
ooo, thanks, max BSON of 16 MB is potentially a problem. However, maybe we can just let it slide. I'm guessing that BSON is more compact than JSON and relatively few JSONs are larger than 16 MB (maybe 5%?) If they just silently fell on the floor, that might not be terrible. Only the other hand, MongoDB does fancy indexing into the BSON I think, which we totally don't need. What do you think of throwing all JSON blobs into S3? Would that be performant and doesn't have a file size limit? |
Note that we are not looking for a speed advantage over the local file system (that would be very hard). We are looking for a mechanism that multiple worker nodes can write to a shared space (instead of their local ephemeral filesystems) that is not that much slower than writing to a local filesystem. S3? |
Closing since this was done in #1702 |
We had touched on this a bit previously bu I wanted to maker an issue to keep track of progress on the discussion. Essentially the idea would be to remove the mysql response database from inside the container and replace it with a separate central response database that the other instances of ARAX could also use at the same time.
The text was updated successfully, but these errors were encountered: