Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discuss breaking the response database out into it's own central thing. #1685

Closed
finnagin opened this issue Oct 1, 2021 · 15 comments
Closed
Assignees

Comments

@finnagin
Copy link
Member

finnagin commented Oct 1, 2021

We had touched on this a bit previously bu I wanted to maker an issue to keep track of progress on the discussion. Essentially the idea would be to remove the mysql response database from inside the container and replace it with a separate central response database that the other instances of ARAX could also use at the same time.

@edeutsch
Copy link
Collaborator

edeutsch commented Oct 1, 2021

Yeah, I think this would be good practice for the ITRB deployment. I'm thinking they would have a completely separate MySQL instance on their services. But doing it ourselves would be nice practice, I think.

Coupled with this we should probably have a place to store the Responses and maybe Queries. These are JSON blobs. Seems like the cool kids use MongoDB for this sort of thing. Although any blob store could be fine. I was originally using MySQL but it proved slow. One concern is speed. We don't want it to take more than a couple seconds to store a 20 MB JSON blob. The store doesn't need to query over the content. It's just central storage. Any other recommended remotely accessible fast blob store?

I'm thinking setting up something like responsedb.rtx.ai that hosts both the MySQL and the MongoDB and we migrate to that. As practice for having a similar system on ITRB that those systems use.

Thoughts?

@finnagin
Copy link
Member Author

finnagin commented Oct 1, 2021

Yeah I think that makes sense to keep them separate.

I like that idea of standing up a responsedb.rtx.ai. So would the purpose of having both MongoDB be for storing the response json blobs in MongoDB and the Server Load info on MySQL?

@edeutsch
Copy link
Collaborator

edeutsch commented Oct 1, 2021

Yeah I think that makes sense to keep them separate.

I am not sure I really understand what "them" is..

I like that idea of standing up a responsedb.rtx.ai. So would the purpose of having both MongoDB be for storing the response json blobs in MongoDB and the Server Load info on MySQL?

The MySQL instance actually has several purposes. It most importantly stores the response metadata as well. So there are several tables on MySQL. And use MongoDB to store Response JSON blobs. And maybe we should also use it to store the Query blobs as well. At the moment I am using MySQL for that, but when large Queries with a fully formed KG come in for workflow steps, this is going to fail.

@finnagin
Copy link
Member Author

finnagin commented Oct 1, 2021

Sorry to clarify by them I meant the MySQL instance for ITRB and our own ARAX deployment.

Oh ok. Thanks for breaking that down for me! So it sounds like we have a couple things (query and response) we are storing that are json blobs which makes sense to put in MongoDB and then metadata for those things will live in MySQL along with the server load info.

@edeutsch
Copy link
Collaborator

edeutsch commented Oct 1, 2021

Great. And one other thing to consider is whether we can make a clean break with the past and start our Response counter back at 1 with the fresh database/server, or whether is seems important to preserve r=28548 et al. by migrating the existing database. We have 39 GB of responses accumulated since Jan 2021 and maybe we can flush and do a fresh start?

We can probably expect our MongoDB database to accumulate at least 50 GB of responses per year, probably increasing. So perhaps a purging strategy should be designed. Keeping the metadata about each response in MySQL seems pretty trivial (perhaps 50,000 per year). But sunsetting the JSON contents after 6 months or something may be good.

If we have tests that begin with a cached response by id, we may need to consider how to handle that.

@dkoslicki
Copy link
Member

One thing to consider though: some users have been relying quite a bit on the persistence of the r=123's. Eg. Kara Fecho demonstrated as much during her Dec demo portion of the relay. If we plan a periodic purge, we should somehow communicate this to the user so they don't get a nasty surprise (eg. pre-computing results for a talk or demo, then finding those results disappeared).

@edeutsch
Copy link
Collaborator

edeutsch commented Oct 1, 2021

okay, thanks, if we determine that it is important to retain these, it is certainly quite possible, just a bit more work.

One idea that I implemented long ago on a different system is the ability to "name" cached responses, and IFF they are so named, then they are excepted from purging sweeps.

@isbluis
Copy link
Member

isbluis commented Oct 3, 2021

Is it feasible to start by only removing queries that are pre-TRAPI 1.2, while leaving the others in the database with ids intact, @edeutsch ?

@edeutsch
Copy link
Collaborator

edeutsch commented Oct 4, 2021

Yes, this is certainly feasible and quite sensible. If everyone else is okay with losing all old TRAPI 1.0 and 1.1 results?

@saramsey
Copy link
Member

Some facts about MongoDB, taken from MongoDB Limits and Thresholds:

  • Maximum BSON document size is 16 MB (maybe we can get away with that if we have one BSON document per result?)
  • Maximum collection size is 32 TB, so that seems, um, ample
  • Limit on the number of documents is 2^32, which seems, um, ample

Obligatory disclaimer: I am not a MongoDB person, YMMV.

@saramsey
Copy link
Member

saramsey commented Oct 13, 2021

It looks like we are no longer storing the response response.envelope in MySQL, per lines 188-195 of response_cache.py:

#### Instead of storing the large response object in the MySQL database as a blob
#### now store it as a JSON file on the filesystem
response_dir = os.path.dirname(os.path.abspath(__file__)) + '/../../../data/responses_1_0'
if not os.path.exists(response_dir):
try:
os.mkdir(response_dir)
except:
eprint(f"ERROR: Unable to create dir {response_dir}")

so I am not sure if MongoDB would be expected to have a significant speed advantage over reading the JSON blob from the file system, unless the directory of responses had a huge count of response files in it (in which case, yes, MongoDB would be expected to be much faster than the filesystem in terms of lookup time to find the response).

@saramsey
Copy link
Member

Not sure I have much else to add here, so I will take myself off the assignment for this issue, for now.

@saramsey saramsey removed their assignment Oct 13, 2021
@edeutsch
Copy link
Collaborator

edeutsch commented Oct 13, 2021

ooo, thanks, max BSON of 16 MB is potentially a problem. However, maybe we can just let it slide. I'm guessing that BSON is more compact than JSON and relatively few JSONs are larger than 16 MB (maybe 5%?) If they just silently fell on the floor, that might not be terrible. Only the other hand, MongoDB does fancy indexing into the BSON I think, which we totally don't need.

What do you think of throwing all JSON blobs into S3? Would that be performant and doesn't have a file size limit?

@edeutsch
Copy link
Collaborator

so I am not sure if MongoDB would be expected to have a significant speed advantage over reading the JSON blob from the file system, unless the directory of responses had a huge count of response files in it (in which case, yes, MongoDB would be expected to be much faster than the filesystem in terms of lookup time to find the response).

Note that we are not looking for a speed advantage over the local file system (that would be very hard). We are looking for a mechanism that multiple worker nodes can write to a shared space (instead of their local ephemeral filesystems) that is not that much slower than writing to a local filesystem. S3?

@finnagin
Copy link
Member Author

Closing since this was done in #1702

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants