-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Purge old data #65
Comments
In what situations might apparently-junk posts actually be useful? Edit as I think of them
|
Partially implemented in 2b02579. The service start timestamp is used to filter the temporary post_with_context table, which is what gets iterated, reducing the number of posts that need to be looked at. This is a really easy, hacky way of doing this and doesn't solve the problem, it just offsets it. This also doesn't delete any data, so no storage space is saved, and time will still be wasted downloading posts that are permanently cut off by this filter (except in the case that they eventually become parent posts, in which case they would be redownloaded at that time anyway). |
After plenty of thought, here's my new plan for going about this:
A couple of things I think are worth noting:
|
Deciding whether to bother keeping separate tables for thread context (wiki and category info), or just merge them into the new thread context table. Upside to having separate tables:
Downside:
Data:
Sticking with a 1 year threshold, I think that attaching pretty much whatever data I want to 375 posts is perfectly fine. It's worth noting that this isn't the number of context entries, but the number of context entries per type of context must necessarily be less than 375. So I'm not concerned about data usage at all. 375 * ~5 < 450,000. |
context_thread table is going to be a bit different to the current thread table.
|
I'm a bit confused on why I've decided to make a So let's backtrack through my reasoning:
|
I keep coming back to read the comments here to answer the question, 'Why define tables for contexts? Why not just use columns of the main posts table?' Two answers:
|
Execution time tends to increase:
Linear increase suggests to me that this is growing with the number of downloaded posts (as opposed to the number of users, which I would expect to elicit irregular jumps). (It's worth noting that the large jumps depicted are from me refining and optimising the application.)
If this assumption is true, could it be the case that a lot of the posts I have stored are junk data which will never be useful?
If so, and if they can be detected, could clearing them out on a regular basis help to flatten out the execution time increase?
The text was updated successfully, but these errors were encountered: