Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Purge old data #65

Closed
rossjrw opened this issue Jun 8, 2023 · 7 comments · Fixed by #85
Closed

Purge old data #65

rossjrw opened this issue Jun 8, 2023 · 7 comments · Fixed by #85
Labels
optimisation Make an existing feature faster or smaller

Comments

@rossjrw
Copy link
Member

rossjrw commented Jun 8, 2023

Execution time tends to increase:

image

Linear increase suggests to me that this is growing with the number of downloaded posts (as opposed to the number of users, which I would expect to elicit irregular jumps). (It's worth noting that the large jumps depicted are from me refining and optimising the application.)

If this assumption is true, could it be the case that a lot of the posts I have stored are junk data which will never be useful?

If so, and if they can be detected, could clearing them out on a regular basis help to flatten out the execution time increase?

@rossjrw rossjrw added the optimisation Make an existing feature faster or smaller label Jun 8, 2023
@rossjrw
Copy link
Member Author

rossjrw commented Aug 7, 2023

In what situations might apparently-junk posts actually be useful? Edit as I think of them

  • New notifier users who are old Wikidot users with long post histories - much of their posts won't be in the database
    • Does that matter? notifier will become aware of these posts as they receive notification-emitting replies
    • Actually I think this is the assumed use case for notifier so this is solved by design
  • Error users who have a waiting notification and have had it for longer than whatever the new post expiry period would be

@rossjrw
Copy link
Member Author

rossjrw commented Sep 29, 2023

Partially implemented in 2b02579. The service start timestamp is used to filter the temporary post_with_context table, which is what gets iterated, reducing the number of posts that need to be looked at. This is a really easy, hacky way of doing this and doesn't solve the problem, it just offsets it. This also doesn't delete any data, so no storage space is saved, and time will still be wasted downloading posts that are permanently cut off by this filter (except in the case that they eventually become parent posts, in which case they would be redownloaded at that time anyway).

@rossjrw
Copy link
Member Author

rossjrw commented Sep 30, 2023

After plenty of thought, here's my new plan for going about this:

  • Posts are now split into 'notifiable posts' and 'context posts'. A 'notifiable post' is only ever kept in the database if someone is waiting to be notified about it. A 'context post' is only kept in the database if it's needed to contextualise a notifiable post.
  • Deprecate the post table.
  • Track notifiable posts in their own table.
  • Create a 1:many table tracking notifiable posts and their context.
  • Track context posts in their own table.
    • There'll be some data overlap between this and the notifiable posts table. This is worth the tradeoff in order to achieve total semantic separation between notifiable posts and context posts - even if they'll sometimes refer to the same post. This is fine.
  • Track context threads in their own table.
    • I think this is no change from current behaviour.
  • Never iterate threads, except where absolutely necessary to acquire context.
    • Currently this is the first post and the parent post, so 2 pages per thread per new post max.
  • Posts are evaluated for relevancy at download time. If they're not relevant to a stored user, they're not stored in the database.
    • I'm still going to need some way of avoiding re-downloading all 60 posts in the RSS feed even if none were stored last time.

A couple of things I think are worth noting:

  • I would be checking if a post is considered 'relevant' when it is downloaded, and this might change by the time it comes to notify a user about it.
    • If a user has subscribed to post that would have been downloaded but is by now lost to time, this is fine - we now can only guarantee notifying users of posts made after they subscribed to it, and this is considered intentional behaviour.
    • If a user has unsubscribed from a post that was downloaded for them, I just need to recheck before attempting to notify them about it, and remove it when I find out. The database is a bit less efficient in the interim but who gives a fuck.
    • If a user has deleted their config, I'll have no prompt to check that a post downloaded for them is still wanted. Will probably have to periodically check for this situation, given that it would happen when they switch to the 'never' channel too.

@rossjrw
Copy link
Member Author

rossjrw commented Oct 1, 2023

Deciding whether to bother keeping separate tables for thread context (wiki and category info), or just merge them into the new thread context table.

Upside to having separate tables:

  • Less data usage overall.

Downside:

  • Need to join the tables every time I want information about a thread.
  • I'm trying to make context more generic. 'thread' is already a context. Wiki and category are therefore context that's 2 levels deep and I don't want to have this unique behaviour.
    • Counterpoint: I eventually want to add pages as a kind of context. Pages are literally thread context, not post context.
      • Countercounterpoint: It's negligible data space to store page ID against posts and then pages as a context_page table. Exactly like what I'm already doing for the other kinds of context.
    • Solution: add wiki and category as contexts for posts. Problem solved.

Data:

  • In the test database there are 456529 posts.
  • After applying the filter (which includes removing notifiable posts that have been waiting for more than one year), there are 375 notifiable posts. Of those, 5 are intended for users who are not error users.
    • If I lower the threshold from 1y to 6mo, the number of notifiable posts is 178.

Sticking with a 1 year threshold, I think that attaching pretty much whatever data I want to 375 posts is perfectly fine. It's worth noting that this isn't the number of context entries, but the number of context entries per type of context must necessarily be less than 375. So I'm not concerned about data usage at all. 375 * ~5 < 450,000.

@rossjrw
Copy link
Member Author

rossjrw commented Oct 2, 2023

context_thread table is going to be a bit different to the current thread table.

  • It's going to contain information about the first post in the thread. That means that when constructing the table from current data, I can use information I already have about either the thread or the first post as I see fit.
  • First post author and thread creator will be tracked separately.
  • A thread will have a 'snippet'. For now this will be the first post's snippet; in the future, I might want it to be the snippet as defined by the thread create UI on Wikidot.
  • For some reason I have not stored thread creator user ID. It's possible that I don't actually need it for anything.
    • I'll generate it from the post info that I have on hand right now, but I'll make it nullable in case the creator is Wikidot or deleted.
    • Posts can have null IDs when the author is a guest account, but I currently appear to implement those as empty strings.

@rossjrw
Copy link
Member Author

rossjrw commented Jan 23, 2024

I'm a bit confused on why I've decided to make a context_forum_category table. Similar logic can be applied to context_wiki, but that table is actually needed for storing HTTP/S settings for the wikis that notifier is configured for.

So let's backtrack through my reasoning:

  • There's no need for a context_forum_category table to exist, because that data can just be attached directly to the post in notifiable_posts.
  • All the category table is storing is the category name, which isn't really that much deduplicated data.
  • Looking back at this thread, the idea behind splitting out 'contexts' into different tables (of which the forum category is one of four, currently), is because I want a generic backend for different kinds of contexts. In the future all I would need to do is create a table for a new context, add a foreign key column to notifiable_posts, and then JOIN the posts to that context table each time I want to use it. Theoretically that would scale to even heavily-duplicated contexts with a lot of its own data (such as pages!), because storing it in its own table is an easy place to store data per context in a deduplicated manner.
  • That's all well and good, but, forum categories do not have a lot of data - they have a single string. There is practically no downside to storing it directly in the post table, and it saves a JOIN down the line.
  • Ok, but, there's an argument to be made that forum category is not context for a post, technically; it's context for a thread. Likewise, wiki is context for a category. We're not doing context chaining, so wikis/categories/threads all count as post contexts. Given that, it doesn't make sense for some of those things to have their own tables and some not, and it makes even less sense for the outer two to have their own tables but the middle one to not.
  • So you're arguing that forum categories should be in their own table for consistency's sake, rather than any preference effected by the data?
  • Yes, and now you've made this self-argument format very weird.

@rossjrw
Copy link
Member Author

rossjrw commented Apr 20, 2024

I keep coming back to read the comments here to answer the question, 'Why define tables for contexts? Why not just use columns of the main posts table?'

Two answers:

  • A table per context scales to highly-duplicated contexts that needs lots of data to track. There are none at the time of writing but it's little cost to be prepared.
  • It's just what we're doing. We're committed now. Shut up and deal with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimisation Make an existing feature faster or smaller
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant