-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split post table into notifiable posts and context #85
Conversation
e8b2475
to
bb29f26
Compare
For testing purposes using the local dataset, here's the user IDs that would be notified8hourly daily daily monthly weekly I need to write the new data structure and scripts to query it. I need to answer the following questions:
To run these tests I'll need to have both approaches available at the same time. That'll mean constructing the new tables without disrupting the old tables, for now, and I'll need to have both queries available too. Measuring the old approach: for freq in monthly hourly 8hourly weekly daily; do mysql -h127.0.0.1 -uroot -proot wikidot_notifier < <(echo "$(cat notifier/database/queries/cache_post_context.sql | sed 's/%(post_lower_timestamp_limit)s/1627277777/');$(cat notifier/database/queries/get_user_ids_for_frequency_with_notifications.sql | sed "s/%(frequency)s/\"$freq\"/")") > $freq.txt && echo "done $freq"; done
First test, with amendments to the cache table but no amendments to the selection query: completed in less than one second in all trials. However, there are some missing users, which is not acceptable. |
37199ea
to
bb29f26
Compare
bb29f26
to
341c070
Compare
Updating the old queriesGoing to record any notes about updating queries here. What I effectively need to do is, one by one for each query:
Once I'm done with that, maybe, this whole thing will be complete.
|
fe12c79
to
6f22278
Compare
When should posts be checked for notifiability?The migration from the old data structure to the new will bin all posts that aren't notifiable. But, right now, all posts are downloaded and stored exactly as naively as they were before. The choice of when exactly to test for and purge posts that aren't notifiable is an important decision. There are some things to bear in mind:
The ideal scenario is to always be checking at all stages, as this will ensure that the database is always trimmed and no unnecessary requests are made. But is this always possible?
I like the idea of having one single method of establishing notifiability, and I like the idea of it being an idempotent operation which, if aborted, can be resumed by a later run with only a temporary loss in efficiency. To this effect, the best way to go about this is to naively download and store all new posts. Then at the end of the run as part of the cleanup phase, any non-notifiable posts are removed. This will be performed in SQL using the algorithm outlined in the data structure migration script. |
Updating the digesterThe digester fundamentally assumes that it's getting a list of thread replies and a list of post replies. Now it's just getting a list of posts alongs with flags to mark why it's considered a notification. The digester is going to need to be made more generic, but also I want to change its overall structure as little as possible because I don't want to deal with changing the translations. I think what I'm going to try to do is implement this by only removing text from the English lexicon, where possible. Then I'll copy those changes to other languages once I'm done. |
a3e9119
to
93ef5d2
Compare
Speed testsRunning a speed comparison again, AMD 7840U, 2023-06-07 dataset. Reference: drop database wikidot_notifier, create database wikidot_notifier,
Test:
Compare results:
Test: baseline Reference @ main: 1m 17s Results: Not accurate; test was missing users. Figured that's probably because the cutoff dates were different - service start timestamp vs last post minus 1y. Test: added Reference @ main+1: 1m 1s Results: Fewer missing users than before. There are still 3 in the hourly channel and 7 in the daily channel. This feels very familiar - I'm sure I've gone through these exact tests. I should see if I can find those in order to work out where those missing users have gone. Failing that (which appears to be the case as of now), I should work it out from scratch. I'm comparing the notifiable_post migration with the queries that pull notifiable posts/notifiable users from the database, which were more recently written, and there are a few where conditions in the migration that differ. In particular, users are filtered to only those with frequencies in a known set (which excludes Test and Never, for which there do exist some users). Additionally, the old query attempts to avoid notifying users about posts they already responded to - but because the test didn't elicit any additional notificiations I don't believe this is a concern. I need to:
Then amend the test procedure:
New test @ 6f11aad: 1m 50s. Very surprised that it actually took longer! Added an index to speed it up. Accuracy @ 7327e8e: Pretty good! Just 2 extra results in the daily channel. I need to investigate why that is, but before I do, here's my guess: the new query resolves conflicting manual subscriptions differently. The old one gave weight to unsubscriptions, the new one just sort of picks one in whatever order they come and its behaviour re conflicts is not defined. I think that's fine, but let's see if it's the case here.
Oops! It's because I didn't re-run the reference without the year filter. There are actually no regressions from the migration alone. Yay! Looks like the only 'regressions' caused by the delete script are the exact same two users as well. |
8c42862
to
8404045
Compare
Deleting postsChecking for post deletion state is still something I need to do, to remove from my database posts that have been deleted in the remote. But it's no longer enough to check a subset of posts for deletion. Previously I was iterating all posts that a weekly or monthly user would be notified about, which minimises the amount of work to do. The new priority is to minimise the absolute size of the database, and that means that I need to check all posts for deletion. There are far, far fewer stored posts in the new structure, but still far too many to check on every run - even limited to infrequent runs. So, I need some way to limit the number of posts that I check for deletion. I figure I probably want to limit the check to some number of posts per run. The current deletion check takes a decent amount of time on average, I should check how many posts that is. But I also would like to stagger when to check a post for deletion. I'd like to check it frequently initially, as that's when a post is mostly likely to be deleted - probably trailing off after about 6 hours. I want to keep checking it after that, with the frequency of checks trailing off according to the post's age. Initially I thought I'd need to keep track of how many times a post has been checked, store it along with the rest of the post, increment it over time and increase the timing of checks based on that. But I don't need to do that. If I know the post's age, I can define a formula to select ages for which a post should be checked. E.g. I can define a list: check posts aged 0-6 hours old, 12 hours old, 24 hours, 48 hours etc. Provided that I round each age to an integer number of hours, and assuming the deletion runs successfully every hour, this should catch all posts with no stragglers. There's still a risk that too many posts could be captured in one of these bands. It's not really a problem because I can just slap a limit on the query, but I don't much like the idea of some posts being masked and slipping through permanently. So it might actually help to track the number of times a post has been checked, and order by that. Still, I'd really like to avoid needing to do that if possible. It could be possible to define some ordering system that changes each time the deletion check happens. Honestly though, a random order would probably be fine. It's also worth noting that I now need to check for context deletion too. The only contexts right now to which this applies are thread and parent post, which can both be acquired by a single request to the post. Context deletion is actually a little more complex than it appears at first glance. If I want to delete a context, I also need to delete all posts associated with it. I then need to follow that up by deleting all contexts that now have zero posts associated with them. Simplifying that, I can (unintuitively) skip the first step. I can just delete all posts with that context's ID registered, and then when I delete contexts that no longer have any posts associated with them, the context I wanted to delete in the first place will be wiped out too. I like this approach because it reduces duplication and effects an idempotent action (deleting unused context). I don't know what the best way to delete unused context is, though. I figure I'm going to need a query for each context, even if they're real similar. But is there anything I can do in the setup stage that would help? E.g. I could create a trigger that deletes contexts when a post is deleted. But I'd need to edit that trigger whenever I add a new context and that seems like something I don't need to learn how to do. I'll brute force it using existing solutions in whatever is the easiest way. |
I've decided that this is a pointless behaviour - personally I'd much rather want the guarantee that I'm going to be notified of a post. Otherwise a vigilant user might fear that the notifications system doesn't actually work. There may well be users who depend on this behaviour and I'm sure they'll pipe up. If in the future I choose to reimplement this behaviour, I should run a migration to remove posts that are no longer relevant because of it.
Now also soft-deletes wikis instead of deleting their data on every run, preserving their contextual information for posts that request it. Unsure exactly where wiki_service_configured will come in handy.
The post existence check is only used when acquiring new posts to check which ones are already known in order to skip them. This isn't necessary because it's 60n times cheaper, where n is the number of configured wikis, to just get the latest timestamp and ignore anything that came earlier than it.
This means that there is no longer a page-by-page thread iterator in the codebase. That's truly a shame because honestly this was one of my favourite functions I've ever written. But it is time for it to go.
8404045
to
43f89fa
Compare
I've come up against a wall about the interaction between post replies and thread unsubscriptions. The current behaviour is that a post reply will bypass a thread unsubscription, and will go directly to the user. I don't know if this is the correct approach or not. Instinctively I'd like the new behaviour to be that a thread unsubscription blocks any notifications from the thread, post replies included. I think I should probably check the FAQ to see how I describe the current behaviour, and that might give me some indication as to whether it's intentional...
The documentation isn't specific about how the two kinds of manual unsubscriptions specifically interact with subscriptions; in fact throughout the documentation on the site, the ability to sub/unsub to/from a post is at best strongly implied but never specifically explained. That being said, the quoted line implies that it's more in line with notifier's philosophy to notify of replies even in unsubscribed threads. So let's consider the use case for this feature. The following conditions are in play:
The distinction in that 3rd point is significant. If I made a post later on in the thread, it implies some level of engagement; to have unsubscribed after that implies that the unsubscription is very much intentional. In this case it would be better not to notify of a post reply. If I made only the OP and did not engage further, and I have unsubscribed, it probably means the thread isn't actually relevant to me at all, so I should not be notified of a post reply. Therefore I can justify removing this undocumented feature entirely. |
Ran an integration test and no notifications were selected for me, despite me setting my timestamp to 0. Right now I suspect an error during the migration (are posts discarded for users on the test frequency?) Reran the integration test and it looks pretty much fine :) |
I reason that it is not sensible to notify about post replies to threads from which the user is unsubscribed.
Because the correct HTTP response for a failed login is 200
05809f9
to
e7dc8cd
Compare
Resolves #65.
Requirements deemed out of scope for now: