-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve opds2_feed_reaper
performance for large feeds. (PP-1756)
#2089
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2089 +/- ##
==========================================
+ Coverage 90.67% 90.78% +0.10%
==========================================
Files 344 344
Lines 40585 40580 -5
Branches 6583 8819 +2236
==========================================
+ Hits 36801 36839 +38
+ Misses 2506 2484 -22
+ Partials 1278 1257 -21 ☔ View full report in Codecov by Sentry. |
to_be_reaped_qu = unlimited_access_license_pools_qu.join(Identifier).filter( | ||
~Identifier.id.in_(identifier_ids) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The key change starts here. We remove this join, which scaled poorly because the SELECT
grew linearly with the number of identifiers in the feed.
The other part of this change is below, where we iterate over all eligible license pools locally and reap the ones that are not mentioned in the feed.
bin/opds2_reaper_monitor
Outdated
for pool in eligible_license_pools_qu.options(raiseload("*")).yield_per( | ||
query_batch_size | ||
): | ||
if pool.identifier_id not in identifier_ids: | ||
reap_count += 1 | ||
# Don't actually reap, unless this is explicitly NOT a dry run. | ||
if self.dry_run is False: | ||
pool.unlimited_access = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the second part of the main change here, where we iterate over all eligible license pools locally and reap the ones that are not mentioned in the feed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
feed_id_generator = (id_ for id_ in self.seen_identifiers) | ||
while _feed_id_batch := list( | ||
itertools.islice(feed_id_generator, query_batch_size) | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also needed to batch the identifier lookups.
Description
Makes memory use and database query size more predictable for the
opds2_feed_reaper
script by batching identifier lookups and license pool fetching.Motivation and Context
The previous approach could not support very large feeds (somewhere over 150K items) and performed poorly on even smaller feeds than that.
[Jira PP-1756]
How Has This Been Tested?
Checklist