Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#9211 Retry on failed to read large indexeddb value #9229

Merged
merged 13 commits into from
Oct 4, 2024

Conversation

grahamlangford
Copy link
Collaborator

@grahamlangford grahamlangford commented Oct 3, 2024

What does this PR do?

Remaining work

  • identify idb calls we want to enable retries for
  • Add retries (and onRetry where it makes sense) for each

For more information on our expectations for the PR process, see the
code review principles doc

@grahamlangford grahamlangford marked this pull request as draft October 3, 2024 18:02
@grahamlangford grahamlangford self-assigned this Oct 3, 2024
@grahamlangford grahamlangford added this to the 2.1.3 milestone Oct 3, 2024
Copy link

codecov bot commented Oct 3, 2024

Codecov Report

Attention: Patch coverage is 69.09091% with 51 lines in your changes missing coverage. Please review.

Project coverage is 74.82%. Comparing base (8318d74) to head (45968ea).
Report is 335 commits behind head on main.

Files with missing lines Patch % Lines
src/telemetry/trace.ts 33.33% 24 Missing ⚠️
src/telemetry/logging.ts 75.00% 17 Missing ⚠️
src/registry/packageRegistry.ts 80.00% 5 Missing ⚠️
src/background/telemetry.ts 71.42% 4 Missing ⚠️
src/utils/idbUtils.ts 95.45% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9229      +/-   ##
==========================================
+ Coverage   74.24%   74.82%   +0.58%     
==========================================
  Files        1332     1367      +35     
  Lines       40817    42232    +1415     
  Branches     7634     7897     +263     
==========================================
+ Hits        30306    31602    +1296     
- Misses      10511    10630     +119     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

{
retries: retry ? MAX_RETRIES : 0,
shouldRetry: (error) =>
isIDBConnectionError(error) || isIDBLargeValueError(error),
Copy link
Contributor

@twschiller twschiller Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be good to document here for future context why we think "Large value error" is recoverable. Are you suspecting it might be a race condition with when the blob is written to disk?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can link this chromium discussion where devs have reported retries working for them in the case where this error is thrown:
https://issues.chromium.org/issues/342779913#comment8

Copy link
Collaborator Author

@grahamlangford grahamlangford Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be passed in. It's not necessarily recoverable without some kind of onRetry, though we're going to attempt in a couple of areas as a test. We also want to be flexible enough to be able to change the reasons to retry for other errors in the future

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can link this chromium discussion where devs have reported retries working for them in the case where this error is thrown:

Agree that's extremely helpful for explaining the "why" behind the code

Copy link

github-actions bot commented Oct 3, 2024

Playwright test results

passed  130 passed
flaky  4 flaky
skipped  4 skipped

Details

report  Open report ↗︎
stats  138 tests across 45 suites
duration  13 minutes, 23 seconds
commit  45968ea
info  For more information on how to debug and view this report, see our readme

Flaky tests

chrome › tests/modLifecycle.spec.ts › create, run, package, and update mod
chrome › tests/pageEditor/addStarterBrick.spec.ts › Add starter brick to mod
msedge › tests/pageEditor/addStarterBrick.spec.ts › Add starter brick to mod
chrome › tests/pageEditor/copyMod.spec.ts › run a copied mod with a built-in integration

Skipped tests

chrome › tests/regressions/doNotCloseSidebarOnPageEditorSave.spec.ts › #8104: Do not automatically close the sidebar when saving in the Page Editor
msedge › tests/regressions/doNotCloseSidebarOnPageEditorSave.spec.ts › #8104: Do not automatically close the sidebar when saving in the Page Editor
chrome › tests/runtime/googleSheetsIntegration.spec.ts › can activate a google spreadsheet mod with config options
msedge › tests/runtime/googleSheetsIntegration.spec.ts › can activate a google spreadsheet mod with config options

@@ -98,6 +100,11 @@ export function isIDBQuotaError(error: unknown): boolean {
return QUOTA_ERRORS.some((quotaError) => message.includes(quotaError));
}

export function isIDBLargeValueError(error: unknown): boolean {
Copy link
Contributor

@twschiller twschiller Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to link to the Chrome Roadmap context on what this error means: https://chromestatus.com/feature/5140210640486400

I.e., that it could be a large value, or a missing blob. Also, this check might break in Chrome 130 if the error messages change?

Chrome commits corresponding to the roadmap might have the text of the new messages

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per comment here: https://github.com/pixiebrix/pixiebrix-extension/pull/9229/files#r1786666504. Which of those 2 underlying causes do we think might be recoverable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I.e., that it could be a large value, or a missing blob. Also, this check might break in Chrome 130 if the error messages change?

I tested on Chrome 131 and it works, so should be fine on Chrome 130

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only change to the error is it can be a NotFoundError now. But it still has the same error.message

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for investigating. NIT: I think think it'd be valuable to link to that Chrome Roadmap entry that explains what that error means in Chrome

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log in the latest version already has a relevant link

},
{
operationName: IDB_OPERATION[DATABASE_NAME.PACKAGE_REGISTRY].GET_BY_KINDS,
// The large value error is fixable by re-syncing the packages
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this comment is not exactly true since large value error right now could be one of several issues, some of which are retryable, and some are not.

Copy link

github-actions bot commented Oct 3, 2024

No loom links were found in the first post. Please add one there if you'd like to it to appear on Slack.

Do not edit this comment manually.

shouldRetry: (error) =>
isIDBConnectionError(error) || isIDBLargeValueError(error),
async onRetry(error) {
if (isIDBLargeValueError(error)) {
Copy link
Contributor

@twschiller twschiller Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm missing the explanation of why getByKinds is the only operation that re-runs syncPackages in this PR. (E.g., why not the "find" operation). Is it because it's a bulk operation that needs to read multiple bricks (i.e., we think the find method is less risky because it would only fail on specific bricks?)

Is that in Slack/Notion somewhere?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were able to trigger a getByKinds error locally and the potential impact is higher. We could implement it in a lot more places, but we decided to keep the change limited until we had the additional logging information from this release.

If you think it's valuable to implement it more widely, I can definitely see to that

Copy link
Contributor

@twschiller twschiller Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you think it's valuable to implement it more widely, I can definitely see to that

Up to you/Eduardo, I don't have a strong opinion. Just wanted to ensure we had the context captured

message: `${operationName} failed for IDB database ${databaseName}`,
});

if (isIDBLargeValueError(error)) {
Copy link
Contributor

@twschiller twschiller Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a "why" here would be helpful. I.e., that 1/ an error for a single value can break bulk operations on the whole DB, and 2/ we don't know of a way to drop the single bad record even if we know which one it is

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, adding a comment here would be useful to explain, along with a log statement


// IDB Connection Error message strings
const CONNECTION_ERRORS = ["Error Opening IndexedDB"] as const;
const MAX_RETRIES = 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any context on how we picked 3 here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No concept. Just seemed a sane default. We typically use 3 retries for sandbox messaging for example

operationName: IDB_OPERATION[DATABASE_NAME.PACKAGE_REGISTRY].GET_BY_KINDS,
// The large value error is fixable by re-syncing the packages
shouldRetry: (error) =>
isIDBConnectionError(error) || isIDBLargeValueError(error),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the isIDBConnectionError(error) || isIDBLargeValueError(error) check is repeated, you might consider introducing a helper isMaybeTemporaryIDBError or a similar name to more directly reflect the meaning

@grahamlangford grahamlangford merged commit 3ba3f57 into main Oct 4, 2024
26 checks passed
@grahamlangford grahamlangford deleted the retry-on-failed-to-read-large-indexeddb-value branch October 4, 2024 14:20
grahamlangford added a commit that referenced this pull request Oct 8, 2024
* make operationName part of a config object

* implement first retry

* fix errors from type change

* update error messaging; add handling

* move shouldRetry to the callsites

* document isIDBLargeValueError

* improve comment

* handle retry in test

* remove retry flag and depend on shouldRetry instead

* clear db on large value error

* adds requested comments

---------

Co-authored-by: Eduardo <eduardo@pixiebrix.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants