Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shutdown Doc worker when it is not considered as available in Redis #831 #856

Conversation

fflorent
Copy link
Collaborator

Context

In #831, we asked whether we could make a doc worker available again when they can be reached again.

Instead of that, we suggest to just shutdown the worker and assume that another worker would be created instead.

Proposed solution

This PR proposes that the doc worker shutdowns itself in two cases:

  1. the worker is not marked as available anymore in Redis;
  2. the worker fails to join Redis to check its presence more than 3 times;

The check is at most every 30 seconds (30 seconds + the duration of requesting Redis).

@fflorent fflorent marked this pull request as draft February 15, 2024 13:11
@fflorent fflorent force-pushed the issue-831-unregister-doc-worker-on-redis branch from c0429b3 to 6778e82 Compare February 15, 2024 17:41
Comment on lines 320 to 326
if (nbFailures >= 3) {
log.error(
'DocWorkerMap: Presence checker failed 3 times, considering the worker "%s" is not available anymore',
workerId
);
return cb();
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether this is useful or whether that's what we would want:

  • The only case I think we could ultimately enter in this condition is because of a failure when calling sismemberAsync();
  • I tested simulating network issues (in my case, I simulated some ECONNREFUSED), no exception has been raised, and I never entered in this condition;
  • I feel like it is not the role of this method to handle connection issues with Redis, I see that errors are already logged above in this class;

I tend to think that we may just simplify everything:

  • not count the number of failure in this recursive setTimeout;
  • just log for each exception;
  • If we want to shutdown the worker because it cannot join the redis server, it's another subject so another issue/PR (that I can take care of);

What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I chose to do as I planed above.

app/gen-server/lib/DocWorkerMap.ts Outdated Show resolved Hide resolved
@fflorent fflorent marked this pull request as ready for review February 16, 2024 20:05
test: /\.m?js/,
resolve: {
fullySpecified: false
}
Copy link
Collaborator Author

@fflorent fflorent Feb 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I admit this is a piece of code I found on the internet to fix issues while building after I upgraded Sinon, and that I don't understand fully how it works neither the potential consequences.

I can try to know more about that next week.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you happen to have a log of what those issues were?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, here are the logs when I run yarn build:prod and without the above patch. Removing the node_modules/ folder doesn't solve anything.

yarn run v1.22.21
$ buildtools/build.sh
+ tsc --build
+ buildtools/update_type_info.sh app
+ webpack --config buildtools/webpack.config.js --mode production
assets by status 8.11 MiB [cached] 21 assets
orphan modules 1.13 MiB [orphan] 236 modules
runtime modules 20.6 KiB 46 modules
modules by path ./node_modules/ 8.18 MiB
  javascript modules 7.86 MiB 652 modules
  json modules 327 KiB 3 modules
modules by path ./app/ 3.87 MiB 279 modules
modules by path ./test/ 219 KiB 28 modules
modules by path ./stubs/app/ 671 bytes
  modules by path ./stubs/app/client/ui/*.ts 386 bytes 3 modules
  ./stubs/app/common/version.ts 156 bytes [built] [code generated]
  ./stubs/app/client/components/Banners.ts 129 bytes [built] [code generated]
external "jQuery" 42 bytes [built] [code generated]
external "alert" 42 bytes [built] [code generated]
./package.json 6.45 KiB [built] [code generated]

WARNING in ./stubs/app/common/version.ts 2:23-42
Should not import the named export 'version' (imported as 'packageJson') from default-exporting module (only default export is available soon)
 @ ./app/client/ui/AppHeader.ts 5:0-46 40:170-185 40:191-208 40:230-247
 @ ./app/client/ui/errorPages.ts 3:0-52 90:25-34
 @ ./app/client/errorMain.ts 1:0-56 2:0-12

ERROR in ./node_modules/sinon/pkg/sinon-esm.js 5166:15-22
Module not found: Error: Can't resolve 'process/browser' in '/home/florentpro/projects/grist-core/node_modules/sinon/pkg'
Did you mean 'browser.js'?
BREAKING CHANGE: The request 'process/browser' failed to resolve only because it was resolved as fully specified
(probably because the origin is strict EcmaScript Module, e. g. a module with javascript mimetype, a '*.mjs' file, or a '*.js' file where the package.json contains '"type": "module"').
The extension in the request is mandatory for it to be fully specified.
Add the extension to the request.
resolve 'process/browser' in '/home/florentpro/projects/grist-core/node_modules/sinon/pkg'
  Parsed request is a module
  using description file: /home/florentpro/projects/grist-core/node_modules/sinon/package.json (relative path: ./pkg)
    Field 'browser' doesn't contain a valid alias configuration
    resolve as module
      looking for modules in /home/florentpro/projects/grist-core
        /home/florentpro/projects/grist-core/process doesn't exist
      looking for modules in /home/florentpro/projects/grist-core/ext
        /home/florentpro/projects/grist-core/ext/process doesn't exist
      looking for modules in /home/florentpro/projects/grist-core/stubs
        /home/florentpro/projects/grist-core/stubs/process doesn't exist
      looking for modules in /home/florentpro/projects/grist-core/node_modules
        existing directory /home/florentpro/projects/grist-core/node_modules/process
          using description file: /home/florentpro/projects/grist-core/node_modules/process/package.json (relative path: .)
            using description file: /home/florentpro/projects/grist-core/node_modules/process/package.json (relative path: ./browser)
              Field 'browser' doesn't contain a valid alias configuration
              /home/florentpro/projects/grist-core/node_modules/process/browser doesn't exist
 @ ./test/common/InactivityTimer.ts 4:0-31 8:10-19
 @ ./test/client-harness/client.js

webpack 5.73.0 compiled with 1 error and 1 warning in 37312 ms
error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may have fixed properly the problem here: 5e874f7

Though let's see what the CI tells.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fflorent fflorent force-pushed the issue-831-unregister-doc-worker-on-redis branch 4 times, most recently from 8eb4e75 to a68d2f1 Compare February 21, 2024 09:34
@fflorent fflorent force-pushed the issue-831-unregister-doc-worker-on-redis branch 2 times, most recently from 2e86558 to db2e03a Compare February 27, 2024 07:34
@fflorent
Copy link
Collaborator Author

The test failure is not due to the instabilities. I convert this PR into draft until it is fixed.

@fflorent fflorent marked this pull request as draft February 27, 2024 13:23
@fflorent fflorent marked this pull request as ready for review February 27, 2024 14:22
@fflorent
Copy link
Collaborator Author

The failing test has been fixed: 997ee86
Also fixed an issue while running the Smoke tests when the environment is in French: fe3ec42

@paulfitz
Copy link
Member

paulfitz commented Mar 6, 2024

Overall, the logic here feels similar to a container health check. It seems fairly reasonable to expect that a Grist installation with multiple workers is containerized, and probably has health checks running on each Grist instance? In our own deployments we use the /status endpoint as a health check, specifically /status?db=1&redis=1 which checks also for db and redis connectivity. I wonder if it might make sense to add an extra parameter to this endpoint to allow also for making the basic check you have here: if the Grist instance is a doc worker, and that worker has been registered in redis, and that registration is no longer present, then the health check should fail. That means the operator is free to decide how often to run the check, and whether to trust a failure immediately or do retries (though I don't think retries make sense for this particular check).

Not against this change, just proposing an alternative to see what you think @fflorent.

Skimmed the PR, and it looks generally in good shape except it could be worth clearing any pending timeout in the close() method for any IDocWorkerMap implementations so that FlexServers will shut down cleanly every time.

@fflorent fflorent force-pushed the issue-831-unregister-doc-worker-on-redis branch from fe3ec42 to f14a0ad Compare March 12, 2024 10:25
@fflorent
Copy link
Collaborator Author

fflorent commented Mar 12, 2024

@paulfitz Thanks for your remark! Sounds wise and much simpler, indeed. After discussing with ops, that seems to answer the need.

I applied your requested change, I just kept the upgrade of the Sinon dependency, as it was a painful move and it would remove a barrier to use features of newer versions in the future. If that's OK for you?

@fflorent fflorent force-pushed the issue-831-unregister-doc-worker-on-redis branch from f14a0ad to 4d0c602 Compare March 12, 2024 10:41
@fflorent fflorent force-pushed the issue-831-unregister-doc-worker-on-redis branch from 2502232 to 0720b4c Compare March 20, 2024 15:40
@paulfitz
Copy link
Member

@fflorent any thoughts on the LanguageSettings test failure?

@fflorent
Copy link
Collaborator Author

@paulfitz sould be fixed now.

Copy link
Member

@paulfitz paulfitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, looks like I never submitted this small comment :(

app/gen-server/lib/DocWorkerMap.ts Outdated Show resolved Hide resolved
@fflorent fflorent force-pushed the issue-831-unregister-doc-worker-on-redis branch from 8cc5b3d to 6955609 Compare April 3, 2024 21:13
@fflorent fflorent requested a review from paulfitz April 3, 2024 21:14
@fflorent
Copy link
Collaborator Author

fflorent commented Apr 3, 2024

@paulfitz Right, fixed, rebased and pushed!

Copy link
Member

@paulfitz paulfitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @fflorent !

@paulfitz paulfitz merged commit 4a9b6fe into gristlabs:main Apr 4, 2024
13 checks passed
@fflorent fflorent deleted the issue-831-unregister-doc-worker-on-redis branch April 4, 2024 18:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants