Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sidecar crashes when using Lighthouse with checkpoint sync enabled #5

Open
LuisNaldo7 opened this issue Nov 23, 2022 · 3 comments
Open
Assignees

Comments

@LuisNaldo7
Copy link

Issue

According to the beacon api consensus clients that are not fully synced should return

{
  "data": {
    "head_slot": "1",
    "sync_distance": "3803456",
    "is_syncing": true,
    "is_optimistic": true
  }
}

when calling endpoint "/eth/v1/node/syncing".
This is not the case for Lighthouse when it's running with checkpoint sync enabled. Then the result is

{
   "data":{
      "BackFillSyncing":{
         "completed":353024,
         "remaining":3803456
      }
   }
}

which leads to the sidecar crashing.

syncDistance, _ := strconv.Atoi(data.Data.SyncDistance)

In "eth2.go" the client tries to receive the value from "data.sync_distance" which is not part of the object.

Solution

Even though the error is caused by lighthouse returning an object that is not defined in the official api an error handling and logging would be highly appreciated.

@samlaf
Copy link

samlaf commented Mar 10, 2023

There's a bug in the fix.

fmt.Println(err)

This is printing nil for us, because it's printing err in the happy case (no error).
These are the logs:

Starting server: 0.0.0.0:3000
2023/03/10 13:06:20 GET /eth2/readiness
2023/03/10 13:06:50 GET /eth2/readiness
2023/03/10 13:07:20 GET /eth2/readiness
2023/03/10 13:07:50 GET /eth2/readiness
2023/03/10 13:08:20 GET /eth2/readiness
2023/03/10 13:08:50 GET /eth2/readiness
2023/03/10 13:09:20 GET /eth2/readiness
2023/03/10 13:09:50 GET /eth2/readiness
2023/03/10 13:10:20 GET /eth2/readiness
2023/03/10 13:10:50 GET /eth2/readiness
2023/03/10 13:11:20 GET /eth2/readiness
2023/03/10 13:11:50 GET /eth2/readiness
2023/03/10 13:12:20 GET /eth2/readiness
2023/03/10 13:12:50 GET /eth2/readiness
2023/03/10 13:13:20 GET /eth2/readiness
2023/03/10 13:13:50 GET /eth2/readiness
2023/03/10 13:14:20 GET /eth2/readiness
2023/03/10 13:14:50 GET /eth2/readiness
2023/03/10 13:15:20 GET /eth2/readiness
2023/03/10 13:15:50 GET /eth2/readiness
2023/03/10 13:16:20 GET /eth2/liveness
2023/03/10 13:16:20 GET /eth2/readiness
<nil>
2023/03/10 13:16:50 GET /eth2/liveness
2023/03/10 13:16:50 GET /eth2/readiness
<nil>
2023/03/10 13:17:20 GET /eth2/readiness
2023/03/10 13:17:20 GET /eth2/liveness
<nil>

@samlaf
Copy link

samlaf commented Mar 10, 2023

Also, what's the point of the liveness probe?
Seems like all it's doing is restarting the sidecar after 3 fails.
Since the sidecar is a stateless container, what's the point of restarting it? Giving 900 more seconds for the client to sync during the initialDelay period?

This is what we're seeing with the p2pNodePort turned on. There's lots of restarts.. which is puzzling because we thought having inbound connections should actually help with syncing.

NAME                      READY   STATUS    RESTARTS         AGE
pod/lighthouse-goerli-0   1/2     Running   69 (3m9s ago)    18h
pod/lighthouse-goerli-1   2/2     Running   44 (6h42m ago)   18h
pod/lighthouse-goerli-2   1/2     Running   69 (69s ago)     18h

And with the p2pNodePort turned off:

NAME               READY   STATUS    RESTARTS       AGE
pod/debug          1/1     Running   0              12m
pod/lighthouse-0   2/2     Running   39 (11h ago)   21h
pod/lighthouse-1   2/2     Running   35 (11h ago)   21h
pod/lighthouse-2   2/2     Running   80 (15m ago)   21h

@antares-sw
Copy link
Contributor

Hey @LuisNaldo7 as mentioned in many of our charts e.g.: https://github.com/stakewise/helm-charts/blob/main/charts/geth/values.yaml#L367 liveness/readniness probes should be disabled until node fully synced. Even if checkpoint sync support is implemented, the use of liveness probes will be meaningless, the pod will still restart due to the fact that the node is not synchronized, since it is not very desirable to restart nodes during synchronization, it is better to disable these checks completely until the node is synchronized, after sync you can enable it back and everything should work as expected

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants