Sidecar crashes when using Lighthouse with checkpoint sync enabled #5

LuisNaldo7 · 2022-11-23T14:34:46Z

Issue

According to the beacon api consensus clients that are not fully synced should return

{
  "data": {
    "head_slot": "1",
    "sync_distance": "3803456",
    "is_syncing": true,
    "is_optimistic": true
  }
}

when calling endpoint "/eth/v1/node/syncing".
This is not the case for Lighthouse when it's running with checkpoint sync enabled. Then the result is

{
   "data":{
      "BackFillSyncing":{
         "completed":353024,
         "remaining":3803456
      }
   }
}

which leads to the sidecar crashing.

ethnode-sidecar/clients/eth2.go

Line 103 in efa22eb

syncDistance, _ := strconv.Atoi(data.Data.SyncDistance)

In "eth2.go" the client tries to receive the value from "data.sync_distance" which is not part of the object.

Solution

Even though the error is caused by lighthouse returning an object that is not defined in the official api an error handling and logging would be highly appreciated.

The text was updated successfully, but these errors were encountered:

samlaf · 2023-03-10T19:59:16Z

There's a bug in the fix.

ethnode-sidecar/clients/eth2.go

Line 124 in 8116151

fmt.Println(err)

This is printing nil for us, because it's printing err in the happy case (no error).
These are the logs:

Starting server: 0.0.0.0:3000
2023/03/10 13:06:20 GET /eth2/readiness
2023/03/10 13:06:50 GET /eth2/readiness
2023/03/10 13:07:20 GET /eth2/readiness
2023/03/10 13:07:50 GET /eth2/readiness
2023/03/10 13:08:20 GET /eth2/readiness
2023/03/10 13:08:50 GET /eth2/readiness
2023/03/10 13:09:20 GET /eth2/readiness
2023/03/10 13:09:50 GET /eth2/readiness
2023/03/10 13:10:20 GET /eth2/readiness
2023/03/10 13:10:50 GET /eth2/readiness
2023/03/10 13:11:20 GET /eth2/readiness
2023/03/10 13:11:50 GET /eth2/readiness
2023/03/10 13:12:20 GET /eth2/readiness
2023/03/10 13:12:50 GET /eth2/readiness
2023/03/10 13:13:20 GET /eth2/readiness
2023/03/10 13:13:50 GET /eth2/readiness
2023/03/10 13:14:20 GET /eth2/readiness
2023/03/10 13:14:50 GET /eth2/readiness
2023/03/10 13:15:20 GET /eth2/readiness
2023/03/10 13:15:50 GET /eth2/readiness
2023/03/10 13:16:20 GET /eth2/liveness
2023/03/10 13:16:20 GET /eth2/readiness
<nil>
2023/03/10 13:16:50 GET /eth2/liveness
2023/03/10 13:16:50 GET /eth2/readiness
<nil>
2023/03/10 13:17:20 GET /eth2/readiness
2023/03/10 13:17:20 GET /eth2/liveness
<nil>

samlaf · 2023-03-10T20:02:46Z

Also, what's the point of the liveness probe?
Seems like all it's doing is restarting the sidecar after 3 fails.
Since the sidecar is a stateless container, what's the point of restarting it? Giving 900 more seconds for the client to sync during the initialDelay period?

This is what we're seeing with the p2pNodePort turned on. There's lots of restarts.. which is puzzling because we thought having inbound connections should actually help with syncing.

NAME                      READY   STATUS    RESTARTS         AGE
pod/lighthouse-goerli-0   1/2     Running   69 (3m9s ago)    18h
pod/lighthouse-goerli-1   2/2     Running   44 (6h42m ago)   18h
pod/lighthouse-goerli-2   1/2     Running   69 (69s ago)     18h

And with the p2pNodePort turned off:

NAME               READY   STATUS    RESTARTS       AGE
pod/debug          1/1     Running   0              12m
pod/lighthouse-0   2/2     Running   39 (11h ago)   21h
pod/lighthouse-1   2/2     Running   35 (11h ago)   21h
pod/lighthouse-2   2/2     Running   80 (15m ago)   21h

antares-sw · 2023-03-13T11:25:02Z

Hey @LuisNaldo7 as mentioned in many of our charts e.g.: https://github.com/stakewise/helm-charts/blob/main/charts/geth/values.yaml#L367 liveness/readniness probes should be disabled until node fully synced. Even if checkpoint sync support is implemented, the use of liveness probes will be meaningless, the pod will still restart due to the fact that the node is not synchronized, since it is not very desirable to restart nodes during synchronization, it is better to disable these checks completely until the node is synchronized, after sync you can enable it back and everything should work as expected

antares-sw mentioned this issue Dec 5, 2022

#2 fix checkpoint sync issue #6

Merged

antares-sw self-assigned this Dec 5, 2022

michaelsproul mentioned this issue Mar 7, 2023

Don't return BackfillSyncing data on standard syncing endpoint sigp/lighthouse#4055

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sidecar crashes when using Lighthouse with checkpoint sync enabled #5

Sidecar crashes when using Lighthouse with checkpoint sync enabled #5

LuisNaldo7 commented Nov 23, 2022

samlaf commented Mar 10, 2023

samlaf commented Mar 10, 2023

antares-sw commented Mar 13, 2023

Sidecar crashes when using Lighthouse with checkpoint sync enabled #5

Sidecar crashes when using Lighthouse with checkpoint sync enabled #5

Comments

LuisNaldo7 commented Nov 23, 2022

Issue

Solution

samlaf commented Mar 10, 2023

samlaf commented Mar 10, 2023

antares-sw commented Mar 13, 2023