Restarting pod causes "panic: assertion failed: Page expected to be: 4190, but self identifies as 0" #28626

WoodyWoodsta · 2024-10-08T10:24:17Z

Describe the bug
I have a 3 node vault cluster using raft storage, in Kubernetes. If I restart one of the pods, it fails immediately and continuously with the following error:

panic: assertion failed: Page expected to be: 4190, but self identifies as 0

goroutine 1 [running]:
github.com/hashicorp-forge/bbolt._assert(...)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/bbolt@v1.3.8-hc3/db.go:1387
github.com/hashicorp-forge/bbolt.(*page).fastCheck(0x79acc0b2e000, 0x105e)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/bbolt@v1.3.8-hc3/page.go:57 +0x1d9
github.com/hashicorp-forge/bbolt.(*Tx).page(0x79acbfc2a000?, 0x88b5d80?)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/bbolt@v1.3.8-hc3/tx.go:534 +0x7b
github.com/hashicorp-forge/bbolt.(*Tx).forEachPageInternal(0xc00389a000, {0xc0033a25f0, 0x4, 0xa}, 0xc0037fe298)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/bbolt@v1.3.8-hc3/tx.go:546 +0x5d
github.com/hashicorp-forge/bbolt.(*Tx).forEachPageInternal(0xc00389a000, {0xc0033a25f0, 0x3, 0xa}, 0xc0037fe298)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/bbolt@v1.3.8-hc3/tx.go:555 +0xc8
github.com/hashicorp-forge/bbolt.(*Tx).forEachPageInternal(0xc00389a000, {0xc0033a25f0, 0x2, 0xa}, 0xc0037fe298)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/bbolt@v1.3.8-hc3/tx.go:555 +0xc8
github.com/hashicorp-forge/bbolt.(*Tx).forEachPageInternal(0xc00389a000, {0xc0033a25f0, 0x1, 0xa}, 0xc0037fe298)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/bbolt@v1.3.8-hc3/tx.go:555 +0xc8
github.com/hashicorp-forge/bbolt.(*Tx).forEachPage(...)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/bbolt@v1.3.8-hc3/tx.go:542
github.com/hashicorp-forge/bbolt.(*Tx).checkBucket(0xc00389a000, 0xc00339bf00, 0xc0037fe6a0, 0xc0037fe5e0, {0xcff58f0, 0x13361e40}, 0xc0033aa300)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/bbolt@v1.3.8-hc3/tx_check.go:83 +0x114
github.com/hashicorp-forge/bbolt.(*Tx).checkBucket.func2({0x79acbfb54140?, 0xc0033a25a0?, 0xc003381108?})
/home/runner/go/pkg/mod/github.com/hashicorp-forge/bbolt@v1.3.8-hc3/tx_check.go:110 +0x90
github.com/hashicorp-forge/bbolt.(*Bucket).ForEachBucket(0x0?, 0xc0037fe498)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/bbolt@v1.3.8-hc3/bucket.go:403 +0x96
github.com/hashicorp-forge/bbolt.(*Tx).checkBucket(0xc00389a000, 0xc00389a018, 0xc0037fe6a0, 0xc0037fe5e0, {0xcff58f0, 0x13361e40}, 0xc0033aa300)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/bbolt@v1.3.8-hc3/tx_check.go:108 +0x255
github.com/hashicorp-forge/bbolt.(*DB).freepages(0xc003392908)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/bbolt@v1.3.8-hc3/db.go:1205 +0x225
github.com/hashicorp-forge/bbolt.(*DB).loadFreelist.func1()
/home/runner/go/pkg/mod/github.com/hashicorp-forge/bbolt@v1.3.8-hc3/db.go:417 +0xc5
sync.(*Once).doSlow(0x1dea4e0?, 0xc003392ad0?)
/opt/hostedtoolcache/go/1.22.7/x64/src/sync/once.go:74 +0xc2
sync.(*Once).Do(...)
/opt/hostedtoolcache/go/1.22.7/x64/src/sync/once.go:65
github.com/hashicorp-forge/bbolt.(*DB).loadFreelist(0xc003392908?)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/bbolt@v1.3.8-hc3/db.go:413 +0x45
github.com/hashicorp-forge/bbolt.Open({0xc0033ba3a8, 0x14}, 0x180, 0xc0033ee9c0)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/bbolt@v1.3.8-hc3/db.go:295 +0x430
github.com/hashicorp/vault/physical/raft.(*FSM).openDBFile(0xc0033ea500, {0xc0033ba3a8, 0x14})
/home/runner/work/vault/vault/physical/raft/fsm.go:264 +0x266
github.com/hashicorp/vault/physical/raft.NewFSM({0xc003380241, 0xb}, {0xc0000ba133, 0x7}, {0xd096f48, 0xc003398c90})
/home/runner/work/vault/vault/physical/raft/fsm.go:218 +0x433
github.com/hashicorp/vault/physical/raft.NewRaftBackend(0xc003398a80, {0xd096f48, 0xc003398c60})
/home/runner/work/vault/vault/physical/raft/raft.go:439 +0xed
github.com/hashicorp/vault/command.(*ServerCommand).setupStorage(0xc003392008, 0xc0033da008)
/home/runner/work/vault/vault/command/server.go:811 +0x319
github.com/hashicorp/vault/command.(*ServerCommand).Run(0xc003392008, {0xc0000b4860, 0x1, 0x1})
/home/runner/work/vault/vault/command/server.go:1188 +0x10e6
github.com/hashicorp/cli.(*CLI).Run(0xc003814f00)
/home/runner/go/pkg/mod/github.com/hashicorp/cli@v1.1.6/cli.go:265 +0x5b8
github.com/hashicorp/vault/command.RunCustom({0xc0000b4850?, 0x2?, 0x2?}, 0xc0000061c0?)
/home/runner/work/vault/vault/command/main.go:243 +0x9a6
github.com/hashicorp/vault/command.Run(...)
/home/runner/work/vault/vault/command/main.go:147
main.main()
/home/runner/work/vault/vault/main.go:13 +0x47

The only way to solve this right now is to completely remove the persistent volume for the pod, and restart. This means it's impossible to update the Vault cluster without doing a full restore.

To Reproduce
Steps to reproduce the behavior:

Run a 3 node cluster
Restart one of the nodes

Expected behavior
Vault is able to recover from restarts.

Environment:

Vault Server Version (retrieve with vault status): 1.17.5
Vault CLI Version (retrieve with vault version): Vault v1.17.5 (4d0c53e), built 2024-08-30T15:54:57Z
Server Operating System/Architecture: Kubernetes, bare metal

Vault server configuration file(s):

disable_mlock = true
ui = true

listener "tcp" {
  tls_disable = 1
  address = "[::]:8200"
  cluster_address = "[::]:8201"

  # Enable unauthenticated metrics access (necessary for Prometheus Operator)
  telemetry {
    unauthenticated_metrics_access = "true"
  }
}

storage "raft" {
  path = "/vault/data"
  raft_wal = "true"
  raft_log_verifier_enabled = "true"

  retry_join {
    leader_api_addr = "http://vault-0.vault-internal:8200"
  }
  retry_join {
    leader_api_addr = "http://vault-1.vault-internal:8200"
  }
  retry_join {
    leader_api_addr = "http://vault-2.vault-internal:8200"
  }
}

service_registration "kubernetes" {}

telemetry {
  prometheus_retention_time = "30s"
  disable_hostname = true
}

Additional context

I understand the panic is from boltdb, but I'm wondering if Vault is doing anything specific which causes this issue.
I attempted to switch to use raft_wal to fix the issue, but it looks as though boltdb is still used. This issue occurs regradless of setting raft_wal

The text was updated successfully, but these errors were encountered:

heatherezell added k8s bug Used to indicate a potential bug storage/raft labels Oct 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restarting pod causes "panic: assertion failed: Page expected to be: 4190, but self identifies as 0" #28626

Restarting pod causes "panic: assertion failed: Page expected to be: 4190, but self identifies as 0" #28626

WoodyWoodsta commented Oct 8, 2024 •

edited

Loading

Restarting pod causes "panic: assertion failed: Page expected to be: 4190, but self identifies as 0" #28626

Restarting pod causes "panic: assertion failed: Page expected to be: 4190, but self identifies as 0" #28626

Comments

WoodyWoodsta commented Oct 8, 2024 • edited Loading

WoodyWoodsta commented Oct 8, 2024 •

edited

Loading