Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restarting pod causes "panic: assertion failed: Page expected to be: 4190, but self identifies as 0" #28626

Open
WoodyWoodsta opened this issue Oct 8, 2024 · 0 comments
Labels
bug Used to indicate a potential bug k8s storage/raft

Comments

@WoodyWoodsta
Copy link

WoodyWoodsta commented Oct 8, 2024

Describe the bug
I have a 3 node vault cluster using raft storage, in Kubernetes. If I restart one of the pods, it fails immediately and continuously with the following error:

panic: assertion failed: Page expected to be: 4190, but self identifies as 0

goroutine 1 [running]:
github.com/hashicorp-forge/bbolt._assert(...)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/bbolt@v1.3.8-hc3/db.go:1387
github.com/hashicorp-forge/bbolt.(*page).fastCheck(0x79acc0b2e000, 0x105e)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/bbolt@v1.3.8-hc3/page.go:57 +0x1d9
github.com/hashicorp-forge/bbolt.(*Tx).page(0x79acbfc2a000?, 0x88b5d80?)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/bbolt@v1.3.8-hc3/tx.go:534 +0x7b
github.com/hashicorp-forge/bbolt.(*Tx).forEachPageInternal(0xc00389a000, {0xc0033a25f0, 0x4, 0xa}, 0xc0037fe298)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/bbolt@v1.3.8-hc3/tx.go:546 +0x5d
github.com/hashicorp-forge/bbolt.(*Tx).forEachPageInternal(0xc00389a000, {0xc0033a25f0, 0x3, 0xa}, 0xc0037fe298)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/bbolt@v1.3.8-hc3/tx.go:555 +0xc8
github.com/hashicorp-forge/bbolt.(*Tx).forEachPageInternal(0xc00389a000, {0xc0033a25f0, 0x2, 0xa}, 0xc0037fe298)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/bbolt@v1.3.8-hc3/tx.go:555 +0xc8
github.com/hashicorp-forge/bbolt.(*Tx).forEachPageInternal(0xc00389a000, {0xc0033a25f0, 0x1, 0xa}, 0xc0037fe298)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/bbolt@v1.3.8-hc3/tx.go:555 +0xc8
github.com/hashicorp-forge/bbolt.(*Tx).forEachPage(...)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/bbolt@v1.3.8-hc3/tx.go:542
github.com/hashicorp-forge/bbolt.(*Tx).checkBucket(0xc00389a000, 0xc00339bf00, 0xc0037fe6a0, 0xc0037fe5e0, {0xcff58f0, 0x13361e40}, 0xc0033aa300)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/bbolt@v1.3.8-hc3/tx_check.go:83 +0x114
github.com/hashicorp-forge/bbolt.(*Tx).checkBucket.func2({0x79acbfb54140?, 0xc0033a25a0?, 0xc003381108?})
/home/runner/go/pkg/mod/github.com/hashicorp-forge/bbolt@v1.3.8-hc3/tx_check.go:110 +0x90
github.com/hashicorp-forge/bbolt.(*Bucket).ForEachBucket(0x0?, 0xc0037fe498)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/bbolt@v1.3.8-hc3/bucket.go:403 +0x96
github.com/hashicorp-forge/bbolt.(*Tx).checkBucket(0xc00389a000, 0xc00389a018, 0xc0037fe6a0, 0xc0037fe5e0, {0xcff58f0, 0x13361e40}, 0xc0033aa300)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/bbolt@v1.3.8-hc3/tx_check.go:108 +0x255
github.com/hashicorp-forge/bbolt.(*DB).freepages(0xc003392908)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/bbolt@v1.3.8-hc3/db.go:1205 +0x225
github.com/hashicorp-forge/bbolt.(*DB).loadFreelist.func1()
/home/runner/go/pkg/mod/github.com/hashicorp-forge/bbolt@v1.3.8-hc3/db.go:417 +0xc5
sync.(*Once).doSlow(0x1dea4e0?, 0xc003392ad0?)
/opt/hostedtoolcache/go/1.22.7/x64/src/sync/once.go:74 +0xc2
sync.(*Once).Do(...)
/opt/hostedtoolcache/go/1.22.7/x64/src/sync/once.go:65
github.com/hashicorp-forge/bbolt.(*DB).loadFreelist(0xc003392908?)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/bbolt@v1.3.8-hc3/db.go:413 +0x45
github.com/hashicorp-forge/bbolt.Open({0xc0033ba3a8, 0x14}, 0x180, 0xc0033ee9c0)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/bbolt@v1.3.8-hc3/db.go:295 +0x430
github.com/hashicorp/vault/physical/raft.(*FSM).openDBFile(0xc0033ea500, {0xc0033ba3a8, 0x14})
/home/runner/work/vault/vault/physical/raft/fsm.go:264 +0x266
github.com/hashicorp/vault/physical/raft.NewFSM({0xc003380241, 0xb}, {0xc0000ba133, 0x7}, {0xd096f48, 0xc003398c90})
/home/runner/work/vault/vault/physical/raft/fsm.go:218 +0x433
github.com/hashicorp/vault/physical/raft.NewRaftBackend(0xc003398a80, {0xd096f48, 0xc003398c60})
/home/runner/work/vault/vault/physical/raft/raft.go:439 +0xed
github.com/hashicorp/vault/command.(*ServerCommand).setupStorage(0xc003392008, 0xc0033da008)
/home/runner/work/vault/vault/command/server.go:811 +0x319
github.com/hashicorp/vault/command.(*ServerCommand).Run(0xc003392008, {0xc0000b4860, 0x1, 0x1})
/home/runner/work/vault/vault/command/server.go:1188 +0x10e6
github.com/hashicorp/cli.(*CLI).Run(0xc003814f00)
/home/runner/go/pkg/mod/github.com/hashicorp/cli@v1.1.6/cli.go:265 +0x5b8
github.com/hashicorp/vault/command.RunCustom({0xc0000b4850?, 0x2?, 0x2?}, 0xc0000061c0?)
/home/runner/work/vault/vault/command/main.go:243 +0x9a6
github.com/hashicorp/vault/command.Run(...)
/home/runner/work/vault/vault/command/main.go:147
main.main()
/home/runner/work/vault/vault/main.go:13 +0x47

The only way to solve this right now is to completely remove the persistent volume for the pod, and restart. This means it's impossible to update the Vault cluster without doing a full restore.

To Reproduce
Steps to reproduce the behavior:

  1. Run a 3 node cluster
  2. Restart one of the nodes

Expected behavior
Vault is able to recover from restarts.

Environment:

  • Vault Server Version (retrieve with vault status): 1.17.5
  • Vault CLI Version (retrieve with vault version): Vault v1.17.5 (4d0c53e), built 2024-08-30T15:54:57Z
  • Server Operating System/Architecture: Kubernetes, bare metal

Vault server configuration file(s):

disable_mlock = true
ui = true

listener "tcp" {
  tls_disable = 1
  address = "[::]:8200"
  cluster_address = "[::]:8201"

  # Enable unauthenticated metrics access (necessary for Prometheus Operator)
  telemetry {
    unauthenticated_metrics_access = "true"
  }
}

storage "raft" {
  path = "/vault/data"
  raft_wal = "true"
  raft_log_verifier_enabled = "true"

  retry_join {
    leader_api_addr = "http://vault-0.vault-internal:8200"
  }
  retry_join {
    leader_api_addr = "http://vault-1.vault-internal:8200"
  }
  retry_join {
    leader_api_addr = "http://vault-2.vault-internal:8200"
  }
}

service_registration "kubernetes" {}

telemetry {
  prometheus_retention_time = "30s"
  disable_hostname = true
}

Additional context

  • I understand the panic is from boltdb, but I'm wondering if Vault is doing anything specific which causes this issue.
  • I attempted to switch to use raft_wal to fix the issue, but it looks as though boltdb is still used. This issue occurs regradless of setting raft_wal
@heatherezell heatherezell added k8s bug Used to indicate a potential bug storage/raft labels Oct 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Used to indicate a potential bug k8s storage/raft
Projects
None yet
Development

No branches or pull requests

2 participants