Potential index corruption on exit #198

jasl · 2023-04-07T08:37:38Z

I'm promoting our user (Phala) switch to ParityDB, then I got 3 dedicated reports that after a reboot (normal exit then restart I believe), the node is stuck on Is collating: No, according to htop, no IO, a CPU core is 100%, enable RUST_LOG="DEBUG" no extra info

A reporter just tries to delete all index_ files, it seems those indexes are rebuilt, and the node is back to normal.

The text was updated successfully, but these errors were encountered:

arkpar · 2023-04-07T08:48:15Z

What version of the parity-db crate is used in the project? Is there a way to reproduce this?

A reporter just tries to delete all index_ files, it seems those indexes are rebuilt, and the node is back to normal.

This is basically equivalent to deleting the whole database.

jasl · 2023-04-07T09:09:43Z

What version of the parity-db crate is used in the project? Is there a way to reproduce this?

A reporter just tries to delete all index_ files, it seems those indexes are rebuilt, and the node is back to normal.

This is basically equivalent to deleting the whole database.

the latest version, 0.4.6

TBH, I tested paritydb on dozens of machines, with no issue, so I don't know how these users met the problem.
I only know they're running node in Docker, and Ubuntu 20.04 or 22.04, ext4

This is basically equivalent to deleting the whole database.

Those indexes do seem not quite big, only ~ 2G, total DB are 47G

one of the reporters give me his bad indexes, I'm not sure it would help, could you help to exam it?

https://storage.googleapis.com/phala-misc/phala-node-db-index.zip

jasl · 2023-04-07T09:20:42Z

they're using

jasl123/phala-node-with-launcher:v0.1.23-dev.5
DIGEST:sha256:21e8b239dd12f9287832e135dae1a2262c9dbd83d026e9617d60d80a372b3e9d

The binary is based on Phala-Network/khala-parachain#263

I already tested paritydb at least 6 months on dozens of machines, with no issue, so I'm not sure it can stably reproduce...

wowvwow · 2023-04-07T14:45:02Z

After a period of intervals, about tens of minutes, or several hours later, after rebooting the system, the phala-node service starts to check the logs, and it will always be stuck without any log output. At the same time, the single-core cpu The performance reaches 100%, and there is no IO.

Then, after I stopped the phala-node service, after the index file, lock file, and metadata file under paritydb, mv became bak, and restarted phala-node, everything became better. Excuse me, what caused this?

This problem occurred this morning, and now, at night, after I rebooted the server, the bug reappeared

arkpar · 2023-04-15T11:33:49Z

It's hard to sat what's going on looking at screenshots.
You have a few "Address space overflow" errors as a result of deleting index files.
In general, deleting index file won't do you any good, as it simply breaks internal database structure. If you have a way to reproduce the issue with a clean database, or a copy of the database that demonstrate the issue before any messing around with database files, that would be helpful. Otherwise there's not much we can do.

jasl · 2023-04-19T07:30:39Z

so the fix is paritytech/cumulus#2461
so it's not ParityDB issue, but do you know why ParityDB impacts much? people switch back to RocksDB then everything seems to go well

I'll try to backport that and let users retry, if no issue, I'll close this issue

arkpar · 2023-04-19T08:20:16Z

We (maintainers of parity-db) are not aware of any serious issues. We would appreciate a bug report with logs and samples of broken databases.

This particualr issue seems to be caused by some inefficiencies in cumulus.
If this happens again please get a stack trace for the stuck process with gdb, eu-stack or similar tool.
See here:
https://stackoverflow.com/questions/12394935/getting-stacktrace-of-all-threads-without-attaching-gdb

jasl · 2023-04-20T18:46:37Z

I just backported paritytech/cumulus#2461 and confirm the node is no more stuck on boot.

Sorry for wasting your time, and thank you for your patience!

jasl mentioned this issue Apr 9, 2023

Externalities not allowed to fail within runtime: "Trie lookup error: Database missing expected key paritytech/substrate#13864

Closed

2 tasks

jasl closed this as completed Apr 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential index corruption on exit #198

Potential index corruption on exit #198

jasl commented Apr 7, 2023

arkpar commented Apr 7, 2023

jasl commented Apr 7, 2023 •

edited

Loading

jasl commented Apr 7, 2023

wowvwow commented Apr 7, 2023

arkpar commented Apr 15, 2023

jasl commented Apr 19, 2023

arkpar commented Apr 19, 2023

jasl commented Apr 20, 2023

Potential index corruption on exit #198

Potential index corruption on exit #198

Comments

jasl commented Apr 7, 2023

arkpar commented Apr 7, 2023

jasl commented Apr 7, 2023 • edited Loading

jasl commented Apr 7, 2023

wowvwow commented Apr 7, 2023

arkpar commented Apr 15, 2023

jasl commented Apr 19, 2023

arkpar commented Apr 19, 2023

jasl commented Apr 20, 2023

jasl commented Apr 7, 2023 •

edited

Loading