Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential index corruption on exit #198

Closed
jasl opened this issue Apr 7, 2023 · 8 comments
Closed

Potential index corruption on exit #198

jasl opened this issue Apr 7, 2023 · 8 comments

Comments

@jasl
Copy link

jasl commented Apr 7, 2023

I'm promoting our user (Phala) switch to ParityDB, then I got 3 dedicated reports that after a reboot (normal exit then restart I believe), the node is stuck on Is collating: No, according to htop, no IO, a CPU core is 100%, enable RUST_LOG="DEBUG" no extra info

A reporter just tries to delete all index_ files, it seems those indexes are rebuilt, and the node is back to normal.

@arkpar
Copy link
Member

arkpar commented Apr 7, 2023

What version of the parity-db crate is used in the project? Is there a way to reproduce this?

A reporter just tries to delete all index_ files, it seems those indexes are rebuilt, and the node is back to normal.

This is basically equivalent to deleting the whole database.

@jasl
Copy link
Author

jasl commented Apr 7, 2023

What version of the parity-db crate is used in the project? Is there a way to reproduce this?

A reporter just tries to delete all index_ files, it seems those indexes are rebuilt, and the node is back to normal.

This is basically equivalent to deleting the whole database.

the latest version, 0.4.6

TBH, I tested paritydb on dozens of machines, with no issue, so I don't know how these users met the problem.
I only know they're running node in Docker, and Ubuntu 20.04 or 22.04, ext4

This is basically equivalent to deleting the whole database.

Those indexes do seem not quite big, only ~ 2G, total DB are 47G

one of the reporters give me his bad indexes, I'm not sure it would help, could you help to exam it?

https://storage.googleapis.com/phala-misc/phala-node-db-index.zip

@jasl
Copy link
Author

jasl commented Apr 7, 2023

they're using

jasl123/phala-node-with-launcher:v0.1.23-dev.5
DIGEST:sha256:21e8b239dd12f9287832e135dae1a2262c9dbd83d026e9617d60d80a372b3e9d

The binary is based on Phala-Network/khala-parachain#263

I already tested paritydb at least 6 months on dozens of machines, with no issue, so I'm not sure it can stably reproduce...

@wowvwow
Copy link

wowvwow commented Apr 7, 2023

image
image
image
image
After a period of intervals, about tens of minutes, or several hours later, after rebooting the system, the phala-node service starts to check the logs, and it will always be stuck without any log output. At the same time, the single-core cpu The performance reaches 100%, and there is no IO.

Then, after I stopped the phala-node service, after the index file, lock file, and metadata file under paritydb, mv became bak, and restarted phala-node, everything became better. Excuse me, what caused this?

This problem occurred this morning, and now, at night, after I rebooted the server, the bug reappeared

@arkpar
Copy link
Member

arkpar commented Apr 15, 2023

It's hard to sat what's going on looking at screenshots.
You have a few "Address space overflow" errors as a result of deleting index files.
In general, deleting index file won't do you any good, as it simply breaks internal database structure. If you have a way to reproduce the issue with a clean database, or a copy of the database that demonstrate the issue before any messing around with database files, that would be helpful. Otherwise there's not much we can do.

@jasl
Copy link
Author

jasl commented Apr 19, 2023

so the fix is paritytech/cumulus#2461
so it's not ParityDB issue, but do you know why ParityDB impacts much? people switch back to RocksDB then everything seems to go well

I'll try to backport that and let users retry, if no issue, I'll close this issue

@arkpar
Copy link
Member

arkpar commented Apr 19, 2023

We (maintainers of parity-db) are not aware of any serious issues. We would appreciate a bug report with logs and samples of broken databases.

This particualr issue seems to be caused by some inefficiencies in cumulus.
If this happens again please get a stack trace for the stuck process with gdb, eu-stack or similar tool.
See here:
https://stackoverflow.com/questions/12394935/getting-stacktrace-of-all-threads-without-attaching-gdb

@jasl
Copy link
Author

jasl commented Apr 20, 2023

I just backported paritytech/cumulus#2461 and confirm the node is no more stuck on boot.

Sorry for wasting your time, and thank you for your patience!

@jasl jasl closed this as completed Apr 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants