-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal for a CacheKV v2 store #14990
Comments
An important need for me that motivates me to do this cache store refactoring(#14444) is remove the overhead of nesting cache store, which is pretty significant currently.
this is useful for existing use cases since we do have 2-3 layers in sdk already, it'll be even more important for EVM integration. |
My impression for how to mitigate nesting overhead:
The downside of this, is then a nested store close, must copy all its elements into parent, which is a real cost. (But is already the case, and every other overhead is also much lower than today). There are strategies for not requiring a copy on close, but then these have more trade-offs at concurrency guarantees, which I think maintaining more optionality around is personally worth it. |
What do you think about the copy-on-write btree solution? |
Feels like it should be a type of write buffer. Its an important type of write buffer, and nesting strategy for write buffers! Solves a lot of things well :) Doesn't solve the problem of a lot of this having no abstraction, and thus no iterability. I claim that theres a lot of ways that our caching model should vary. And we have several concurrency designs that should be explored, so we need to move away from just being stuck with one "thing" as our caching strategy. |
I hope we can put up the list of problems we are trying to solve first, before we can compare different solutions. |
I totally agree. The gist I get is that it's mainly three things:
|
The claim is that buffering writes is very orthogonal to caching reads from the parent database, hence a separation in data types is valuable. Theres multiple things that need to go into how you optimally cache writes (or if you even want them cached), depending on your usecase. For reads, this is not complex. Nesting/multi-layer caching is the problem of concern in #14444 . Its one of many problems of the existing CacheKV store. |
One concern of separating clean cache and dirty cache(write buffer) is, it creates more indirection in reading side, now the reader need to check the dirty cache first, then the clean cache, then the parent's dirty cache and clean cache, etc.. |
Theres no need to assume that your parent maintains separate dirty and clean caches! You really only need this read cache for Disk/IO in most cases. Otherwise agreed that a read-side indirection is faced, but:
|
I mean the case of nested cache store, solving nesting case in itself has some requirements on the implementation, like supporting cheap snapshot in the data structure, and "clean" cache may have different meanings at different layer, like a dirty value at lower level maybe treated as clean cache in upper layer, could you elaborate on these issues if I misunderstood. |
I'm a big fan of the "journal" approach that is done in Geth, though it can get memory intensive. The whole wrapping thing we have in Cosmos is kind of weird and feels overly complex. |
Geth is a closed system, they only need to handle a dozen types of state changes, for us, either you maintain lots of types of logical journal entries and carefully implement the undo logic, or you put the low level set/remove operation as journal logs, there'll be lots of logs, and revert is With the copy-on-write btree approach, there's no wrapping thing anymore. But I do like geth's interface, just snapshot and revert, don't need to worry about merge of two branched stores |
Depending on the implementation you can make revert O(1), also I'd rather optimize for the success case with a cheaper insert. |
Need to have two concrete proposals to compare complexity though. |
@yihuang totally agree, we have something we have been working on, will share when it's more concrete. |
I think the writing complexity of hashmap and btree based cache store are similar:
|
Doesn't log approach with O(1) complexity use copy-on-write? I can't find any other way how this would work with O(1). Another approach I was thinking is a versioning data structure, which in fact is copy-on-write:
|
I guess you mean hashmap and using log for revert, when you write the dirty cache into backing store, the sorting is O(n*log(n)). And revert cost is linear to the number of write operations. |
Just an idea to throw in here: We're using an immutable data structure (an AVL tree under the hood so why don't we just branch off that for caching writes? It's always struck me as strange that we put another different tree data structure on top of that and do all this merge iterator stuff when immutable/persistent data structures give you branching for free. In particular I think we might want to explore a thing called a transient data structure to built on top of iavl in memory. Some references: https://clojure.org/reference/transients |
The current iavl implementation is for on-disk db, if we implement an in-memory avl, it'll be similar to the existing btree implementation we are using for cache store. |
Just did a benchmark to compare our in-memory IAVL implementation against
What's interesting is it performs very close to the btree with degree of 1, but btree with degree 32 out performs it. |
Another benchmark I did about
|
So when iavl nodes are read from disk they build an in memory tree if I understand correctly. Using an in memory iavl tree for write caching would basically mean a combination of nodes that were read from disk and new in memory nodes which only get written if there is a commit. A rollback should be trivial if needed because you just hang onto the old root.
That's the idea with transients. There's mutability while writing is happening but the resulting tree is still immutable. It's a way of starting from an immutable version of a tree say version N on disk, efficiently adding a bunch of nodes to that tree already ordered and balanced and either writing them as version N+1 or discarding. It would seem that with a btree cache layer we have more overhead because we build a different kind of tree and then need to iterate over that tree to build an immutable iavl tree anyway without getting the advantage of transients when doing that. That's not to mention the overhead of merge iterators. |
In the current db backed iavl tree, nodes reference each other with much more indirections, but that's not to say IAVL implementation itself can't be redesigned if we are going that far. |
Benchmarks look promising @yihuang. Is 32 degrees the default? This means we could get rid of the CacheKV almost entirely. |
Yeah, one thing that can be significant faster is the golang map, which can help if there are repeated get operations on same keys, I don't know how many cases of that in our code though. |
it's only this fast because memiavl nodes are referenced by offset in a mmap-ed file, without go through a kv-db at all, which reduce a lot of indirections, works almost the same as an in-memory data structure. Footnotes |
I'd like to share about the work that's been going on in the store v2 development efforts. One of our main goals is keep abstractions to a minimum and keep each component relatively isolated and simple. In the context of CacheKV, we actually have no such abstraction or notion in store v2. Instead, we simply have a store that stages dirty writes (we call this We currently do not have plans to introduce an explicit cache layer, but if we were, we'd do it as a separate store to keep components minimal in nature. This way, reads would flow as follows: Ideally, we rely on the SS backend to give us good caching for us, but if benchmarks show that reads are not where we like, then evaluate introducing a cache store/layer. |
Closing in light of a different branch store design that I don't have context on. |
Summary
This details a proposal for how to make a CacheKV store v2.
Right now it is hard to iterate on the CacheKV store. It mixes a lot of roles into the same data structure, is rather complex, and under-specified. This has led to large numbers of performance and liveness critical bugs, that would be vastly simplified with a redesign.
I claim that the way to best simplify this is to do an API compatible rewrite, that simplifies the roles of the differing components, and then drop-in replace this.
The cache KV store has two roles, each of which has roadmaps for improvement, and potential for tighter integration with gas counting. These are:
The proposal is to make a directory structure as follows:
Basically, separate packages, for ReadBuffer and WriteBuffers that allow us to flexibly iterate going forward, and making a for now "merged CacheKV store API" that has the exact API guarantees (aside from bug/perf improvements) as the existing one, so theres no break to users. As time goes on, this merged CacheKV store API gets less used.
The short-term roadmap proposed:
Then internal API roadmaps, that can far more simply progress after a basic split:
Whats nice about having these split out, is we can also make these re-usable components, as we explore more distinct concurrency designs. Furthermore, we should be expecting different read buffer and write buffer designs to emerge. A lot of performance suffering is had to support iteration, when really there could be flexibility in only making things that need iteration pay that cost.
If folks like this direction, I can get started on it.
The text was updated successfully, but these errors were encountered: