Proposal for a CacheKV v2 store #14990

ValarDragon · 2023-02-10T05:15:34Z

Summary

This details a proposal for how to make a CacheKV store v2.

Right now it is hard to iterate on the CacheKV store. It mixes a lot of roles into the same data structure, is rather complex, and under-specified. This has led to large numbers of performance and liveness critical bugs, that would be vastly simplified with a redesign.

I claim that the way to best simplify this is to do an API compatible rewrite, that simplifies the roles of the differing components, and then drop-in replace this.

The cache KV store has two roles, each of which has roadmaps for improvement, and potential for tighter integration with gas counting. These are:

ReadBuffers
WriteBuffers

The proposal is to make a directory structure as follows:

store/cachekv2
store/cachekv2/readbuffer
store/cachekv2/writebuffer

Basically, separate packages, for ReadBuffer and WriteBuffers that allow us to flexibly iterate going forward, and making a for now "merged CacheKV store API" that has the exact API guarantees (aside from bug/perf improvements) as the existing one, so theres no break to users. As time goes on, this merged CacheKV store API gets less used.

The short-term roadmap proposed:

Make a store/cachekv2 with identical external API guarantees as the existing CacheKV
- The semantic will always be, for reads and iterates, "attempt iterating over write buffer. If not in write buffer, go to read buffer"
Create a ReadBuffer and WriteBuffer interface
Make the first objects behind the interface, the equivalent object in CacheKV 1
- ReadBuffer only has data thats in parent, it has no "dirty" data.
Swap cachekv usage with cachekv2

Then internal API roadmaps, that can far more simply progress after a basic split:

Amend the ReadBuffer to have an interval tree, so it can know during repeated iterations, whether the iteration is already cached. (Today we always hit IAVL on repeat iterations)
- Ensure single threaded concurrency safety
Make a new API for Gets, that is also aware of whether the data was in cache or not. Cache gets should be much cheaper than uncached ones.
- This will be short-term unused, but will get used when we are willing to do a larger API break with the gas store
Make write buffer single threaded concurrency safe. (As far as I'm aware, a fix for this was merged into SDK, so maybe we just re-use that)
Consider lifting concurrency safety to multi-threaded safety
Make Cachewrap only layer a write buffer on top
Add RAM bounds to ReadBuffer
Create safer iter_mut API's

Whats nice about having these split out, is we can also make these re-usable components, as we explore more distinct concurrency designs. Furthermore, we should be expecting different read buffer and write buffer designs to emerge. A lot of performance suffering is had to support iteration, when really there could be flexibility in only making things that need iteration pay that cost.

If folks like this direction, I can get started on it.

The text was updated successfully, but these errors were encountered:

yihuang · 2023-02-10T07:53:52Z

An important need for me that motivates me to do this cache store refactoring(#14444) is remove the overhead of nesting cache store, which is pretty significant currently.

we commit the dirty stuff layer by layer
- in each layer's write, it sort the keys and set to parent in order
we duplicate the cached stuff in each layer
get/iterate also traverse the whole nesting depth.

this is useful for existing use cases since we do have 2-3 layers in sdk already, it'll be even more important for EVM integration.
And a copy-on-write btree can fix them all ;D

ValarDragon · 2023-02-10T12:29:22Z

My impression for how to mitigate nesting overhead:

Make nesting by default create a new write buffer, no new read buffers (therefore no extra read caching)
- Could switch this later to a design of a single tree thats versioning multiple layers, at concurrency consequence
Keep write buffers that get iterated on at least once, entirely in sorted trees, so iterating over them is always cheap.
- Can allow more registration in the future, to better optimize this
If you do heavy iteration workloads, insert some "interval tracking only" read buffer layers in the nestings.

The downside of this, is then a nested store close, must copy all its elements into parent, which is a real cost. (But is already the case, and every other overhead is also much lower than today). There are strategies for not requiring a copy on close, but then these have more trade-offs at concurrency guarantees, which I think maintaining more optionality around is personally worth it.

yihuang · 2023-02-10T14:16:13Z

What do you think about the copy-on-write btree solution?

ValarDragon · 2023-02-10T14:44:27Z

Feels like it should be a type of write buffer. Its an important type of write buffer, and nesting strategy for write buffers! Solves a lot of things well :)

Doesn't solve the problem of a lot of this having no abstraction, and thus no iterability. I claim that theres a lot of ways that our caching model should vary. And we have several concurrency designs that should be explored, so we need to move away from just being stuck with one "thing" as our caching strategy.

yihuang · 2023-02-10T15:00:34Z

I hope we can put up the list of problems we are trying to solve first, before we can compare different solutions.

alexanderbez · 2023-02-10T15:03:25Z

I hope we can put up the list of problems we are trying to solve first, before we can compare different solutions.

I totally agree. The gist I get is that it's mainly three things:

CacheKV does too many things. It has too many responsibilities. Regardless of write buffers or BTrees, this can be solved independently.
Nesting/multi-layer-caching is slow (the main crux of the problem)
DevUX around nesting/multi-layer-caching

ValarDragon · 2023-02-12T18:06:00Z

The claim is that buffering writes is very orthogonal to caching reads from the parent database, hence a separation in data types is valuable.

Theres multiple things that need to go into how you optimally cache writes (or if you even want them cached), depending on your usecase. For reads, this is not complex.

Nesting/multi-layer caching is the problem of concern in #14444 . Its one of many problems of the existing CacheKV store.

yihuang · 2023-02-13T00:33:20Z

Theres multiple things that need to go into how you optimally cache writes (or if you even want them cached), depending on your usecase. For reads, this is not complex.

One concern of separating clean cache and dirty cache(write buffer) is, it creates more indirection in reading side, now the reader need to check the dirty cache first, then the clean cache, then the parent's dirty cache and clean cache, etc..

ValarDragon · 2023-02-13T05:14:04Z

Theres no need to assume that your parent maintains separate dirty and clean caches! You really only need this read cache for Disk/IO in most cases.

Otherwise agreed that a read-side indirection is faced, but:

Its pretty unclear to me that this actually matters. This should be dwarfed in magnitude by many other perf concerns we face. (Even the I/O being faced)
Has benefits when we want things to start being concurrent, with no locks

yihuang · 2023-02-13T05:35:37Z

Theres no need to assume that your parent maintains separate dirty and clean caches! You really only need this read cache for Disk/IO in most cases.

I mean the case of nested cache store, solving nesting case in itself has some requirements on the implementation, like supporting cheap snapshot in the data structure, and "clean" cache may have different meanings at different layer, like a dirty value at lower level maybe treated as clean cache in upper layer, could you elaborate on these issues if I misunderstood.

itsdevbear · 2023-02-16T12:54:31Z

I'm a big fan of the "journal" approach that is done in Geth, though it can get memory intensive.

The whole wrapping thing we have in Cosmos is kind of weird and feels overly complex.

yihuang · 2023-02-16T13:12:30Z

I'm a big fan of the "journal" approach that is done in Geth, though it can get memory intensive.

The whole wrapping thing we have in Cosmos is kind of weird and feels overly complex.

Geth is a closed system, they only need to handle a dozen types of state changes, for us, either you maintain lots of types of logical journal entries and carefully implement the undo logic, or you put the low level set/remove operation as journal logs, there'll be lots of logs, and revert is O(N) operation.

With the copy-on-write btree approach, there's no wrapping thing anymore.

But I do like geth's interface, just snapshot and revert, don't need to worry about merge of two branched stores

itsdevbear · 2023-02-16T13:14:27Z

Depending on the implementation you can make revert O(1), also I'd rather optimize for the success case with a cheaper insert.

yihuang · 2023-02-16T13:24:09Z

Depending on the implementation you can make revert O(1), also I'd rather optimize for the success case with a cheaper insert.

Need to have two concrete proposals to compare complexity though.

itsdevbear · 2023-02-16T13:25:45Z

@yihuang totally agree, we have something we have been working on, will share when it's more concrete.

yihuang · 2023-02-17T01:40:23Z

I think the writing complexity of hashmap and btree based cache store are similar:

assuming a single insertion into hashmap has constant complexity, but to commit it into parent store in order, you need to sort the keys, which is at least O(n*log(n)).
btree has O(log(n)) complexity for single insertion, but O(n) for ordered iteration.

robert-zaremba · 2023-02-17T08:52:51Z

Doesn't log approach with O(1) complexity use copy-on-write? I can't find any other way how this would work with O(1).

Another approach I was thinking is a versioning data structure, which in fact is copy-on-write:

All data is in a balanced tree
The tree has a global version counter, and the current version index.
each node key, is a record key, and a value is a hashmap of version -> record value.
When we create a new snapshot / cachewrapper, we increment the the global version counter and assign current version index to the new value of global version counter. And also we need another "global" map to the previous version (so when we do revert, we update the current version index)
When we write data, we write it to the version = the global current version index
Whenever we we read a data which doesn't exist yet in the tree, we set it to version 1.

yihuang · 2023-02-17T21:32:35Z

Doesn't log approach with O(1) complexity use copy-on-write? I can't find any other way how this would work with O(1).

I guess you mean hashmap and using log for revert, when you write the dirty cache into backing store, the sorting is O(n*log(n)). And revert cost is linear to the number of write operations.

aaronc · 2023-02-17T22:37:39Z

Just an idea to throw in here: We're using an immutable data structure (an AVL tree under the hood so why don't we just branch off that for caching writes? It's always struck me as strange that we put another different tree data structure on top of that and do all this merge iterator stuff when immutable/persistent data structures give you branching for free. In particular I think we might want to explore a thing called a transient data structure to built on top of iavl in memory. Some references:

https://clojure.org/reference/transients
https://github.com/clojure/data.avl

yihuang · 2023-02-18T02:15:20Z

Just an idea to throw in here: We're using an immutable data structure (an AVL tree under the hood so why don't we just branch off that for caching writes? It's always struck me as strange that we put another different tree data structure on top of that and do all this merge iterator stuff when immutable/persistent data structures give you branching for free. In particular I think we might want to explore a thing called a transient data structure to built on top of iavl in memory. Some references:

https://clojure.org/reference/transients
https://github.com/clojure/data.avl

The current iavl implementation is for on-disk db, if we implement an in-memory avl, it'll be similar to the existing btree implementation we are using for cache store.
We don't really need a fully immutable data structure, normally a mutable data structure is more performant than immutable one.

yihuang · 2023-02-18T03:21:25Z

Just did a benchmark to compare our in-memory IAVL implementation against tidwall/btree library, the benchmark inserts 1m random key value pairs into the tree, not updating hashes in the IAVL case:

* go test -run=^$ -bench=. -benchmem ./ -count=4
goos: darwin
goarch: amd64
pkg: github.com/crypto-org-chain/cronos/memiavl
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkRandomSet/memiavl-12         	       1	2605531081 ns/op	256009792 B/op	 2000014 allocs/op
BenchmarkRandomSet/memiavl-12         	       1	2488336328 ns/op	255999904 B/op	 2000003 allocs/op
BenchmarkRandomSet/memiavl-12         	       1	2515380451 ns/op	255999888 B/op	 2000001 allocs/op
BenchmarkRandomSet/memiavl-12         	       1	2541959269 ns/op	255999872 B/op	 1999999 allocs/op
BenchmarkRandomSet/btree-degree-1-12  	       1	2631245985 ns/op	204001352 B/op	 2060575 allocs/op
BenchmarkRandomSet/btree-degree-1-12  	       1	2624319698 ns/op	204001352 B/op	 2060575 allocs/op
BenchmarkRandomSet/btree-degree-1-12  	       1	2627311580 ns/op	204001352 B/op	 2060575 allocs/op
BenchmarkRandomSet/btree-degree-1-12  	       1	2604990695 ns/op	204001352 B/op	 2060575 allocs/op
BenchmarkRandomSet/btree-degree-32-12 	       1	1259560359 ns/op	138291648 B/op	   68968 allocs/op
BenchmarkRandomSet/btree-degree-32-12 	       1	1236588308 ns/op	138291656 B/op	   68969 allocs/op
BenchmarkRandomSet/btree-degree-32-12 	       1	1229279040 ns/op	138291648 B/op	   68968 allocs/op
BenchmarkRandomSet/btree-degree-32-12 	       1	1266665535 ns/op	138291648 B/op	   68968 allocs/op
PASS
ok  	github.com/crypto-org-chain/cronos/memiavl	27.651s

What's interesting is it performs very close to the btree with degree of 1, but btree with degree 32 out performs it.

yihuang · 2023-02-18T09:23:44Z

Another benchmark I did about Get operation, the "memiavl-disk" is an on-disk mmap-ed IAVL tree, the takeaway is the on-disk tree is almost as fast as in-memory data structure when OS page cache is warm, if we can make IAVL tree itself almost as fast as the cache store, there'll be no point to cache the clean values at all.

goos: darwin
goarch: amd64
pkg: github.com/crypto-org-chain/cronos/memiavl
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkRandomGet/memiavl-12         	11986951	        98.66 ns/op	       0 B/op	       0 allocs/op
BenchmarkRandomGet/memiavl-12         	12390306	        97.24 ns/op	       0 B/op	       0 allocs/op
BenchmarkRandomGet/memiavl-12         	12287302	        97.96 ns/op	       0 B/op	       0 allocs/op
BenchmarkRandomGet/memiavl-12         	11695172	        98.87 ns/op	       0 B/op	       0 allocs/op
BenchmarkRandomGet/memiavl-disk-12    	 5495362	       194.1 ns/op	       0 B/op	       0 allocs/op
BenchmarkRandomGet/memiavl-disk-12    	 6187528	       190.1 ns/op	       0 B/op	       0 allocs/op
BenchmarkRandomGet/memiavl-disk-12    	 5406562	       272.1 ns/op	       0 B/op	       0 allocs/op
BenchmarkRandomGet/memiavl-disk-12    	 6560026	       190.3 ns/op	       0 B/op	       0 allocs/op
BenchmarkRandomGet/btree-degree-1-12  	32179621	        35.80 ns/op	       0 B/op	       0 allocs/op
BenchmarkRandomGet/btree-degree-1-12  	32729978	        35.80 ns/op	       0 B/op	       0 allocs/op
BenchmarkRandomGet/btree-degree-1-12  	32721939	        35.85 ns/op	       0 B/op	       0 allocs/op
BenchmarkRandomGet/btree-degree-1-12  	31634030	        36.03 ns/op	       0 B/op	       0 allocs/op
BenchmarkRandomGet/btree-degree-32-12 	 5755815	       180.1 ns/op	       0 B/op	       0 allocs/op
BenchmarkRandomGet/btree-degree-32-12 	 6836102	       177.4 ns/op	       0 B/op	       0 allocs/op
BenchmarkRandomGet/btree-degree-32-12 	 6832458	       178.7 ns/op	       0 B/op	       0 allocs/op
BenchmarkRandomGet/btree-degree-32-12 	 6862957	       179.8 ns/op	       0 B/op	       0 allocs/op
PASS
ok  	github.com/crypto-org-chain/cronos/memiavl	35.029s

aaronc · 2023-02-18T16:29:20Z

Just an idea to throw in here: We're using an immutable data structure (an AVL tree under the hood so why don't we just branch off that for caching writes? It's always struck me as strange that we put another different tree data structure on top of that and do all this merge iterator stuff when immutable/persistent data structures give you branching for free. In particular I think we might want to explore a thing called a transient data structure to built on top of iavl in memory. Some references:

https://clojure.org/reference/transients

https://github.com/clojure/data.avl

The current iavl implementation is for on-disk db, if we implement an in-memory avl, it'll be similar to the existing btree implementation we are using for cache store.

So when iavl nodes are read from disk they build an in memory tree if I understand correctly. Using an in memory iavl tree for write caching would basically mean a combination of nodes that were read from disk and new in memory nodes which only get written if there is a commit. A rollback should be trivial if needed because you just hang onto the old root.

We don't really need a fully immutable data structure, normally a mutable data structure is more performant than immutable one.

That's the idea with transients. There's mutability while writing is happening but the resulting tree is still immutable. It's a way of starting from an immutable version of a tree say version N on disk, efficiently adding a bunch of nodes to that tree already ordered and balanced and either writing them as version N+1 or discarding. It would seem that with a btree cache layer we have more overhead because we build a different kind of tree and then need to iterate over that tree to build an immutable iavl tree anyway without getting the advantage of transients when doing that. That's not to mention the overhead of merge iterators.

yihuang · 2023-02-18T20:08:21Z

So when iavl nodes are read from disk they build an in memory tree if I understand correctly. Using an in memory iavl tree for write caching would basically mean a combination of nodes that were read from disk and new in memory nodes which only get written if there is a commit. A rollback should be trivial if needed because you just hang onto the old root.

In the current db backed iavl tree, nodes reference each other with much more indirections, but that's not to say IAVL implementation itself can't be redesigned if we are going that far.
The more critical issue of writing into IAVL tree directly is the writing order is more complex, because the result tree structure depends on the write order, current implementation has a very useful property that we can replay the change sets that's extracted from exiting tree versions to rebuild identical tree, that relies on a particular order of the changes.

alexanderbez · 2023-02-20T20:09:56Z

Benchmarks look promising @yihuang. Is 32 degrees the default? This means we could get rid of the CacheKV almost entirely.

yihuang · 2023-02-22T01:11:32Z

Benchmarks look promising @yihuang. Is 32 degrees the default? This means we could get rid of the CacheKV almost entirely.

Yeah, one thing that can be significant faster is the golang map, which can help if there are repeated get operations on same keys, I don't know how many cases of that in our code though.
If we'll do the read/write(clean/dirty) cache separation, I think the golang map would be perfect for the read(clean) cache, a sorted tree is good for the write(dirty) cache.
The other thing to consider is, the size of cache is much smaller than the size of iavl tree, so the height of tree is smaller, and query performance can be better.

yihuang · 2023-03-02T02:17:56Z

Benchmarks look promising @yihuang. Is 32 degrees the default? This means we could get rid of the CacheKV almost entirely.

it's only this fast because memiavl nodes are referenced by offset in a mmap-ed file, without go through a kv-db at all, which reduce a lot of indirections, works almost the same as an in-memory data structure.
The main uncertainty of using mmap is the IO perform less than ideal when db size is larger than available ram, there are some research on that¹, but I think the advantage still out-weight the disadvantages. and we can even develop custom IO and cache management if we are willing to go that far ;D

https://cs.brown.edu/people/acrotty/pubs/p13-crotty.pdf ↩

alexanderbez · 2023-10-25T15:38:53Z

I'd like to share about the work that's been going on in the store v2 development efforts. One of our main goals is keep abstractions to a minimum and keep each component relatively isolated and simple.

In the context of CacheKV, we actually have no such abstraction or notion in store v2. Instead, we simply have a store that stages dirty writes (we call this branch.Store atm). There's no explicit caching at all in this store.

We currently do not have plans to introduce an explicit cache layer, but if we were, we'd do it as a separate store to keep components minimal in nature. This way, reads would flow as follows: DirtySet -> Cache -> SS.

Ideally, we rely on the SS backend to give us good caching for us, but if benchmarks show that reads are not where we like, then evaluate introducing a cache store/layer.

ValarDragon · 2024-04-04T08:00:48Z

Closing in light of a different branch store design that I don't have context on.

github-actions bot added the needs-triage Issue that needs to be triaged label Feb 10, 2023

alexanderbez added the C:Store label Feb 10, 2023

tac0turtle removed the needs-triage Issue that needs to be triaged label Feb 12, 2023

ValarDragon closed this as completed Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal for a CacheKV v2 store #14990

Proposal for a CacheKV v2 store #14990

ValarDragon commented Feb 10, 2023 •

edited

Loading

yihuang commented Feb 10, 2023 •

edited

Loading

ValarDragon commented Feb 10, 2023 •

edited

Loading

yihuang commented Feb 10, 2023

ValarDragon commented Feb 10, 2023

yihuang commented Feb 10, 2023

alexanderbez commented Feb 10, 2023

ValarDragon commented Feb 12, 2023 •

edited

Loading

yihuang commented Feb 13, 2023

ValarDragon commented Feb 13, 2023

yihuang commented Feb 13, 2023 •

edited

Loading

itsdevbear commented Feb 16, 2023 •

edited

Loading

yihuang commented Feb 16, 2023 •

edited

Loading

itsdevbear commented Feb 16, 2023

yihuang commented Feb 16, 2023

itsdevbear commented Feb 16, 2023

yihuang commented Feb 17, 2023

robert-zaremba commented Feb 17, 2023

yihuang commented Feb 17, 2023

aaronc commented Feb 17, 2023 •

edited

Loading

yihuang commented Feb 18, 2023 •

edited

Loading

yihuang commented Feb 18, 2023 •

edited

Loading

yihuang commented Feb 18, 2023 •

edited

Loading

aaronc commented Feb 18, 2023 •

edited

Loading

yihuang commented Feb 18, 2023 •

edited

Loading

alexanderbez commented Feb 20, 2023

yihuang commented Feb 22, 2023 •

edited

Loading

yihuang commented Mar 2, 2023 •

edited

Loading

alexanderbez commented Oct 25, 2023 •

edited

Loading

ValarDragon commented Apr 4, 2024

Proposal for a CacheKV v2 store #14990

Proposal for a CacheKV v2 store #14990

Comments

ValarDragon commented Feb 10, 2023 • edited Loading

Summary

yihuang commented Feb 10, 2023 • edited Loading

ValarDragon commented Feb 10, 2023 • edited Loading

yihuang commented Feb 10, 2023

ValarDragon commented Feb 10, 2023

yihuang commented Feb 10, 2023

alexanderbez commented Feb 10, 2023

ValarDragon commented Feb 12, 2023 • edited Loading

yihuang commented Feb 13, 2023

ValarDragon commented Feb 13, 2023

yihuang commented Feb 13, 2023 • edited Loading

itsdevbear commented Feb 16, 2023 • edited Loading

yihuang commented Feb 16, 2023 • edited Loading

itsdevbear commented Feb 16, 2023

yihuang commented Feb 16, 2023

itsdevbear commented Feb 16, 2023

yihuang commented Feb 17, 2023

robert-zaremba commented Feb 17, 2023

yihuang commented Feb 17, 2023

aaronc commented Feb 17, 2023 • edited Loading

yihuang commented Feb 18, 2023 • edited Loading

yihuang commented Feb 18, 2023 • edited Loading

yihuang commented Feb 18, 2023 • edited Loading

aaronc commented Feb 18, 2023 • edited Loading

yihuang commented Feb 18, 2023 • edited Loading

alexanderbez commented Feb 20, 2023

yihuang commented Feb 22, 2023 • edited Loading

yihuang commented Mar 2, 2023 • edited Loading

Footnotes

alexanderbez commented Oct 25, 2023 • edited Loading

ValarDragon commented Apr 4, 2024

ValarDragon commented Feb 10, 2023 •

edited

Loading

yihuang commented Feb 10, 2023 •

edited

Loading

ValarDragon commented Feb 10, 2023 •

edited

Loading

ValarDragon commented Feb 12, 2023 •

edited

Loading

yihuang commented Feb 13, 2023 •

edited

Loading

itsdevbear commented Feb 16, 2023 •

edited

Loading

yihuang commented Feb 16, 2023 •

edited

Loading

aaronc commented Feb 17, 2023 •

edited

Loading

yihuang commented Feb 18, 2023 •

edited

Loading

yihuang commented Feb 18, 2023 •

edited

Loading

yihuang commented Feb 18, 2023 •

edited

Loading

aaronc commented Feb 18, 2023 •

edited

Loading

yihuang commented Feb 18, 2023 •

edited

Loading

yihuang commented Feb 22, 2023 •

edited

Loading

yihuang commented Mar 2, 2023 •

edited

Loading

alexanderbez commented Oct 25, 2023 •

edited

Loading