Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for a CacheKV v2 store #14990

Closed
ValarDragon opened this issue Feb 10, 2023 · 29 comments
Closed

Proposal for a CacheKV v2 store #14990

ValarDragon opened this issue Feb 10, 2023 · 29 comments
Labels

Comments

@ValarDragon
Copy link
Contributor

ValarDragon commented Feb 10, 2023

Summary

This details a proposal for how to make a CacheKV store v2.

Right now it is hard to iterate on the CacheKV store. It mixes a lot of roles into the same data structure, is rather complex, and under-specified. This has led to large numbers of performance and liveness critical bugs, that would be vastly simplified with a redesign.

I claim that the way to best simplify this is to do an API compatible rewrite, that simplifies the roles of the differing components, and then drop-in replace this.

The cache KV store has two roles, each of which has roadmaps for improvement, and potential for tighter integration with gas counting. These are:

  • ReadBuffers
  • WriteBuffers

The proposal is to make a directory structure as follows:

store/cachekv2
store/cachekv2/readbuffer
store/cachekv2/writebuffer

Basically, separate packages, for ReadBuffer and WriteBuffers that allow us to flexibly iterate going forward, and making a for now "merged CacheKV store API" that has the exact API guarantees (aside from bug/perf improvements) as the existing one, so theres no break to users. As time goes on, this merged CacheKV store API gets less used.

The short-term roadmap proposed:

  • Make a store/cachekv2 with identical external API guarantees as the existing CacheKV
    • The semantic will always be, for reads and iterates, "attempt iterating over write buffer. If not in write buffer, go to read buffer"
  • Create a ReadBuffer and WriteBuffer interface
  • Make the first objects behind the interface, the equivalent object in CacheKV 1
    • ReadBuffer only has data thats in parent, it has no "dirty" data.
  • Swap cachekv usage with cachekv2

Then internal API roadmaps, that can far more simply progress after a basic split:

  • Amend the ReadBuffer to have an interval tree, so it can know during repeated iterations, whether the iteration is already cached. (Today we always hit IAVL on repeat iterations)
    • Ensure single threaded concurrency safety
  • Make a new API for Gets, that is also aware of whether the data was in cache or not. Cache gets should be much cheaper than uncached ones.
    • This will be short-term unused, but will get used when we are willing to do a larger API break with the gas store
  • Make write buffer single threaded concurrency safe. (As far as I'm aware, a fix for this was merged into SDK, so maybe we just re-use that)
  • Consider lifting concurrency safety to multi-threaded safety
  • Make Cachewrap only layer a write buffer on top
  • Add RAM bounds to ReadBuffer
  • Create safer iter_mut API's

Whats nice about having these split out, is we can also make these re-usable components, as we explore more distinct concurrency designs. Furthermore, we should be expecting different read buffer and write buffer designs to emerge. A lot of performance suffering is had to support iteration, when really there could be flexibility in only making things that need iteration pay that cost.

If folks like this direction, I can get started on it.

@github-actions github-actions bot added the needs-triage Issue that needs to be triaged label Feb 10, 2023
@yihuang
Copy link
Collaborator

yihuang commented Feb 10, 2023

An important need for me that motivates me to do this cache store refactoring(#14444) is remove the overhead of nesting cache store, which is pretty significant currently.

  • we commit the dirty stuff layer by layer
    • in each layer's write, it sort the keys and set to parent in order
  • we duplicate the cached stuff in each layer
  • get/iterate also traverse the whole nesting depth.

this is useful for existing use cases since we do have 2-3 layers in sdk already, it'll be even more important for EVM integration.
And a copy-on-write btree can fix them all ;D

@ValarDragon
Copy link
Contributor Author

ValarDragon commented Feb 10, 2023

My impression for how to mitigate nesting overhead:

  • Make nesting by default create a new write buffer, no new read buffers (therefore no extra read caching)
    • Could switch this later to a design of a single tree thats versioning multiple layers, at concurrency consequence
  • Keep write buffers that get iterated on at least once, entirely in sorted trees, so iterating over them is always cheap.
    • Can allow more registration in the future, to better optimize this
  • If you do heavy iteration workloads, insert some "interval tracking only" read buffer layers in the nestings.

The downside of this, is then a nested store close, must copy all its elements into parent, which is a real cost. (But is already the case, and every other overhead is also much lower than today). There are strategies for not requiring a copy on close, but then these have more trade-offs at concurrency guarantees, which I think maintaining more optionality around is personally worth it.

@yihuang
Copy link
Collaborator

yihuang commented Feb 10, 2023

What do you think about the copy-on-write btree solution?

@ValarDragon
Copy link
Contributor Author

Feels like it should be a type of write buffer. Its an important type of write buffer, and nesting strategy for write buffers! Solves a lot of things well :)

Doesn't solve the problem of a lot of this having no abstraction, and thus no iterability. I claim that theres a lot of ways that our caching model should vary. And we have several concurrency designs that should be explored, so we need to move away from just being stuck with one "thing" as our caching strategy.

@yihuang
Copy link
Collaborator

yihuang commented Feb 10, 2023

I hope we can put up the list of problems we are trying to solve first, before we can compare different solutions.

@alexanderbez
Copy link
Contributor

I hope we can put up the list of problems we are trying to solve first, before we can compare different solutions.

I totally agree. The gist I get is that it's mainly three things:

  • CacheKV does too many things. It has too many responsibilities. Regardless of write buffers or BTrees, this can be solved independently.
  • Nesting/multi-layer-caching is slow (the main crux of the problem)
  • DevUX around nesting/multi-layer-caching

@tac0turtle tac0turtle removed the needs-triage Issue that needs to be triaged label Feb 12, 2023
@ValarDragon
Copy link
Contributor Author

ValarDragon commented Feb 12, 2023

The claim is that buffering writes is very orthogonal to caching reads from the parent database, hence a separation in data types is valuable.

Theres multiple things that need to go into how you optimally cache writes (or if you even want them cached), depending on your usecase. For reads, this is not complex.

Nesting/multi-layer caching is the problem of concern in #14444 . Its one of many problems of the existing CacheKV store.

@yihuang
Copy link
Collaborator

yihuang commented Feb 13, 2023

Theres multiple things that need to go into how you optimally cache writes (or if you even want them cached), depending on your usecase. For reads, this is not complex.

One concern of separating clean cache and dirty cache(write buffer) is, it creates more indirection in reading side, now the reader need to check the dirty cache first, then the clean cache, then the parent's dirty cache and clean cache, etc..

@ValarDragon
Copy link
Contributor Author

Theres no need to assume that your parent maintains separate dirty and clean caches! You really only need this read cache for Disk/IO in most cases.

Otherwise agreed that a read-side indirection is faced, but:

  • Its pretty unclear to me that this actually matters. This should be dwarfed in magnitude by many other perf concerns we face. (Even the I/O being faced)
  • Has benefits when we want things to start being concurrent, with no locks

@yihuang
Copy link
Collaborator

yihuang commented Feb 13, 2023

Theres no need to assume that your parent maintains separate dirty and clean caches! You really only need this read cache for Disk/IO in most cases.

I mean the case of nested cache store, solving nesting case in itself has some requirements on the implementation, like supporting cheap snapshot in the data structure, and "clean" cache may have different meanings at different layer, like a dirty value at lower level maybe treated as clean cache in upper layer, could you elaborate on these issues if I misunderstood.

@itsdevbear
Copy link
Contributor

itsdevbear commented Feb 16, 2023

I'm a big fan of the "journal" approach that is done in Geth, though it can get memory intensive.

The whole wrapping thing we have in Cosmos is kind of weird and feels overly complex.

@yihuang
Copy link
Collaborator

yihuang commented Feb 16, 2023

I'm a big fan of the "journal" approach that is done in Geth, though it can get memory intensive.

The whole wrapping thing we have in Cosmos is kind of weird and feels overly complex.

Geth is a closed system, they only need to handle a dozen types of state changes, for us, either you maintain lots of types of logical journal entries and carefully implement the undo logic, or you put the low level set/remove operation as journal logs, there'll be lots of logs, and revert is O(N) operation.

With the copy-on-write btree approach, there's no wrapping thing anymore.

But I do like geth's interface, just snapshot and revert, don't need to worry about merge of two branched stores

@itsdevbear
Copy link
Contributor

Depending on the implementation you can make revert O(1), also I'd rather optimize for the success case with a cheaper insert.

@yihuang
Copy link
Collaborator

yihuang commented Feb 16, 2023

Depending on the implementation you can make revert O(1), also I'd rather optimize for the success case with a cheaper insert.

Need to have two concrete proposals to compare complexity though.

@itsdevbear
Copy link
Contributor

@yihuang totally agree, we have something we have been working on, will share when it's more concrete.

@yihuang
Copy link
Collaborator

yihuang commented Feb 17, 2023

I think the writing complexity of hashmap and btree based cache store are similar:

  • assuming a single insertion into hashmap has constant complexity, but to commit it into parent store in order, you need to sort the keys, which is at least O(n*log(n)).
  • btree has O(log(n)) complexity for single insertion, but O(n) for ordered iteration.

@robert-zaremba
Copy link
Collaborator

Doesn't log approach with O(1) complexity use copy-on-write? I can't find any other way how this would work with O(1).

Another approach I was thinking is a versioning data structure, which in fact is copy-on-write:

  1. All data is in a balanced tree
  2. The tree has a global version counter, and the current version index.
  3. each node key, is a record key, and a value is a hashmap of version -> record value.
  4. When we create a new snapshot / cachewrapper, we increment the the global version counter and assign current version index to the new value of global version counter. And also we need another "global" map to the previous version (so when we do revert, we update the current version index)
  5. When we write data, we write it to the version = the global current version index
  6. Whenever we we read a data which doesn't exist yet in the tree, we set it to version 1.

@yihuang
Copy link
Collaborator

yihuang commented Feb 17, 2023

Doesn't log approach with O(1) complexity use copy-on-write? I can't find any other way how this would work with O(1).

I guess you mean hashmap and using log for revert, when you write the dirty cache into backing store, the sorting is O(n*log(n)). And revert cost is linear to the number of write operations.

@aaronc
Copy link
Member

aaronc commented Feb 17, 2023

Just an idea to throw in here: We're using an immutable data structure (an AVL tree under the hood so why don't we just branch off that for caching writes? It's always struck me as strange that we put another different tree data structure on top of that and do all this merge iterator stuff when immutable/persistent data structures give you branching for free. In particular I think we might want to explore a thing called a transient data structure to built on top of iavl in memory. Some references:

https://clojure.org/reference/transients
https://github.com/clojure/data.avl

@yihuang
Copy link
Collaborator

yihuang commented Feb 18, 2023

Just an idea to throw in here: We're using an immutable data structure (an AVL tree under the hood so why don't we just branch off that for caching writes? It's always struck me as strange that we put another different tree data structure on top of that and do all this merge iterator stuff when immutable/persistent data structures give you branching for free. In particular I think we might want to explore a thing called a transient data structure to built on top of iavl in memory. Some references:

https://clojure.org/reference/transients
https://github.com/clojure/data.avl

The current iavl implementation is for on-disk db, if we implement an in-memory avl, it'll be similar to the existing btree implementation we are using for cache store.
We don't really need a fully immutable data structure, normally a mutable data structure is more performant than immutable one.

@yihuang
Copy link
Collaborator

yihuang commented Feb 18, 2023

Just did a benchmark to compare our in-memory IAVL implementation against tidwall/btree library, the benchmark inserts 1m random key value pairs into the tree, not updating hashes in the IAVL case:

* go test -run=^$ -bench=. -benchmem ./ -count=4
goos: darwin
goarch: amd64
pkg: github.com/crypto-org-chain/cronos/memiavl
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkRandomSet/memiavl-12         	       1	2605531081 ns/op	256009792 B/op	 2000014 allocs/op
BenchmarkRandomSet/memiavl-12         	       1	2488336328 ns/op	255999904 B/op	 2000003 allocs/op
BenchmarkRandomSet/memiavl-12         	       1	2515380451 ns/op	255999888 B/op	 2000001 allocs/op
BenchmarkRandomSet/memiavl-12         	       1	2541959269 ns/op	255999872 B/op	 1999999 allocs/op
BenchmarkRandomSet/btree-degree-1-12  	       1	2631245985 ns/op	204001352 B/op	 2060575 allocs/op
BenchmarkRandomSet/btree-degree-1-12  	       1	2624319698 ns/op	204001352 B/op	 2060575 allocs/op
BenchmarkRandomSet/btree-degree-1-12  	       1	2627311580 ns/op	204001352 B/op	 2060575 allocs/op
BenchmarkRandomSet/btree-degree-1-12  	       1	2604990695 ns/op	204001352 B/op	 2060575 allocs/op
BenchmarkRandomSet/btree-degree-32-12 	       1	1259560359 ns/op	138291648 B/op	   68968 allocs/op
BenchmarkRandomSet/btree-degree-32-12 	       1	1236588308 ns/op	138291656 B/op	   68969 allocs/op
BenchmarkRandomSet/btree-degree-32-12 	       1	1229279040 ns/op	138291648 B/op	   68968 allocs/op
BenchmarkRandomSet/btree-degree-32-12 	       1	1266665535 ns/op	138291648 B/op	   68968 allocs/op
PASS
ok  	github.com/crypto-org-chain/cronos/memiavl	27.651s

What's interesting is it performs very close to the btree with degree of 1, but btree with degree 32 out performs it.

@yihuang
Copy link
Collaborator

yihuang commented Feb 18, 2023

Another benchmark I did about Get operation, the "memiavl-disk" is an on-disk mmap-ed IAVL tree, the takeaway is the on-disk tree is almost as fast as in-memory data structure when OS page cache is warm, if we can make IAVL tree itself almost as fast as the cache store, there'll be no point to cache the clean values at all.

goos: darwin
goarch: amd64
pkg: github.com/crypto-org-chain/cronos/memiavl
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkRandomGet/memiavl-12         	11986951	        98.66 ns/op	       0 B/op	       0 allocs/op
BenchmarkRandomGet/memiavl-12         	12390306	        97.24 ns/op	       0 B/op	       0 allocs/op
BenchmarkRandomGet/memiavl-12         	12287302	        97.96 ns/op	       0 B/op	       0 allocs/op
BenchmarkRandomGet/memiavl-12         	11695172	        98.87 ns/op	       0 B/op	       0 allocs/op
BenchmarkRandomGet/memiavl-disk-12    	 5495362	       194.1 ns/op	       0 B/op	       0 allocs/op
BenchmarkRandomGet/memiavl-disk-12    	 6187528	       190.1 ns/op	       0 B/op	       0 allocs/op
BenchmarkRandomGet/memiavl-disk-12    	 5406562	       272.1 ns/op	       0 B/op	       0 allocs/op
BenchmarkRandomGet/memiavl-disk-12    	 6560026	       190.3 ns/op	       0 B/op	       0 allocs/op
BenchmarkRandomGet/btree-degree-1-12  	32179621	        35.80 ns/op	       0 B/op	       0 allocs/op
BenchmarkRandomGet/btree-degree-1-12  	32729978	        35.80 ns/op	       0 B/op	       0 allocs/op
BenchmarkRandomGet/btree-degree-1-12  	32721939	        35.85 ns/op	       0 B/op	       0 allocs/op
BenchmarkRandomGet/btree-degree-1-12  	31634030	        36.03 ns/op	       0 B/op	       0 allocs/op
BenchmarkRandomGet/btree-degree-32-12 	 5755815	       180.1 ns/op	       0 B/op	       0 allocs/op
BenchmarkRandomGet/btree-degree-32-12 	 6836102	       177.4 ns/op	       0 B/op	       0 allocs/op
BenchmarkRandomGet/btree-degree-32-12 	 6832458	       178.7 ns/op	       0 B/op	       0 allocs/op
BenchmarkRandomGet/btree-degree-32-12 	 6862957	       179.8 ns/op	       0 B/op	       0 allocs/op
PASS
ok  	github.com/crypto-org-chain/cronos/memiavl	35.029s

@aaronc
Copy link
Member

aaronc commented Feb 18, 2023

Just an idea to throw in here: We're using an immutable data structure (an AVL tree under the hood so why don't we just branch off that for caching writes? It's always struck me as strange that we put another different tree data structure on top of that and do all this merge iterator stuff when immutable/persistent data structures give you branching for free. In particular I think we might want to explore a thing called a transient data structure to built on top of iavl in memory. Some references:

https://clojure.org/reference/transients

https://github.com/clojure/data.avl

The current iavl implementation is for on-disk db, if we implement an in-memory avl, it'll be similar to the existing btree implementation we are using for cache store.

So when iavl nodes are read from disk they build an in memory tree if I understand correctly. Using an in memory iavl tree for write caching would basically mean a combination of nodes that were read from disk and new in memory nodes which only get written if there is a commit. A rollback should be trivial if needed because you just hang onto the old root.

We don't really need a fully immutable data structure, normally a mutable data structure is more performant than immutable one.

That's the idea with transients. There's mutability while writing is happening but the resulting tree is still immutable. It's a way of starting from an immutable version of a tree say version N on disk, efficiently adding a bunch of nodes to that tree already ordered and balanced and either writing them as version N+1 or discarding. It would seem that with a btree cache layer we have more overhead because we build a different kind of tree and then need to iterate over that tree to build an immutable iavl tree anyway without getting the advantage of transients when doing that. That's not to mention the overhead of merge iterators.

@yihuang
Copy link
Collaborator

yihuang commented Feb 18, 2023

So when iavl nodes are read from disk they build an in memory tree if I understand correctly. Using an in memory iavl tree for write caching would basically mean a combination of nodes that were read from disk and new in memory nodes which only get written if there is a commit. A rollback should be trivial if needed because you just hang onto the old root.

In the current db backed iavl tree, nodes reference each other with much more indirections, but that's not to say IAVL implementation itself can't be redesigned if we are going that far.
The more critical issue of writing into IAVL tree directly is the writing order is more complex, because the result tree structure depends on the write order, current implementation has a very useful property that we can replay the change sets that's extracted from exiting tree versions to rebuild identical tree, that relies on a particular order of the changes.

@alexanderbez
Copy link
Contributor

Benchmarks look promising @yihuang. Is 32 degrees the default? This means we could get rid of the CacheKV almost entirely.

@yihuang
Copy link
Collaborator

yihuang commented Feb 22, 2023

Benchmarks look promising @yihuang. Is 32 degrees the default? This means we could get rid of the CacheKV almost entirely.

Yeah, one thing that can be significant faster is the golang map, which can help if there are repeated get operations on same keys, I don't know how many cases of that in our code though.
If we'll do the read/write(clean/dirty) cache separation, I think the golang map would be perfect for the read(clean) cache, a sorted tree is good for the write(dirty) cache.
The other thing to consider is, the size of cache is much smaller than the size of iavl tree, so the height of tree is smaller, and query performance can be better.

@yihuang
Copy link
Collaborator

yihuang commented Mar 2, 2023

Benchmarks look promising @yihuang. Is 32 degrees the default? This means we could get rid of the CacheKV almost entirely.

it's only this fast because memiavl nodes are referenced by offset in a mmap-ed file, without go through a kv-db at all, which reduce a lot of indirections, works almost the same as an in-memory data structure.
The main uncertainty of using mmap is the IO perform less than ideal when db size is larger than available ram, there are some research on that1, but I think the advantage still out-weight the disadvantages. and we can even develop custom IO and cache management if we are willing to go that far ;D

Footnotes

  1. https://cs.brown.edu/people/acrotty/pubs/p13-crotty.pdf

@alexanderbez
Copy link
Contributor

alexanderbez commented Oct 25, 2023

I'd like to share about the work that's been going on in the store v2 development efforts. One of our main goals is keep abstractions to a minimum and keep each component relatively isolated and simple.

In the context of CacheKV, we actually have no such abstraction or notion in store v2. Instead, we simply have a store that stages dirty writes (we call this branch.Store atm). There's no explicit caching at all in this store.

We currently do not have plans to introduce an explicit cache layer, but if we were, we'd do it as a separate store to keep components minimal in nature. This way, reads would flow as follows: DirtySet -> Cache -> SS.

Ideally, we rely on the SS backend to give us good caching for us, but if benchmarks show that reads are not where we like, then evaluate introducing a cache store/layer.

@ValarDragon
Copy link
Contributor Author

Closing in light of a different branch store design that I don't have context on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: 🥳 Done
Development

No branches or pull requests

7 participants