Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CacheKV has no size constraint which would cause memory leak #13834

Closed
cryptocifer opened this issue Nov 10, 2022 · 19 comments
Closed

CacheKV has no size constraint which would cause memory leak #13834

cryptocifer opened this issue Nov 10, 2022 · 19 comments

Comments

@cryptocifer
Copy link

cryptocifer commented Nov 10, 2022

Summary of Bug

We have an archive node, it's memory usage is constantly rising and never fall. The memory rises a lot especially when we make rpc calls to it for history state.

we made heap profile and looked into the code and found there's a cachekv layer which hold the pointer to the underlying iavl node. the cachekv uses a map to hold the pointer and the issue here is there's no other constraints on the map to limit how many iavl nodes it can hold.

For a pruned node, iavl node keeps being pruned as the block number grows, while for an archive node, the iavl nodes are never pruned, so as the block grows, the iavl nodes loaded into memory(loaded by block sync, rpc calls etc...) will stay there permanently, they won't be GCed because cachekv also holds pointers to them permanently.

Version

v0.8.2

@barryz
Copy link

barryz commented Nov 11, 2022

Same issue.

@tac0turtle
Copy link
Member

Thanks for opening the issue. We will look into this.

Cc @alexanderbez

@yihuang
Copy link
Collaborator

yihuang commented Nov 11, 2022

normally, the CacheKV should be resetted at commit event?

@cryptocifer
Copy link
Author

cryptocifer commented Nov 11, 2022

shares the heap profile we catched before and after the memory jump if it helps. By substracting the two profiles we can see the large part is on iavl.MakeNode, I think it ought to be garbage collected but not quite sure why it isn't. My rough triaging came out that it's the cachekv that holds the reference to the node.

cronos_heap_diff.zip (see the updated file below)

@yihuang
Copy link
Collaborator

yihuang commented Nov 11, 2022

cachekv caches the key values directly, iavl.MakeNode is cached by the iavl's nodedb, which is supposed to be a lru cache with configurable cache size. what value do you set as the iavl-cache-size

@cryptocifer
Copy link
Author

@yihuang we set it to 15625000, I got that the default capacity 781250 occupies around 50mb in memory, so by calculation 15625000 corresponds 1GB I think

@yihuang
Copy link
Collaborator

yihuang commented Nov 11, 2022

One case comes to mind where cachekv cache won't be cleared is grpc-only, where there's no abci event to reset the cache.

@alexanderbez
Copy link
Contributor

Are you running grpc-only?

@cryptocifer
Copy link
Author

cryptocifer commented Nov 12, 2022

We're not running with grpc-only, our node is with tendermint and keeps syncing up. Oh btw, it's a cronos node in our case, is there any customization with it?

@yihuang
Copy link
Collaborator

yihuang commented Nov 12, 2022

By substracting the two profiles we can see the large part is on iavl.MakeNode

How did you subtract the profiles? And how do you open the heap file, I tried to open in browser but can't display the complete picture.

@cryptocifer
Copy link
Author

cryptocifer commented Nov 12, 2022

How did you subtract the profiles? And how do you open the heap file, I tried to open in browser but can't display the complete picture.

After extracting the zip file, can run this command

go tool pprof -http :8883 -base cronos1_base.heap cronos1_current.heap

to get the heap diff

@yihuang
Copy link
Collaborator

yihuang commented Nov 12, 2022

How did you subtract the profiles? And how do you open the heap file, I tried to open in browser but can't display the complete picture.

After extracting the zip file, can run this command

go tool pprof -http :8883 -base cronos1_base.heap cronos1_current.heap

to get the heap diff

cronos_d1_current.heap: parsing profile: unrecognized profile format
cronos_d1_base.heap: parsing profile: unrecognized profile format

@cryptocifer
Copy link
Author

Oh seems I uploaded an wrong file, please refer to this one:
cronos_heap_diff2.zip

@yihuang
Copy link
Collaborator

yihuang commented Nov 12, 2022

there's only 100M created by iavl.MakeNode, it seems align with your iavl cache size setting? Maybe it just handled some EthCall requests for a old block height, results in loading lots of cold data.

@cryptocifer
Copy link
Author

cryptocifer commented Nov 12, 2022

The 100M+ is a sampled result of golang's heap profiler, we can look at the proportions.

Maybe it just handled some EthCall requests for a old block height, results in loading lots of cold data.

yes it is, the point I think is there's no size limit on the loaded cold data, there're two layers of cache, iavl layer and the cachekv layer, iavl-cache-size controls the cache size in the iavl layer, while cachekv seems has no limitation now

@yihuang
Copy link
Collaborator

yihuang commented Nov 12, 2022

yes it is, the point I think is there's no size limit on the loaded cold data, there're two layers of cache, iavl layer and the cachekv layer, iavl-cache-size controls the cache size in the iavl layer, while cachekv seems has no limitation now

Agree, the only limit is it's reset at each block cycle, but if there are a surge of query requests happens between that, it could cause a surge of memory usage.

@cryptocifer
Copy link
Author

cryptocifer commented Nov 14, 2022

is the cachekv reset guranteed to run? if so it should be frequent since new block comes every several seconds, while it looks doesn't reset in our case since the memory just rises without fall.

@alexanderbez
Copy link
Contributor

Yes, the cache is cleared on Write() (https://github.com/cosmos/cosmos-sdk/blob/main/store/cachekv/store.go#L141-L153)

@tac0turtle
Copy link
Member

@cifer76 are you still running into this issue? the last comment eludes it could be something else

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants