Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc: mismatch in rpc-metrics all versus individual #28619

Closed
Aracki opened this issue Nov 28, 2023 · 8 comments · Fixed by #28649
Closed

rpc: mismatch in rpc-metrics all versus individual #28619

Aracki opened this issue Nov 28, 2023 · 8 comments · Fixed by #28649

Comments

@Aracki
Copy link
Contributor

Aracki commented Nov 28, 2023

Something we don't get regarding Summary metric rpc_duration_all and the same metric per method.

Just for easier glance, we created 2 recording rules so we can get percentiles per method by searching specific label:

- expr: label_replace({__name__=~"rpc_duration_.*_success"}, "method", "$1", "__name__", "rpc_duration_(.+)_success")
  record: geth_rpc_requests_success
- expr: label_replace({__name__=~"rpc_duration_.*_failure"}, "method", "$1", "__name__", "rpc_duration_(.+)_failure")
  record: geth_rpc_requests_failure

So now when we compare P95 for all methods:

rpc_duration_all{pod='node-polygon-0', quantile='0.95'}    316677.89999999973

and P95 for top 5 methods:

topk(5, (sum(rate(geth_rpc_requests_success{quantile='0.95', pod="node-polygon-0"}[5m])) by (pod, method)))

{method="eth_getLogs", pod="node-polygon-0"} | 182.76462962962964
{method="eth_call", pod="node-polygon-0"} | 176.64814814814804
{method="eth_getTransactionReceipt", pod="node-polygon-0"} | 145.22407407407408
{method="eth_gasPrice", pod="node-polygon-0"} | 92.9074074074074
{method="eth_getTransactionByHash", pod="node-polygon-0"} | 20.548703703703705

See the results. The difference is huge. Thats not possible if summaries work correctly, those "top 5 P95 values" should be much much closer to rpc_duration_all.

Can someone explain this behaviour?

@Rene27132
Copy link

Let's go

@holiman
Copy link
Contributor

holiman commented Dec 1, 2023

NewTimer (used by the summary)

	return &StandardTimer{
		histogram: NewHistogram(NewExpDecaySample(1028, 0.015)),
		meter:     NewMeter(),
	}

The individual ones:

	sampler := func() metrics.Sample {
		return metrics.ResettingSample(
			metrics.NewExpDecaySample(1028, 0.015),
		)
	}

So it seems that they are using the same basic building blocks.
They are both updated at the same time:

		rpcServingTimer.UpdateSince(start)
		updateServeTimeHistogram(msg.Method, answer.Error == nil, time.Since(start))

I can't say what the difference comes from. I think you need to investigate your data more, for example, aside from those top 5, are there any extreme other outliers which affect the "all" meter?

@Aracki
Copy link
Contributor Author

Aracki commented Dec 1, 2023

I have only ~10 eth methods which has P95 > 0, and also no extreme outliers as all others are lower values.

My guess is that one of these two metrics not showing correct results. I checked different networks with this metrics and all has the same behaviour: Value of "all metric P95" being 100-1000x higher than highest P95 per method. (e.g. ethereum mainnet, sepolia, polygon,...)

@MariusVanDerWijden
Copy link
Member

MariusVanDerWijden commented Dec 5, 2023

I think you're only showing the top 5 methods that succeed. Could you do topk(5, (sum(rate(geth_rpc_requests_failure{quantile='0.95', pod="node-polygon-0"}[5m])) by (pod, method))) as well?

There will always be a small mismatch between the two numbers since we do the time.Since computation twice at different points, but the difference should not be as big

@Aracki
Copy link
Contributor Author

Aracki commented Dec 5, 2023

I thought that as well, but most of the time (checked now) the geth_rpc_requests_failure (recording rule) returns only 0 for all methods.

Exactly as you said. Difference is ok to exist, but this big difference means something is not working correctly.

Here are the current results for my Ethereum Mainnet node (just to be sure it's not something related to Polygon fork.

rpc_duration_all{pod='node-ethereum-mainnet-0', quantile='0.95'} 184336.69999999998

topk(5, (sum(rate(geth_rpc_requests_success{quantile='0.95', pod="node-ethereum-mainnet-0"}[5m])) by (pod, method)))

{method="engine_newPayloadV2", pod="node-ethereum-mainnet-0"} | 4137.368677037038
{method="eth_getTransactionReceipt", pod="node-ethereum-mainnet-0"} | 137.3148148148148
{method="engine_forkchoiceUpdatedV2", pod="node-ethereum-mainnet-0"} | 125.63281106172843
{method="eth_call", pod="node-ethereum-mainnet-0"} | 80.05578186172828
{method="eth_getBlockByNumber", pod="node-ethereum-mainnet-0"} | 70.96648148148127

topk(5, (sum(rate(geth_rpc_requests_failure{quantile='0.95', pod="node-ethereum-mainnet-0"}[5m])) by (pod, method)))

{method="eth_call", pod="node-ethereum-mainnet-0"} | 5575.1866666666665
{method="eth_coinbase", pod="node-ethereum-mainnet-0"} | 0
{method="eth_getBalance", pod="node-ethereum-mainnet-0"} | 0
{method="eth_estimateGas", pod="node-ethereum-mainnet-0"} | 0
{method="eth_getCode", pod="node-ethereum-mainnet-0"} | 0

Difference is ~30x. 🤔

@holiman
Copy link
Contributor

holiman commented Dec 5, 2023

Could you get the raw metrics report from geth? I don't know what happens there, with the topk stuff for example
Also, instead of top 5, could you check all of them? Or , top 100?

@Aracki
Copy link
Contributor Author

Aracki commented Dec 5, 2023

100% sure everything is ok on Prometheus side. top 100 will just extend the list and all the other values are lower than one presented.

Here are the raw metrics from very same node. I did grep on all P95 rpc_duration metrics:

rpc_duration_all {quantile="0.95"} 205882.49999999997
rpc_duration_engine_exchangeCapabilities_success {quantile="0.95"} 0
rpc_duration_engine_exchangeTransitionConfigurationV1_success {quantile="0.95"} 0
rpc_duration_engine_forkchoiceUpdatedV2_success {quantile="0.95"} 0
rpc_duration_engine_getPayloadBodiesByRangeV1_success {quantile="0.95"} 0
rpc_duration_engine_newPayloadV2_success {quantile="0.95"} 0
rpc_duration_eth_blockNumber_success {quantile="0.95"} 33
rpc_duration_eth_call_failure {quantile="0.95"} 0
rpc_duration_eth_call_success {quantile="0.95"} 1364
rpc_duration_eth_chainId_success {quantile="0.95"} 0
rpc_duration_eth_coinbase_failure {quantile="0.95"} 0
rpc_duration_eth_estimateGas_failure {quantile="0.95"} 0
rpc_duration_eth_estimateGas_success {quantile="0.95"} 0
rpc_duration_eth_feeHistory_success {quantile="0.95"} 0
rpc_duration_eth_gasPrice_success {quantile="0.95"} 0
rpc_duration_eth_getBalance_failure {quantile="0.95"} 0
rpc_duration_eth_getBalance_success {quantile="0.95"} 102
rpc_duration_eth_getBlockByHash_success {quantile="0.95"} 307
rpc_duration_eth_getBlockByNumber_success {quantile="0.95"} 2884.9999999999977
rpc_duration_eth_getBlockReceipts_success {quantile="0.95"} 0
rpc_duration_eth_getBlockTransactionCountByNumber_success {quantile="0.95"} 0
rpc_duration_eth_getCode_failure {quantile="0.95"} 0
rpc_duration_eth_getCode_success {quantile="0.95"} 205.19999999999982
rpc_duration_eth_getFilterChanges_failure {quantile="0.95"} 0
rpc_duration_eth_getFilterChanges_success {quantile="0.95"} 21
rpc_duration_eth_getLogs_failure {quantile="0.95"} 0
rpc_duration_eth_getLogs_success {quantile="0.95"} 0
rpc_duration_eth_getProof_success {quantile="0.95"} 0
rpc_duration_eth_getStorageAt_success {quantile="0.95"} 140.14999999999998
rpc_duration_eth_getTransactionByBlockHashAndIndex_success {quantile="0.95"} 0
rpc_duration_eth_getTransactionByBlockNumberAndIndex_success {quantile="0.95"} 59
rpc_duration_eth_getTransactionByHash_success {quantile="0.95"} 1093.9999999999986
rpc_duration_eth_getTransactionCount_failure {quantile="0.95"} 0
rpc_duration_eth_getTransactionCount_success {quantile="0.95"} 108
rpc_duration_eth_getTransactionReceipt_success {quantile="0.95"} 0
rpc_duration_eth_hashrate_success {quantile="0.95"} 0
rpc_duration_eth_newPendingTransactionFilter_success {quantile="0.95"} 0
rpc_duration_eth_sendRawTransaction_failure {quantile="0.95"} 0
rpc_duration_eth_sendRawTransaction_success {quantile="0.95"} 0
rpc_duration_eth_syncing_success {quantile="0.95"} 0
rpc_duration_eth_uninstallFilter_success {quantile="0.95"} 0
rpc_duration_net_peerCount_success {quantile="0.95"} 0

@holiman
Copy link
Contributor

holiman commented Dec 6, 2023

Thanks for this, you are right, and I've found the bug now

@holiman holiman changed the title Mismatch regarding percentiles for rpc_duration_all and rpc_duration.*_success rpc: mismatch in rpc-metrics all versus individual Dec 6, 2023
@fjl fjl closed this as completed in #28649 Dec 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants