-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DO NOT MERGE 2xTx: Double transaction size on top of #29055 #29310
Conversation
bad news is that the base commit for these changes had a problem which doesn't allow me to run |
a2b7765
to
821da80
Compare
This has been rebased correctly, let me know if you still run into anything |
The following query examines the number of packets sent by the QUIC server, and then takes the max across all metric reporting nodes in the cluster. For a given time window, the maximum value should correspond to the leader.
Here is the screen grab from the past 2 days, with the absolute peak reaching close to 400k packets, but the overall average looking much lower. Any load-shedding (discard random packets, de-dupe, etc) is done later down the pipeline. Also, the metric is reported every 5 seconds so these numbers are over that 5 second period. Recall that our quic server currently creates and sends a single element solana/streamer/src/nonblocking/quic.rs Line 670 in 9a82368
Granted this isn't a super-comprehensive, but given the above, we're looking at say ~400k packet temporary burst for incoming transactions (both tpu and tpu_forwards). Now, supposing that we blanket double the buffer for all transactions to be Assuming those 400k packets all came in in one second (they didn't but we'll run with it as worst case), A fully saturated 1 GBps link that is doing nothing else except receiving transactions could theoretically see ~800k packets per second, and using above line of thinking, this would only be 1 GB of additional memory spike for doubling the buffer to put transactions into. All this said, I guess I don't expect to see a massive effect on RAM usage unless it is late enough that I missed something simple in my run through here. But, we should definitely run the tests |
packet_batches_sent is the accumulated number of packets quic streamer sends down the pipeline within the last 5 seconds. Depends on how quickly we are receiving from the pipe and processing it -- that number can be over/under the actual peak usage. |
Yep, I saw/noted that metric is 5 seconds of runtime. Hard to know the distribution over those 5 seconds (likely mostly in the duration that node is leader), so I leaned towards worst case with the quick math.
To confirm, your point here is that the pipeline might back up if the downstream stages are unable to keep up with what the QUIC server is pushing in? |
Yes. I think we need to look at the max element count in the channel and that of the downstream. |
The methods that pull packets from the channels between quic/FindSenderStake and FindSenderStake/SigVerify drain the channels when they receive: solana/streamer/src/streamer.rs Lines 344 to 358 in 926debd
After draining the channels, both stages then do their load-shedding with discard_batches_randomly() , which drops (thus freeing memory) discarded batches immediately. I believe these stages have a target "loop duration", and we shave off excess packets in excess of some N such that we can process N packets within that target loop duration.
|
821da80
to
06b3202
Compare
Codecov Report
@@ Coverage Diff @@
## master #29310 +/- ##
=======================================
Coverage 76.5% 76.5%
=======================================
Files 54 54
Lines 3129 3130 +1
Branches 470 470
=======================================
+ Hits 2396 2397 +1
Misses 568 568
Partials 165 165 |
Problem
We need a branch that has doubled transaction sizes on top of #29055
Summary of Changes
Add one more commit on top which just doubles transaction sizes. This can be used with bench-tps against a normal cluster to see the impact of larger packet batches. In the best case, there will only be an effect on RAM usage.
Fixes #