Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DO NOT MERGE 2xTx: Double transaction size on top of #29055 #29310

Closed
wants to merge 12 commits into from

Conversation

joncinque
Copy link
Contributor

Problem

We need a branch that has doubled transaction sizes on top of #29055

Summary of Changes

Add one more commit on top which just doubles transaction sizes. This can be used with bench-tps against a normal cluster to see the impact of larger packet batches. In the best case, there will only be an effect on RAM usage.

Fixes #

@github-actions github-actions bot added the web3.js Related to the JavaScript client label Dec 17, 2022
@KirillLykov
Copy link
Contributor

bad news is that the base commit for these changes had a problem which doesn't allow me to run ./net.sh script. Rebasing on the master helped

@joncinque
Copy link
Contributor Author

This has been rebased correctly, let me know if you still run into anything

@steviez
Copy link
Contributor

steviez commented Dec 20, 2022

In the best case, there will only be an effect on RAM usage.

The following query examines the number of packets sent by the QUIC server, and then takes the max across all metric reporting nodes in the cluster. For a given time window, the maximum value should correspond to the leader.

SELECT max("packet_batches_sent") AS "max_packet_batches_sent"
FROM "mainnet-beta"."autogen"."quic-connections"
WHERE time > :dashboardTime: AND time < :upperDashboardTime: GROUP BY time(5s) FILL(null)

Here is the screen grab from the past 2 days, with the absolute peak reaching close to 400k packets, but the overall average looking much lower. Any load-shedding (discard random packets, de-dupe, etc) is done later down the pipeline. Also, the metric is reported every 5 seconds so these numbers are over that 5 second period.
image

Recall that our quic server currently creates and sends a single element PacketBatch at a time.

let mut batch = PacketBatch::with_capacity(1);

Granted this isn't a super-comprehensive, but given the above, we're looking at say ~400k packet temporary burst for incoming transactions (both tpu and tpu_forwards). Now, supposing that we blanket double the buffer for all transactions to be 2 * PACKET_DATA_SIZE, we're now only allocating an extra num_packets * PACKET_DATA_SIZE bytes (since we would have had to allocate PACKET_DATA_SIZE anyways).

Assuming those 400k packets all came in in one second (they didn't but we'll run with it as worst case),
400k * 1232 additional bytes / packet = ~500 MB of additional memory spike. But, these packets will either be quickly processed or shed, so uptick will indeed be a momentary spike.

A fully saturated 1 GBps link that is doing nothing else except receiving transactions could theoretically see ~800k packets per second, and using above line of thinking, this would only be 1 GB of additional memory spike for doubling the buffer to put transactions into.

All this said, I guess I don't expect to see a massive effect on RAM usage unless it is late enough that I missed something simple in my run through here. But, we should definitely run the tests

@lijunwangs
Copy link
Contributor

In the best case, there will only be an effect on RAM usage.

The following query examines the number of packets sent by the QUIC server, and then takes the max across all metric reporting nodes in the cluster. For a given time window, the maximum value should correspond to the leader.

SELECT max("packet_batches_sent") AS "max_packet_batches_sent"
FROM "mainnet-beta"."autogen"."quic-connections"
WHERE time > :dashboardTime: AND time < :upperDashboardTime: GROUP BY time(5s) FILL(null)

Here is the screen grab from the past 2 days, with the absolute peak reaching close to 400k packets, but the overall average looking much lower. Any load-shedding (discard random packets, de-dupe, etc) is done later down the pipeline. Also, the metric is reported every 5 seconds so these numbers are over that 5 second period. image

Recall that our quic server currently creates and sends a single element PacketBatch at a time.

let mut batch = PacketBatch::with_capacity(1);

Granted this isn't a super-comprehensive, but given the above, we're looking at say ~400k packet temporary burst for incoming transactions (both tpu and tpu_forwards). Now, supposing that we blanket double the buffer for all transactions to be 2 * PACKET_DATA_SIZE, we're now only allocating an extra num_packets * PACKET_DATA_SIZE bytes (since we would have had to allocate PACKET_DATA_SIZE anyways).

Assuming those 400k packets all came in in one second (they didn't but we'll run with it as worst case), 400k * 1232 additional bytes / packet = ~500 MB of additional memory spike. But, these packets will either be quickly processed or shed, so uptick will indeed be a momentary spike.

A fully saturated 1 GBps link that is doing nothing else except receiving transactions could theoretically see ~800k packets per second, and using above line of thinking, this would only be 1 GB of additional memory spike for doubling the buffer to put transactions into.

All this said, I guess I don't expect to see a massive effect on RAM usage unless it is late enough that I missed something simple in my run through here. But, we should definitely run the tests

packet_batches_sent is the accumulated number of packets quic streamer sends down the pipeline within the last 5 seconds. Depends on how quickly we are receiving from the pipe and processing it -- that number can be over/under the actual peak usage.

@steviez
Copy link
Contributor

steviez commented Dec 20, 2022

packet_batches_sent is the accumulated number of packets quic streamer sends down the pipeline within the last 5 seconds

Yep, I saw/noted that metric is 5 seconds of runtime. Hard to know the distribution over those 5 seconds (likely mostly in the duration that node is leader), so I leaned towards worst case with the quick math.

Depends on how quickly we are receiving from the pipe and processing it -- that number can be over/under the actual peak usage.

To confirm, your point here is that the pipeline might back up if the downstream stages are unable to keep up with what the QUIC server is pushing in?

@lijunwangs
Copy link
Contributor

To confirm, your point here is that the pipeline might back up if the downstream stages are unable to keep up with what the QUIC server is pushing in?

Yes. I think we need to look at the max element count in the channel and that of the downstream.

@steviez
Copy link
Contributor

steviez commented Dec 20, 2022

Yes. I think we need to look at the max element count in the channel and that of the downstream.

The methods that pull packets from the channels between quic/FindSenderStake and FindSenderStake/SigVerify drain the channels when they receive:

pub fn recv_packet_batches(
recvr: &PacketBatchReceiver,
) -> Result<(Vec<PacketBatch>, usize, Duration)> {
let timer = Duration::new(1, 0);
let packet_batch = recvr.recv_timeout(timer)?;
let recv_start = Instant::now();
trace!("got packets");
let mut num_packets = packet_batch.len();
let mut packet_batches = vec![packet_batch];
while let Ok(packet_batch) = recvr.try_recv() {
trace!("got more packets");
num_packets += packet_batch.len();
packet_batches.push(packet_batch);
}
let recv_duration = recv_start.elapsed();

After draining the channels, both stages then do their load-shedding with discard_batches_randomly(), which drops (thus freeing memory) discarded batches immediately. I believe these stages have a target "loop duration", and we shave off excess packets in excess of some N such that we can process N packets within that target loop duration.

@github-actions github-actions bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Jan 4, 2023
@github-actions github-actions bot closed this Jan 12, 2023
@steviez steviez reopened this Jan 12, 2023
@steviez steviez removed the stale [bot only] Added to stale content; results in auto-close after a week. label Jan 12, 2023
@codecov
Copy link

codecov bot commented Jan 12, 2023

Codecov Report

Merging #29310 (06b3202) into master (d392378) will increase coverage by 0.0%.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #29310   +/-   ##
=======================================
  Coverage    76.5%    76.5%           
=======================================
  Files          54       54           
  Lines        3129     3130    +1     
  Branches      470      470           
=======================================
+ Hits         2396     2397    +1     
  Misses        568      568           
  Partials      165      165           

@github-actions github-actions bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Jan 27, 2023
@github-actions github-actions bot closed this Feb 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale [bot only] Added to stale content; results in auto-close after a week. web3.js Related to the JavaScript client
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants