MSM improvements #372

DmytroTym · 2024-02-11T13:00:42Z

Describe the changes

MSM can now handle zero base points. They are represented as affine points with x and y coordinates equal to zero which is (as far as I know) consistent with gnark and rapidsnark but not arkworks. Rust tests are changed accordingly.

A number of performance and memory improvements have been made:

Kernel launches, allocations and frees are moved around to minimise memory footprint and better parallelise copying bases from host to device and sorting scalars. Still, compute often has to wait for copying bases because it takes significantly more time than scalar sorting. The way to solve it in the future is probably doing what Matter Labs are doing - computing MSM in chunks that would allow masking uploading the next chunk of base points with bucket accumulation from the previous chunk.
Speaking of sorting scalars, instead of sorting indices for each bucket module individually, it's now done for all indices at once. While this requires a bit more memory and theoretically should take more operations that one-by-one approach, it turns out to be faster in practice. Not just for smaller MSM but for large ones as well. This also allows to automatically remove zero buckets.
Removing zero buckets allows us to avoid using them in reduction. Though this makes the code more convoluted and doesn't help too much for large MSMs, it boosts small MSMs quite a bit.
For large buckets, I increased the number of threads per bucket by a factor of large_bucket_size to make each thread in large bucket accumulation do the same amount of work as expected in normal accumulation. At the same time, this worsens potential memory bottleneck in large bucket accumulation because the amount of memory allocated here in the old version is proportional to the number of threads in the largest bucket times the number of large buckets. So if the largest bucket is really large and there are lots of much smaller large buckets, we might run out of memory. So I allocated only as much memory as necessary for each large bucket, depending on its size. This complicates the code but I think the speedup and memory savings are worth it.

Benchmarks

Measurements are made on an RTX 3090Ti card for the bn254 curve, H2D memory operations not included.

In the first experiment, ~30% of scalars are equal to 1, there are also 10 random scalars each with frequency around 1%. The rest of scalars are chosen uniformly at random. In the second experiment, all scalars are chosen uniformly at random.

#	MSM size	Batch size	Old version, ms.	New version, ms.
1	`2^22`	1	33.8	29.8
1	`2^22`	3	68.2	56.3
2	`2^24`	1	151.5	149.8
2	`2^14`	`2^8`	79.5	73.1

Failures and future work

I spent quite a bit of time trying to make reduction and scan methods from CUB and thrust work with our point types. I thought that it would be nice to let well-optimised CUDA libraries handle load balancing in bucket accumulation for us. As it turns out, there are many issues with this approach:

It seems that CUB and thrust are optimised for minimising memory movement rather than compute but our EC addition is very much compute bound;
Compile time grows a lot when trying to use our primitives inside CUB/thrust functions;
I wasn't even able to get correctness, results seem to always be zero or random, no idea why.

Overall I think this is a dead end and it's not worth trying to swap our custom bucket accumulation to anything CUB or thrust provide nowadays.

In terms of future work, arbitrary choice of c is still not supported as of this PR (though I think I can implement it pretty quickly) and a bunch of other improvements like xyzz accumulation, signed digits/scalars and computing MSM in chunks as described earlier are still in TODO status.

…vements

yshekel · 2024-02-12T13:17:27Z

wrappers/rust/icicle-core/src/curve.rs

-            .to_ark();
-        Self::ArkEquivalent::new_unchecked(proj_x, proj_y)
+        if *self == Self::zero() {
+            Self::ArkEquivalent::zero()


is it more efficient than the else case?

The else case just doesn't cover zero. new_unchecked assumes that inputs represent a valid non-zero point.

yshekel · 2024-02-12T13:18:32Z

wrappers/rust/icicle-core/src/msm/tests.rs

@@ -168,14 +166,20 @@ where
        for batch_size in batch_sizes {
            let mut points = C::generate_random_affine_points(test_size * batch_size);
            let mut scalars = vec![C::ScalarField::zero(); test_size * batch_size];
+
+            // add some zero points
+            for _ in 0..100 {


Consider moving this logic to generate_random_...()

On the one hand we can
On the other, I don't think users of a function called generate_random_affine_points expect zero points being sprinkled in. We're not promising cryptographic rng here or anything and this function shouldn't be used to create secure randomness but still...

Ok, I just thought that if you need it more than once it's worth writing it once. You could wrap this function in a test util function that accepts the probability of each point to be zeroed but you should decide if it makes sense to you. If not that's fine too.

Yes, creating a separate test util function makes sense, will do

yshekel · 2024-02-12T13:28:45Z

icicle/appUtils/msm/msm.cu

+      unsigned start = (sorted_bucket_sizes_sum[tid] + nof_pts_per_thread - 1) / nof_pts_per_thread + tid;
+      unsigned end = (sorted_bucket_sizes_sum[tid + 1] + nof_pts_per_thread - 1) / nof_pts_per_thread + tid + 1;
+      for (unsigned i = start; i < end; i++) {
+        bucket_indices[i] = tid | ((i - start) << log_nof_large_buckets);


are you sure this is correct when nof_buckets is not a power of two?
If you assume it is, please add a comment and maybe also assert where calling the kernel.

It should be correct for non-powers-of-2 and in most cases nof_large_buckets is not a power of 2
What this line does is just packing two values into one number bucket_indices[i]. tid goes into the lowest log_nof_large_buckets bits and i - start goes into the rest. So log_nof_large_buckets in this case just means the number of bits needed to represent tid which varies from 0 to nof_large_buckets

right, as long as log_nof_large_buckets = ceil(log2(nof_large_buckets)) this is correct. Since it's not verified inside the kernel, it could mix the two fields so that's why I suggested to make verify it at kernel launch.

In principle we can just take log inside the kernel, I just wanted to avoid all threads doing identical work. Verification is definitely cheaper though, can do it inside the kernel if you want

My motivation is to avoid debugging cases like nof_buckets=100, (int)log2(nof_buckets)=6.

I would personally add a comment about this assumption next to the kernel param and make sure the host is calling correctly. If you prefer assert inside the kernel it's fine too.

Added a comment though maybe I should've done a more full doc comment for each kernel (but they are internal so I don't feel the need to spend too much time documenting them tbh)

DmytroTym · 2024-02-14T20:08:44Z

icicle/appUtils/msm/msm.cu

      // sort by bucket sizes
      unsigned h_nof_buckets_to_compute;
      CHK_IF_RETURN(cudaMemcpyAsync(
        &h_nof_buckets_to_compute, nof_buckets_to_compute, sizeof(unsigned), cudaMemcpyDeviceToHost, stream));
-
-      // if all points are 0 just return point 0


There was a question about removing this block. Turned out it was there for a good reason. If all the scalars are zero, there's a division by zero error inside. I will push a more elegant fix for this tomorrow.

DmytroTym · 2024-02-15T13:36:04Z

Tried to explain the parts which raised questions during live review with comments, also dealt with all scalars being zero case. See the diagram with visual explanation of key changes in this PR:

Benchmarks

We discussed that the change in large bucket accumulation demands measuring how performance changes with large_bucket_factor changing, both for skewed and uniform distributions. For skewed distributions, I looked at Lurk MSM (cc: @omershlo) and tried emulating data in their test number 1 (their raw data is unavailable plus afaik it's on grumpkin curve which we don't yet support, though it's in the works). For the second experiment, I just used uniform distribution. bn254 curve is used, it should have similar performance to grumpkin. RTX A6000, the same GPU as in Lurk MSM experiments, has been used.

	Lurk MSM test 1	Size 9699051 uniform MSM
`pasta-msm` on uniform distribution, ms.	119.79	144.74
`lurkrs`, ms.	552.09	-
`lurkrs` compressed (hypothetical), ms.	19.74	-
`ICICLE` on dev branch, ms.	384.99	134.18
`ICICLE` on dev with optimal `large_bucket_factor`, ms.	164.62	127.2
Optimal `large_bucket_factor` for `ICICLE` on dev	4	0
`ICICLE` on this branch, ms.	133.4	174.85
`ICICLE` on this with optimal `large_bucket_factor`, ms.	132.69	126.9
Optimal `large_bucket_factor` for `ICICLE` on this	15	0

One weird detail is that for uniform distributions large_bucket_factor=0 is optimal for both the old and (especially) new versions. This is not the case on GPUs I tested on before, only on RTX A6000 and I didn't have access to thorough profiling due to renting it in the cloud. I would suggest that this is weirdness that needs further investigation in the future. Otherwise the results for the new version vary very little for most positive values of large_bucket_factor and 10 seems to be a reasonable default choice to me, don't see any reason to change it.

LeonHibnik · 2024-02-15T17:41:34Z

icicle/appUtils/msm/msm.cu

      const unsigned c,
-      const unsigned threads_per_bucket,
-      const unsigned max_run_length)
+      const int points_per_thread,


why not unsigned?

yshekel

honestly I don't fully understand the details of the large bucket accumulation, but overall looks good to me.
I approve but you may want Hadar to review too.

LeonHibnik

lgtm

## Contents of this release [FEAT]: support for multi-device execution: #356 [FEAT]: full support for new mixed-radix NTT: #367, #368 and #371 [FEAT]: examples for Poseidon hash and tree builder based on it (currently only on C++ side): #375 [PERF]: MSM performance upgrades & zero point handling: #372

DmytroTym added 3 commits February 11, 2024 11:33

Improved MSM

dc70959

Zero point handling in large buckets

848445b

Merge remote-tracking branch 'origin/dev' into develop/dima/msm_impro…

c1f3f5c

…vements

DmytroTym requested review from yshekel and HadarIngonyama February 11, 2024 13:00

DmytroTym added 4 commits February 11, 2024 15:03

Merge branch 'dev' into develop/dima/msm_improvements

1922f46

Fixed affine zero point conversion for arkworks

dcff5ec

Fixed affine zero point conversion with arkworks

3bae807

cargo fmt

64cdf86

yshekel reviewed Feb 12, 2024

View reviewed changes

DmytroTym added 3 commits February 13, 2024 16:05

Merge branch 'dev' into develop/dima/msm_improvements

bf24e9d

Addressed comments

6a9c684

MSM comments

034c14b

DmytroTym commented Feb 14, 2024

View reviewed changes

DmytroTym added 3 commits February 15, 2024 08:59

All zero scalars case handled

d42a589

Merge branch 'dev' into develop/dima/msm_improvements

3fc3ea6

clang format

f0c5194

DmytroTym requested a review from yshekel February 15, 2024 13:36

Merge branch 'dev' into develop/dima/msm_improvements

18e8195

LeonHibnik reviewed Feb 15, 2024

View reviewed changes

yshekel approved these changes Feb 15, 2024

View reviewed changes

LeonHibnik approved these changes Feb 15, 2024

View reviewed changes

LeonHibnik merged commit a91397e into dev Feb 15, 2024
14 checks passed

LeonHibnik deleted the develop/dima/msm_improvements branch February 15, 2024 18:02

DmytroTym mentioned this pull request Feb 15, 2024

Release v1.4.0 #378

Merged

jeremyfelder mentioned this pull request Feb 22, 2024

[BUG]: Infinity/zero points not filtered out #169

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MSM improvements #372

MSM improvements #372

DmytroTym commented Feb 11, 2024

yshekel Feb 12, 2024

DmytroTym Feb 12, 2024

yshekel Feb 12, 2024 •

edited

Loading

DmytroTym Feb 12, 2024

yshekel Feb 13, 2024 •

edited

Loading

DmytroTym Feb 13, 2024

yshekel Feb 12, 2024 •

edited

Loading

DmytroTym Feb 12, 2024

yshekel Feb 13, 2024

DmytroTym Feb 13, 2024

yshekel Feb 13, 2024

DmytroTym Feb 13, 2024

DmytroTym Feb 14, 2024 •

edited

Loading

DmytroTym commented Feb 15, 2024

LeonHibnik Feb 15, 2024

yshekel left a comment

LeonHibnik left a comment

MSM improvements #372

MSM improvements #372

Conversation

DmytroTym commented Feb 11, 2024

Describe the changes

Benchmarks

Failures and future work

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yshekel Feb 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yshekel Feb 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yshekel Feb 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DmytroTym Feb 14, 2024 • edited Loading

Choose a reason for hiding this comment

DmytroTym commented Feb 15, 2024

Benchmarks

Choose a reason for hiding this comment

yshekel left a comment

Choose a reason for hiding this comment

LeonHibnik left a comment

Choose a reason for hiding this comment

yshekel Feb 12, 2024 •

edited

Loading

yshekel Feb 13, 2024 •

edited

Loading

yshekel Feb 12, 2024 •

edited

Loading

DmytroTym Feb 14, 2024 •

edited

Loading