Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

amd64 SIMD IP checksum #13

Merged
merged 2 commits into from
Jul 10, 2023
Merged

Conversation

sailorfrag
Copy link
Member

This adds AMD64 assembly implementations of IP checksum computation, one
for baseline AMD64 and the other for v3 AMD64 (AVX2 and BMI2).

All performance numbers reported are from a Ryzen 7 4750U but similar
improvements are expected for a wide range of processors.

The generic IP checksum implementation has also been further improved to
be significantly faster using bits.AddUint64 (for a 64KiB buffer the
throughput improves from 15,000MiB/s to 27,600MiB/s; similar gains are
also reported on ARM64 but I do not have specific numbers).

The baseline AMD64 implementation for a 64KiB buffer reports 32,700MiB/s
and the AVX2 implementation is slightly over 107,000MiB/s.

Unfortunately, for very small sizes (e.g. the expected size for an IPv4
header) setting up SIMD computation involves some overhead that makes
computing a checksum for small buffers slower than a non-SIMD
implementation. Even more unfortunately, testing for this at runtimen in
Go and calling a func optimized for small buffers mitigates most of the
improvement due to call overhead. The break even point is around 256
byte buffers; IPv4 headers are no more than 60 bytes including
extensions. IPv6 headers do not have a checksum but are a fixed size of
40 bytes. As a result, the generated assembly code uses an alternate
approach for buffers of less than 256 bytes. Additionally, buffers of
less than 32 bytes need to be handled specially because the strategy for
reading buffers that are not a multiple of 8 bytes fails when the buffer
is too small.

As suggested by additional benchmarking, pseudo header computation has
been rewritten to be faster (benchmark time reduced by 1/2 to 1/4).

This change has been split into two commits: first the tests (to verify that the existing code passes all the tests), then the second introduces the new implementation.

Previously reviewed on our internal-only wireguard-go fork (PR 3 in that repo) with @jwhited

Signed-off-by: Adrian Dewhurst <adrian@tailscale.com>
This adds AMD64 assembly implementations of IP checksum computation, one
for baseline AMD64 and the other for v3 AMD64 (AVX2 and BMI2).

All performance numbers reported are from a Ryzen 7 4750U but similar
improvements are expected for a wide range of processors.

The generic IP checksum implementation has also been further improved to
be significantly faster using bits.AddUint64 (for a 64KiB buffer the
throughput improves from 15,000MiB/s to 27,600MiB/s; similar gains are
also reported on ARM64 but I do not have specific numbers).

The baseline AMD64 implementation for a 64KiB buffer reports 32,700MiB/s
and the AVX2 implementation is slightly over 107,000MiB/s.

Unfortunately, for very small sizes (e.g. the expected size for an IPv4
header) setting up SIMD computation involves some overhead that makes
computing a checksum for small buffers slower than a non-SIMD
implementation. Even more unfortunately, testing for this at runtimen in
Go and calling a func optimized for small buffers mitigates most of the
improvement due to call overhead. The break even point is around 256
byte buffers; IPv4 headers are no more than 60 bytes including
extensions. IPv6 headers do not have a checksum but are a fixed size of
40 bytes. As a result, the generated assembly code uses an alternate
approach for buffers of less than 256 bytes. Additionally, buffers of
less than 32 bytes need to be handled specially because the strategy for
reading buffers that are not a multiple of 8 bytes fails when the buffer
is too small.

As suggested by additional benchmarking, pseudo header computation has
been rewritten to be faster (benchmark time reduced by 1/2 to 1/4).

Updates tailscale/corp#9755

Signed-off-by: Adrian Dewhurst <adrian@tailscale.com>
@sailorfrag sailorfrag force-pushed the adrian/simd-checksum-tailscale branch from 343daf2 to dce3221 Compare July 10, 2023 17:51
@sailorfrag
Copy link
Member Author

Updates tailscale/corp#9755

@sailorfrag sailorfrag merged commit bb2c8f2 into tailscale Jul 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants