amd64 SIMD IP checksum #13

sailorfrag · 2023-07-10T17:49:25Z

This adds AMD64 assembly implementations of IP checksum computation, one
for baseline AMD64 and the other for v3 AMD64 (AVX2 and BMI2).

All performance numbers reported are from a Ryzen 7 4750U but similar
improvements are expected for a wide range of processors.

The generic IP checksum implementation has also been further improved to
be significantly faster using bits.AddUint64 (for a 64KiB buffer the
throughput improves from 15,000MiB/s to 27,600MiB/s; similar gains are
also reported on ARM64 but I do not have specific numbers).

The baseline AMD64 implementation for a 64KiB buffer reports 32,700MiB/s
and the AVX2 implementation is slightly over 107,000MiB/s.

Unfortunately, for very small sizes (e.g. the expected size for an IPv4
header) setting up SIMD computation involves some overhead that makes
computing a checksum for small buffers slower than a non-SIMD
implementation. Even more unfortunately, testing for this at runtimen in
Go and calling a func optimized for small buffers mitigates most of the
improvement due to call overhead. The break even point is around 256
byte buffers; IPv4 headers are no more than 60 bytes including
extensions. IPv6 headers do not have a checksum but are a fixed size of
40 bytes. As a result, the generated assembly code uses an alternate
approach for buffers of less than 256 bytes. Additionally, buffers of
less than 32 bytes need to be handled specially because the strategy for
reading buffers that are not a multiple of 8 bytes fails when the buffer
is too small.

As suggested by additional benchmarking, pseudo header computation has
been rewritten to be faster (benchmark time reduced by 1/2 to 1/4).

This change has been split into two commits: first the tests (to verify that the existing code passes all the tests), then the second introduces the new implementation.

Previously reviewed on our internal-only wireguard-go fork (PR 3 in that repo) with @jwhited

Signed-off-by: Adrian Dewhurst <adrian@tailscale.com>

This adds AMD64 assembly implementations of IP checksum computation, one for baseline AMD64 and the other for v3 AMD64 (AVX2 and BMI2). All performance numbers reported are from a Ryzen 7 4750U but similar improvements are expected for a wide range of processors. The generic IP checksum implementation has also been further improved to be significantly faster using bits.AddUint64 (for a 64KiB buffer the throughput improves from 15,000MiB/s to 27,600MiB/s; similar gains are also reported on ARM64 but I do not have specific numbers). The baseline AMD64 implementation for a 64KiB buffer reports 32,700MiB/s and the AVX2 implementation is slightly over 107,000MiB/s. Unfortunately, for very small sizes (e.g. the expected size for an IPv4 header) setting up SIMD computation involves some overhead that makes computing a checksum for small buffers slower than a non-SIMD implementation. Even more unfortunately, testing for this at runtimen in Go and calling a func optimized for small buffers mitigates most of the improvement due to call overhead. The break even point is around 256 byte buffers; IPv4 headers are no more than 60 bytes including extensions. IPv6 headers do not have a checksum but are a fixed size of 40 bytes. As a result, the generated assembly code uses an alternate approach for buffers of less than 256 bytes. Additionally, buffers of less than 32 bytes need to be handled specially because the strategy for reading buffers that are not a multiple of 8 bytes fails when the buffer is too small. As suggested by additional benchmarking, pseudo header computation has been rewritten to be faster (benchmark time reduced by 1/2 to 1/4). Updates tailscale/corp#9755 Signed-off-by: Adrian Dewhurst <adrian@tailscale.com>

sailorfrag · 2023-07-10T17:52:16Z

Updates tailscale/corp#9755

sailorfrag added 2 commits July 10, 2023 13:38

tun: checksum tests and benchmarks

2999af5

Signed-off-by: Adrian Dewhurst <adrian@tailscale.com>

sailorfrag force-pushed the adrian/simd-checksum-tailscale branch from 343daf2 to dce3221 Compare July 10, 2023 17:51

WalterHub approved these changes Jul 10, 2023

View reviewed changes

sailorfrag merged commit bb2c8f2 into tailscale Jul 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

amd64 SIMD IP checksum #13

amd64 SIMD IP checksum #13

sailorfrag commented Jul 10, 2023

sailorfrag commented Jul 10, 2023

amd64 SIMD IP checksum #13

amd64 SIMD IP checksum #13

Conversation

sailorfrag commented Jul 10, 2023

sailorfrag commented Jul 10, 2023