Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This adds AMD64 assembly implementations of IP checksum computation, one
for baseline AMD64 and the other for v3 AMD64 (AVX2 and BMI2).
All performance numbers reported are from a Ryzen 7 4750U but similar
improvements are expected for a wide range of processors.
The generic IP checksum implementation has also been further improved to
be significantly faster using bits.AddUint64 (for a 64KiB buffer the
throughput improves from 15,000MiB/s to 27,600MiB/s; similar gains are
also reported on ARM64 but I do not have specific numbers).
The baseline AMD64 implementation for a 64KiB buffer reports 32,700MiB/s
and the AVX2 implementation is slightly over 107,000MiB/s.
Unfortunately, for very small sizes (e.g. the expected size for an IPv4
header) setting up SIMD computation involves some overhead that makes
computing a checksum for small buffers slower than a non-SIMD
implementation. Even more unfortunately, testing for this at runtimen in
Go and calling a func optimized for small buffers mitigates most of the
improvement due to call overhead. The break even point is around 256
byte buffers; IPv4 headers are no more than 60 bytes including
extensions. IPv6 headers do not have a checksum but are a fixed size of
40 bytes. As a result, the generated assembly code uses an alternate
approach for buffers of less than 256 bytes. Additionally, buffers of
less than 32 bytes need to be handled specially because the strategy for
reading buffers that are not a multiple of 8 bytes fails when the buffer
is too small.
As suggested by additional benchmarking, pseudo header computation has
been rewritten to be faster (benchmark time reduced by 1/2 to 1/4).
This change has been split into two commits: first the tests (to verify that the existing code passes all the tests), then the second introduces the new implementation.
Previously reviewed on our internal-only wireguard-go fork (PR 3 in that repo) with @jwhited