-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add faster bit unpacking #2352
Add faster bit unpacking #2352
Conversation
✅ Deploy Preview for meta-velox canceled.
|
@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good, not sure why the build is failing
scripts/setup-helper-functions.sh
Outdated
@@ -100,7 +100,7 @@ function get_cxx_flags { | |||
;; | |||
|
|||
"avx") | |||
echo -n "-mavx2 -mfma -mavx -mf16c -mlzcnt -std=c++17" | |||
echo -n "-mavx2 -mfma -mavx -mf16c -mlzcnt -mbmi2 -g -std=c++17" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-g
is not platform specific, does not need to be added here.
@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
eb9068d
to
5f5f462
Compare
Thanks @kgpai for fixing the build! |
@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Hi Orri, I think your last force push , removed the changes I made, Please add these changes to your pr : https://github.com/facebookincubator/velox/compare/eb9068d5db07efec5a1c3f406b0b3c6b2385beb7..5f5f4628017a2d0c28fcbbcab36eb8867587046a |
@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
3935c67
to
aa8f76d
Compare
@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Uses BMI2 and AVX2 to unpack contiguous and non-contiguous accesses to bit packed data of width 24 or less. Contiguous runs of up to 16 bit fields are loaded 64 bits at a time and laid out in separate bytes or 16 bit words with pdep. The bytes/shorts are then widened to a vector of 8x32 and stored. A 64 bit target width widens the 8x32 to two 4x64 vectors. If the positions to load are not contiguous, the byte offsets and bit shifts are calculated as 8x32 vectors. The fields are read with a 8x32 gather. This data is in the lanes but if the bit width is not multiple of 8, a different shift has to be applied to each lane. This is done by multiplying the lanes by a vector of 8x32 where the multipliers are chosen by the bit shift applicable to each lane by permuting the multipliers by the bit shift vector. Now all the lanes are aligned and can be shifted 8 bits down and extra bits can be anded off. Contiguous fields of more than 16 bits are loaded with gather since pdep would be getting only 2 or 3 values at a time. A benchmark compares the fast path with a naive implementation. The acceleration is between 3-6x. In TPCH with Parquet, processing of bit fields goes down from ~7% to ~2.5% of profile in velox_tpch_benchmark at scale 10.
@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary: Uses BMI2 and AVX2 to unpack contiguous and non-contiguous accesses to bit packed data of width 24 or less. Contiguous runs of up to 16 bit fields are loaded 64 bits at a time and laid out in separate bytes or 16 bit words with pdep. The bytes/shorts are then widened to a vector of 8x32 and stored. A 64 bit target width widens the 8x32 to two 4x64 vectors. If the positions to load are not contiguous, the byte offsets and bit shifts are calculated as 8x32 vectors. The fields are read with a 8x32 gather. This data is in the lanes but if the bit width is not multiple of 8, a different shift has to be applied to each lane. This is done by multiplying the lanes by a vector of 8x32 where the multipliers are chosen by the bit shift applicable to each lane by permuting the multipliers by the bit shift vector. Now all the lanes are aligned and can be shifted 8 bits down and extra bits can be anded off. Contiguous fields of more than 16 bits are loaded with gather since pdep would be getting only 2 or 3 values at a time. A benchmark compares the fast path with a naive implementation. The acceleration is between 3-6x. In TPCH with Parquet, processing of bit fields goes down from ~7% to ~2.5% of profile in velox_tpch_benchmark at scale 10. Pull Request resolved: facebookincubator#2352 Reviewed By: Yuhta Differential Revision: D38907589 Pulled By: oerling fbshipit-source-id: f2e42d6f7d334d7ceb945a63c4e9c3565be6897e
Uses BMI2 and AVX2 to unpack contiguous and non-contiguous accesses to
bit packed data of width 24 or less.
Contiguous runs of up to 16 bit fields are loaded 64 bits at a time
and laid out in separate bytes or 16 bit words with pdep. The
bytes/shorts are then widened to a vector of 8x32 and stored. A 64 bit
target width widens the 8x32 to two 4x64 vectors.
If the positions to load are not contiguous, the byte offsets and bit
shifts are calculated as 8x32 vectors. The fields are read with a 8x32
gather. This data is in the lanes but if the bit width is not multiple
of 8, a different shift has to be applied to each lane. This is done
by multiplying the lanes by a vector of 8x32 where the multipliers are
chosen by the bit shift applicable to each lane by permuting the
multipliers by the bit shift vector. Now all the lanes are aligned and
can be shifted 8 bits down and extra bits can be anded off.
Contiguous fields of more than 16 bits are loaded with gather since pdep would be getting only 2 or 3 values at a time.
A benchmark compares the fast path with a naive implementation. The
acceleration is between 3-6x.
In TPCH with Parquet, processing of bit fields goes down from ~7% to
~2.5% of profile in velox_tpch_benchmark at scale 10.