Add faster bit unpacking #2352

oerling · 2022-08-22T14:08:12Z

Uses BMI2 and AVX2 to unpack contiguous and non-contiguous accesses to
bit packed data of width 24 or less.

Contiguous runs of up to 16 bit fields are loaded 64 bits at a time
and laid out in separate bytes or 16 bit words with pdep. The
bytes/shorts are then widened to a vector of 8x32 and stored. A 64 bit
target width widens the 8x32 to two 4x64 vectors.

If the positions to load are not contiguous, the byte offsets and bit
shifts are calculated as 8x32 vectors. The fields are read with a 8x32
gather. This data is in the lanes but if the bit width is not multiple
of 8, a different shift has to be applied to each lane. This is done
by multiplying the lanes by a vector of 8x32 where the multipliers are
chosen by the bit shift applicable to each lane by permuting the
multipliers by the bit shift vector. Now all the lanes are aligned and
can be shifted 8 bits down and extra bits can be anded off.

Contiguous fields of more than 16 bits are loaded with gather since pdep would be getting only 2 or 3 values at a time.

A benchmark compares the fast path with a naive implementation. The
acceleration is between 3-6x.

In TPCH with Parquet, processing of bit fields goes down from ~7% to
~2.5% of profile in velox_tpch_benchmark at scale 10.

netlify · 2022-08-22T14:08:16Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`b3f0492`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/630676bc671d82000827a9d0

facebook-github-bot · 2022-08-22T14:11:09Z

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-08-22T14:30:07Z

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-08-22T17:50:46Z

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-08-22T22:33:43Z

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-08-22T22:52:40Z

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-08-23T02:52:44Z

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Yuhta

The code looks good, not sure why the build is failing

Yuhta · 2022-08-23T16:11:55Z

scripts/setup-helper-functions.sh

@@ -100,7 +100,7 @@ function get_cxx_flags {
    ;;

    "avx")
-      echo -n "-mavx2 -mfma -mavx -mf16c -mlzcnt -std=c++17"
+      echo -n "-mavx2 -mfma -mavx -mf16c -mlzcnt -mbmi2 -g -std=c++17"


-g is not platform specific, does not need to be added here.

facebook-github-bot · 2022-08-23T22:22:29Z

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Yuhta · 2022-08-24T14:13:07Z

Thanks @kgpai for fixing the build!

facebook-github-bot · 2022-08-24T15:44:20Z

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

kgpai · 2022-08-24T15:44:49Z

Hi Orri, I think your last force push , removed the changes I made, Please add these changes to your pr : https://github.com/facebookincubator/velox/compare/eb9068d5db07efec5a1c3f406b0b3c6b2385beb7..5f5f4628017a2d0c28fcbbcab36eb8867587046a

facebook-github-bot · 2022-08-24T16:37:38Z

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-08-24T18:58:45Z

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Uses BMI2 and AVX2 to unpack contiguous and non-contiguous accesses to bit packed data of width 24 or less. Contiguous runs of up to 16 bit fields are loaded 64 bits at a time and laid out in separate bytes or 16 bit words with pdep. The bytes/shorts are then widened to a vector of 8x32 and stored. A 64 bit target width widens the 8x32 to two 4x64 vectors. If the positions to load are not contiguous, the byte offsets and bit shifts are calculated as 8x32 vectors. The fields are read with a 8x32 gather. This data is in the lanes but if the bit width is not multiple of 8, a different shift has to be applied to each lane. This is done by multiplying the lanes by a vector of 8x32 where the multipliers are chosen by the bit shift applicable to each lane by permuting the multipliers by the bit shift vector. Now all the lanes are aligned and can be shifted 8 bits down and extra bits can be anded off. Contiguous fields of more than 16 bits are loaded with gather since pdep would be getting only 2 or 3 values at a time. A benchmark compares the fast path with a naive implementation. The acceleration is between 3-6x. In TPCH with Parquet, processing of bit fields goes down from ~7% to ~2.5% of profile in velox_tpch_benchmark at scale 10.

facebook-github-bot · 2022-08-24T19:27:39Z

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Summary: Uses BMI2 and AVX2 to unpack contiguous and non-contiguous accesses to bit packed data of width 24 or less. Contiguous runs of up to 16 bit fields are loaded 64 bits at a time and laid out in separate bytes or 16 bit words with pdep. The bytes/shorts are then widened to a vector of 8x32 and stored. A 64 bit target width widens the 8x32 to two 4x64 vectors. If the positions to load are not contiguous, the byte offsets and bit shifts are calculated as 8x32 vectors. The fields are read with a 8x32 gather. This data is in the lanes but if the bit width is not multiple of 8, a different shift has to be applied to each lane. This is done by multiplying the lanes by a vector of 8x32 where the multipliers are chosen by the bit shift applicable to each lane by permuting the multipliers by the bit shift vector. Now all the lanes are aligned and can be shifted 8 bits down and extra bits can be anded off. Contiguous fields of more than 16 bits are loaded with gather since pdep would be getting only 2 or 3 values at a time. A benchmark compares the fast path with a naive implementation. The acceleration is between 3-6x. In TPCH with Parquet, processing of bit fields goes down from ~7% to ~2.5% of profile in velox_tpch_benchmark at scale 10. Pull Request resolved: facebookincubator#2352 Reviewed By: Yuhta Differential Revision: D38907589 Pulled By: oerling fbshipit-source-id: f2e42d6f7d334d7ceb945a63c4e9c3565be6897e

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 22, 2022

oerling requested a review from Yuhta August 22, 2022 14:10

oerling force-pushed the fast-bits-pr branch from 096c0de to 59e3c37 Compare August 22, 2022 14:29

oerling force-pushed the fast-bits-pr branch from 59e3c37 to 8c356e9 Compare August 22, 2022 17:42

oerling force-pushed the fast-bits-pr branch from 8c356e9 to 1b8dd8f Compare August 22, 2022 22:08

oerling force-pushed the fast-bits-pr branch from 1b8dd8f to 5c7f1a1 Compare August 22, 2022 22:45

oerling force-pushed the fast-bits-pr branch from 5c7f1a1 to eef5683 Compare August 23, 2022 00:18

oerling force-pushed the fast-bits-pr branch from eef5683 to 635acc0 Compare August 23, 2022 20:15

Yuhta approved these changes Aug 23, 2022

View reviewed changes

kgpai force-pushed the fast-bits-pr branch 4 times, most recently from eb9068d to 5f5f462 Compare August 24, 2022 00:16

oerling force-pushed the fast-bits-pr branch from 5f5f462 to 459f495 Compare August 24, 2022 15:40

oerling force-pushed the fast-bits-pr branch from 459f495 to 6db5e56 Compare August 24, 2022 16:35

oerling force-pushed the fast-bits-pr branch 2 times, most recently from 3935c67 to aa8f76d Compare August 24, 2022 18:50

oerling force-pushed the fast-bits-pr branch from aa8f76d to b3f0492 Compare August 24, 2022 19:06

facebook-github-bot closed this in c337b92 Aug 24, 2022

marin-ma pushed a commit to marin-ma/velox-oap that referenced this pull request Dec 15, 2023

[CORE] Code refactor for plan validation logic (facebookincubator#2352)

70fbdc2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add faster bit unpacking #2352

Add faster bit unpacking #2352

oerling commented Aug 22, 2022

netlify bot commented Aug 22, 2022 •

edited

Loading

facebook-github-bot commented Aug 22, 2022

facebook-github-bot commented Aug 22, 2022

facebook-github-bot commented Aug 22, 2022

facebook-github-bot commented Aug 22, 2022

facebook-github-bot commented Aug 22, 2022

facebook-github-bot commented Aug 23, 2022

Yuhta left a comment

Yuhta Aug 23, 2022

facebook-github-bot commented Aug 23, 2022

Yuhta commented Aug 24, 2022

facebook-github-bot commented Aug 24, 2022

kgpai commented Aug 24, 2022 •

edited

Loading

facebook-github-bot commented Aug 24, 2022

facebook-github-bot commented Aug 24, 2022

facebook-github-bot commented Aug 24, 2022

Add faster bit unpacking #2352

Add faster bit unpacking #2352

Conversation

oerling commented Aug 22, 2022

netlify bot commented Aug 22, 2022 • edited Loading

✅ Deploy Preview for meta-velox canceled.

facebook-github-bot commented Aug 22, 2022

facebook-github-bot commented Aug 22, 2022

facebook-github-bot commented Aug 22, 2022

facebook-github-bot commented Aug 22, 2022

facebook-github-bot commented Aug 22, 2022

facebook-github-bot commented Aug 23, 2022

Yuhta left a comment

Choose a reason for hiding this comment

Yuhta Aug 23, 2022

Choose a reason for hiding this comment

facebook-github-bot commented Aug 23, 2022

Yuhta commented Aug 24, 2022

facebook-github-bot commented Aug 24, 2022

kgpai commented Aug 24, 2022 • edited Loading

facebook-github-bot commented Aug 24, 2022

facebook-github-bot commented Aug 24, 2022

facebook-github-bot commented Aug 24, 2022

netlify bot commented Aug 22, 2022 •

edited

Loading

kgpai commented Aug 24, 2022 •

edited

Loading