Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add faster bit unpacking #2352

Closed
wants to merge 1 commit into from
Closed

Conversation

oerling
Copy link
Contributor

@oerling oerling commented Aug 22, 2022

Uses BMI2 and AVX2 to unpack contiguous and non-contiguous accesses to
bit packed data of width 24 or less.

Contiguous runs of up to 16 bit fields are loaded 64 bits at a time
and laid out in separate bytes or 16 bit words with pdep. The
bytes/shorts are then widened to a vector of 8x32 and stored. A 64 bit
target width widens the 8x32 to two 4x64 vectors.

If the positions to load are not contiguous, the byte offsets and bit
shifts are calculated as 8x32 vectors. The fields are read with a 8x32
gather. This data is in the lanes but if the bit width is not multiple
of 8, a different shift has to be applied to each lane. This is done
by multiplying the lanes by a vector of 8x32 where the multipliers are
chosen by the bit shift applicable to each lane by permuting the
multipliers by the bit shift vector. Now all the lanes are aligned and
can be shifted 8 bits down and extra bits can be anded off.

Contiguous fields of more than 16 bits are loaded with gather since pdep would be getting only 2 or 3 values at a time.

A benchmark compares the fast path with a naive implementation. The
acceleration is between 3-6x.

In TPCH with Parquet, processing of bit fields goes down from ~7% to
~2.5% of profile in velox_tpch_benchmark at scale 10.

@netlify
Copy link

netlify bot commented Aug 22, 2022

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit b3f0492
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/630676bc671d82000827a9d0

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 22, 2022
@oerling oerling requested a review from Yuhta August 22, 2022 14:10
@facebook-github-bot
Copy link
Contributor

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@Yuhta Yuhta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good, not sure why the build is failing

@@ -100,7 +100,7 @@ function get_cxx_flags {
;;

"avx")
echo -n "-mavx2 -mfma -mavx -mf16c -mlzcnt -std=c++17"
echo -n "-mavx2 -mfma -mavx -mf16c -mlzcnt -mbmi2 -g -std=c++17"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-g is not platform specific, does not need to be added here.

@facebook-github-bot
Copy link
Contributor

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@kgpai kgpai force-pushed the fast-bits-pr branch 4 times, most recently from eb9068d to 5f5f462 Compare August 24, 2022 00:16
@Yuhta
Copy link
Contributor

Yuhta commented Aug 24, 2022

Thanks @kgpai for fixing the build!

@facebook-github-bot
Copy link
Contributor

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@kgpai
Copy link
Contributor

kgpai commented Aug 24, 2022

Hi Orri, I think your last force push , removed the changes I made, Please add these changes to your pr : https://github.com/facebookincubator/velox/compare/eb9068d5db07efec5a1c3f406b0b3c6b2385beb7..5f5f4628017a2d0c28fcbbcab36eb8867587046a

@facebook-github-bot
Copy link
Contributor

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@oerling oerling force-pushed the fast-bits-pr branch 2 times, most recently from 3935c67 to aa8f76d Compare August 24, 2022 18:50
@facebook-github-bot
Copy link
Contributor

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Uses BMI2 and AVX2 to unpack contiguous and non-contiguous accesses to
bit packed data of width 24 or less.

Contiguous runs of up to 16 bit fields are loaded 64 bits at a time
and laid out in separate bytes or 16 bit words with pdep. The
bytes/shorts are then widened to a vector of 8x32 and stored. A 64 bit
target width widens the 8x32 to two 4x64 vectors.

If the positions to load are not contiguous, the byte offsets and bit
shifts are calculated as 8x32 vectors. The fields are read with a 8x32
gather. This data is in the lanes but if the bit width is not multiple
of 8, a different shift has to be applied to each lane. This is done
by multiplying the lanes by a vector of 8x32 where the multipliers are
chosen by the bit shift applicable to each lane by permuting the
multipliers by the bit shift vector. Now all the lanes are aligned and
can be shifted 8 bits down and extra bits can be anded off.

Contiguous fields of more than 16 bits are loaded with gather since pdep would be getting only 2 or 3 values at a time.

A benchmark compares the fast path with a naive implementation. The
acceleration is between 3-6x.

In TPCH with Parquet, processing of bit fields goes down from ~7% to
~2.5% of profile in velox_tpch_benchmark at scale 10.
@facebook-github-bot
Copy link
Contributor

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

mbasmanova pushed a commit to mbasmanova/velox-1 that referenced this pull request Aug 25, 2022
Summary:
Uses BMI2 and AVX2 to unpack contiguous and non-contiguous accesses to
bit packed data of width 24 or less.

Contiguous runs of up to 16 bit fields are loaded 64 bits at a time
and laid out in separate bytes or 16 bit words with pdep. The
bytes/shorts are then widened to a vector of 8x32 and stored. A 64 bit
target width widens the 8x32 to two 4x64 vectors.

If the positions to load are not contiguous, the byte offsets and bit
shifts are calculated as 8x32 vectors. The fields are read with a 8x32
gather. This data is in the lanes but if the bit width is not multiple
of 8, a different shift has to be applied to each lane. This is done
by multiplying the lanes by a vector of 8x32 where the multipliers are
chosen by the bit shift applicable to each lane by permuting the
multipliers by the bit shift vector. Now all the lanes are aligned and
can be shifted 8 bits down and extra bits can be anded off.

Contiguous fields of more than 16 bits are loaded with gather since pdep would be getting only 2 or 3 values at a time.

A benchmark compares the fast path with a naive implementation. The
acceleration is between 3-6x.

In TPCH with Parquet, processing of bit fields goes down from ~7% to
~2.5% of profile in velox_tpch_benchmark at scale 10.

Pull Request resolved: facebookincubator#2352

Reviewed By: Yuhta

Differential Revision: D38907589

Pulled By: oerling

fbshipit-source-id: f2e42d6f7d334d7ceb945a63c4e9c3565be6897e
marin-ma pushed a commit to marin-ma/velox-oap that referenced this pull request Dec 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants