Add faster bit unpacking #2352

Uses BMI2 and AVX2 to unpack contiguous and non-contiguous accesses to bit packed data of width 24 or less. Contiguous runs of up to 16 bit fields are loaded 64 bits at a time and laid out in separate bytes or 16 bit words with pdep. The bytes/shorts are then widened to a vector of 8x32 and stored. A 64 bit target width widens the 8x32 to two 4x64 vectors. If the positions to load are not contiguous, the byte offsets and bit shifts are calculated as 8x32 vectors. The fields are read with a 8x32 gather. This data is in the lanes but if the bit width is not multiple of 8, a different shift has to be applied to each lane. This is done by multiplying the lanes by a vector of 8x32 where the multipliers are chosen by the bit shift applicable to each lane by permuting the multipliers by the bit shift vector. Now all the lanes are aligned and can be shifted 8 bits down and extra bits can be anded off. Contiguous fields of more than 16 bits are loaded with gather since pdep would be getting only 2 or 3 values at a time. A benchmark compares the fast path with a naive implementation. The acceleration is between 3-6x. In TPCH with Parquet, processing of bit fields goes down from ~7% to ~2.5% of profile in velox_tpch_benchmark at scale 10.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add faster bit unpacking #2352

Add faster bit unpacking #2352

Commits on Aug 24, 2022