Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add faster bit unpacking #2352

Closed
wants to merge 1 commit into from
Closed

Commits on Aug 24, 2022

  1. Add faster bit unpacking

    Uses BMI2 and AVX2 to unpack contiguous and non-contiguous accesses to
    bit packed data of width 24 or less.
    
    Contiguous runs of up to 16 bit fields are loaded 64 bits at a time
    and laid out in separate bytes or 16 bit words with pdep. The
    bytes/shorts are then widened to a vector of 8x32 and stored. A 64 bit
    target width widens the 8x32 to two 4x64 vectors.
    
    If the positions to load are not contiguous, the byte offsets and bit
    shifts are calculated as 8x32 vectors. The fields are read with a 8x32
    gather. This data is in the lanes but if the bit width is not multiple
    of 8, a different shift has to be applied to each lane. This is done
    by multiplying the lanes by a vector of 8x32 where the multipliers are
    chosen by the bit shift applicable to each lane by permuting the
    multipliers by the bit shift vector. Now all the lanes are aligned and
    can be shifted 8 bits down and extra bits can be anded off.
    
    Contiguous fields of more than 16 bits are loaded with gather since pdep would be getting only 2 or 3 values at a time.
    
    A benchmark compares the fast path with a naive implementation. The
    acceleration is between 3-6x.
    
    In TPCH with Parquet, processing of bit fields goes down from ~7% to
    ~2.5% of profile in velox_tpch_benchmark at scale 10.
    Orri Erling committed Aug 24, 2022
    Configuration menu
    Copy the full SHA
    b3f0492 View commit details
    Browse the repository at this point in the history