Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Fully implement multiply and divide for decimal128 #3974

Closed
revans2 opened this issue Oct 30, 2021 · 4 comments
Closed

[FEA] Fully implement multiply and divide for decimal128 #3974

revans2 opened this issue Oct 30, 2021 · 4 comments
Assignees
Labels
feature request New feature or request

Comments

@revans2
Copy link
Collaborator

revans2 commented Oct 30, 2021

Is your feature request related to a problem? Please describe.
This is a bit of a crazy idea. I already had a crazy idea about SUM #3944. This is similar but for multiply and divide.

There are algorithms for both multiply and divide that work for large arbitrary precision.

https://en.wikipedia.org/wiki/Multiplication_algorithm

https://en.wikipedia.org/wiki/Division_algorithm

We also would need to write out own arbitrary precision rounding algorithm, which we could base off of what we and CUDF do already or try to come up with something new and more efficient.

I think we can adapt these to work with the operations we already have access to on the GPU. If we had to we could also write custom cuda kernels to take the inputs. Do the operations as 256-bit integer values, and then round the result back.

This is not as fully thought through as the SUM operation is. This is because there are a number of possible algorithms. I am no expert in this so I think we could start with something simpler that we know we can implement, even if it is not the most memory/compute efficient, and then we can refine it later if needed.

Lets take multiply as an example and to keep things simple in the example lets say we only can support decimal values up to a precision of 4. In this example we will do Decimal(4, 2) * Decimal(4, 2) and get an output of Decimal(4, 1) (trying to emulate what Spark does with overflow). We will do the math for 10.27 * 0.51.

The Spark way.

  1. Do the multiply with arbitrary precision.
    10.27 * 0.51 => 5.2377
  2. Round the result to the desired scale
    5.2377 => 5.2
  3. Check if the result fits in the desired precision.
    if (999.9 >= 5.2 >= -999.9) 5.2 else null

For us we don't have arbitrary precision so we can do a long multiply with some adds and existing multiply operations.

  1. Convert the numbers to "integers" (scale 0) and split them up into values we know we can multiply without overflowing.
    10.27 => 1027 => 10 and 27
    0.51 => 51 => 00 and 51
  2. Do the 4 multiplies with numbers we know that we can do without overflow, and keep track of the multiply digits we would need
    00 * 10 = 0000 * 10^(6 - 4) // the 6 comes from the fact it was the two higher order numbers we multiplied. The -4 is because that is the resulting scale we would see after doing the original multiply
    00 * 27 = 0000 * 10^(4 - 4)
    10 * 51 = 0510 * 10^(2 - 4)
    27 * 51 = 1377 * 10^(0 - 4)
  3. Do some initial overflow checks. Essentially do the same kind of overflow checks we do now for max and min values that can fit in the output, but we scale it to match the scale of the partial answer.
    999.9 >= 0000 * 10^2 >= -999.9
    999.9 >= 0000 >= -999.9
    999.9 >= 05.10 >= -999.9
    999.9 >= .1377 >= -999.9

If any of them would overflow we know that the final result would also overflow.

  1. add the results and round/truncate as we go. We start by adding all of the numbers together that have values below the desired output scale. I think with how Spark works we can just truncate the lower number to the precision we need before rounding, but I am not 100% sure on that. Will need to do some experimentation to see.
    .1377 truncate to scale 2 => .13
    .13 + 05.10 => 05.23 round to the final precision
    05.2

  2. Move the remaining values to the final output scale and add them together. The rounding could have added another value to the maximum precision for the result. But because we know that a decimal128 has enough space to hold this without really overflowing we don't need to worry about it because we already checked the ranges ahead of time.

0000 * 10^2 + 0000.0 + 05.2 => 5.2

  1. finally check the range of the result, because we did rounding.

This is already long enough, but I think we can do something similar for divide, and I think we can also reduce the complexity of multiply with a little work.

@revans2 revans2 added feature request New feature or request ? - Needs Triage Need team to review and classify labels Oct 30, 2021
@Salonijain27 Salonijain27 removed the ? - Needs Triage Need team to review and classify label Nov 2, 2021
@sameerz
Copy link
Collaborator

sameerz commented Jan 18, 2022

Removing from 22.02 until we have a more clear customer need.

@jlowe jlowe self-assigned this Feb 23, 2022
@sameerz sameerz assigned revans2 and unassigned jlowe Aug 2, 2022
@sameerz
Copy link
Collaborator

sameerz commented Aug 2, 2022

This blocks issues #6142 and #6143

@revans2
Copy link
Collaborator Author

revans2 commented Aug 3, 2022

Multiply looks relatively simple to implement. Divide however is much more difficult.

Sadly, it looks like in the worst case we are going to need more than 256-bits.

  • 192-bits can hold 57 singed digits (see below for average)
  • 256-bits can hold 76 signed digits
  • 384-bits can hold 115 signed digits (see below for divide in general)
  • 512-bits can hold 153 signed digits

For divide we need to have the large values because we have to shift the dividend/lhs over to make enough room so the output gets the precision/scale it needs + 1. The + 1 is so we can round the result afterwards to match Spark. In the worst case the output is (38, 38) the dividend/lhs is (38, 0) and the divisor/rhs is (38, 38). This is highly contrived and unrealistic but possible. So, if we can support it we should try to.This would make it, so we had to shift the lhs to have a scale of 77 = 38 + 38 + 1 and a precision of 115 = 77 + 38.

// For Spark the final desired output is
// new_scale = max(6, lhs.scale + rhs.precision + 1)
// new_precision = lhs.precision - lhs.scale + rhs.scale + new_scale
// But Spark will round the final result, so we need at least one more
// decimal place on the scale to be able to do the rounding too.
def lhsNeededScale(rhs: DecimalType, outputType: DecimalType): Int =
outputType.scale + rhs.scale + 1
def lhsNeededPrecision(lhs: DecimalType, rhs: DecimalType, outputType: DecimalType): Int = {
val neededLhsScale = lhsNeededScale(rhs, outputType)
(lhs.precision - lhs.scale) + neededLhsScale
}
).

We might be able to play some games if we can get the remainder out of the divide algorithm.

if (remainder >= dividend/2) {
  output += 1
} 

With that we would only need a precision of 114 and scale of 76. But 384-bits can support everything that we need so there really isn't a lot of reason to do it unless it makes the rounding work much much simpler. Not sure that what I am proposing actually will work.

For average #6142 we know that the divisor/rhs is a long so it would end up being a (20, 0) and the output is the input to average with 4 added to the scale, but bound to 38. So worst case we would end up with an output of (38, 38), a dividend/lhs of (38, 34) and a divisor/rhs of (20, 0). This means the lhs would need to be shifted to have a scale of 39 and a precision of 43.

Not sure if we want to have multiple kernels for different sizes of just one. We probably can start with the hardest one first and then see what kind of a performance boost we get for smaller sizes.

The good news is that in either case the rhs stays as a smaller value 128-bits at most, but could be 64-bits or even 32-bits. This allows us to potentially save a lot of register space when doing the divide. Still need to think about how this might impact the choice we have to algorithms. 384-bits is still a lot.

@revans2 revans2 added this to the Aug 22 - Sep 2 milestone Aug 30, 2022
@revans2
Copy link
Collaborator Author

revans2 commented Aug 30, 2022

The basics of multiply and divide are in place now.

@revans2 revans2 closed this as completed Aug 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants