You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is an RFC on altering Numba's fundamental numerical functionality to better reflect what is present in Python/NumPy. The intended impact of this is to:
Ensure better reproducibility, i.e. Numba will more closely follow what would happen in the interpreter.
Alter performance characteristics with an anticipated net performance gain.
Create conceptually easier paths to follow to achieve higher performance.
Current state:
At present, the Numba code base implements a large number of the functions found in Python builtins, Python's operator, math and cmath modules, and NumPy. Due to historical reasons (largely from the way in which the compiler was bootstrapped) a lot of the implementations of the functions are reasonable approximations for "common" input values that are in use by most users, but they often fail to replicate functionality exactly.
This "reasonable approximation" has served Numba well, but the current state has a few fundamental problems:
The functions often call into each others' implementations, for example NumPy ufuncs that "borrow" implementation details from the math module. The performance characteristics can be somewhat unpredictable due to this as the implementations from Python modules often contain exception handling whereas NumPy does not. Exceptions lead to branches which lead to lost optimisation opportunities in e.g. vectorization.
The implementations are sufficiently acceptable along "common" paths, but are often not correctly implemented with respect to things like type promotion or error handling.
Whilst this is an implementation detail, due to the state of the current code base the reusability of the implementations across hardware targets involves a lot of boilerplate and copy-paste to fix things e.g. on CUDA swapping libm/intrinsic cos calls for CUDA libdevice cos calls.
Numba's type system aliases Python int with NumPy intp, float with NumPy float64 and complex with NumPy complex128. This makes it hard to be specific about what the behaviour should be when e.g. raw numerical values interact with e.g. NumPy functions.
Some examples of the above issues:
This example demonstrates the problems noted in 2. that for simple integer and float inputs to a "power"-like function quite a large spectrum of result values, result types and error handling behaviours are produced.
The types of the outputs are not consistent across implementations (integer vs. floating point output type).
The behaviour with respect to the numerical "domains" is not consistent, some implementations raise, others do the "mathematically correct" thing, others return nan. That NumPy always returns a value without raising is good for performance as there will be no branches to check for exceptions, this leads to more vectorization opportunities.
This example shows that math functions call the __float__ method on the type and actually operate in double precision, contrary to NumPy, which specialises:
That the math module converts input values to double precision via __float__ and then runs double precision numerical functions on these values.
NumPy ufuncs specialise on the type of the float, this gives better performance.
Thoughts on target specific behaviours:
Numba currently supports CPU and CUDA targets. Due to recent work it's now possible to write @overloads of functions that are "generic" i.e. should work on any hardware, and also to specialised overloads so as to target specific hardware. Having started to use this functionality/API more, it's become clear that the point of abstraction where "target specific" implementations must exist is at the point where there is some fundamental difference in the actual target, e.g. something only exists on a particular target or there are some performance specialisations available for a particular target. The relevant point of abstraction for the implementation of mathematical functionality is at calls to the underlying maths library as this is something that is solely called by Numba and differs specifically at the hardware target level.
The proposal:
Considering all of the above, it's proposed that Numba internals for Python numerical builtins, operator, math/cmath and numpy (particularly the ufuncs) are altered to reflect the actual behaviours present when executing in the Python interpreter. The impact is anticipated to be as follows:
The new behaviour would be more correct in its replication of execution in the Python interpreter, and predictably so. This may have an impact on users that are relying on the implementation details of current Numba implementations, it's expected that this is a small number of users.
The builtins, operator and math/cmath modules would all behave correctly with respect to numerical domains and raise exceptions appropriately. The performance impact of this is that use of these functions will make it much harder for the underlying optimisers. It's suggested that a NumbaPerformanceWarning encouraging the use of NumPy is issued if functions from these modules are used. The number of users impacted by this is not known, but is probably quite small as most will be using NumPy already. The outlier in this is users of the CUDA target for which there is no NumPy ufunc support at present, this is being implemented in CUDA: Add trig ufunc support #8294 to create an "upgrade" path.
The NumPy ufuncs would all be implemented in a largely exception-free manner as they are in NumPy, this better replicates NumPy and also has the advantage that optimisation passes which attempt vectorization will be able to do more as there are fewer branches. Further, encouraging users to use the overloaded NumPy functionality would also mean their code can be type specialised correctly, i.e. float32 inputs will (typically) produce float32 output. This makes it very easy to gain performance, simply "use NumPy".
Much of the new/updated numerical code can be implemented for the "generic" target. The only place where there needs to be a target specific abstraction is at the point of calling the underlying target's specific maths library. In the case of the CPU this is a mix of llvm intrinsics and libm calls, in the case of CUDA it's the use of libdevice. It should be possible to create a math_h module which contains overloadable stubs based on functions declared in C's math.h. All the rest of the Python and NumPy functionality can be implemented based on this math_h module in a manner similar to how it is written in the existing original implementations. This provides maximum reuse of code and makes it very easy for "new" hardware targets to make use of this functionality, all that would be needed is to create the target specific overloads for the math_h stubs.
The above proposal intends to address items 1, 2 and 3 as listed in the "Current state" above. Item 4 will not be addressed, however, with this adjustment to the implementations it should be easier to make changes to the type system as the implementations of CPython and NumPy functions will be correctly separated such that types won't "leak" between implementations.
The text was updated successfully, but these errors were encountered:
This is an RFC on altering Numba's fundamental numerical functionality to better reflect what is present in Python/NumPy. The intended impact of this is to:
Current state:
At present, the Numba code base implements a large number of the functions found in Python builtins, Python's
operator
,math
andcmath
modules, and NumPy. Due to historical reasons (largely from the way in which the compiler was bootstrapped) a lot of the implementations of the functions are reasonable approximations for "common" input values that are in use by most users, but they often fail to replicate functionality exactly.This "reasonable approximation" has served Numba well, but the current state has a few fundamental problems:
ufuncs
that "borrow" implementation details from themath
module. The performance characteristics can be somewhat unpredictable due to this as the implementations from Python modules often contain exception handling whereas NumPy does not. Exceptions lead to branches which lead to lost optimisation opportunities in e.g. vectorization.cos
calls for CUDA libdevicecos
calls.int
with NumPyintp
,float
with NumPyfloat64
andcomplex
with NumPycomplex128
. This makes it hard to be specific about what the behaviour should be when e.g. raw numerical values interact with e.g. NumPy functions.Some examples of the above issues:
This example demonstrates the problems noted in 2. that for simple integer and float inputs to a "power"-like function quite a large spectrum of result values, result types and error handling behaviours are produced.
which produces:
Things to note:
nan
. That NumPy always returns a value without raising is good for performance as there will be no branches to check for exceptions, this leads to more vectorization opportunities.This example shows that
math
functions call the__float__
method on the type and actually operate in double precision, contrary to NumPy, which specialises:which produces:
Things to note:
math
module converts input values to double precision via__float__
and then runs double precision numerical functions on these values.ufuncs
specialise on the type of thefloat
, this gives better performance.Thoughts on target specific behaviours:
Numba currently supports CPU and CUDA targets. Due to recent work it's now possible to write
@overload
s of functions that are "generic" i.e. should work on any hardware, and also to specialised overloads so as to target specific hardware. Having started to use this functionality/API more, it's become clear that the point of abstraction where "target specific" implementations must exist is at the point where there is some fundamental difference in the actual target, e.g. something only exists on a particular target or there are some performance specialisations available for a particular target. The relevant point of abstraction for the implementation of mathematical functionality is at calls to the underlying maths library as this is something that is solely called by Numba and differs specifically at the hardware target level.The proposal:
Considering all of the above, it's proposed that Numba internals for Python numerical builtins,
operator
,math
/cmath
andnumpy
(particularly the ufuncs) are altered to reflect the actual behaviours present when executing in the Python interpreter. The impact is anticipated to be as follows:operator
andmath
/cmath
modules would all behave correctly with respect to numerical domains and raise exceptions appropriately. The performance impact of this is that use of these functions will make it much harder for the underlying optimisers. It's suggested that aNumbaPerformanceWarning
encouraging the use of NumPy is issued if functions from these modules are used. The number of users impacted by this is not known, but is probably quite small as most will be using NumPy already. The outlier in this is users of the CUDA target for which there is no NumPyufunc
support at present, this is being implemented in CUDA: Add trig ufunc support #8294 to create an "upgrade" path.ufuncs
would all be implemented in a largely exception-free manner as they are in NumPy, this better replicates NumPy and also has the advantage that optimisation passes which attempt vectorization will be able to do more as there are fewer branches. Further, encouraging users to use the overloaded NumPy functionality would also mean their code can be type specialised correctly, i.e.float32
inputs will (typically) producefloat32
output. This makes it very easy to gain performance, simply "use NumPy".llvm
intrinsics andlibm
calls, in the case of CUDA it's the use oflibdevice
. It should be possible to create amath_h
module which contains overloadable stubs based on functions declared inC
'smath.h
. All the rest of the Python and NumPy functionality can be implemented based on thismath_h
module in a manner similar to how it is written in the existing original implementations. This provides maximum reuse of code and makes it very easy for "new" hardware targets to make use of this functionality, all that would be needed is to create the target specific overloads for themath_h
stubs.The above proposal intends to address items 1, 2 and 3 as listed in the "Current state" above. Item 4 will not be addressed, however, with this adjustment to the implementations it should be easier to make changes to the type system as the implementations of CPython and NumPy functions will be correctly separated such that types won't "leak" between implementations.
The text was updated successfully, but these errors were encountered: