Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Separate NumPy and CPython implementions/functionality. #8700

Open
stuartarchibald opened this issue Jan 10, 2023 · 0 comments
Open

RFC: Separate NumPy and CPython implementions/functionality. #8700

stuartarchibald opened this issue Jan 10, 2023 · 0 comments
Labels
RFC Request for Comments

Comments

@stuartarchibald
Copy link
Contributor

This is an RFC on altering Numba's fundamental numerical functionality to better reflect what is present in Python/NumPy. The intended impact of this is to:

  • Ensure better reproducibility, i.e. Numba will more closely follow what would happen in the interpreter.
  • Alter performance characteristics with an anticipated net performance gain.
  • Create conceptually easier paths to follow to achieve higher performance.

Current state:

At present, the Numba code base implements a large number of the functions found in Python builtins, Python's operator, math and cmath modules, and NumPy. Due to historical reasons (largely from the way in which the compiler was bootstrapped) a lot of the implementations of the functions are reasonable approximations for "common" input values that are in use by most users, but they often fail to replicate functionality exactly.

This "reasonable approximation" has served Numba well, but the current state has a few fundamental problems:

  1. The functions often call into each others' implementations, for example NumPy ufuncs that "borrow" implementation details from the math module. The performance characteristics can be somewhat unpredictable due to this as the implementations from Python modules often contain exception handling whereas NumPy does not. Exceptions lead to branches which lead to lost optimisation opportunities in e.g. vectorization.
  2. The implementations are sufficiently acceptable along "common" paths, but are often not correctly implemented with respect to things like type promotion or error handling.
  3. Whilst this is an implementation detail, due to the state of the current code base the reusability of the implementations across hardware targets involves a lot of boilerplate and copy-paste to fix things e.g. on CUDA swapping libm/intrinsic cos calls for CUDA libdevice cos calls.
  4. Numba's type system aliases Python int with NumPy intp, float with NumPy float64 and complex with NumPy complex128. This makes it hard to be specific about what the behaviour should be when e.g. raw numerical values interact with e.g. NumPy functions.

Some examples of the above issues:

This example demonstrates the problems noted in 2. that for simple integer and float inputs to a "power"-like function quite a large spectrum of result values, result types and error handling behaviours are produced.

import operator
import math
import numpy as np

def operator_pow(a, b):
    return operator.pow(a, b)

def math_pow(a, b):
    return math.pow(a, b)

def np_pow(a, b):
    return np.power(a, b)

for x, y in ((2, 2), (2.1, 2), (2.1, 1.3), (2.1, 0.3), (2, 0.3), (2.1, -1),
             (2, -1), (2.1, -1.2), (-2., -1), (-2, -2.1), (-2.1, -0.2)):

    buf = [f"x={x:<4}, y={y:<4}: "]
    for key, fn in (('o', operator_pow), ('m', math_pow), ('n', np_pow)):
        try:
            o = fn(x, y)
            if isinstance(o, (int, np.integer)):
                prec = '6'
            else:
                prec = '6.4'
            ofmt = f'{key}: {str(type(o)):<18}{o:<{prec}} '
        except Exception as e:
            ofmt = f'{key}: {e} '
        buf.append(ofmt)
    print(''.join(buf))

which produces:

x=2   , y=2   : o: <class 'int'>     4      m: <class 'float'>   4.0    n: <class 'numpy.int64'>4
x=2.1 , y=2   : o: <class 'float'>   4.41   m: <class 'float'>   4.41   n: <class 'numpy.float64'>4.41
x=2.1 , y=1.3 : o: <class 'float'>   2.624  m: <class 'float'>   2.624  n: <class 'numpy.float64'>2.624
x=2.1 , y=0.3 : o: <class 'float'>   1.249  m: <class 'float'>   1.249  n: <class 'numpy.float64'>1.249
x=2   , y=0.3 : o: <class 'float'>   1.231  m: <class 'float'>   1.231  n: <class 'numpy.float64'>1.231
x=2.1 , y=-1  : o: <class 'float'>   0.4762 m: <class 'float'>   0.4762 n: <class 'numpy.float64'>0.4762
x=2   , y=-1  : o: <class 'float'>   0.5    m: <class 'float'>   0.5    n: Integers to negative integer powers are not allowed.
x=2.1 , y=-1.2: o: <class 'float'>   0.4105 m: <class 'float'>   0.4105 n: <class 'numpy.float64'>0.4105
x=-2.0, y=-1  : o: <class 'float'>   -0.5   m: <class 'float'>   -0.5   n: <class 'numpy.float64'>-0.5
x=-2  , y=-2.1: o: <class 'complex'> (0.2218-0.07208j) m: math domain error n: <class 'numpy.float64'>nan
x=-2.1, y=-0.2: o: <class 'complex'> (0.6975-0.5067j) m: math domain error n: <class 'numpy.float64'>nan

Things to note:

  1. The types of the outputs are not consistent across implementations (integer vs. floating point output type).
  2. The behaviour with respect to the numerical "domains" is not consistent, some implementations raise, others do the "mathematically correct" thing, others return nan. That NumPy always returns a value without raising is good for performance as there will be no branches to check for exceptions, this leads to more vectorization opportunities.

This example shows that math functions call the __float__ method on the type and actually operate in double precision, contrary to NumPy, which specialises:

import numpy as np
import math
import ctypes
from ctypes.util import find_library
libm_name = find_library('m')
libm = ctypes.CDLL(libm_name)
libm.sin.argtypes=(ctypes.c_double,)
libm.sin.restype=ctypes.c_double

x = np.float16(1.2)

m = math.sin(x)
print(type(m), m)

n = np.sin(x)
print(type(n), n)

def fake_math(x):
    return libm.sin(x.__float__())

f = foo(x)
print(type(f), f)

which produces:

<class 'float'> 0.9321098411884629
<class 'numpy.float16'> 0.932
<class 'float'> 0.9321098411884629

Things to note:

  1. That the math module converts input values to double precision via __float__ and then runs double precision numerical functions on these values.
  2. NumPy ufuncs specialise on the type of the float, this gives better performance.

Thoughts on target specific behaviours:

Numba currently supports CPU and CUDA targets. Due to recent work it's now possible to write @overloads of functions that are "generic" i.e. should work on any hardware, and also to specialised overloads so as to target specific hardware. Having started to use this functionality/API more, it's become clear that the point of abstraction where "target specific" implementations must exist is at the point where there is some fundamental difference in the actual target, e.g. something only exists on a particular target or there are some performance specialisations available for a particular target. The relevant point of abstraction for the implementation of mathematical functionality is at calls to the underlying maths library as this is something that is solely called by Numba and differs specifically at the hardware target level.

The proposal:

Considering all of the above, it's proposed that Numba internals for Python numerical builtins, operator, math/cmath and numpy (particularly the ufuncs) are altered to reflect the actual behaviours present when executing in the Python interpreter. The impact is anticipated to be as follows:

  1. The new behaviour would be more correct in its replication of execution in the Python interpreter, and predictably so. This may have an impact on users that are relying on the implementation details of current Numba implementations, it's expected that this is a small number of users.
  2. The builtins, operator and math/cmath modules would all behave correctly with respect to numerical domains and raise exceptions appropriately. The performance impact of this is that use of these functions will make it much harder for the underlying optimisers. It's suggested that a NumbaPerformanceWarning encouraging the use of NumPy is issued if functions from these modules are used. The number of users impacted by this is not known, but is probably quite small as most will be using NumPy already. The outlier in this is users of the CUDA target for which there is no NumPy ufunc support at present, this is being implemented in CUDA: Add trig ufunc support #8294 to create an "upgrade" path.
  3. The NumPy ufuncs would all be implemented in a largely exception-free manner as they are in NumPy, this better replicates NumPy and also has the advantage that optimisation passes which attempt vectorization will be able to do more as there are fewer branches. Further, encouraging users to use the overloaded NumPy functionality would also mean their code can be type specialised correctly, i.e. float32 inputs will (typically) produce float32 output. This makes it very easy to gain performance, simply "use NumPy".
  4. Much of the new/updated numerical code can be implemented for the "generic" target. The only place where there needs to be a target specific abstraction is at the point of calling the underlying target's specific maths library. In the case of the CPU this is a mix of llvm intrinsics and libm calls, in the case of CUDA it's the use of libdevice. It should be possible to create a math_h module which contains overloadable stubs based on functions declared in C's math.h. All the rest of the Python and NumPy functionality can be implemented based on this math_h module in a manner similar to how it is written in the existing original implementations. This provides maximum reuse of code and makes it very easy for "new" hardware targets to make use of this functionality, all that would be needed is to create the target specific overloads for the math_h stubs.

The above proposal intends to address items 1, 2 and 3 as listed in the "Current state" above. Item 4 will not be addressed, however, with this adjustment to the implementations it should be easier to make changes to the type system as the implementations of CPython and NumPy functions will be correctly separated such that types won't "leak" between implementations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
RFC Request for Comments
Projects
None yet
Development

No branches or pull requests

1 participant