You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A bit of backstory for the discussion, some of this is opinion, but hopefully at least somewhat helpful.
I think it is useful to think about the operations as belonging to two categories: one dealing with floating point semantics and the other with other platform specifics (mostly integer). What this allows is separating questions regarding acceptable floating point output from other, arguably less tricky ones, like encoding invalid values when converting floats to ints. This division is somewhat subjective, but might become clearer with more concrete examples below.
Relaxed versions of existing 'integer' SIMD operations
i8x16.swizzle, different treatment out of bound lane indices
float to integer conversions, different treatment of NaN values and overflow
lanselect, different lane encoding in the mask
Swizzle, laneselect, and float to int converstions in existing SIMD spec have Arm semantics, and new operations match them on Arm, while having different output on x86. Unlike floating point the differences are much more subjective (for example, should the invalid value be all zeros or all ones). It might be even possible to imagine a world where both flavors coexist. Emulating such operations is likely to be less tedious than trying to emulate an operations with better FP accuracy, plus they generally don't deviate from semantics already established for scalar operations.
Relaxed versions of existing floating point SIMD operations
fmin, different treatment of +/- 0.0, NaN inputs
fmax, different treatment of +/- 0.0, NaN inputs
The gist is that x86 operations, unlike Arm operations, "short circuit" on NaN and disregard the sign of zero.
Code that cannot rule out NaN inputs would likely expect more symmetric variants that what x86 is providing natively, and there are well known instruction sequences that would bring the behavior up to, say C++ spec, or one or the other IEEE standard. Obviously, the proposed operations have vastly better performance on x86 than the strict ones, but for code that doesn't rule out NaNs there needs to be some mitigation (along the lines of what native libraries do), which still might be worth it from performance point of view.
New operations
Just to summarize:
Integer and fixed point
Q format multiplication, different output in case of overflow
Integer dot product, different behavior w.r.t signed/unsigned values
Floating point
FMA, different accuracy (fused vs not fused)
bfloat16 dot product, different accuracy (fused vs not fused), new number encoding
I think in general those have the same FP vs non-FP considerations as above, with a few extras (like single rounding FMA). The fact that those are new may not be an advantage.
The text was updated successfully, but these errors were encountered:
This is a partial answer to @titzer's question about what the alternatives for "union" approach are. I haven't looked into the newer operations as close as the old ones.
Looked into this as a side effect of a different project.
FMA
True FMA can only be emulated via integer ops - the inputs need to be broken up into components, both operations performed, then result needs to be rounded and stored back into a float. It should take about 5 additions and 5 multiplication to get the result. This is expensive, though some existing SIMD instructions have even worse lowering (unsigned int conversions for example).
Floating-point min and max
Edit: removed a couple paragraphs describing emulation of x86 floating-point min and max, since we already have those in the standard. Thanks to @abrown for pointing this out.
We have both deterministic variants in the spec already:
f32x4.relaxed_min is either f32x4.min or f32x4.pmin
f32x4.relaxed_max is either f32x4.max or f32x4.pmax
f64x2.relaxed_min is either f64x2.min or f64x2.pmin
f64x2.relaxed_max is either f64x2.max or f64x2.pmax
A bit of backstory for the discussion, some of this is opinion, but hopefully at least somewhat helpful.
I think it is useful to think about the operations as belonging to two categories: one dealing with floating point semantics and the other with other platform specifics (mostly integer). What this allows is separating questions regarding acceptable floating point output from other, arguably less tricky ones, like encoding invalid values when converting floats to ints. This division is somewhat subjective, but might become clearer with more concrete examples below.
Relaxed versions of existing 'integer' SIMD operations
i8x16.swizzle
, different treatment out of bound lane indiceslanselect
, different lane encoding in the maskSwizzle, laneselect, and float to int converstions in existing SIMD spec have Arm semantics, and new operations match them on Arm, while having different output on x86. Unlike floating point the differences are much more subjective (for example, should the invalid value be all zeros or all ones). It might be even possible to imagine a world where both flavors coexist. Emulating such operations is likely to be less tedious than trying to emulate an operations with better FP accuracy, plus they generally don't deviate from semantics already established for scalar operations.
Relaxed versions of existing floating point SIMD operations
fmin
, different treatment of +/- 0.0, NaN inputsfmax
, different treatment of +/- 0.0, NaN inputsThe gist is that x86 operations, unlike Arm operations, "short circuit" on
NaN
and disregard the sign of zero.Code that cannot rule out
NaN
inputs would likely expect more symmetric variants that what x86 is providing natively, and there are well known instruction sequences that would bring the behavior up to, say C++ spec, or one or the other IEEE standard. Obviously, the proposed operations have vastly better performance on x86 than the strict ones, but for code that doesn't rule outNaN
s there needs to be some mitigation (along the lines of what native libraries do), which still might be worth it from performance point of view.New operations
Just to summarize:
I think in general those have the same FP vs non-FP considerations as above, with a few extras (like single rounding FMA). The fact that those are new may not be an advantage.
The text was updated successfully, but these errors were encountered: