Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Floating point vs integer and fixed point #104

Open
penzn opened this issue Nov 4, 2022 · 2 comments
Open

Floating point vs integer and fixed point #104

penzn opened this issue Nov 4, 2022 · 2 comments

Comments

@penzn
Copy link
Contributor

penzn commented Nov 4, 2022

A bit of backstory for the discussion, some of this is opinion, but hopefully at least somewhat helpful.

I think it is useful to think about the operations as belonging to two categories: one dealing with floating point semantics and the other with other platform specifics (mostly integer). What this allows is separating questions regarding acceptable floating point output from other, arguably less tricky ones, like encoding invalid values when converting floats to ints. This division is somewhat subjective, but might become clearer with more concrete examples below.

Relaxed versions of existing 'integer' SIMD operations

  • i8x16.swizzle, different treatment out of bound lane indices
  • float to integer conversions, different treatment of NaN values and overflow
  • lanselect, different lane encoding in the mask

Swizzle, laneselect, and float to int converstions in existing SIMD spec have Arm semantics, and new operations match them on Arm, while having different output on x86. Unlike floating point the differences are much more subjective (for example, should the invalid value be all zeros or all ones). It might be even possible to imagine a world where both flavors coexist. Emulating such operations is likely to be less tedious than trying to emulate an operations with better FP accuracy, plus they generally don't deviate from semantics already established for scalar operations.

Relaxed versions of existing floating point SIMD operations

  • fmin, different treatment of +/- 0.0, NaN inputs
  • fmax, different treatment of +/- 0.0, NaN inputs

The gist is that x86 operations, unlike Arm operations, "short circuit" on NaN and disregard the sign of zero.

Code that cannot rule out NaN inputs would likely expect more symmetric variants that what x86 is providing natively, and there are well known instruction sequences that would bring the behavior up to, say C++ spec, or one or the other IEEE standard. Obviously, the proposed operations have vastly better performance on x86 than the strict ones, but for code that doesn't rule out NaNs there needs to be some mitigation (along the lines of what native libraries do), which still might be worth it from performance point of view.

New operations

Just to summarize:

  • Integer and fixed point
    • Q format multiplication, different output in case of overflow
    • Integer dot product, different behavior w.r.t signed/unsigned values
  • Floating point
    • FMA, different accuracy (fused vs not fused)
    • bfloat16 dot product, different accuracy (fused vs not fused), new number encoding

I think in general those have the same FP vs non-FP considerations as above, with a few extras (like single rounding FMA). The fact that those are new may not be an advantage.

@penzn
Copy link
Contributor Author

penzn commented Nov 4, 2022

This is a partial answer to @titzer's question about what the alternatives for "union" approach are. I haven't looked into the newer operations as close as the old ones.

@penzn
Copy link
Contributor Author

penzn commented Dec 6, 2022

Looked into this as a side effect of a different project.

FMA

True FMA can only be emulated via integer ops - the inputs need to be broken up into components, both operations performed, then result needs to be rounded and stored back into a float. It should take about 5 additions and 5 multiplication to get the result. This is expensive, though some existing SIMD instructions have even worse lowering (unsigned int conversions for example).

Floating-point min and max

Edit: removed a couple paragraphs describing emulation of x86 floating-point min and max, since we already have those in the standard. Thanks to @abrown for pointing this out.

We have both deterministic variants in the spec already:

  • f32x4.relaxed_min is either f32x4.min or f32x4.pmin
  • f32x4.relaxed_max is either f32x4.max or f32x4.pmax
  • f64x2.relaxed_min is either f64x2.min or f64x2.pmin
  • f64x2.relaxed_max is either f64x2.max or f64x2.pmax

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant