move{32,64}_zero_{r,v} instructions #374

omnisip · 2020-10-07T01:11:25Z

Introduction

@Maratyszcza has done a wonderful job describing the use cases and functionality of load64_zero and load32_zero in #237. This proposal seeks to extend the functionality of load64_zero and load32_zero to be functionally complete with the underlying architecture by adding support for its sister variants with identical implementations. This would add support from other 32-bit and 64-bit registers and from the low 32 and 64 bits of other vectors. The proposed instructions are move32_zero_r, move64_zero_r, move32_zero_v, and move64_zero_v respectively. Since these are sister instructions, the applications, use cases, and instructions are identical to the original proposal.

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with AVX instruction set

v128.move32_zero_r

v = v128.move32_zero_r(r32) is lowered to VMOVD xmm_v, r32

v128.move64_zero_r

v = v128.move64_zero_r(r64) is lowered to VMOVQ xmm_v, r64

v128.move32_zero_v

v = v128.move32_zero_v(v128) is lowered to

        vxorps  xmm1, xmm1, xmm1
        vblendps        xmm0, xmm1, xmm0, 1             # xmm0 = xmm0[0],xmm1[1,2,3]

v128.move64_zero_v

v = v128.move64_zero_v(v128) is lowered to VMOVQ xmm_v, xmm

x86/x86-64 processors with SSE2 instruction set

v128.move32_zero_r

v = v128.move32_zero_r(r32) is lowered to MOVD xmm_v, r32

v128.move64_zero_r

v = v128.move64_zero_r(r64) is lowered to MOVQ xmm_v, r64

v128.move32_zero_v

v = v128.move32_zero_v(v128) is lowered to

        xorps   xmm1, xmm1
        movss   xmm1, xmm0                      # xmm1 = xmm0[0],xmm1[1,2,3]
        movaps  xmm0, xmm1

v128.move64_zero_v

v = v128.move64_zero_v(v128) is lowered to MOVQ xmm_v, xmm

ARM64 Processors

v128.move32_zero_r

v = v128.move32_zero_r(r32) is lowered to fmov s0, w0

v128.move64_zero_r

v = v128.move64_zero_r(r64) is lowered to fmov d0, x0

v128.move32_zero_v

v = v128.move32_zero_v(v128) is lowered to fmov s0, s1

v128.move64_zero_v

v = v128.move64_zero_v(v128) is lowered to fmov d0, d1

ARMv7 with Neon

v128.move32_zero_r

v = v128.move32_zero_r(r32) is lowered to

        movi    v0.2d, #0000000000000000
        mov     v0.s[0], w0

v128.move64_zero_r

v = v128.move32_zero_r(r64) is lowered to

        movi    v0.2d, #0000000000000000
        mov     v0.d[0], x0

v128.move32_zero_v

v = v128.move32_zero_v(v128) is lowered to

        movi    v1.2d, #0000000000000000
        mov     v1.s[0], v0.s[0]
        mov     v0.16b, v1.16b

v128.move64_zero_v(v128) is lowered to

        movi    v0.2d, #0000000000000000
        mov     v0.d[0], x0

omnisip · 2020-11-18T18:51:13Z

Copied from #373 for easy reading:

This proposal was originally put together for completeness along with the load32/load64_zero. Its functionality is equivalent to and 0xffffffffffffff for the lower 64 bits with movq and and 0xffffffff for the lower 32 bits with movd. Since there is an efficient alternative (albeit register using alternative), I don't think this should hold up any standardization effort. If someone has a need for this to be included before we finalize the instruction set, please comment here.

dtig · 2021-02-02T20:49:06Z

@omnisip As there have been no further comments, would you object to this PR being closed?

Maratyszcza · 2021-02-02T20:52:56Z

The lowering of these instructions looks good, and if there are important use cases it makes sense to expose the underlying native instructions to WAsm SIMD. However, I personally don't have a use-case for these instructions.

dtig · 2021-02-02T20:59:25Z

Agree that the lowering on different architectures is good - I read this PR as a call for use cases, and as there have been none so far pinging this thread to see if @omnisip has reasons we should include this. We can also punt this to post-MVP, but I'd be inclined to close if there are no concrete use cases.

omnisip · 2021-02-03T16:58:33Z

If we include these, we would probably only include move32_zero_r and move64_zero_r as top level instructions. Move32_zero_r and move64_zero_r are special in that they make it possible to initialize a vector from a register without splatting the value. This is something that doesn't exist in our current instruction set.

Maratyszcza · 2021-02-04T19:35:11Z

@omnisip We need use-cases where this is useful, and they are not obvious: e.g. AFAIK NEON intrinsics on ARM don't expose this instruction.

omnisip · 2021-02-04T20:23:13Z

Yeah. That's what I've been running across in my research. When I looked for the intrinsic versions of the moves on x64, I found that in most cases, they were immediately followed by a broadcast or a shuffle to replicate the same behavior as dup. If ARM doesn't expose these, I don't see any pertinent reason to keep this PR open -- I just wanted to check loose ends before we closed this out.

dtig · 2021-03-05T02:16:50Z

Closing as per #436.

move{32,64}_zero_{r,v} instructions

0eaa82c

tlively added the post SIMD MVP label Feb 2, 2021

dtig added needs discussion Proposal with an unclear resolution and removed post SIMD MVP labels Feb 2, 2021

dtig mentioned this pull request Feb 2, 2021

Agenda for sync meeting 2/5/21 #436

Closed

dtig closed this Mar 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

move{32,64}_zero_{r,v} instructions #374

move{32,64}_zero_{r,v} instructions #374

omnisip commented Oct 7, 2020 •

edited

Loading

omnisip commented Nov 18, 2020

dtig commented Feb 2, 2021

Maratyszcza commented Feb 2, 2021

dtig commented Feb 2, 2021

omnisip commented Feb 3, 2021

Maratyszcza commented Feb 4, 2021

omnisip commented Feb 4, 2021

dtig commented Mar 5, 2021

move{32,64}_zero_{r,v} instructions #374

move{32,64}_zero_{r,v} instructions #374

Conversation

omnisip commented Oct 7, 2020 • edited Loading

Introduction

Mapping to Common Instruction Sets

x86/x86-64 processors with AVX instruction set

v128.move32_zero_r

v = v128.move32_zero_r(r32) is lowered to VMOVD xmm_v, r32

v128.move64_zero_r

v = v128.move64_zero_r(r64) is lowered to VMOVQ xmm_v, r64

v128.move32_zero_v

v = v128.move32_zero_v(v128) is lowered to

v128.move64_zero_v

v = v128.move64_zero_v(v128) is lowered to VMOVQ xmm_v, xmm

x86/x86-64 processors with SSE2 instruction set

v128.move32_zero_r

v = v128.move32_zero_r(r32) is lowered to MOVD xmm_v, r32

v128.move64_zero_r

v = v128.move64_zero_r(r64) is lowered to MOVQ xmm_v, r64

v128.move32_zero_v

v = v128.move32_zero_v(v128) is lowered to

v128.move64_zero_v

v = v128.move64_zero_v(v128) is lowered to MOVQ xmm_v, xmm

ARM64 Processors

v128.move32_zero_r

v = v128.move32_zero_r(r32) is lowered to fmov s0, w0

v128.move64_zero_r

v = v128.move64_zero_r(r64) is lowered to fmov d0, x0

v128.move32_zero_v

v = v128.move32_zero_v(v128) is lowered to fmov s0, s1

v128.move64_zero_v

v = v128.move64_zero_v(v128) is lowered to fmov d0, d1

ARMv7 with Neon

v128.move32_zero_r

v = v128.move32_zero_r(r32) is lowered to

v128.move64_zero_r

v = v128.move32_zero_r(r64) is lowered to

v128.move32_zero_v

v = v128.move32_zero_v(v128) is lowered to

v128.move64_zero_v(v128) is lowered to

omnisip commented Nov 18, 2020

dtig commented Feb 2, 2021

Maratyszcza commented Feb 2, 2021

dtig commented Feb 2, 2021

omnisip commented Feb 3, 2021

Maratyszcza commented Feb 4, 2021

omnisip commented Feb 4, 2021

dtig commented Mar 5, 2021

omnisip commented Oct 7, 2020 •

edited

Loading