Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

move{32,64}_zero_{r,v} instructions #374

Closed
wants to merge 1 commit into from

Conversation

omnisip
Copy link

@omnisip omnisip commented Oct 7, 2020

Introduction

@Maratyszcza has done a wonderful job describing the use cases and functionality of load64_zero and load32_zero in #237. This proposal seeks to extend the functionality of load64_zero and load32_zero to be functionally complete with the underlying architecture by adding support for its sister variants with identical implementations. This would add support from other 32-bit and 64-bit registers and from the low 32 and 64 bits of other vectors. The proposed instructions are move32_zero_r, move64_zero_r, move32_zero_v, and move64_zero_v respectively. Since these are sister instructions, the applications, use cases, and instructions are identical to the original proposal.

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with AVX instruction set

v128.move32_zero_r

v = v128.move32_zero_r(r32) is lowered to VMOVD xmm_v, r32

v128.move64_zero_r

v = v128.move64_zero_r(r64) is lowered to VMOVQ xmm_v, r64

v128.move32_zero_v

v = v128.move32_zero_v(v128) is lowered to

        vxorps  xmm1, xmm1, xmm1
        vblendps        xmm0, xmm1, xmm0, 1             # xmm0 = xmm0[0],xmm1[1,2,3]

v128.move64_zero_v

v = v128.move64_zero_v(v128) is lowered to VMOVQ xmm_v, xmm

x86/x86-64 processors with SSE2 instruction set

v128.move32_zero_r

v = v128.move32_zero_r(r32) is lowered to MOVD xmm_v, r32

v128.move64_zero_r

v = v128.move64_zero_r(r64) is lowered to MOVQ xmm_v, r64

v128.move32_zero_v

v = v128.move32_zero_v(v128) is lowered to

        xorps   xmm1, xmm1
        movss   xmm1, xmm0                      # xmm1 = xmm0[0],xmm1[1,2,3]
        movaps  xmm0, xmm1

v128.move64_zero_v

v = v128.move64_zero_v(v128) is lowered to MOVQ xmm_v, xmm

ARM64 Processors

v128.move32_zero_r

v = v128.move32_zero_r(r32) is lowered to fmov s0, w0

v128.move64_zero_r

v = v128.move64_zero_r(r64) is lowered to fmov d0, x0

v128.move32_zero_v

v = v128.move32_zero_v(v128) is lowered to fmov s0, s1

v128.move64_zero_v

v = v128.move64_zero_v(v128) is lowered to fmov d0, d1

ARMv7 with Neon

v128.move32_zero_r

v = v128.move32_zero_r(r32) is lowered to

        movi    v0.2d, #0000000000000000
        mov     v0.s[0], w0

v128.move64_zero_r

v = v128.move32_zero_r(r64) is lowered to

        movi    v0.2d, #0000000000000000
        mov     v0.d[0], x0

v128.move32_zero_v

v = v128.move32_zero_v(v128) is lowered to

        movi    v1.2d, #0000000000000000
        mov     v1.s[0], v0.s[0]
        mov     v0.16b, v1.16b

v128.move64_zero_v(v128) is lowered to

        movi    v0.2d, #0000000000000000
        mov     v0.d[0], x0

@omnisip
Copy link
Author

omnisip commented Nov 18, 2020

Copied from #373 for easy reading:

This proposal was originally put together for completeness along with the load32/load64_zero. Its functionality is equivalent to and 0xffffffffffffff for the lower 64 bits with movq and and 0xffffffff for the lower 32 bits with movd. Since there is an efficient alternative (albeit register using alternative), I don't think this should hold up any standardization effort. If someone has a need for this to be included before we finalize the instruction set, please comment here.

@dtig
Copy link
Member

dtig commented Feb 2, 2021

@omnisip As there have been no further comments, would you object to this PR being closed?

@dtig dtig added needs discussion Proposal with an unclear resolution and removed post SIMD MVP labels Feb 2, 2021
@Maratyszcza
Copy link
Contributor

The lowering of these instructions looks good, and if there are important use cases it makes sense to expose the underlying native instructions to WAsm SIMD. However, I personally don't have a use-case for these instructions.

@dtig
Copy link
Member

dtig commented Feb 2, 2021

Agree that the lowering on different architectures is good - I read this PR as a call for use cases, and as there have been none so far pinging this thread to see if @omnisip has reasons we should include this. We can also punt this to post-MVP, but I'd be inclined to close if there are no concrete use cases.

@omnisip
Copy link
Author

omnisip commented Feb 3, 2021

If we include these, we would probably only include move32_zero_r and move64_zero_r as top level instructions. Move32_zero_r and move64_zero_r are special in that they make it possible to initialize a vector from a register without splatting the value. This is something that doesn't exist in our current instruction set.

@Maratyszcza
Copy link
Contributor

@omnisip We need use-cases where this is useful, and they are not obvious: e.g. AFAIK NEON intrinsics on ARM don't expose this instruction.

@omnisip
Copy link
Author

omnisip commented Feb 4, 2021

Yeah. That's what I've been running across in my research. When I looked for the intrinsic versions of the moves on x64, I found that in most cases, they were immediately followed by a broadcast or a shuffle to replicate the same behavior as dup. If ARM doesn't expose these, I don't see any pertinent reason to keep this PR open -- I just wanted to check loose ends before we closed this out.

@dtig
Copy link
Member

dtig commented Mar 5, 2021

Closing as per #436.

@dtig dtig closed this Mar 5, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
needs discussion Proposal with an unclear resolution
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants