Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

v128.move32_zero and v128.move64_zero instructions extensions to #237 #373

Open
omnisip opened this issue Oct 6, 2020 · 11 comments
Open

Comments

@omnisip
Copy link

omnisip commented Oct 6, 2020

Introduction

@Maratyszcza has done a wonderful job describing the use cases and functionality of load64_zero and load32_zero in #237. This proposal seeks to extend the functionality of load64_zero and load32_zero to be functionally complete with the underlying architecture by adding support for its sister variants with identical implementations. This would add support from other 32-bit and 64-bit registers and from the low 32 and 64 bits of other vectors. The proposed instructions are move32_zero_r, move64_zero_r, move32_zero_v, and move64_zero_v respectively. Since these are sister instructions, the applications, use cases, and instructions are identical to the original proposal. This ticket will serve as a placeholder for the upcoming PR and will be updated in tandem.

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with AVX instruction set

v128.move32_zero_r

v = v128.move32_zero_r(r32) is lowered to VMOVD xmm_v, r32

v128.move64_zero_r

v = v128.move64_zero_r(r64) is lowered to VMOVQ xmm_v, r64

v128.move32_zero_v

v = v128.move32_zero_v(v128) is lowered to VMOVD xmm_v, xmm

v128.move64_zero_v

v = v128.move64_zero_v(v128) is lowered to VMOVQ xmm_v, xmm

x86/x86-64 processors with SSE2 instruction set

v128.move32_zero_r

v = v128.move32_zero_r(r32) is lowered to MOVD xmm_v, r32

v128.move64_zero_r

v = v128.move64_zero_r(r64) is lowered to MOVQ xmm_v, r64

v128.move32_zero_v

v = v128.move32_zero_v(v128) is lowered to MOVD xmm_v, xmm

v128.move64_zero_v

v = v128.move64_zero_v(v128) is lowered to MOVQ xmm_v, xmm

ARM64 Processors

v128.move32_zero_r

v = v128.move32_zero_r(r32) is lowered to fmov s0, w0

v128.move64_zero_r

v = v128.move64_zero_r(r64) is lowered to fmov d0, x0

v128.move32_zero_v

v = v128.move32_zero_v(v128) is lowered to mov s0, v1.s[0] OR fmov s0, s1

v128.move64_zero_v

v = v128.move64_zero_v(v128) is lowered to mov d0, v1.d[0] OR fmov d0, d1

ARMv7 with Neon

v128.move32_zero_r

v = v128.move32_zero_r(r32)

v128.move64_zero_r

v128.move32_zero_v

v128.move64_zero_v

v= v128.move64_zero_v(v128).

@penzn
Copy link
Contributor

penzn commented Oct 6, 2020

Memory operations in WebAssembly can only do either of the following:

  • Load: pop a memory index off the stack, get the value from specified memory location and push it onto the stack
  • Store: pop a value and an index off the stack, put the value into specified memory location

@omnisip
Copy link
Author

omnisip commented Oct 6, 2020

Is this a labeling issue?

I've updated this to use the word 'move' instead of 'load'. I didn't see any other terminology in there that matched this specific case. If there is, please let me know and I'll update promptly.

@omnisip omnisip changed the title v128.load32_zero and v128.load64_zero instructions extensions to #237 v128.move32_zero and v128.move64_zero instructions extensions to #237 Oct 6, 2020
@penzn
Copy link
Contributor

penzn commented Oct 6, 2020

Setting a lane is done using replace lane operations. Please try to understand the spec before proposing changes to it, and if you have questions - please ask, we'll be happy to answer.

@omnisip
Copy link
Author

omnisip commented Oct 6, 2020

@penzn This isn't the same as replace_lane. Replace lane replaces one value and returns the updated vector. This initializes a vector from a scalar or another vector zeroing the upper parts.

@Maratyszcza
Copy link
Contributor

Looks reasonable. I suggest @omnisip open a Pull Request with the proposed changes to the specification, because is it more actionable for V8/SpiderMonkey/LLVM devs.

@penzn
Copy link
Contributor

penzn commented Oct 6, 2020

Do we expect this operation to be "on critical path", and if so, what kind of gains are we going to get by going from two wasm instructions (neither of which touches memory) to one?

@omnisip
Copy link
Author

omnisip commented Oct 6, 2020

@Maratyszcza Thanks for the feedback. If anyone can help with the ARMv7 with Neon intrinsics I'll generate the PR today.

@penzn Your question has two parts and is very interesting. One of the things that may or may not be obvious is that the logic @Maratyszcza and I are proposing is much older technologically speaking than any shuffling, insert (replace_lane), or extract (extract_lane). Movd and Movq are the original instructions for initializing a vector on x86, and their support goes all the way back to MMX. Their use is evident in just about every application that has ever had to load a vector on this architecture. Prior to AVX and AVX2, the only ways to initialize a vector were XORs, Movs, Compares, and Loads.

With respect to your second question, what's the benefit in two ops -- let's look and compare what we're replacing. If I'm thinking about this correctly and please correct me if I'm wrong, to get to your two op solution, we'd have to zero the vector and use insert. In an ideal configuation, that's 3 uops with a throughput of 2.33 on Skylake whereas movq is 1 uop and throughput of 0.33 on the same architecture going between vectors. That's a pretty significant difference in performance. It also doesn't require the use of a shuffle port.

The question that hasn't been asked, but is also relevant to the topic, is whether we should consider adding the two corresponding conversion ops going from XMM/YMM to registers. I think the answer is no -- provided that the implementations optimize the extract_lane 0 cases for 32 bit and 64-bit operations to use this under the hood.

@Maratyszcza
Copy link
Contributor

VMOVD xmm, xmm and VMOVQ xmm, xmm forms don't exist. You could use [V]MOVSS and [V]MOVSD, but they don't set non-copied lanes to zero.

@omnisip
Copy link
Author

omnisip commented Oct 7, 2020 via email

@omnisip
Copy link
Author

omnisip commented Nov 18, 2020

This proposal was originally put together for completeness along with the load32/load64_zero. Its functionality is equivalent to and 0xffffffffffffff for the lower 64 bits with movq and and 0xffffffff for the lower 32 bits with movd. Since there is an efficient alternative (albeit register using alternative), I don't think this should hold up any standardization effort. If someone has a need for this to be included before we finalize the instruction set, please comment here.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants