v128.move32_zero and v128.move64_zero instructions extensions to #237 #373

omnisip · 2020-10-06T16:35:35Z

Introduction

@Maratyszcza has done a wonderful job describing the use cases and functionality of load64_zero and load32_zero in #237. This proposal seeks to extend the functionality of load64_zero and load32_zero to be functionally complete with the underlying architecture by adding support for its sister variants with identical implementations. This would add support from other 32-bit and 64-bit registers and from the low 32 and 64 bits of other vectors. The proposed instructions are move32_zero_r, move64_zero_r, move32_zero_v, and move64_zero_v respectively. Since these are sister instructions, the applications, use cases, and instructions are identical to the original proposal. This ticket will serve as a placeholder for the upcoming PR and will be updated in tandem.

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with AVX instruction set

v128.move32_zero_r

v = v128.move32_zero_r(r32) is lowered to VMOVD xmm_v, r32

v128.move64_zero_r

v = v128.move64_zero_r(r64) is lowered to VMOVQ xmm_v, r64

v128.move32_zero_v

v = v128.move32_zero_v(v128) is lowered to VMOVD xmm_v, xmm

v128.move64_zero_v

v = v128.move64_zero_v(v128) is lowered to VMOVQ xmm_v, xmm

x86/x86-64 processors with SSE2 instruction set

v128.move32_zero_r

v = v128.move32_zero_r(r32) is lowered to MOVD xmm_v, r32

v128.move64_zero_r

v = v128.move64_zero_r(r64) is lowered to MOVQ xmm_v, r64

v128.move32_zero_v

v = v128.move32_zero_v(v128) is lowered to MOVD xmm_v, xmm

v128.move64_zero_v

v = v128.move64_zero_v(v128) is lowered to MOVQ xmm_v, xmm

ARM64 Processors

v128.move32_zero_r

v = v128.move32_zero_r(r32) is lowered to fmov s0, w0

v128.move64_zero_r

v = v128.move64_zero_r(r64) is lowered to fmov d0, x0

v128.move32_zero_v

v = v128.move32_zero_v(v128) is lowered to mov s0, v1.s[0] OR fmov s0, s1

v128.move64_zero_v

v = v128.move64_zero_v(v128) is lowered to mov d0, v1.d[0] OR fmov d0, d1

ARMv7 with Neon

v128.move32_zero_r

v = v128.move32_zero_r(r32)

v128.move64_zero_r

v128.move32_zero_v

v128.move64_zero_v

v= v128.move64_zero_v(v128).

penzn · 2020-10-06T16:52:11Z

Memory operations in WebAssembly can only do either of the following:

Load: pop a memory index off the stack, get the value from specified memory location and push it onto the stack
Store: pop a value and an index off the stack, put the value into specified memory location

omnisip · 2020-10-06T16:55:26Z

Is this a labeling issue?

I've updated this to use the word 'move' instead of 'load'. I didn't see any other terminology in there that matched this specific case. If there is, please let me know and I'll update promptly.

penzn · 2020-10-06T17:04:17Z

Setting a lane is done using replace lane operations. Please try to understand the spec before proposing changes to it, and if you have questions - please ask, we'll be happy to answer.

omnisip · 2020-10-06T17:30:50Z

@penzn This isn't the same as replace_lane. Replace lane replaces one value and returns the updated vector. This initializes a vector from a scalar or another vector zeroing the upper parts.

Maratyszcza · 2020-10-06T19:09:16Z

Looks reasonable. I suggest @omnisip open a Pull Request with the proposed changes to the specification, because is it more actionable for V8/SpiderMonkey/LLVM devs.

penzn · 2020-10-06T19:18:41Z

Do we expect this operation to be "on critical path", and if so, what kind of gains are we going to get by going from two wasm instructions (neither of which touches memory) to one?

omnisip · 2020-10-06T21:16:38Z

@Maratyszcza Thanks for the feedback. If anyone can help with the ARMv7 with Neon intrinsics I'll generate the PR today.

@penzn Your question has two parts and is very interesting. One of the things that may or may not be obvious is that the logic @Maratyszcza and I are proposing is much older technologically speaking than any shuffling, insert (replace_lane), or extract (extract_lane). Movd and Movq are the original instructions for initializing a vector on x86, and their support goes all the way back to MMX. Their use is evident in just about every application that has ever had to load a vector on this architecture. Prior to AVX and AVX2, the only ways to initialize a vector were XORs, Movs, Compares, and Loads.

With respect to your second question, what's the benefit in two ops -- let's look and compare what we're replacing. If I'm thinking about this correctly and please correct me if I'm wrong, to get to your two op solution, we'd have to zero the vector and use insert. In an ideal configuation, that's 3 uops with a throughput of 2.33 on Skylake whereas movq is 1 uop and throughput of 0.33 on the same architecture going between vectors. That's a pretty significant difference in performance. It also doesn't require the use of a shuffle port.

The question that hasn't been asked, but is also relevant to the topic, is whether we should consider adding the two corresponding conversion ops going from XMM/YMM to registers. I think the answer is no -- provided that the implementations optimize the extract_lane 0 cases for 32 bit and 64-bit operations to use this under the hood.

Maratyszcza · 2020-10-07T05:28:01Z

VMOVD xmm, xmm and VMOVQ xmm, xmm forms don't exist. You could use [V]MOVSS and [V]MOVSD, but they don't set non-copied lanes to zero.

omnisip · 2020-10-07T05:56:29Z

https://www.felixcloutier.com/x86/movq

…

On Tue, Oct 6, 2020, 23:28 Marat Dukhan ***@***.***> wrote: VMOVD xmm, xmm and VMOVQ xmm, xmm forms don't exist. You could use [V]MOVSS and [V]MOVSD, but they don't set non-copied lanes to zero. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#373 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKQJVAOC4TNCDVXXHXZ63SJP365ANCNFSM4SGISCOA> .

omnisip · 2020-10-07T06:46:45Z

@Maratyszcza Check these two out

https://uops.info/table.html?search=Movq&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_SKL=on&cb_measurements=on&cb_iaca30=on&cb_doc=on&cb_base=on&cb_avx=on

https://uops.info/table.html?search=Pand&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_SKL=on&cb_measurements=on&cb_iaca30=on&cb_doc=on&cb_base=on&cb_avx=on

See how vmovq and pand both use port 015? Think they're synonyms for the same op?

omnisip · 2020-11-18T18:35:44Z

This proposal was originally put together for completeness along with the load32/load64_zero. Its functionality is equivalent to and 0xffffffffffffff for the lower 64 bits with movq and and 0xffffffff for the lower 32 bits with movd. Since there is an efficient alternative (albeit register using alternative), I don't think this should hold up any standardization effort. If someone has a need for this to be included before we finalize the instruction set, please comment here.

omnisip mentioned this issue Oct 6, 2020

v128.load32_zero and v128.load64_zero instructions #237

Merged

omnisip changed the title ~~v128.load32_zero and v128.load64_zero instructions extensions to #237~~ v128.move32_zero and v128.move64_zero instructions extensions to #237 Oct 6, 2020

omnisip mentioned this issue Nov 18, 2020

move{32,64}_zero_{r,v} instructions #374

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v128.move32_zero and v128.move64_zero instructions extensions to #237 #373

v128.move32_zero and v128.move64_zero instructions extensions to #237 #373

omnisip commented Oct 6, 2020 •

edited

Loading

penzn commented Oct 6, 2020 •

edited

Loading

omnisip commented Oct 6, 2020 •

edited

Loading

penzn commented Oct 6, 2020

omnisip commented Oct 6, 2020

Maratyszcza commented Oct 6, 2020

penzn commented Oct 6, 2020

omnisip commented Oct 6, 2020 •

edited

Loading

Maratyszcza commented Oct 7, 2020

omnisip commented Oct 7, 2020 via email

omnisip commented Oct 7, 2020

omnisip commented Nov 18, 2020

v128.move32_zero and v128.move64_zero instructions extensions to #237 #373

v128.move32_zero and v128.move64_zero instructions extensions to #237 #373

Comments

omnisip commented Oct 6, 2020 • edited Loading

Introduction

Mapping to Common Instruction Sets

x86/x86-64 processors with AVX instruction set

v128.move32_zero_r

v = v128.move32_zero_r(r32) is lowered to VMOVD xmm_v, r32

v128.move64_zero_r

v = v128.move64_zero_r(r64) is lowered to VMOVQ xmm_v, r64

v128.move32_zero_v

v = v128.move32_zero_v(v128) is lowered to VMOVD xmm_v, xmm

v128.move64_zero_v

v = v128.move64_zero_v(v128) is lowered to VMOVQ xmm_v, xmm

x86/x86-64 processors with SSE2 instruction set

v128.move32_zero_r

v = v128.move32_zero_r(r32) is lowered to MOVD xmm_v, r32

v128.move64_zero_r

v = v128.move64_zero_r(r64) is lowered to MOVQ xmm_v, r64

v128.move32_zero_v

v = v128.move32_zero_v(v128) is lowered to MOVD xmm_v, xmm

v128.move64_zero_v

v = v128.move64_zero_v(v128) is lowered to MOVQ xmm_v, xmm

ARM64 Processors

v128.move32_zero_r

v = v128.move32_zero_r(r32) is lowered to fmov s0, w0

v128.move64_zero_r

v = v128.move64_zero_r(r64) is lowered to fmov d0, x0

v128.move32_zero_v

v = v128.move32_zero_v(v128) is lowered to mov s0, v1.s[0] OR fmov s0, s1

v128.move64_zero_v

v = v128.move64_zero_v(v128) is lowered to mov d0, v1.d[0] OR fmov d0, d1

ARMv7 with Neon

v128.move32_zero_r

v = v128.move32_zero_r(r32)

v128.move64_zero_r

v128.move32_zero_v

v128.move64_zero_v

v= v128.move64_zero_v(v128).

penzn commented Oct 6, 2020 • edited Loading

omnisip commented Oct 6, 2020 • edited Loading

penzn commented Oct 6, 2020

omnisip commented Oct 6, 2020

Maratyszcza commented Oct 6, 2020

penzn commented Oct 6, 2020

omnisip commented Oct 6, 2020 • edited Loading

Maratyszcza commented Oct 7, 2020

omnisip commented Oct 7, 2020 via email

omnisip commented Oct 7, 2020

omnisip commented Nov 18, 2020

omnisip commented Oct 6, 2020 •

edited

Loading

penzn commented Oct 6, 2020 •

edited

Loading

omnisip commented Oct 6, 2020 •

edited

Loading

omnisip commented Oct 6, 2020 •

edited

Loading