Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Arm64 encodings for IF_SVE_AA_3A to IF_SVE_HL_3A #95127

Merged
merged 33 commits into from
Nov 27, 2023

Conversation

a74nh
Copy link
Contributor

@a74nh a74nh commented Nov 22, 2023

Add the first set of 17 SVE groups.

Contributes towards #94549

Based on top of #95105
Code starts from 55731b4

Change-Id: I9d32d52c8a54e4f001a8af5ff556e747087c9149
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Nov 22, 2023
@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Nov 22, 2023
@ghost
Copy link

ghost commented Nov 22, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Add all the groups IF_SVE_**_3A.

Based on top of #95105
Code starts from

Author: a74nh
Assignees: -
Labels:

area-CodeGen-coreclr, community-contribution

Milestone: -

@a74nh
Copy link
Contributor Author

a74nh commented Nov 22, 2023

IF_SVE_AA_3A, IF_SVE_AB_3A, IF_SVE_AC_3A are done and fully working.
All other groups are WIP.

Output in jitdump:

IN0c89: 003234      fnmsub  d7, d15, d23, d31
IN0c8a: 003238      and     z0.b, p1/m, z0.b, z2.b
IN0c8b: 00323C      bic     z3.h, p4/m, z3.h, z5.h
IN0c8c: 003240      eor     z14.s, p5/m, z14.s, z16.s
IN0c8d: 003244      orr     z29.d, p7/m, z29.d, z31.d
IN0c8e: 003248      add     z5.b, p6/m, z5.b, z7.b
IN0c8f: 00324C      sub     z15.h, p7/m, z15.h, z29.h
IN0c90: 003250      subr    z2.s, p0/m, z2.s, z13.s
IN0c91: 003254      sdiv    z3.s, p2/m, z3.s, z9.s
IN0c92: 003258      sdivr   z31.d, p3/m, z31.d, z29.d
IN0c93: 00325C      udiv    z1.s, p0/m, z1.s, z0.s
IN0c94: 003260      udivr   z13.d, p7/m, z13.d, z15.d

Output in gdb:

(gdb) disassemble 0xffffa9864360+0x3234,0xffffa9864360+0x3264
Dump of assembler code from 0xffffa9867594 to 0xffffa98675c4:
   0x0000ffffa9867594:	fnmsub	d7, d15, d23, d31
   0x0000ffffa9867598:	and	z0.b, p1/m, z0.b, z2.b
   0x0000ffffa986759c:	bic	z3.h, p4/m, z3.h, z5.h
   0x0000ffffa98675a0:	eor	z14.s, p5/m, z14.s, z16.s
   0x0000ffffa98675a4:	orr	z29.d, p7/m, z29.d, z31.d
   0x0000ffffa98675a8:	add	z5.b, p6/m, z5.b, z7.b
   0x0000ffffa98675ac:	sub	z15.h, p7/m, z15.h, z29.h
   0x0000ffffa98675b0:	subr	z2.s, p0/m, z2.s, z13.s
   0x0000ffffa98675b4:	sdiv	z3.s, p2/m, z3.s, z9.s
   0x0000ffffa98675b8:	sdivr	z31.d, p3/m, z31.d, z29.d
   0x0000ffffa98675bc:	udiv	z1.s, p0/m, z1.s, z0.s
   0x0000ffffa98675c0:	udivr	z13.d, p7/m, z13.d, z15.d

@a74nh
Copy link
Contributor Author

a74nh commented Nov 22, 2023

@kunalspathak
Copy link
Member

cc: @TIHan @amanasifkhalid

@kunalspathak kunalspathak added the arm-sve Work related to arm64 SVE/SVE2 support label Nov 22, 2023
kunalspathak added a commit to kunalspathak/runtime that referenced this pull request Nov 22, 2023
@@ -505,6 +505,9 @@ class emitter
OPSZ32 = 5,
OPSZ64 = 6,
OPSZ_COUNT = 7,
#elif defined(TARGET_ARM64)
OPSZ_SCALABLE = 5,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we are using it anywhere. can we revert this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used indirectly. When setting idOpSize, the emitattr is converted into opSize:

        void idOpSize(emitAttr opsz)
        {
            _idOpSize = emitEncodeSize(opsz);
        }

/* static */ inline emitter::opSize emitter::emitEncodeSize(emitAttr size)
{
    assert((size != EA_UNKNOWN) && ((size & EA_SIZE_MASK) == size));
    return static_cast<emitter::opSize>(genLog2(size));
}

/* static */ inline emitAttr emitter::emitDecodeSize(emitter::opSize ensz)
{
    assert(static_cast<unsigned>(ensz) < OPSZ_COUNT);
    return emitSizeDecode[ensz];
}

Without the changes, a value of EA_SCALABLE becomes OPSZ_COUNT, and then asserts inside emitDecodeSize()

case IF_SVE_HJ_3A: // ........xx...... ...gggmmmmmddddd -- SVE floating-point serial reduction (predicated)
case IF_SVE_HL_3A: // ........xx...... ...gggmmmmmddddd -- SVE floating-point arithmetic (predicated)
emitDispSveReg(id->idReg1(), id->idInsOpt(), true); // ddddd
emitDispPredicateReg(id->idReg2(), true, true); // ggg
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am actually wondering if the merge == true is correct for all the formats it shares?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is one example I found, but there are just handful of them, so perhaps for them, we can split the case.

case IF_SVE_AR_4A:   // ........xx.mmmmm ...gggnnnnnddddd -- SVE integer multiply-accumulate writing addend (predicated)
case IF_SVE_GI_4A:   // ........xx.mmmmm ...gggnnnnnddddd -- SVE2 histogram generation (vector)
case IF_SVE_HU_4A:   // ........xx.mmmmm ...gggnnnnnddddd -- SVE floating-point multiply-accumulate writing addend
   emitDispSveReg(id->idReg10(), id->idInsOpt(), true); // ddddd
   emitDispSveReg(id->idReg20(), id->idInsOpt(), true); // nnnnn
   emitDispPredicateReg(id->idReg30(), id->idInsOpt(), true); // ggg
   emitDispSveReg(id->idReg40(), id->idInsOpt(), true); // mmmmm
image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should take it group by group. Hardcoding with different cases should be ok for most things. But, will need a little more thinking when we get to the instructions that can take /M or /Z

src/coreclr/jit/targetarm64.h Outdated Show resolved Hide resolved
src/coreclr/jit/emitarm64.cpp Outdated Show resolved Hide resolved
@kunalspathak
Copy link
Member

IMO, I like this change to use a separate name, so debugging is easier (and make it const, too)

Sure. I have put it in #95105 because that will get merge sooner, unblocking others.

kunalspathak added a commit that referenced this pull request Nov 23, 2023
* Add insEncodeReg* methods

* Adjust the `ins` for sve instruction

Picked up the change in #95127 (comment)

* Update the comments

* Use sve_ins_offset
@a74nh a74nh marked this pull request as ready for review November 24, 2023 14:52
Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good progress.

cstool output (using #94896 changes):

It seems the tool needs a fixup for formatting the output aligned. @TIHan .

Also, since you did the encoding comparison using both gdb and cstool. I am wondering which option was easier to you?

theEmitter->emitIns_R_R_R(INS_sve_subr, EA_SCALABLE, REG_V2, REG_P0, REG_V13,
INS_OPTS_SCALABLE_S); // SUBR <Zdn>.<T>, <Pg>/M, <Zdn>.<T>, <Zm>.<T>

theEmitter->emitIns_R_R_R(INS_sve_sdiv, EA_SCALABLE, REG_V3, REG_P2, REG_V9,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tool generated just 1 line per instruction, but we should add some more permutation for registers, emitAttr and instOpts to make sure emitIns_* methods works as expected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For each small group I've made sure to try to go across as many permutations as possible. (ie: INS_sve_and, INS_sve_bic, INS_sve_eor, INS_sve_orr are all IF_SVE_AA_3A and go through the same code so will have the same boundaries). The number of instructions here meant I felt there was decent coverage. I also made sure to hit value boundaries where possible (0, 31 etc).

The only instruction I felt I needed to add more permutations on were the EA_4BYTE versions of clasta/clastb.

What I found limiting is we can't test for error cases. Eg: I can't check that I can't generate a B version of INS_sve_sdiv.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eg: I can't check that I can't generate a B version of INS_sve_sdiv.

Yes, I don't think we have code that verifies that something is not possible to get generated.

@@ -8048,6 +8114,216 @@ void emitter::emitIns_R_R_R(
fmt = IF_DV_3A;
break;

case INS_sve_and:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps I should also let the tool generate this code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be useful, even if it's just some isVectorRegister() asserts

@BruceForstall
Copy link
Member

@a74nh I see one set of discrepancies in your example above:

JitDump:

290 fadda h21, p6/m, h21, z14.h
294 fadda s22, p5/m, s22, z13.s
298 fadda d23, p4/m, d23, z12.d

cstool:

290 fadda h21, p6, h21, z14.h
294 fadda s22, p5, s22, z13.s
298 fadda d23, p4, d23, z12.d

(so, the JIT is generating /m on output when it shouldn't). Looks like this is a case where there is no /m or /z option in the ISA.

src/coreclr/jit/emitarm64.h Outdated Show resolved Hide resolved
src/coreclr/jit/emitarm64.h Outdated Show resolved Hide resolved
src/coreclr/jit/emitarm64.h Outdated Show resolved Hide resolved
src/coreclr/jit/emitarm64.h Outdated Show resolved Hide resolved
Comment on lines +5437 to +5439
// #define ALL_ARM64_EMITTER_UNIT_TESTS_GENERAL
// #define ALL_ARM64_EMITTER_UNIT_TESTS_ADVSIMD
// #define ALL_ARM64_EMITTER_UNIT_TESTS_SVE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding that categories; hopefully that helps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I almost put this in another PR.

Also, the INS_brk test fails to compile because it should be INS_brk_windows and INS_brk_linux.

It be good if we could use the cstool output to automatically have this tested and checked.

@a74nh
Copy link
Contributor Author

a74nh commented Nov 27, 2023

@a74nh I see one set of discrepancies in your example above:

JitDump:

290 fadda h21, p6/m, h21, z14.h
294 fadda s22, p5/m, s22, z13.s
298 fadda d23, p4/m, d23, z12.d

cstool:

290 fadda h21, p6, h21, z14.h
294 fadda s22, p5, s22, z13.s
298 fadda d23, p4, d23, z12.d

(so, the JIT is generating /m on output when it shouldn't). Looks like this is a case where there is no /m or /z option in the ISA.

Done.

This is purely a printing issue, the encoding of the instruction is correct. For every instruction I've done so far, there are no bits to indicate the type (/m etc) of the predicate, it's just fixed for each instruction.

@a74nh
Copy link
Contributor Author

a74nh commented Nov 27, 2023

This PR has 82 instructions out of the 889 generated by the tool. That's 9.2%. It took me a week to write the code.

@kunalspathak
Copy link
Member

kunalspathak commented Nov 27, 2023

This PR has 82 instructions out of the 889 generated by the tool. That's 9.2%. It took me a week to write the code.

That's almost 1 hour per instruction. I was hoping it would be faster given the tool generated most of the code, but I guess most of the time taken was also writing some new methods that were needed and stitching everything together. Hoping next batch would be faster and if not, let me know if things need to be added in the tool to reduce the burden. Here are few things that I have:

  • Generate emitIns* code
  • Move the comments containing group name from individual unit test to the top (your changes made in 96271ae codegenarm64.cpp )
  • Use the information in Arm64: Implement SVE encodings #94549 (comment) to generate the latency and throughput information.
  • Pass PredicateType to the display methods.

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes looks great to me. Added some minor comments.

src/coreclr/jit/emitarm64.cpp Show resolved Hide resolved
src/coreclr/jit/emitarm64.cpp Show resolved Hide resolved
src/coreclr/jit/emitarm64.cpp Show resolved Hide resolved
src/coreclr/jit/emitarm64.cpp Outdated Show resolved Hide resolved
case IF_SVE_GR_3A: // ........xx...... ...gggmmmmmddddd -- SVE2 floating-point pairwise operations
case IF_SVE_HJ_3A: // ........xx...... ...gggmmmmmddddd -- SVE floating-point serial reduction (predicated)
case IF_SVE_HL_3A: // ........xx...... ...gggmmmmmddddd -- SVE floating-point arithmetic (predicated)
result.insThroughput = PERFSCORE_THROUGHPUT_1C;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering if this should be fixed to the real values of latency and throughput. can you point to a document that has these details?

I didn't find anything on internet except this kind of questions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a tricky one, as the latency and throughput will change depending on the microarchitecture. For example, neon instructions won't be the same on Altra and N2.

For now, maybe we should just pick one (N2 sounds a reasonable choice for SVE).

There is definitely a future task to add other microarchitecture latencies and throughputs into coreclr.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, maybe we should just pick one (N2 sounds a reasonable choice for SVE).

If there is a parsable data, I can probably generate it via the tool with right numbers too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly, only the pdf.

I've manually done it for this PR. It wasn't fun to do. Thankfully, most of your groups are the same as the groups in the guide. But not all.

Not sure how easy it will be to parse as you'll need to turn instructions back into the groups.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly, only the pdf.

Can you share the link for pdf that has SVE instructions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted that information and pasted in #94549 (comment). I will see if I can make the tool parse that and generate the right latencies and throughput.

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kunalspathak kunalspathak merged commit a26802a into dotnet:main Nov 27, 2023
129 checks passed
@kunalspathak
Copy link
Member

This PR has 82 instructions out of the 889 generated by the tool. That's 9.2%. It took me a week to write the code.

That's almost 1 hour per instruction. I was hoping it would be faster given the tool generated most of the code, but I guess most of the time taken was also writing some new methods that were needed and stitching everything together. Hoping next batch would be faster and if not, let me know if things need to be added in the tool to reduce the burden. Here are few things that I have:

  • Generate emitIns* code
  • Move the comments containing group name from individual unit test to the top (your changes made in 96271ae codegenarm64.cpp )
  • Use the information in Arm64: Implement SVE encodings #94549 (comment) to generate the latency and throughput information.
  • Pass PredicateType to the display methods.

Fixed all of these and uploaded latest files in #94549.

@a74nh a74nh deleted the SVE_AA_3A_4_github branch November 28, 2023 09:01
@github-actions github-actions bot locked and limited conversation to collaborators Dec 29, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI arm-sve Work related to arm64 SVE/SVE2 support community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants