Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose AVX512F embedded rounding intrinsics. #97415

Merged
merged 28 commits into from
Feb 15, 2024

Conversation

Ruihan-Yin
Copy link
Contributor

@Ruihan-Yin Ruihan-Yin commented Jan 23, 2024

Follow-up of #94684.

Updates on 02/14/24:

This PR exposes embedded rounding enabled APIs in AVX512:

The followings are from #73604

public partial class Avx512F
{
    public static Vector512<double> Add(Vector512<double> left, Vector512<double> right, FloatRoundingMode mode);
    public static Vector512<float> Add(Vector512<float> left, Vector512<float> right, FloatRoundingMode mode);

    public static Vector128<double> AddScalar(Vector128<double> left, Vector128<double> right, FloatRoundingMode mode);
    public static Vector128<float> AddScalar(Vector128<float> left, Vector128<float> right, FloatRoundingMode mode);

    public static Vector128<float> ConvertScalarToVector128Single(Vector128<float> upper, Vector128<double> value, FloatRoundingMode mode);
    public static Vector128<double> ConvertScalarToVector128Double(Vector128<double> upper, Vector128<float> value, FloatRoundingMode mode);

    public static int ConvertToInt32(Vector128<double> value, FloatRoundingMode mode);
    public static int ConvertToInt32(Vector128<float> value, FloatRoundingMode mode);

    public static Vector256<int> ConvertToVector256Int32(Vector512<double> value, FloatRoundingMode mode);
    public static Vector512<float> ConvertToVector512Single(Vector512<int> value, FloatRoundingMode mode);
    public static Vector256<float> ConvertToVector256Single(Vector512<double> value, FloatRoundingMode mode);

    public static Vector512<int> ConvertToVector512Int32(Vector512<float> value, FloatRoundingMode mode);

    public static Vector128<float> ConvertScalarToVector128Single(Vector128<float> upper, int value, FloatRoundingMode mode);

    public static Vector512<double> Divide(Vector512<double> left, Vector512<double> right, FloatRoundingMode mode);
    public static Vector512<float> Divide(Vector512<float> left, Vector512<float> right, FloatRoundingMode mode);

    public static Vector128<double> DivideScalar(Vector128<double> left, Vector128<double> right, FloatRoundingMode mode);
    public static Vector128<float> DivideScalar(Vector128<float> left, Vector128<float> right, FloatRoundingMode mode);


    public static Vector512<double> Multiply(Vector512<double> left, Vector512<double> right, FloatRoundingMode mode);
    public static Vector512<float> Multiply(Vector512<float> left, Vector512<float> right, FloatRoundingMode mode);

    public static Vector128<double> MultiplyScalar(Vector128<double> left, Vector128<double> right, FloatRoundingMode mode);
    public static Vector128<float> MultiplyScalar(Vector128<float> left, Vector128<float> right, FloatRoundingMode mode);

    public static Vector512<double> Sqrt(Vector512<double> value, FloatRoundingMode mode);
    public static Vector512<float> Sqrt(Vector512<float> value, FloatRoundingMode mode);

    public static Vector128<double> SqrtScalar(Vector128<double> upper, Vector128<double> value, FloatRoundingMode mode);
    public static Vector128<float> SqrtScalar(Vector128<float> upper, Vector128<float> value, FloatRoundingMode mode);

    public static Vector512<double> Subtract(Vector512<double> left, Vector512<double> right, FloatRoundingMode mode);
    public static Vector512<float> Subtract(Vector512<float> left, Vector512<float> right, FloatRoundingMode mode);

    public static Vector128<double> SubtractScalar(Vector128<double> left, Vector128<double> right, FloatRoundingMode mode);
    public static Vector128<float> SubtractScalar(Vector128<float> left, Vector128<float> right, FloatRoundingMode mode);

    // AVX512

    public static uint ConvertToUInt32(Vector128<double> value, FloatRoundingMode mode);
    public static uint ConvertToUInt32(Vector128<float> value, FloatRoundingMode mode);

    public static Vector256<uint> ConvertToVector256UInt32(Vector512<double> value, FloatRoundingMode mode);
    public static Vector512<uint> ConvertToVector512UInt32(Vector512<float> value, FloatRoundingMode mode);

    public partial class X64
    {
        public static long ConvertToInt64(Vector128<double> value, FloatRoundingMode mode);
        public static long ConvertToInt64(Vector128<float> value, FloatRoundingMode mode);

        public static Vector128<double> ConvertScalarToVector128Double(Vector128<double> upper, long value, FloatRoundingMode mode);
        public static Vector128<float> ConvertScalarToVector128Single(Vector128<float> upper, long value, FloatRoundingMode mode);

        // AVX512

        public static ulong ConvertToUInt64(Vector128<double> value, FloatRoundingMode mode);
        public static ulong ConvertToUInt64(Vector128<float> value, FloatRoundingMode mode);

        public static Vector128<double> ConvertScalarToVector128Double(Vector128<double> upper, ulong value, FloatRoundingMode mode);
        public static Vector128<float> ConvertScalarToVector128Single(Vector128<float> upper, ulong value, FloatRoundingMode mode);
    }
}

The followings are suggested by the reviewer: from AVX512 surfaces tracked by other issues.

public partial class Avx512F
{
    public static Vector512<double> FusedMultiplyAdd (Vector512<double> a, Vector512<double> b, Vector512<double> c, FloatRoundingMode mode);
    public static Vector512<float> FusedMultiplyAdd (Vector512<float> a, Vector512<float> b, Vector512<float> c, FloatRoundingMode mode);
    public static Vector512<double> FusedMultiplyAddNegated (Vector512<double> a, Vector512<double> b, Vector512<double> c, FloatRoundingMode mode);
    public static Vector512<float> FusedMultiplyAddNegated (Vector512<float> a, Vector512<float> b, Vector512<float> c, FloatRoundingMode mode);
    public static Vector512<double> FusedMultiplySubtract (Vector512<double> a, Vector512<double> b, Vector512<double> c, FloatRoundingMode mode);
    public static Vector512<float> FusedMultiplySubtract (Vector512<float> a, Vector512<float> b, Vector512<float> c, FloatRoundingMode mode);
    public static Vector512<double> FusedMultiplySubtractAdd (Vector512<double> a, Vector512<double> b, Vector512<double> c, FloatRoundingMode mode);
    public static Vector512<float> FusedMultiplySubtractAdd (Vector512<float> a, Vector512<float> b, Vector512<float> c, FloatRoundingMode mode);
    public static Vector512<double> FusedMultiplySubtractNegated (Vector512<double> a, Vector512<double> b, Vector512<double> c, FloatRoundingMode mode);
    public static Vector512<float> FusedMultiplySubtractNegated (Vector512<float> a, Vector512<float> b, Vector512<float> c, FloatRoundingMode mode);
    public static Vector128<double> FusedMultiplyAddScalar (Vector128<double> a, Vector128<double> b, Vector128<double> c, FloatRoundingMode mode);
    public static Vector128<float> FusedMultiplyAddScalar (Vector128<float> a, Vector128<float> b, Vector128<float> c, FloatRoundingMode mode);
    public static Vector128<double> FusedMultiplyAddNegatedScalar (Vector128<double> a, Vector128<double> b, Vector128<double> c, FloatRoundingMode mode);
    public static Vector128<float> FusedMultiplyAddNegatedScalar (Vector128<float> a, Vector128<float> b, Vector128<float> c, FloatRoundingMode mode);
    public static Vector128<double> FusedMultiplySubtractScalar (Vector128<double> a, Vector128<double> b, Vector128<double> c, FloatRoundingMode mode);
    public static Vector128<float> FusedMultiplySubtractScalar (Vector128<float> a, Vector128<float> b, Vector128<float> c, FloatRoundingMode mode);
    public static Vector128<double> FusedMultiplySubtractNegatedScalar (Vector128<double> a, Vector128<double> b, Vector128<double> c, FloatRoundingMode mode);
    public static Vector128<float> FusedMultiplySubtractNegatedScalar (Vector128<float> a, Vector128<float> b, Vector128<float> c, FloatRoundingMode mode);
    public static Vector128<double> ScaleScalar (Vector128<double> left, Vector128<double> right, FloatRoundingMode mode);
    public static Vector128<float> ScaleScalar (Vector128<float> left, Vector128<float> right, FloatRoundingMode mode);
}
public partial class Avx512DQ
{
    public static Vector256<float> ConvertToVector256Single (Vector512<long> value, FloatRoundingMode mode);
    public static Vector256<float> ConvertToVector256Single (Vector512<ulong> value, FloatRoundingMode mode);
    public static Vector512<double> ConvertToVector512Double (Vector512<long> value, FloatRoundingMode mode);
    public static Vector512<double> ConvertToVector512Double (Vector512<ulong> value, FloatRoundingMode mode);
    public static Vector512<long> ConvertToVector512Int64 (Vector512<double> value, FloatRoundingMode mode);
    public static Vector512<long> ConvertToVector512Int64 (Vector256<float> value, FloatRoundingMode mode);
    public static Vector512<ulong> ConvertToVector512UInt64 (Vector512<double> value, FloatRoundingMode mode);
    public static Vector512<ulong> ConvertToVector512UInt64 (Vector256<float> value, FloatRoundingMode mode);
}

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Jan 23, 2024
@dotnet-issue-labeler dotnet-issue-labeler bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI new-api-needs-documentation labels Jan 23, 2024
Copy link

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

@ghost
Copy link

ghost commented Jan 23, 2024

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Follow-up of #94684, still WIP, no need to review at the moment.

Author: Ruihan-Yin
Assignees: -
Labels:

area-CodeGen-coreclr, new-api-needs-documentation, community-contribution

Milestone: -

@Ruihan-Yin Ruihan-Yin force-pushed the EmbRoundingAPIs branch 2 times, most recently from dfee205 to 5c44b41 Compare January 23, 2024 21:41
@ryujit-bot
Copy link

Diff results for #97415

Throughput diffs

Throughput diffs for linux/x64 ran on windows/x64

MinOpts (+0.00% to +0.01%)
Collection PDIFF
libraries.crossgen2.linux.x64.checked.mch +0.01%
smoke_tests.nativeaot.linux.x64.checked.mch +0.01%

Throughput diffs for windows/x64 ran on windows/x64

MinOpts (+0.00% to +0.01%)
Collection PDIFF
benchmarks.run.windows.x64.checked.mch +0.01%
libraries.crossgen2.windows.x64.checked.mch +0.01%
smoke_tests.nativeaot.windows.x64.checked.mch +0.01%

Details here


@ryujit-bot
Copy link

Diff results for #97415

Throughput diffs

Throughput diffs for linux/x64 ran on windows/x64

MinOpts (+0.00% to +0.01%)
Collection PDIFF
libraries.crossgen2.linux.x64.checked.mch +0.01%
smoke_tests.nativeaot.linux.x64.checked.mch +0.01%

Throughput diffs for windows/x64 ran on windows/x64

MinOpts (+0.00% to +0.01%)
Collection PDIFF
benchmarks.run.windows.x64.checked.mch +0.01%
libraries.crossgen2.windows.x64.checked.mch +0.01%
smoke_tests.nativeaot.windows.x64.checked.mch +0.01%

Details here


@Ruihan-Yin
Copy link
Contributor Author

Failures look unrelated, PR is ready for review.

@Ruihan-Yin Ruihan-Yin marked this pull request as ready for review January 25, 2024 01:09
@ryujit-bot
Copy link

Diff results for #97415

Throughput diffs

Throughput diffs for linux/arm64 ran on windows/x64

MinOpts (-0.00% to +0.01%)
Collection PDIFF
libraries.pmi.linux.arm64.checked.mch +0.01%

Throughput diffs for linux/x64 ran on windows/x64

MinOpts (0.00% to +0.01%)
Collection PDIFF
benchmarks.run.linux.x64.checked.mch +0.01%

Throughput diffs for windows/arm64 ran on windows/x64

MinOpts (-0.00% to +0.01%)
Collection PDIFF
libraries.pmi.windows.arm64.checked.mch +0.01%

Throughput diffs for windows/x64 ran on windows/x64

MinOpts (0.00% to +0.01%)
Collection PDIFF
benchmarks.run.windows.x64.checked.mch +0.01%

Details here


@ryujit-bot
Copy link

Diff results for #97415

Throughput diffs

Throughput diffs for linux/arm64 ran on windows/x64

MinOpts (-0.00% to +0.01%)
Collection PDIFF
libraries.pmi.linux.arm64.checked.mch +0.01%

Throughput diffs for linux/x64 ran on windows/x64

MinOpts (0.00% to +0.01%)
Collection PDIFF
benchmarks.run.linux.x64.checked.mch +0.01%

Throughput diffs for windows/arm64 ran on windows/x64

MinOpts (-0.00% to +0.01%)
Collection PDIFF
libraries.pmi.windows.arm64.checked.mch +0.01%

Throughput diffs for windows/x64 ran on windows/x64

MinOpts (0.00% to +0.01%)
Collection PDIFF
benchmarks.run.windows.x64.checked.mch +0.01%

Details here


@ryujit-bot
Copy link

Diff results for #97415

Throughput diffs

Throughput diffs for linux/arm64 ran on windows/x64

MinOpts (-0.01% to +0.00%)
Collection PDIFF
libraries.pmi.linux.arm64.checked.mch -0.01%

Throughput diffs for linux/x64 ran on windows/x64

MinOpts (+0.00% to +0.01%)
Collection PDIFF
benchmarks.run.linux.x64.checked.mch +0.01%
libraries.crossgen2.linux.x64.checked.mch +0.01%
smoke_tests.nativeaot.linux.x64.checked.mch +0.01%

Throughput diffs for windows/x64 ran on windows/x64

MinOpts (+0.00% to +0.01%)
Collection PDIFF
benchmarks.run.windows.x64.checked.mch +0.01%
libraries.crossgen2.windows.x64.checked.mch +0.01%
smoke_tests.nativeaot.windows.x64.checked.mch +0.01%

Details here


@Ruihan-Yin
Copy link
Contributor Author

Applied the format patch. All failures are known.

@ryujit-bot
Copy link

Diff results for #97415

Throughput diffs

Throughput diffs for linux/arm64 ran on windows/x64

MinOpts (-0.01% to +0.00%)
Collection PDIFF
libraries.pmi.linux.arm64.checked.mch -0.01%

Throughput diffs for linux/x64 ran on windows/x64

MinOpts (+0.00% to +0.01%)
Collection PDIFF
benchmarks.run.linux.x64.checked.mch +0.01%
libraries.crossgen2.linux.x64.checked.mch +0.01%
smoke_tests.nativeaot.linux.x64.checked.mch +0.01%

Throughput diffs for windows/x64 ran on windows/x64

MinOpts (+0.00% to +0.01%)
Collection PDIFF
benchmarks.run.windows.x64.checked.mch +0.01%
libraries.crossgen2.windows.x64.checked.mch +0.01%
smoke_tests.nativeaot.windows.x64.checked.mch +0.01%

Details here


@tannergooding
Copy link
Member

Updated the branch to pick up some CI fixes that have gone in.

@Ruihan-Yin
Copy link
Contributor Author

Looks like there are a few new updates, I will rebase my local branch and resolve the build error.

I think I forgot to mention there were a few APIs that the underlying intrinisics only support Suppress All Exceptions (sae)

For example, I provided one in the commit:

/// __m128d _mm_cvt_roundss_sd (__m128d a, __m128 b, int sae)
///   VCVTSS2SD xmm1, xmm2, xmm3 {sae}
public static Vector128<double> ConvertScalarToVector128Double(Vector128<double> upper, Vector128<float> value, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => ConvertScalarToVector128Double(upper, value, mode);

There are some more I haven't exposed, because the sae parameter acts more like a on-off switch. I am not very sure how we gonna deal with them.

1. remove some redundent checks on embedded rounding intrinsics
pass the correct operand GenTree node, when emitting the fallback for embedded rounding intrinsics.
2. Added FMA intrinsics with embedded rounding and unit tests.
@ryujit-bot
Copy link

Diff results for #97415

Throughput diffs

Throughput diffs for osx/arm64 ran on linux/x64

MinOpts (-0.01% to +0.00%)
Collection PDIFF
libraries.pmi.osx.arm64.checked.mch -0.01%

Details here


@ryujit-bot
Copy link

Diff results for #97415

Assembly diffs

Assembly diffs for linux/x64 ran on windows/x64

Diffs are based on 2,531,978 contexts (984,938 MinOpts, 1,547,040 FullOpts).

MISSED contexts: 1 (0.00%)

Overall (+8 bytes)
Collection Base size (bytes) Diff size (bytes)
coreclr_tests.run.linux.x64.checked.mch 406,618,438 +8
FullOpts (+8 bytes)
Collection Base size (bytes) Diff size (bytes)
coreclr_tests.run.linux.x64.checked.mch 127,431,919 +8

Assembly diffs for windows/x64 ran on windows/x64

Diffs are based on 2,804,171 contexts (1,155,877 MinOpts, 1,648,294 FullOpts).

MISSED contexts: 3,198 (0.11%)

Overall (+8 bytes)
Collection Base size (bytes) Diff size (bytes)
coreclr_tests.run.windows.x64.checked.mch 439,649,126 +8
FullOpts (+8 bytes)
Collection Base size (bytes) Diff size (bytes)
coreclr_tests.run.windows.x64.checked.mch 135,055,052 +8

Details here


Assembly diffs for windows/x86 ran on windows/x86

Diffs are based on 2,599,926 contexts (1,005,474 MinOpts, 1,594,452 FullOpts).

Overall (+8 bytes)
Collection Base size (bytes) Diff size (bytes)
coreclr_tests.run.windows.x86.checked.mch 348,964,431 +8
FullOpts (+8 bytes)
Collection Base size (bytes) Diff size (bytes)
coreclr_tests.run.windows.x86.checked.mch 113,859,455 +8

Details here


Throughput diffs

Throughput diffs for linux/arm64 ran on windows/x64

MinOpts (-0.01% to +0.00%)
Collection PDIFF
libraries.pmi.linux.arm64.checked.mch -0.01%

Throughput diffs for linux/x64 ran on windows/x64

Overall (+0.00% to +0.01%)
Collection PDIFF
benchmarks.run_pgo.linux.x64.checked.mch +0.01%
benchmarks.run_tiered.linux.x64.checked.mch +0.01%
coreclr_tests.run.linux.x64.checked.mch +0.01%
libraries.crossgen2.linux.x64.checked.mch +0.01%
libraries_tests.run.linux.x64.Release.mch +0.01%
realworld.run.linux.x64.checked.mch +0.01%
smoke_tests.nativeaot.linux.x64.checked.mch +0.01%
MinOpts (+0.00% to +0.01%)
Collection PDIFF
benchmarks.run.linux.x64.checked.mch +0.01%
benchmarks.run_pgo.linux.x64.checked.mch +0.01%
benchmarks.run_tiered.linux.x64.checked.mch +0.01%
coreclr_tests.run.linux.x64.checked.mch +0.01%
libraries.crossgen2.linux.x64.checked.mch +0.01%
libraries_tests.run.linux.x64.Release.mch +0.01%
realworld.run.linux.x64.checked.mch +0.01%
smoke_tests.nativeaot.linux.x64.checked.mch +0.01%
FullOpts (+0.00% to +0.01%)
Collection PDIFF
benchmarks.run_pgo.linux.x64.checked.mch +0.01%
benchmarks.run_tiered.linux.x64.checked.mch +0.01%
coreclr_tests.run.linux.x64.checked.mch +0.01%
libraries.crossgen2.linux.x64.checked.mch +0.01%
libraries_tests.run.linux.x64.Release.mch +0.01%
realworld.run.linux.x64.checked.mch +0.01%
smoke_tests.nativeaot.linux.x64.checked.mch +0.01%

Throughput diffs for osx/arm64 ran on windows/x64

MinOpts (-0.01% to +0.00%)
Collection PDIFF
libraries.pmi.osx.arm64.checked.mch -0.01%

Throughput diffs for windows/arm64 ran on windows/x64

MinOpts (-0.01% to +0.00%)
Collection PDIFF
libraries.pmi.windows.arm64.checked.mch -0.01%

Throughput diffs for windows/x64 ran on windows/x64

Overall (+0.00% to +0.01%)
Collection PDIFF
benchmarks.run_pgo.windows.x64.checked.mch +0.01%
MinOpts (+0.00% to +0.01%)
Collection PDIFF
benchmarks.run.windows.x64.checked.mch +0.01%
libraries.crossgen2.windows.x64.checked.mch +0.01%
smoke_tests.nativeaot.windows.x64.checked.mch +0.01%
FullOpts (+0.00% to +0.01%)
Collection PDIFF
benchmarks.run_pgo.windows.x64.checked.mch +0.01%

Details here


@ryujit-bot
Copy link

Diff results for #97415

Assembly diffs

Assembly diffs for linux/x64 ran on windows/x64

Diffs are based on 2,531,978 contexts (984,938 MinOpts, 1,547,040 FullOpts).

MISSED contexts: 1 (0.00%)

Overall (+8 bytes)
Collection Base size (bytes) Diff size (bytes)
coreclr_tests.run.linux.x64.checked.mch 406,618,438 +8
FullOpts (+8 bytes)
Collection Base size (bytes) Diff size (bytes)
coreclr_tests.run.linux.x64.checked.mch 127,431,919 +8

Assembly diffs for windows/x64 ran on windows/x64

Diffs are based on 2,804,171 contexts (1,155,877 MinOpts, 1,648,294 FullOpts).

MISSED contexts: 3,198 (0.11%)

Overall (+8 bytes)
Collection Base size (bytes) Diff size (bytes)
coreclr_tests.run.windows.x64.checked.mch 439,649,126 +8
FullOpts (+8 bytes)
Collection Base size (bytes) Diff size (bytes)
coreclr_tests.run.windows.x64.checked.mch 135,055,052 +8

Details here


Assembly diffs for windows/x86 ran on windows/x86

Diffs are based on 2,599,926 contexts (1,005,474 MinOpts, 1,594,452 FullOpts).

Overall (+8 bytes)
Collection Base size (bytes) Diff size (bytes)
coreclr_tests.run.windows.x86.checked.mch 348,964,431 +8
FullOpts (+8 bytes)
Collection Base size (bytes) Diff size (bytes)
coreclr_tests.run.windows.x86.checked.mch 113,859,455 +8

Details here


Throughput diffs

Throughput diffs for linux/arm64 ran on windows/x64

MinOpts (-0.01% to +0.00%)
Collection PDIFF
libraries.pmi.linux.arm64.checked.mch -0.01%

Throughput diffs for linux/x64 ran on windows/x64

Overall (+0.00% to +0.01%)
Collection PDIFF
benchmarks.run_pgo.linux.x64.checked.mch +0.01%
benchmarks.run_tiered.linux.x64.checked.mch +0.01%
coreclr_tests.run.linux.x64.checked.mch +0.01%
libraries.crossgen2.linux.x64.checked.mch +0.01%
libraries_tests.run.linux.x64.Release.mch +0.01%
realworld.run.linux.x64.checked.mch +0.01%
smoke_tests.nativeaot.linux.x64.checked.mch +0.01%
MinOpts (+0.00% to +0.01%)
Collection PDIFF
benchmarks.run.linux.x64.checked.mch +0.01%
benchmarks.run_pgo.linux.x64.checked.mch +0.01%
benchmarks.run_tiered.linux.x64.checked.mch +0.01%
coreclr_tests.run.linux.x64.checked.mch +0.01%
libraries.crossgen2.linux.x64.checked.mch +0.01%
libraries_tests.run.linux.x64.Release.mch +0.01%
realworld.run.linux.x64.checked.mch +0.01%
smoke_tests.nativeaot.linux.x64.checked.mch +0.01%
FullOpts (+0.00% to +0.01%)
Collection PDIFF
benchmarks.run_pgo.linux.x64.checked.mch +0.01%
benchmarks.run_tiered.linux.x64.checked.mch +0.01%
coreclr_tests.run.linux.x64.checked.mch +0.01%
libraries.crossgen2.linux.x64.checked.mch +0.01%
libraries_tests.run.linux.x64.Release.mch +0.01%
realworld.run.linux.x64.checked.mch +0.01%
smoke_tests.nativeaot.linux.x64.checked.mch +0.01%

Throughput diffs for osx/arm64 ran on windows/x64

MinOpts (-0.01% to +0.00%)
Collection PDIFF
libraries.pmi.osx.arm64.checked.mch -0.01%

Throughput diffs for windows/arm64 ran on windows/x64

MinOpts (-0.01% to +0.00%)
Collection PDIFF
libraries.pmi.windows.arm64.checked.mch -0.01%

Throughput diffs for windows/x64 ran on windows/x64

Overall (+0.00% to +0.01%)
Collection PDIFF
benchmarks.run_pgo.windows.x64.checked.mch +0.01%
MinOpts (+0.00% to +0.01%)
Collection PDIFF
benchmarks.run.windows.x64.checked.mch +0.01%
libraries.crossgen2.windows.x64.checked.mch +0.01%
smoke_tests.nativeaot.windows.x64.checked.mch +0.01%
FullOpts (+0.00% to +0.01%)
Collection PDIFF
benchmarks.run_pgo.windows.x64.checked.mch +0.01%

Details here


@Ruihan-Yin
Copy link
Contributor Author

Failures look unrelated, will apply the format patch, ready to continue the review.

@ryujit-bot
Copy link

Diff results for #97415

Assembly diffs

Assembly diffs for linux/x64 ran on windows/x64

Diffs are based on 2,531,978 contexts (984,938 MinOpts, 1,547,040 FullOpts).

MISSED contexts: 1 (0.00%)

Overall (+8 bytes)
Collection Base size (bytes) Diff size (bytes)
coreclr_tests.run.linux.x64.checked.mch 406,611,323 +8
FullOpts (+8 bytes)
Collection Base size (bytes) Diff size (bytes)
coreclr_tests.run.linux.x64.checked.mch 127,424,804 +8

Assembly diffs for windows/x64 ran on windows/x64

Diffs are based on 2,821,026 contexts (1,163,479 MinOpts, 1,657,547 FullOpts).

Overall (+8 bytes)
Collection Base size (bytes) Diff size (bytes)
coreclr_tests.run.windows.x64.checked.mch 439,646,022 +8
FullOpts (+8 bytes)
Collection Base size (bytes) Diff size (bytes)
coreclr_tests.run.windows.x64.checked.mch 135,051,948 +8

Details here


Throughput diffs

Throughput diffs for linux/arm64 ran on windows/x64

MinOpts (-0.00% to +0.01%)
Collection PDIFF
libraries.pmi.linux.arm64.checked.mch +0.01%

Throughput diffs for linux/x64 ran on windows/x64

Overall (+0.00% to +0.01%)
Collection PDIFF
benchmarks.run_pgo.linux.x64.checked.mch +0.01%
benchmarks.run_tiered.linux.x64.checked.mch +0.01%
coreclr_tests.run.linux.x64.checked.mch +0.01%
libraries.crossgen2.linux.x64.checked.mch +0.01%
libraries_tests.run.linux.x64.Release.mch +0.01%
realworld.run.linux.x64.checked.mch +0.01%
smoke_tests.nativeaot.linux.x64.checked.mch +0.01%
MinOpts (+0.00% to +0.01%)
Collection PDIFF
benchmarks.run.linux.x64.checked.mch +0.01%
benchmarks.run_pgo.linux.x64.checked.mch +0.01%
benchmarks.run_tiered.linux.x64.checked.mch +0.01%
coreclr_tests.run.linux.x64.checked.mch +0.01%
libraries.crossgen2.linux.x64.checked.mch +0.01%
libraries_tests.run.linux.x64.Release.mch +0.01%
realworld.run.linux.x64.checked.mch +0.01%
smoke_tests.nativeaot.linux.x64.checked.mch +0.01%
FullOpts (+0.00% to +0.01%)
Collection PDIFF
benchmarks.run_pgo.linux.x64.checked.mch +0.01%
coreclr_tests.run.linux.x64.checked.mch +0.01%
libraries.crossgen2.linux.x64.checked.mch +0.01%
libraries_tests.run.linux.x64.Release.mch +0.01%
realworld.run.linux.x64.checked.mch +0.01%
smoke_tests.nativeaot.linux.x64.checked.mch +0.01%

Throughput diffs for windows/arm64 ran on windows/x64

MinOpts (-0.01% to +0.01%)
Collection PDIFF
libraries.pmi.windows.arm64.checked.mch +0.01%
realworld.run.windows.arm64.checked.mch -0.01%

Throughput diffs for windows/x64 ran on windows/x64

Overall (+0.00% to +0.01%)
Collection PDIFF
benchmarks.run_pgo.windows.x64.checked.mch +0.01%
MinOpts (+0.00% to +0.01%)
Collection PDIFF
benchmarks.run.windows.x64.checked.mch +0.01%
libraries.crossgen2.windows.x64.checked.mch +0.01%
smoke_tests.nativeaot.windows.x64.checked.mch +0.01%
FullOpts (+0.00% to +0.01%)
Collection PDIFF
benchmarks.run_pgo.windows.x64.checked.mch +0.01%

Details here


@ryujit-bot
Copy link

Diff results for #97415

Assembly diffs

Assembly diffs for linux/x64 ran on windows/x64

Diffs are based on 2,531,978 contexts (984,938 MinOpts, 1,547,040 FullOpts).

MISSED contexts: 1 (0.00%)

Overall (+8 bytes)
Collection Base size (bytes) Diff size (bytes)
coreclr_tests.run.linux.x64.checked.mch 406,611,323 +8
FullOpts (+8 bytes)
Collection Base size (bytes) Diff size (bytes)
coreclr_tests.run.linux.x64.checked.mch 127,424,804 +8

Assembly diffs for windows/x64 ran on windows/x64

Diffs are based on 2,821,026 contexts (1,163,479 MinOpts, 1,657,547 FullOpts).

Overall (+8 bytes)
Collection Base size (bytes) Diff size (bytes)
coreclr_tests.run.windows.x64.checked.mch 439,646,022 +8
FullOpts (+8 bytes)
Collection Base size (bytes) Diff size (bytes)
coreclr_tests.run.windows.x64.checked.mch 135,051,948 +8

Details here


Assembly diffs for windows/x86 ran on windows/x86

Diffs are based on 2,599,259 contexts (1,005,474 MinOpts, 1,593,785 FullOpts).

MISSED contexts: 667 (0.03%)

Overall (+8 bytes)
Collection Base size (bytes) Diff size (bytes)
coreclr_tests.run.windows.x86.checked.mch 348,866,853 +8
FullOpts (+8 bytes)
Collection Base size (bytes) Diff size (bytes)
coreclr_tests.run.windows.x86.checked.mch 113,761,877 +8

Details here


Comment on lines +1424 to +1425
// For FMA intrinsics, since it is not possible to get any contained operand in this case: embedded rounding
// is limited in register-to-register form, and the control byte is dynamic, we don't need to do any swap.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite sure I follow this.

I'd have expected we still want to do instruction selection based on the target register to avoid having to insert the additional movaps. This ensures a last use parameter can still be choosen, regardless of whether its op1, op2, or op3

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this being handled in a follow up PR, so we can get this merged.

But I do think we need to ensure its handled so we get optimal codegen.

Copy link
Member

@tannergooding tannergooding Feb 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may actually be sufficient to just call genFMAIntrinsic here instead of genHWIntrinsic_R_R_R_RM (or rather to split out the core of it so we don't call genProduceReg too many times).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might misunderstand the original emit path for FMA, let me check if this would be a quick fix, thanks for pointing out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same general thing might apply to the other paths, where it'd be "more ideal" if we could call genAvxFamily for things like NI_AVX512F_ConvertToVector256Int32, to ensure we aren't missing handling.

The only reason we "can't" today is because they force the genProduceReg call, but that could be extracted or conditioned in a way to make that safe (maybe just if (instOptions == NONE) { genProduceReg(..); } or similar).

Still fine with it being handled in a separate PR, as what's here isn't incorrect, just potentially less optimal long term

@@ -2506,6 +2509,11 @@ int LinearScan::BuildHWIntrinsic(GenTreeHWIntrinsic* intrinsicTree, int* pDstCou
srcCount += BuildDelayFreeUses(emitOp2, emitOp1);
srcCount += emitOp3->isContained() ? BuildOperandUses(emitOp3) : BuildDelayFreeUses(emitOp3, emitOp1);

if (intrinsicTree->OperIsEmbRoundingEnabled() && !intrinsicTree->Op(4)->IsCnsIntOrI())
{
srcCount += BuildDelayFreeUses(intrinsicTree->Op(4), emitOp1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need BuildDelayFreeUses here. op(4) is an integer and so will go in a GPR. op1 is a vector and so will go in a SIMD register

Copy link
Contributor Author

@Ruihan-Yin Ruihan-Yin Feb 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have limited understand within LSRA domain, it would be much appreciated if you could point me to the reasonable function to use....

Like BuildOperandUse might be the correct one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this case you should just need BuildOperandUses(intrinsicTree->Op(4))

The general premise is:

  • BuildOperandUses -- safe to use when there aren't any restrictions
  • BuildDelayFreeUses -- safe to use when you need to restrict the register set, typically used for RMW nodes

In this case, FMA is an RMW node. However, the register sets used by op4 (simd) and op1 (general-purpose) don't overlap, so we don't need to concern ourselves with op4 being delay free.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation!

Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs secondary sign-off from someone from @dotnet/jit-contrib

@BruceForstall
Copy link
Member

@Ruihan-Yin The top comment here says "Follow-up of #94684, still WIP, no need to review at the moment.". Can you please update this to describe the contents of the PR?

@Ruihan-Yin
Copy link
Contributor Author

@Ruihan-Yin The top comment here says "Follow-up of #94684, still WIP, no need to review at the moment.". Can you please update this to describe the contents of the PR?

Done, thanks for the notice.

@ryujit-bot
Copy link

Diff results for #97415

Assembly diffs

Assembly diffs for linux/x64 ran on windows/x64

Diffs are based on 2,531,978 contexts (984,938 MinOpts, 1,547,040 FullOpts).

MISSED contexts: 1 (0.00%)

Overall (+8 bytes)
Collection Base size (bytes) Diff size (bytes)
coreclr_tests.run.linux.x64.checked.mch 406,611,323 +8
FullOpts (+8 bytes)
Collection Base size (bytes) Diff size (bytes)
coreclr_tests.run.linux.x64.checked.mch 127,424,804 +8

Assembly diffs for windows/x64 ran on windows/x64

Diffs are based on 2,821,026 contexts (1,163,479 MinOpts, 1,657,547 FullOpts).

Overall (+8 bytes)
Collection Base size (bytes) Diff size (bytes)
coreclr_tests.run.windows.x64.checked.mch 439,646,022 +8
FullOpts (+8 bytes)
Collection Base size (bytes) Diff size (bytes)
coreclr_tests.run.windows.x64.checked.mch 135,051,948 +8

Details here


Assembly diffs for windows/x86 ran on windows/x86

Diffs are based on 2,599,259 contexts (1,005,474 MinOpts, 1,593,785 FullOpts).

MISSED contexts: 667 (0.03%)

Overall (+8 bytes)
Collection Base size (bytes) Diff size (bytes)
coreclr_tests.run.windows.x86.checked.mch 348,866,853 +8
FullOpts (+8 bytes)
Collection Base size (bytes) Diff size (bytes)
coreclr_tests.run.windows.x86.checked.mch 113,761,877 +8

Details here


Throughput diffs

Throughput diffs for linux/arm64 ran on windows/x64

MinOpts (-0.01% to +0.00%)
Collection PDIFF
realworld.run.linux.arm64.checked.mch -0.01%

Throughput diffs for linux/x64 ran on windows/x64

Overall (+0.00% to +0.01%)
Collection PDIFF
benchmarks.run_pgo.linux.x64.checked.mch +0.01%
benchmarks.run_tiered.linux.x64.checked.mch +0.01%
coreclr_tests.run.linux.x64.checked.mch +0.01%
libraries.crossgen2.linux.x64.checked.mch +0.01%
libraries_tests.run.linux.x64.Release.mch +0.01%
realworld.run.linux.x64.checked.mch +0.01%
smoke_tests.nativeaot.linux.x64.checked.mch +0.01%
MinOpts (+0.00% to +0.01%)
Collection PDIFF
benchmarks.run.linux.x64.checked.mch +0.01%
benchmarks.run_pgo.linux.x64.checked.mch +0.01%
benchmarks.run_tiered.linux.x64.checked.mch +0.01%
coreclr_tests.run.linux.x64.checked.mch +0.01%
libraries.crossgen2.linux.x64.checked.mch +0.01%
libraries_tests.run.linux.x64.Release.mch +0.01%
realworld.run.linux.x64.checked.mch +0.01%
smoke_tests.nativeaot.linux.x64.checked.mch +0.01%
FullOpts (+0.00% to +0.01%)
Collection PDIFF
benchmarks.run_pgo.linux.x64.checked.mch +0.01%
benchmarks.run_tiered.linux.x64.checked.mch +0.01%
coreclr_tests.run.linux.x64.checked.mch +0.01%
libraries.crossgen2.linux.x64.checked.mch +0.01%
libraries_tests.run.linux.x64.Release.mch +0.01%
realworld.run.linux.x64.checked.mch +0.01%
smoke_tests.nativeaot.linux.x64.checked.mch +0.01%

Throughput diffs for windows/arm64 ran on windows/x64

MinOpts (-0.01% to +0.00%)
Collection PDIFF
realworld.run.windows.arm64.checked.mch -0.01%

Throughput diffs for windows/x64 ran on windows/x64

Overall (+0.00% to +0.01%)
Collection PDIFF
benchmarks.run_pgo.windows.x64.checked.mch +0.01%
MinOpts (-0.00% to +0.01%)
Collection PDIFF
benchmarks.run.windows.x64.checked.mch +0.01%
libraries.crossgen2.windows.x64.checked.mch +0.01%
smoke_tests.nativeaot.windows.x64.checked.mch +0.01%
FullOpts (+0.00% to +0.01%)
Collection PDIFF
benchmarks.run_pgo.windows.x64.checked.mch +0.01%

Details here


@BruceForstall BruceForstall merged commit aeecdb8 into dotnet:main Feb 15, 2024
206 of 208 checks passed
@Ruihan-Yin
Copy link
Contributor Author

Thanks for all the reviews and help!

@github-actions github-actions bot locked and limited conversation to collaborators Mar 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member new-api-needs-documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants