Add more benchmarks to JitBench #16353

noahfalk · 2018-02-13T12:08:26Z

So I’ve been working on this a little while and the scope snowballed from my initial plan, but wanted to let you know where things are now and get any high-level feedback you guys would like to provide. You are welcome to provide more detailed feedback if you want right now, but I imagine it will need at least one more pass before its ready to go and I'll ping again then.

I had two main goals:

Add a few more scenario benchmarks – there are 3 new ones covering some csc and dotnet build test cases + the original music store scenario. I’m doing some work with Lee to identify other potential candidates but not trying to include that in this PR.
Make the benchmarks easier to work with ad-hoc during development. The current automation works fine in CI, but if I want to create a coreclr build locally and compare a few different benchmarks across a few different configurations that takes a good number of commands + manual collation of the results from several log files (I always wind up referring back to Andy’s email because I can never remember all the different steps involved). I was also unable to build or run the JitBench project within VS and I guess I’m a sucker for being able to hit F5 and immediately debug the code I am writing ; )

At the moment my ad-hoc usage looks about like this:
a) Clone Coreclr repo
b) build coreclr
c) Cd tests/src/performance/scenario/JitBench
d) Dotnet run -- --configs:Default,Tiering,Minopts --coreclr-bin-dir F:\github\coreclr\bin\Product\Windows_NT.x64.Release\

  === CONFIGURATION ===

DotnetFrameworkVersion: 2.1.0-preview2-26131-06
DotnetSdkVersion:       2.2.0-preview1-007558
PrivateCoreCLRBinDir:   F:\github\coreclr\bin\Product\Windows_NT.x64.Release\
Architecture:           X64
OutputDir:              C:\Users\noahfalk\AppData\Local\Temp\JitBench_2018_02_12_05_16_34_0611
Iterations:             10
UseExistingSetup:       False
Configurations:         Default,Tiering,Minopts


Benchmark run in progress...
Verbose log: C:\Users\noahfalk\AppData\Local\Temp\JitBench_2018_02_12_05_16_34_0611\JitBench_log.txt


  === RESULTS ===

       Benchmark                Metric           Default       Tiering      Minopts
-----------------------  --------------------  ------------  -----------  ------------
Dotnet_Build_HelloWorld         Duration (ms)  1305.5+-0.98     1222+-15  1180.5+-0.98
        Csc_Hello_World         Duration (ms)    618.5+-8.8      515+-12       496+-18
      Csc_Roslyn_Source         Duration (ms)      2607+-12     2707+-45      2528+-28
             MusicStore          Startup (ms)       724+-30  627.5+-0.98    610.5+-4.9
             MusicStore    First Request (ms)       582+-21     528+-113   441.5+-0.98
             MusicStore  Median Response (ms)         89+-0        89+-0         89+-0

There are no required steps to build the test tree, restore anything, generate a layout, invoke xunit perf python scripts, set env vars or collate logs, though (bugs aside) it would still work roughly as before if you did all those things.

Here are the major changes + a few outstanding questions:

Would it cause issues to submit multiple benchmarks to benchview within a single csv? One concern I noticed is that we are collecting ETW PMC counters and encoding it in metrics such as “dotnet.exe!coreclr.dll, InstructionsRetired, ###”. It seems like counters named this way for multiple tests would inevitably collide if they were included in the same BenchView submission, but I'm not sure we use the data for any purpose right now?
The majority of code in JitBench wasn’t actually specific to MusicStore, rather it was handling things like command line parsing, process launching, logging, error checking, and building object models. After some refactoring the part of JitBench that is specific to MusicStore was factored out into about 250 lines (it used to be ~1000), the three new benchmarks added about ~50-75 lines each, and then the vast majority of the rest is general purpose.
JitBench is now buildable via dotnet build (msbuild still works fine too), and the project type is now the standard dotnet SDK produced console app. On the good side this means you can run it in VS as-is, or just navigate to tests\src\performance\scenario\jitbench and use ‘dotnet run’. However that does make it different than most other test projects. When building with dotnet I had to disable using BuildTools on windows and instead just use the normal dotnet SDK targets, but the official build will continue to use msbuild/BuildTools. I also had to directly wire it up to the batch restore project list in the root of the test tree, otherwise I can’t properly restore it from both the full test build and a build of just the project directory. Last I had to tweak the python xunit runner to accept testBinLoc as either an exact file path or a directory. Dotnet projects produce a build output directory with multiple assemblies in it (the dependencies) and the python xunit runner would otherwise assume that every dll in that build output directory was a runnable benchmark. Explicitly telling it what assembly I wanted it to run resolves that issue.
It felt like I was cutting against the grain on this, but it also feels pretty useful to be able to use VS F5 and the standard dotnet SDK tools on our own projects. Let me know if you don’t like it though.
I wound up doing some unnatural things with the XunitPerformanceHarness API mostly to avoid expanding the scope of the changes yet further into the xunit performance repo and perhaps because I was being overzealous about getting things to work a certain way, but let me know if you have other suggestions:
a. Xunit performance expects to be given the raw command-line args so that it can parse them, but I find the choice to use ETW by default when no option has been specified aggravating here. The ETW events slow the results of these tests and I doubt the majority of our test scenarios make use of the data. Additionally it means the default case of launching from VS or running at the commandline will just give errors saying to relaunch with admin permissions. However making this not the default behavior meant I had to parse the commandline arguments myself in advance, modify the default case, and then reformat a new fake command-line. (Its actually not a lot of code, but its a weird interaction across components)
b. The tests do a bunch of setup at the beginning and if you request ETW + don’t give admin permissions I wanted to fail fast rather than fail when the runner eventually launches the first benchmark. Again this meant I have to pre-parse the command line and reason independently about whether XUnitPerformanceHarness will want to enable ETW.
c. The benchmarks need a working directory for various files they write and XunitPerformanceHarness expects specific command line formatting for how the output directory is passed. This was a third occasion in which I needed to understand and pre-parse the command line the harness wants in order to align the output directories.
d. The tests don’t use XUnitPerformance’s ScenarioBenchmark object model as their primary one. Instead I store the results in a similar OM that fit into the code a bit more smoothly and then spend ~100 lines of code transforming to the ScenarioBenchmark OM if/when I need to serialize the output. I am also not serializing the output at the same time I run the individual benchmarks. I create one XUnitPerformanceHarness instance each time I need to run a test, then optionally create a final one at the end just to serialize output.
e. XunitPerformanceHarness has hard-coded console writeline usage inside it and I wanted all that output spew in a file to keep the console output clearer. I wound up hijacking Console.Out whenever using XunitPerformanceHarness in order to capture and redirect the logging.
In order to streamline using the tests, I wrote a small C# implementation of dotnet-install. The tests are still capable of using the private CoreCLR bits in CORE_ROOT or the result of runtest.cmd GenerateLayoutOnly, but that is now optional rather than required.
The utilities folder I added has some general purpose file system manipulation, logging, and process running code that I grabbed from some of our internal sources as a convenience. In general the utilities are intended to be things that could be shared more widely in the future if we wanted to, they are not perf or benchmark specific.
The runner folder, in addition to the shared portions of the old JitBench benchmark now has some very rudimentary test run functionality such as selecting a set of benchmarks to run, the configurations under which to run them, and logging configuration/progress/results.
I flip-flopped and went back to calling CreateStore/Publish directly on MusicStore instead of using RunBenchmark. I needed to handle the dotnet installation for the other tests already and then it seemed bizarre if I couldn't get MusicStore to share the same one. Although we could refactor RunBenchmark.ps1 to skip dotnet setup too with a separate PR, the complexity to interface with RunBenchmark didn't seem worth the remaining functionality it would be providing at that point. I've still preserved all the same stable versioning semantics as before so we shouldn't have lost any benefits on that front, just a shift of implementation.

noahfalk · 2018-02-13T12:09:18Z

@jorive @AndyAyersMS @brianrob @kouvel

jorive · 2018-02-13T18:04:32Z

@noahfalk Could you please open issues for all xUnit Performance Api related comments?

jorive · 2018-02-13T18:06:34Z

tests/scripts/run-xunit-perf.py

@@ -71,7 +71,7 @@ def validate_arg(arg, check):
            raise ValueError('Argument: %s is not valid.' % (arg))

    coreclrPerf = os.path.join(os.getcwd(), args.coreclrPerf)
-    validate_arg(coreclrPerf, lambda item: os.path.isdir(item))
+    validate_arg(coreclrPerf, lambda item: os.path.isdir(item) or os.path.isfile(item))


This is rather error prone. It would be nicer if you add a parameter for the assembly file name case and handle it separately. #Closed

No problem #Closed

jorive · 2018-02-13T18:09:01Z

tests/src/performance/Scenario/JitBench/JitBench.csproj

-<?xml version="1.0" encoding="utf-8"?>
-<Project ToolsVersion="12.0" DefaultTargets="Build" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
-  <Import Project="$([MSBuild]::GetDirectoryNameOfFileAbove($(MSBuildThisFileDirectory), dir.props))\dir.props" />
+<Project Sdk="Microsoft.NET.Sdk">


.NET Core csproj! I like it! :)
Ideally, all projects should move to it! #Closed

Cool, I was worried this might be a controversial move ; ) #Closed

jorive · 2018-02-13T18:10:08Z

tests/src/performance/Scenario/JitBench/JitBench.csproj

-  </ItemGroup>
-  <ItemGroup>
-    <Service Include="{82A7F48D-3B50-4B1E-B82E-3ADA8210C358}" />
+    <PackageReference Include="CommandLineParser" Version="2.2.1" />


I would add this package/version to the dependencies.props file in the repo root directory. #Closed

AndyAyersMS · 2018-02-13T18:48:55Z

I like the direction your going here.

It has been painful to do "innerloop" type stuff with these tests as the scripts often presume they are starting from scratch, and won't need to leave anything useful behind other than data. So yeah, we need to make this easier.

I think we should be using ETW, at least as one of the ways we run the CI perf jobs, in part as a way of validating that we're running the tests the way we expect them to be run. But having ETW be opt-in rather than opt-out seems perfectly reasonable.

I'll hold off on detailed comments for now.

jorive · 2018-02-13T19:16:48Z

@noahfalk Are you planning to make the JitBench solution/scenario more than just JitBench?

noahfalk · 2018-02-14T00:51:03Z

Thanks for taking a look at this!

@AndyAyersMS - On ETW, agreed. I certainly still want to keep ETW as an option and retain its usage in CI, my quip was solely about choosing the default.

Are you planning to make the JitBench solution/scenario more than just JitBench?

@jorive -Yes, though if you forsee issues with that approach please let me know. The way I've been thinking of it is that JitBench refers to a set of scenarios we can measure, of which MusicStore is one of them. I am ignoring the fact that the ASP.NET repo is literally called JitBench - there is some potential for confusion there but I'm hoping its not a big deal. From purely the naming perspective the alternative is to claim that JitBench = MusicStore and nothing else, but then I still needed a name for the whole set of scenarios and JitBench felt like a pretty good one.

noahfalk · 2018-02-18T05:36:13Z

@jorive @AndyAyersMS @kouvel @brianrob - OK I think things are looking better now for anyone interested in taking a closer look.

Unfortunately I was unable to find a way to get the SDK style projects through CI build succesfully so I had to retreat a bit. The change now includes both the original csproj which is used for all repo and CI builds, and an unofficial_dotnet sub-folder .csproj which will only be used when a dev manually invokes it.

The README has also been updated with some info about the test that may be useful.

jorive · 2018-02-20T17:59:45Z

@noahfalk Should we set this environment variable as default too: UseSharedCompilation=false
Otherwise, it would leave the compiler service running and it generally breaks automation because it cannot delete folders.

jorive · 2018-02-20T18:01:27Z

I like the new harness :)

AndyAyersMS · 2018-02-20T21:01:16Z

Not sure I'll have a chance to go through all this in detail in a timely manner, but it looks reasonable.

At some point I'd really like to see us get away from using mean and standard deviation and consistently use median and intra-quartile (or 5-95 quantile) ranges to summarize results -- these metrics are more relevant for non-normal distributions. For instance response time distribution is quite likely to be very left leaning and have a long right tail of stragglers. So average and standard deviation are misleading and noise prone as they pay too much attention to bad outliers. But we can sort that out later.

jorive · 2018-02-20T21:34:19Z

@AndyAyersMS Should I update the tools to those values?

AndyAyersMS · 2018-02-20T23:50:16Z

We're already recording median in the official runs, so I'd hold off. My comments were for the extra reporting code that Noah is adding for end user consumption.

noahfalk · 2018-02-21T00:22:38Z

@jorive - I'll get that env var added, thanks for spotting it

@AndyAyersMS - Are you specifically referring to the query responses within MusicStore, or more generally that you hope all metrics from all benchmarks will shift to median + interquartile ranges?

Just fyi, the +- stat being reported in the console output isn't a standard deviation, its a 95% confidence interval for the statistic being measured. The goal in that case is not to describe the shape of the distribution but rather to give error bars. If other metrics such as inter-quartile ranges would be useful to you I don't see any reason they couldn't be added.

a) JitBench includes several more benchmarks and a refactored runner to make it easier to add more b) A new JitBench/unofficial_dotnet SDK style build project offers simpler developer workflow at the command line and in VS when working with these Benchmarks c) Misc build issues with local dotnet build in the test tree have workarounds now, but official CI builds still don't support it d) run-xunit-perf.py has support to run a specific assembly from the testBinLoc folder intended to handle selecting a binary from the folder of binaries produced in the .Net Core SDK project build process. However until CI build can support building projects that way the xunit-perf support is unused.

AndyAyersMS · 2018-02-21T02:20:51Z

Generally I have two pet peeves when it comes to summarizing data: assuming normality and using L2 (squared) distance metrics.

The computation of confidence interval in MarginOfError95 does both: it assumes a normal distribution and uses squared errors. You have ample data here so you might want see how well it is described by a normal distribution. Maybe it is well described this way.

If you think L2 error bands are interesting you can use bootstrap resampling or some other non-parametric approach to compute them and not make any assumption at all about the distribution of the data. For response time this would give you asymmetric bands -- much larger on the right side than the left. Or you could try to parametrically fit some long tailed left leaning distribution. But that seems unnecessary as we don't have any particular theory as to what the exact shape should be.

Fundamentally, I think L2 metrics are the wrong way to measure this stuff. IQR or quantiles use an L1 (absolute value) metric and so don't get as distracted by infrequent and unpredictable slow response times that are also likely to be uninteresting (eg GC kicked in or the machine decided it was time to do random thing X). Since we are going to use the results of these measurements to draw inferences it seems important that they reflect what we care about. There's no formula for L1 metrics like there is for L2, so you have to compute it by brute force, but it's not hard. Here you might still want to use bootstrap resampling to get some idea how stable the quantiles are, but generally quantiles are pretty stable things, provided you have a reasonable number of samples and don't have a pathologically bad distribution.

All that being said, I'd much rather see us get a broader set of tests in place first, and only then refine and improve how we measure. So all this can wait.

Adding UseSharedCompilation environment variable to steps that build Switch from mean and 95% confidence interval to median and quartile range

noahfalk · 2018-02-21T07:04:38Z

It is pretty simple to switch out mean and confidence interval for median and interquartile range so I went ahead and did that. Now the output of each test looks like this:

       Benchmark            Metric          Default
-----------------------  -------------  ----------------
Dotnet_Build_HelloWorld  Duration (ms)  1279 (1266-1305)

As anticipated, you can see some right skew in some of the distributions. Other distributions like the MusicStore Median Response are probably going to look normalized because each sample is already itself a median taken from a large sample of observations. Your post encouraged me to have a little fun and go back to refresh my memory on several statistics topics I hadn't looked at in many years. Indeed the normality assumptions I thought I could make via central limit theorem weren't there because default sample sizes (10) weren't big enough.

Hopefully this is an acceptable and more unassuming presentation of the data.

AndyAyersMS · 2018-02-21T21:19:52Z

Thanks for changing it.

kouvel · 2018-02-21T21:35:48Z

Just curious, although the true value of a sample may not be normally distributed, assuming ideally that all samples give a true value (minus all error), it seems like mean or median would not make a difference. However, the error that exists, isn't it likely that the error is normally distributed? Since the magnitude of error would (usually) dominate natural changes in the true value, wouldn't the mean/stderr be a representative summary of the true value (assuming the true value doesn't change much)?

I have previously dealt with outliers by discarding samples outside a set confidence interval and that seemed to work quite well for a variety of similar applications (better than median/ranges). Perhaps there is something better that I'm not understanding, any pointers?

AndyAyersMS · 2018-02-21T22:45:04Z

If you know the thing you are measuring is normally distributed in your population then mean and standard deviation are fine. But much of the time the things we measure in the performance space are not distributed this way.

Run almost any benchmark and plot the resulting execution times and you'll see that there is a left-leaning distribution with a long tail to the right. And for those kinds of distributions the mean and standard deviation are not great summary metrics.

I think you would find using median and IQR or quartiles does more or less what you have been doing by manual inspection. They are naturally resistant to outliers in a way that mean/standard deviation are not and are well suited for skewed distributions.

I can't find a great writeup on this but will keep looking and update if I find one.

noahfalk · 2018-02-21T23:11:48Z

Pure speculation on my part, but I would guess many of behaviors which cause noise in the performance results add an error with a binomial distribution and low value of p. For example maybe there is a 10% chance every unit of time that a cache needs to be flushed, or the virus checker engages, or network activity needs to be handled.

jzabroski · 2019-03-07T01:01:21Z

@kouvel A different way to state the question is, How can you be qualitatively certain your changes did not make the code slower? The simplest way is to simply take all runs from one program t and all runs from one program t+1 and ensure that every execution in t+1 is strictly faster than any execution in t. However, this is a very high bar to set, and it definitely does not account for errors in measurement. In this scenario, you're trying to demonstrate superiority.

However, in most cases, you're trying to demonstrate non-inferiority - you're generally in the same polynomial time for an operation but you've figured out some clever improvement nobody has thought of before. For example, if you were to code a QuickSort algorithm and use the legendary "dual pivot" technique, you might see a small trade-off at various points in the performance curve, due to the algorithm converging on a fixed point and the overhead associated with an extra pivot operation.

An alternative to strictly beating an algorithm is to ask, for each run, how many runs of the other algorithm did it beat? This classical Tortoise-Hare race is what motivates the Mann-Whitney U test, and the U-value is exactly how many times t+1 beats t.

Discarding outliers can get you in trouble! Consider a bug I recently solved within the last year. An engineer had implemented a round robin system for talking to servers: He would ping each server to see if the service was up or down, and if it was up, he would add it to a list of "up" servers. The server would get taken out of the "up" hotlist only when the service returned an unexpected fault. Well, in his testing, he always tested scenarios where the server was always up! As you can imagine, in production, this was not always the case!!

@AndyAyersMS Thanks for the nice discussion of quartiles.

I found this piece of text which indicate that long time ago, about 3 years (dotnet/coreclr#16353), ago infra has not .NET core 2.0 on build agents, and test in this folder cannot be compiled with dotnet sdk This is not longer true. Moreother, this project is disabled https://github.com/dotnet/runtime/blob/main/src/tests/Common/dirs.proj#L17 for same amount of time (3 years) With this PR I want to raise a question, do somebody need this folder, so it can be reenabled, or it can be removed?

noahfalk added the * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) label Feb 13, 2018

jorive reviewed Feb 13, 2018

View reviewed changes

noahfalk force-pushed the more_benchmarks branch from d847333 to 331eb83 Compare February 18, 2018 04:09

This was referenced Feb 18, 2018

XunitPerformanceHarness prints unwanted Console output microsoft/xunit-performance#260

Closed

XUnitPerformanceHarness uses raw command line parsing as only configuration option microsoft/xunit-performance#261

Closed

noahfalk changed the title ~~WIP~~ Add more benchmarks to JitBench Feb 18, 2018

noahfalk removed the * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) label Feb 21, 2018

CR feedback

7591f5e

Adding UseSharedCompilation environment variable to steps that build Switch from mean and 95% confidence interval to median and quartile range

noahfalk force-pushed the more_benchmarks branch from 331eb83 to 7591f5e Compare February 21, 2018 06:09

Fix build to work with netstandard 1.6 APIs

3e2de85

noahfalk merged commit 17541a4 into dotnet:master Feb 21, 2018

kant2002 mentioned this pull request Apr 24, 2021

Remove invalid text dotnet/runtime#51820

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add more benchmarks to JitBench #16353

Add more benchmarks to JitBench #16353

noahfalk commented Feb 13, 2018

noahfalk commented Feb 13, 2018

jorive commented Feb 13, 2018

jorive Feb 13, 2018 •

edited

Loading

noahfalk Feb 14, 2018 •

edited by jorive

Loading

jorive Feb 13, 2018 •

edited

Loading

noahfalk Feb 14, 2018 •

edited by jorive

Loading

jorive Feb 13, 2018 •

edited

Loading

noahfalk Feb 14, 2018 •

edited by jorive

Loading

AndyAyersMS commented Feb 13, 2018

jorive commented Feb 13, 2018

noahfalk commented Feb 14, 2018

noahfalk commented Feb 18, 2018

jorive commented Feb 20, 2018

jorive commented Feb 20, 2018

AndyAyersMS commented Feb 20, 2018

jorive commented Feb 20, 2018

AndyAyersMS commented Feb 20, 2018

noahfalk commented Feb 21, 2018

AndyAyersMS commented Feb 21, 2018

noahfalk commented Feb 21, 2018

AndyAyersMS commented Feb 21, 2018

kouvel commented Feb 21, 2018

AndyAyersMS commented Feb 21, 2018

noahfalk commented Feb 21, 2018

jzabroski commented Mar 7, 2019

Add more benchmarks to JitBench #16353

Add more benchmarks to JitBench #16353

Conversation

noahfalk commented Feb 13, 2018

noahfalk commented Feb 13, 2018

jorive commented Feb 13, 2018

jorive Feb 13, 2018 • edited Loading

Choose a reason for hiding this comment

noahfalk Feb 14, 2018 • edited by jorive Loading

Choose a reason for hiding this comment

jorive Feb 13, 2018 • edited Loading

Choose a reason for hiding this comment

noahfalk Feb 14, 2018 • edited by jorive Loading

Choose a reason for hiding this comment

jorive Feb 13, 2018 • edited Loading

Choose a reason for hiding this comment

noahfalk Feb 14, 2018 • edited by jorive Loading

Choose a reason for hiding this comment

AndyAyersMS commented Feb 13, 2018

jorive commented Feb 13, 2018

noahfalk commented Feb 14, 2018

noahfalk commented Feb 18, 2018

jorive commented Feb 20, 2018

jorive commented Feb 20, 2018

AndyAyersMS commented Feb 20, 2018

jorive commented Feb 20, 2018

AndyAyersMS commented Feb 20, 2018

noahfalk commented Feb 21, 2018

AndyAyersMS commented Feb 21, 2018

noahfalk commented Feb 21, 2018

AndyAyersMS commented Feb 21, 2018

kouvel commented Feb 21, 2018

AndyAyersMS commented Feb 21, 2018

noahfalk commented Feb 21, 2018

jzabroski commented Mar 7, 2019

jorive Feb 13, 2018 •

edited

Loading

noahfalk Feb 14, 2018 •

edited by jorive

Loading

jorive Feb 13, 2018 •

edited

Loading

noahfalk Feb 14, 2018 •

edited by jorive

Loading

jorive Feb 13, 2018 •

edited

Loading

noahfalk Feb 14, 2018 •

edited by jorive

Loading