-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Conversation
@noahfalk Could you please open issues for all xUnit Performance Api related comments? |
tests/scripts/run-xunit-perf.py
Outdated
@@ -71,7 +71,7 @@ def validate_arg(arg, check): | |||
raise ValueError('Argument: %s is not valid.' % (arg)) | |||
|
|||
coreclrPerf = os.path.join(os.getcwd(), args.coreclrPerf) | |||
validate_arg(coreclrPerf, lambda item: os.path.isdir(item)) | |||
validate_arg(coreclrPerf, lambda item: os.path.isdir(item) or os.path.isfile(item)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is rather error prone. It would be nicer if you add a parameter for the assembly file name case and handle it separately. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem #Closed
<?xml version="1.0" encoding="utf-8"?> | ||
<Project ToolsVersion="12.0" DefaultTargets="Build" xmlns="http://schemas.microsoft.com/developer/msbuild/2003"> | ||
<Import Project="$([MSBuild]::GetDirectoryNameOfFileAbove($(MSBuildThisFileDirectory), dir.props))\dir.props" /> | ||
<Project Sdk="Microsoft.NET.Sdk"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.NET Core csproj! I like it! :)
Ideally, all projects should move to it! #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, I was worried this might be a controversial move ; ) #Closed
</ItemGroup> | ||
<ItemGroup> | ||
<Service Include="{82A7F48D-3B50-4B1E-B82E-3ADA8210C358}" /> | ||
<PackageReference Include="CommandLineParser" Version="2.2.1" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add this package/version to the dependencies.props file in the repo root directory. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Np #Closed
I like the direction your going here. It has been painful to do "innerloop" type stuff with these tests as the scripts often presume they are starting from scratch, and won't need to leave anything useful behind other than data. So yeah, we need to make this easier. I think we should be using ETW, at least as one of the ways we run the CI perf jobs, in part as a way of validating that we're running the tests the way we expect them to be run. But having ETW be opt-in rather than opt-out seems perfectly reasonable. I'll hold off on detailed comments for now. |
@noahfalk Are you planning to make the JitBench solution/scenario more than just JitBench? |
Thanks for taking a look at this! @AndyAyersMS - On ETW, agreed. I certainly still want to keep ETW as an option and retain its usage in CI, my quip was solely about choosing the default.
@jorive -Yes, though if you forsee issues with that approach please let me know. The way I've been thinking of it is that JitBench refers to a set of scenarios we can measure, of which MusicStore is one of them. I am ignoring the fact that the ASP.NET repo is literally called JitBench - there is some potential for confusion there but I'm hoping its not a big deal. From purely the naming perspective the alternative is to claim that JitBench = MusicStore and nothing else, but then I still needed a name for the whole set of scenarios and JitBench felt like a pretty good one. |
d847333
to
331eb83
Compare
@jorive @AndyAyersMS @kouvel @brianrob - OK I think things are looking better now for anyone interested in taking a closer look. Unfortunately I was unable to find a way to get the SDK style projects through CI build succesfully so I had to retreat a bit. The change now includes both the original csproj which is used for all repo and CI builds, and an unofficial_dotnet sub-folder .csproj which will only be used when a dev manually invokes it. The README has also been updated with some info about the test that may be useful. |
@noahfalk Should we set this environment variable as default too: |
I like the new harness :) |
Not sure I'll have a chance to go through all this in detail in a timely manner, but it looks reasonable. At some point I'd really like to see us get away from using mean and standard deviation and consistently use median and intra-quartile (or 5-95 quantile) ranges to summarize results -- these metrics are more relevant for non-normal distributions. For instance response time distribution is quite likely to be very left leaning and have a long right tail of stragglers. So average and standard deviation are misleading and noise prone as they pay too much attention to bad outliers. But we can sort that out later. |
@AndyAyersMS Should I update the tools to those values? |
We're already recording median in the official runs, so I'd hold off. My comments were for the extra reporting code that Noah is adding for end user consumption. |
@jorive - I'll get that env var added, thanks for spotting it @AndyAyersMS - Are you specifically referring to the query responses within MusicStore, or more generally that you hope all metrics from all benchmarks will shift to median + interquartile ranges? Just fyi, the +- stat being reported in the console output isn't a standard deviation, its a 95% confidence interval for the statistic being measured. The goal in that case is not to describe the shape of the distribution but rather to give error bars. If other metrics such as inter-quartile ranges would be useful to you I don't see any reason they couldn't be added. |
a) JitBench includes several more benchmarks and a refactored runner to make it easier to add more b) A new JitBench/unofficial_dotnet SDK style build project offers simpler developer workflow at the command line and in VS when working with these Benchmarks c) Misc build issues with local dotnet build in the test tree have workarounds now, but official CI builds still don't support it d) run-xunit-perf.py has support to run a specific assembly from the testBinLoc folder intended to handle selecting a binary from the folder of binaries produced in the .Net Core SDK project build process. However until CI build can support building projects that way the xunit-perf support is unused.
Generally I have two pet peeves when it comes to summarizing data: assuming normality and using L2 (squared) distance metrics. The computation of confidence interval in If you think L2 error bands are interesting you can use bootstrap resampling or some other non-parametric approach to compute them and not make any assumption at all about the distribution of the data. For response time this would give you asymmetric bands -- much larger on the right side than the left. Or you could try to parametrically fit some long tailed left leaning distribution. But that seems unnecessary as we don't have any particular theory as to what the exact shape should be. Fundamentally, I think L2 metrics are the wrong way to measure this stuff. IQR or quantiles use an L1 (absolute value) metric and so don't get as distracted by infrequent and unpredictable slow response times that are also likely to be uninteresting (eg GC kicked in or the machine decided it was time to do random thing X). Since we are going to use the results of these measurements to draw inferences it seems important that they reflect what we care about. There's no formula for L1 metrics like there is for L2, so you have to compute it by brute force, but it's not hard. Here you might still want to use bootstrap resampling to get some idea how stable the quantiles are, but generally quantiles are pretty stable things, provided you have a reasonable number of samples and don't have a pathologically bad distribution. All that being said, I'd much rather see us get a broader set of tests in place first, and only then refine and improve how we measure. So all this can wait. |
Adding UseSharedCompilation environment variable to steps that build Switch from mean and 95% confidence interval to median and quartile range
331eb83
to
7591f5e
Compare
It is pretty simple to switch out mean and confidence interval for median and interquartile range so I went ahead and did that. Now the output of each test looks like this:
As anticipated, you can see some right skew in some of the distributions. Other distributions like the MusicStore Median Response are probably going to look normalized because each sample is already itself a median taken from a large sample of observations. Your post encouraged me to have a little fun and go back to refresh my memory on several statistics topics I hadn't looked at in many years. Indeed the normality assumptions I thought I could make via central limit theorem weren't there because default sample sizes (10) weren't big enough. Hopefully this is an acceptable and more unassuming presentation of the data. |
Thanks for changing it. |
Just curious, although the true value of a sample may not be normally distributed, assuming ideally that all samples give a true value (minus all error), it seems like mean or median would not make a difference. However, the error that exists, isn't it likely that the error is normally distributed? Since the magnitude of error would (usually) dominate natural changes in the true value, wouldn't the mean/stderr be a representative summary of the true value (assuming the true value doesn't change much)? I have previously dealt with outliers by discarding samples outside a set confidence interval and that seemed to work quite well for a variety of similar applications (better than median/ranges). Perhaps there is something better that I'm not understanding, any pointers? |
If you know the thing you are measuring is normally distributed in your population then mean and standard deviation are fine. But much of the time the things we measure in the performance space are not distributed this way. Run almost any benchmark and plot the resulting execution times and you'll see that there is a left-leaning distribution with a long tail to the right. And for those kinds of distributions the mean and standard deviation are not great summary metrics. I think you would find using median and IQR or quartiles does more or less what you have been doing by manual inspection. They are naturally resistant to outliers in a way that mean/standard deviation are not and are well suited for skewed distributions. I can't find a great writeup on this but will keep looking and update if I find one. |
Pure speculation on my part, but I would guess many of behaviors which cause noise in the performance results add an error with a binomial distribution and low value of p. For example maybe there is a 10% chance every unit of time that a cache needs to be flushed, or the virus checker engages, or network activity needs to be handled. |
@kouvel A different way to state the question is, How can you be qualitatively certain your changes did not make the code slower? The simplest way is to simply take all runs from one program t and all runs from one program t+1 and ensure that every execution in t+1 is strictly faster than any execution in t. However, this is a very high bar to set, and it definitely does not account for errors in measurement. In this scenario, you're trying to demonstrate superiority. However, in most cases, you're trying to demonstrate non-inferiority - you're generally in the same polynomial time for an operation but you've figured out some clever improvement nobody has thought of before. For example, if you were to code a QuickSort algorithm and use the legendary "dual pivot" technique, you might see a small trade-off at various points in the performance curve, due to the algorithm converging on a fixed point and the overhead associated with an extra pivot operation. An alternative to strictly beating an algorithm is to ask, for each run, how many runs of the other algorithm did it beat? This classical Tortoise-Hare race is what motivates the Mann-Whitney U test, and the U-value is exactly how many times t+1 beats t. Discarding outliers can get you in trouble! Consider a bug I recently solved within the last year. An engineer had implemented a round robin system for talking to servers: He would ping each server to see if the service was up or down, and if it was up, he would add it to a list of "up" servers. The server would get taken out of the "up" hotlist only when the service returned an unexpected fault. Well, in his testing, he always tested scenarios where the server was always up! As you can imagine, in production, this was not always the case!! @AndyAyersMS Thanks for the nice discussion of quartiles. |
I found this piece of text which indicate that long time ago, about 3 years (dotnet/coreclr#16353), ago infra has not .NET core 2.0 on build agents, and test in this folder cannot be compiled with dotnet sdk This is not longer true. Moreother, this project is disabled https://github.com/dotnet/runtime/blob/main/src/tests/Common/dirs.proj#L17 for same amount of time (3 years) With this PR I want to raise a question, do somebody need this folder, so it can be reenabled, or it can be removed?
I found this piece of text which indicate that long time ago, about 3 years (dotnet/coreclr#16353), ago infra has not .NET core 2.0 on build agents, and test in this folder cannot be compiled with dotnet sdk This is not longer true. Moreother, this project is disabled https://github.com/dotnet/runtime/blob/main/src/tests/Common/dirs.proj#L17 for same amount of time (3 years) With this PR I want to raise a question, do somebody need this folder, so it can be reenabled, or it can be removed?
So I’ve been working on this a little while and the scope snowballed from my initial plan, but wanted to let you know where things are now and get any high-level feedback you guys would like to provide. You are welcome to provide more detailed feedback if you want right now, but I imagine it will need at least one more pass before its ready to go and I'll ping again then.
I had two main goals:
At the moment my ad-hoc usage looks about like this:
a) Clone Coreclr repo
b) build coreclr
c) Cd tests/src/performance/scenario/JitBench
d) Dotnet run -- --configs:Default,Tiering,Minopts --coreclr-bin-dir F:\github\coreclr\bin\Product\Windows_NT.x64.Release\
There are no required steps to build the test tree, restore anything, generate a layout, invoke xunit perf python scripts, set env vars or collate logs, though (bugs aside) it would still work roughly as before if you did all those things.
Here are the major changes + a few outstanding questions:
It felt like I was cutting against the grain on this, but it also feels pretty useful to be able to use VS F5 and the standard dotnet SDK tools on our own projects. Let me know if you don’t like it though.
a. Xunit performance expects to be given the raw command-line args so that it can parse them, but I find the choice to use ETW by default when no option has been specified aggravating here. The ETW events slow the results of these tests and I doubt the majority of our test scenarios make use of the data. Additionally it means the default case of launching from VS or running at the commandline will just give errors saying to relaunch with admin permissions. However making this not the default behavior meant I had to parse the commandline arguments myself in advance, modify the default case, and then reformat a new fake command-line. (Its actually not a lot of code, but its a weird interaction across components)
b. The tests do a bunch of setup at the beginning and if you request ETW + don’t give admin permissions I wanted to fail fast rather than fail when the runner eventually launches the first benchmark. Again this meant I have to pre-parse the command line and reason independently about whether XUnitPerformanceHarness will want to enable ETW.
c. The benchmarks need a working directory for various files they write and XunitPerformanceHarness expects specific command line formatting for how the output directory is passed. This was a third occasion in which I needed to understand and pre-parse the command line the harness wants in order to align the output directories.
d. The tests don’t use XUnitPerformance’s ScenarioBenchmark object model as their primary one. Instead I store the results in a similar OM that fit into the code a bit more smoothly and then spend ~100 lines of code transforming to the ScenarioBenchmark OM if/when I need to serialize the output. I am also not serializing the output at the same time I run the individual benchmarks. I create one XUnitPerformanceHarness instance each time I need to run a test, then optionally create a final one at the end just to serialize output.
e. XunitPerformanceHarness has hard-coded console writeline usage inside it and I wanted all that output spew in a file to keep the console output clearer. I wound up hijacking Console.Out whenever using XunitPerformanceHarness in order to capture and redirect the logging.