Add build metrics to rustbuild #93717

pietroalbini · 2022-02-06T22:24:57Z

This PR adds a new module of rustbuild, ci_profiler, whose job is to gather as much information as possible about the CI build as possible and store it in a JSON file uploaded to ci-artifacts. Right now for each step it collects:

Type name and debug representation of the Step object.
Duration of the step (excluding child steps).
Systemwide CPU stats for the duration of the step (both single core and all cores).
Which child steps were executed.

This is capable of replacing both the scripts to collect CPU stats and the [TIMING] lines in build logs (not yet removed, until we port our tooling to use the CI profiler). The format is also extensible to be able in the future to collect more information.

r? @Mark-Simulacrum

pietroalbini · 2022-02-06T22:25:05Z

@bors try

bors · 2022-02-06T22:25:12Z

⌛ Trying commit 23e9357c0f89dab9db81f8d12f317968734b25ce with merge 74513354f36e2f00af506239e4feaa74bf715e30...

pietroalbini · 2022-02-06T22:32:44Z

Some open questions:

I'm not really happy with the "CI profiler" name but couldn't come up with anything better, ideas?
To avoid installing sysinfo on local builds this uses a new cargo feature, enabled by x.py when the config is set. Is that overkill?
The sampling frequency for CPU stats right now is 1s, should we shrink it to 100ms?

src/bootstrap/ci_profiler.rs

Mark-Simulacrum · 2022-02-06T23:01:00Z

src/bootstrap/ci_profiler.rs

+        assert!(state.running_steps.is_empty(), "steps are still executing");
+
+        let steps = std::mem::take(&mut state.finished_steps);
+        let json = vec![JsonNode::RustbuildInvocation {


IMO we should define the top-level structure of this file to be extensible (i.e., an object, rather than an array).

I agree the format should be extensible, but I think the schema I have achieves most of that goal.

As discussed in other comments, we want to track multiple x.py invocations into a single file. But, if we had a top-level object, the object wouldn't be able to store information that differs between invocations, such as the wall clock time of the whole invocation.

In the current schema, the role of the top-level object is in the RustbuildInvocation node, and all the metadata would be stored there. Having the top-level array then allows to easily store multiple of those invocations into a single file.

The only metadata I can see us wanting to store in the top level is maybe a format version number, but I'm not sure it's worth changing the schema just for that. What do you think?

At least some consumers -- e.g., total wall time analysis like my rustc-ci-timing scripts -- don't care about individual rustbuild invocations.

I also think in general a top-level object future-proofs us against any expansion we want to do, and in general unless there's a strong motivation to avoid it I'd prefer to have it (even if the only field is a "invocations" array).

As I mentioned elsewhere there are some properties that are more machine-dependent than invocation-specific that I'd like to add in the future (e.g., CPU model, disk space, etc.) that are useful to have on a per-job level IMO. Those are not really well-mapped to an individual x.py invocation. Likewise, I think it is useful to have the ability to represent e.g. CPU time spent outside x.py invocations, for example if we start trying to track the duration of AWS uploads or docker builds. Both of those don't map well to "RustbuildInvocations", and while we can have an array of differently-typed objects, that gets somewhat unwieldy fast, IMO.

Mark-Simulacrum · 2022-02-06T23:10:44Z

I'm not really happy with the "CI profiler" name but couldn't come up with anything better, ideas?

Moving away from the CI naming might be one approach, something like "metrics" or "build-summary" -- in some sense, we're not doing anything unique to CI, and one might imagine that eventually we'd like to collect these (opt-in) for folks building to measure some sense of what machines/platforms people are using for rustc development, what their build performance looks like, something like that.

To avoid installing sysinfo on local builds this uses a new cargo feature, enabled by x.py when the config is set. Is that overkill?

I think this is a good idea. I debated a little over whether to hand-roll the logic (moving it from Python to Rust code), since we already have a libc dep it's probably not that hard, but it seems likely that we'll want to include more things (more on that in a little bit).

The sampling frequency for CPU stats right now is 1s, should we shrink it to 100ms?

I think 1s is probably more than plenty -- I'd be worried about bootstrap starting to use more CPU time than warranted the lower we go just collecting stats. (Note that we don't optimize bootstrap, so it's code is good to keep simple; in practice it's probably not important but I want to avoid anything hugely complicated).

As I mentioned in one of the inline comments, I would like the JSON file to be an object so it's extensible. Right off the bat, I think we should try to include the following extra bits:

A total wall time (Instant started and stopped around the whole build)
- I want to avoid losing (in aggregate) time if e.g. the x.py code switching between steps starts taking a long time for whatever reason. It's also much easier to pull a single number than trying to do a recursive traversal and sum up a bunch of tiny measurements.

In the long run I'd love to include things like RAM, CPU core count, CPU model, etc; but that can come later. Let's just make sure our format is extensible so those deserializing can just have Option<...> or similar around fields rather than maintaining entirely separate structs for each kind.

Mark-Simulacrum · 2022-02-06T23:18:59Z

For the CPU stats, I think the current design is OK, but I'd probably prefer we restructure so that instead of each Step trying to have a list of measurements taken during its runtime, we instead move things such that we have a single list of CPU measurements, recorded over the full build, and then each step just has a Range<usize> indicating which of those belong to it. We can still have the ad-hoc logic you created to force re-starting the collection right before a Step begins, of course.

Ideally I think we may want to switch over to having that Vec -- similar to the current CPU collection -- have tuples of (time, usage), rather than pretending that we're recording at some set frequency. With the step-start interrupts there's not a consistent frequency regardless.

We may also want to, instead of trying to replace the current CPU measurement, leave that Python script alone for now. At least on builders that do significant work outside of rustbuild, the current approach will either (a) leave gaps in total wall time / CPU measurements or (b) require some work to bring that in. the src/ci/pgo.sh script has a lot of work outside of the core x.py loop, but which is (I believe) captured by the CPU python script.

I'm not sure whether the python script captures docker image re-builds today, but similarly it would be nice to avoid relying on x.py being the root of truth here -- I'm not sure how to best structure this, maybe you have some thoughts.

bors · 2022-02-06T23:40:43Z

💔 Test failed - checks-actions

pietroalbini · 2022-02-08T19:16:17Z

Moving away from the CI naming might be one approach, something like "metrics" or "build-summary" -- in some sense, we're not doing anything unique to CI, and one might imagine that eventually we'd like to collect these (opt-in) for folks building to measure some sense of what machines/platforms people are using for rustc development, what their build performance looks like, something like that.

metrics / build.metrics is a great name, I'll switch to it.

A total wall time (Instant started and stopped around the whole build)

I'll add that.

In the long run I'd love to include things like RAM, CPU core count, CPU model, etc

I'll see how hard it is to fetch that information with the sysinfo crate.

We may also want to, instead of trying to replace the current CPU measurement, leave that Python script alone for now. At least on builders that do significant work outside of rustbuild, the current approach will either (a) leave gaps in total wall time / CPU measurements or (b) require some work to bring that in. the src/ci/pgo.sh script has a lot of work outside of the core x.py loop, but which is (I believe) captured by the CPU python script.

I'm not sure whether the python script captures docker image re-builds today, but similarly it would be nice to avoid relying on x.py being the root of truth here -- I'm not sure how to best structure this, maybe you have some thoughts.

Now that you mention that, I think the Python script and the CPU information collected by this PR can and probably should coexhist:

The Python script tracks CPU usage across the whole build, even when x.py is not executed. Arguably seeing a graph of the CPU usage of a builder should be a feature the CI platform itself provides (for example CircleCI does that and it's extremely useful). This would be used mostly for "general trends" like "we're using 100% of the macOS CPU all the time".
The tracking done in this PR is specific to x.py steps, and is something that's not trivial to pull out of the logs emitted by the Python script (it's definitely doable, just annoying and not that accurate). This would be used mostly when analyzing a report generated by the data in this PR, to see which steps are not that parallel or don't benefit from many cores.

For the CPU stats, I think the current design is OK, but I'd probably prefer we restructure so that instead of each Step trying to have a list of measurements taken during its runtime, we instead move things such that we have a single list of CPU measurements, recorded over the full build, and then each step just has a Range indicating which of those belong to it. We can still have the ad-hoc logic you created to force re-starting the collection right before a Step begins, of course.

Ideally I think we may want to switch over to having that Vec -- similar to the current CPU collection -- have tuples of (time, usage), rather than pretending that we're recording at some set frequency. With the step-start interrupts there's not a consistent frequency regardless.

I agree this would be the right approach for use case 1), but for 2) I kinda prefer the current implementation.

Mark-Simulacrum · 2022-02-08T20:07:49Z

I'm not opposed to indefinitely keeping both, I think that approach definitely has its benefits -- both kinds of data are useful.

Anyway, I think I'm still in favor of a top-level object (with CPU name etc) -- metrics are not necessarily per-invocation.

bors · 2022-02-13T07:44:58Z

☔ The latest upstream changes (presumably #93713) made this pull request unmergeable. Please resolve the merge conflicts.

pietroalbini · 2022-02-13T10:59:42Z

Addressed all review comments.

pietroalbini · 2022-02-13T16:42:41Z

@bors try to generate a report

bors · 2022-02-13T16:42:48Z

⌛ Trying commit bbd125b96f07e92c1df7b3c46fefd9f8cd8c4ca6 with merge 463b48e8399de823c8ce7f7c9a30190c17751b7e...

src/ci/scripts/upload-artifacts.sh

src/bootstrap/metrics.rs

Mark-Simulacrum · 2022-02-13T16:52:41Z

src/bootstrap/metrics.rs

+        // new step starts. This ensures the first sample for each step is collected one interval
+        // after the step started executing, instead of a random time between the start of the step
+        // and the expected interval (depending on when the last sample was collected, as part of
+        // the previous step).


This comment sort of suggests that a step will always get a sample, but for short-lived steps that's not the case.

I think we also need to consider that at least in theory, we should be hitting the reset at step exit as well -- we probably often have situations where we have an outer step that maybe spends 1ms on CPU and an inner step spending 30s on CPU, for example, followed by processing by the outer step (say for 100ms). This sort of interleaving is why I'm a little doubtful about the reset trigger at all, since it doesn't really seem like it's going to produce good results to me.

bors · 2022-02-13T18:16:31Z

☀️ Try build successful - checks-actions
Build commit: 463b48e8399de823c8ce7f7c9a30190c17751b7e (463b48e8399de823c8ce7f7c9a30190c17751b7e)

Mark-Simulacrum · 2022-05-23T14:31:19Z

@bors r+

bors · 2022-05-23T14:31:21Z

📌 Commit 836b2d2 has been approved by Mark-Simulacrum

bors · 2022-05-24T01:08:56Z

⌛ Testing commit 836b2d2 with merge f8a37e6cf0ad737b3d9f4f614370f51e3edf3562...

bors · 2022-05-24T01:51:55Z

💔 Test failed - checks-actions

pietroalbini · 2022-06-04T14:09:30Z

Huh that failure is weird. @GuillaumeGomez any idea why the sysinfo could be failing this way?

thread 'main' panicked at 'attempt to multiply with overflow', C:\Users\runneradmin\.cargo\registry\src\github.com-1285ae84e5963aae\sysinfo-0.23.11\src\windows\system.rs:167:34
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

GuillaumeGomez · 2022-06-04T14:43:09Z

Seems like some multiplications didn't get the right method used and it's still present: https://github.com/GuillaumeGomez/sysinfo/blob/master/src/windows/system.rs#L171-L178

Do you prefer I make a 0.23.x release for fixing this bug or do you prefer that I send a PR where sysinfo is used to update it to 0.24.x?

pietroalbini · 2022-06-04T15:12:39Z

@GuillaumeGomez sysinfo is being introduced by this PR, so a fix would be necessary to land it. I can upgrade the PR to 0.24 without a problem, so feel free to avoid backporting the fix to 0.23 :)

GuillaumeGomez · 2022-06-04T15:20:12Z

Great! I fix the issue in GuillaumeGomez/sysinfo#757. I'll ping you once a made a release with it then.

GuillaumeGomez · 2022-06-04T16:06:20Z

I published the 0.24.1 version.

pietroalbini · 2022-06-04T16:59:38Z

Bumped sysinfo version to include the fix.

@bors r=Mark-Simulacrum

bors · 2022-06-04T16:59:39Z

📌 Commit 70cdd7e has been approved by Mark-Simulacrum

bors · 2022-06-05T01:35:07Z

⌛ Testing commit 70cdd7e with merge 6dadfc0...

bors · 2022-06-05T04:15:57Z

☀️ Test successful - checks-actions
Approved by: Mark-Simulacrum
Pushing 6dadfc0 to master...

rust-timer · 2022-06-05T05:34:17Z

Finished benchmarking commit (6dadfc0): comparison url.

Instruction count

This benchmark run did not return any relevant results for this metric.

Max RSS (memory usage)

Results

Primary benchmarks: no relevant changes found
Secondary benchmarks: mixed results

	mean¹	max	count²
Regressions 😿 (primary)	N/A	N/A	0
Regressions 😿 (secondary)	4.0%	4.0%	1
Improvements 🎉 (primary)	N/A	N/A	0
Improvements 🎉 (secondary)	-3.2%	-3.2%	1
All 😿🎉 (primary)	N/A	N/A	0

Cycles

This benchmark run did not return any relevant results for this metric.

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

@rustbot label: -perf-regression

the arithmetic mean of the percent change ↩
number of relevant changes ↩

rust-highfive assigned Mark-Simulacrum Feb 6, 2022

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Feb 6, 2022

Mark-Simulacrum reviewed Feb 6, 2022

View reviewed changes

This comment has been minimized.

Sign in to view

bors added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 6, 2022

pietroalbini force-pushed the pa-ci-profiler branch from 23e9357 to b2b1b36 Compare February 13, 2022 10:59

pietroalbini added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Feb 13, 2022

pietroalbini changed the title ~~Add CI profiler to rustbuild~~ Add build metrics to rustbuild Feb 13, 2022

This comment has been minimized.

Sign in to view

pietroalbini force-pushed the pa-ci-profiler branch from b2b1b36 to bbd125b Compare February 13, 2022 16:40

Mark-Simulacrum reviewed Feb 13, 2022

View reviewed changes

Mark-Simulacrum added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 14, 2022

rustbot added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label May 22, 2022

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels May 23, 2022

bors added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels May 24, 2022

This comment has been minimized.

Sign in to view

Mark-Simulacrum added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels May 24, 2022

bump sysinfo version

70cdd7e

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Jun 4, 2022

bors added the merged-by-bors This PR was explicitly merged by bors. label Jun 5, 2022

bors merged commit 6dadfc0 into rust-lang:master Jun 5, 2022

rustbot added this to the 1.63.0 milestone Jun 5, 2022

pietroalbini deleted the pa-ci-profiler branch June 6, 2022 07:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add build metrics to rustbuild #93717

Add build metrics to rustbuild #93717

pietroalbini commented Feb 6, 2022

pietroalbini commented Feb 6, 2022

bors commented Feb 6, 2022

pietroalbini commented Feb 6, 2022

Mark-Simulacrum Feb 6, 2022

pietroalbini Feb 8, 2022

Mark-Simulacrum Feb 8, 2022

Mark-Simulacrum commented Feb 6, 2022

Mark-Simulacrum commented Feb 6, 2022

This comment has been minimized.

bors commented Feb 6, 2022

pietroalbini commented Feb 8, 2022

Mark-Simulacrum commented Feb 8, 2022

bors commented Feb 13, 2022

pietroalbini commented Feb 13, 2022

This comment has been minimized.

pietroalbini commented Feb 13, 2022

bors commented Feb 13, 2022

Mark-Simulacrum Feb 13, 2022

bors commented Feb 13, 2022

Mark-Simulacrum commented May 23, 2022

bors commented May 23, 2022

bors commented May 24, 2022

bors commented May 24, 2022

This comment has been minimized.

pietroalbini commented Jun 4, 2022

GuillaumeGomez commented Jun 4, 2022

pietroalbini commented Jun 4, 2022

GuillaumeGomez commented Jun 4, 2022

GuillaumeGomez commented Jun 4, 2022

pietroalbini commented Jun 4, 2022

bors commented Jun 4, 2022

bors commented Jun 5, 2022

bors commented Jun 5, 2022

rust-timer commented Jun 5, 2022

Add build metrics to rustbuild #93717

Add build metrics to rustbuild #93717

Conversation

pietroalbini commented Feb 6, 2022

pietroalbini commented Feb 6, 2022

bors commented Feb 6, 2022

pietroalbini commented Feb 6, 2022

Mark-Simulacrum Feb 6, 2022

Choose a reason for hiding this comment

pietroalbini Feb 8, 2022

Choose a reason for hiding this comment

Mark-Simulacrum Feb 8, 2022

Choose a reason for hiding this comment

Mark-Simulacrum commented Feb 6, 2022

Mark-Simulacrum commented Feb 6, 2022

This comment has been minimized.

bors commented Feb 6, 2022

pietroalbini commented Feb 8, 2022

Mark-Simulacrum commented Feb 8, 2022

bors commented Feb 13, 2022

pietroalbini commented Feb 13, 2022

This comment has been minimized.

pietroalbini commented Feb 13, 2022

bors commented Feb 13, 2022

Mark-Simulacrum Feb 13, 2022

Choose a reason for hiding this comment

bors commented Feb 13, 2022

Mark-Simulacrum commented May 23, 2022

bors commented May 23, 2022

bors commented May 24, 2022

bors commented May 24, 2022

This comment has been minimized.

pietroalbini commented Jun 4, 2022

GuillaumeGomez commented Jun 4, 2022

pietroalbini commented Jun 4, 2022

GuillaumeGomez commented Jun 4, 2022

GuillaumeGomez commented Jun 4, 2022

pietroalbini commented Jun 4, 2022

bors commented Jun 4, 2022

bors commented Jun 5, 2022

bors commented Jun 5, 2022

rust-timer commented Jun 5, 2022

Instruction count

Max RSS (memory usage)

Cycles

Footnotes