Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.bazelrc: enable skymeld by default #4411

Closed
wants to merge 1 commit into from
Closed

Conversation

sluongng
Copy link
Contributor

Skymeld config helps us start build actions without having to wait for
all repository rules to finish executing in Analysis phase.

Effectively, this let us start the frontend builds earlier without
having to wait for the Go external dependencies and toolchain to finish
loading.

Some benchmarks

  1. Building //server without remote cache

    > hyperfine --prepare 'bazel clean --async' \
                --warmup 1 \
                'bazel build --config=x -k --remote_instance_name="$RANDOM" server' \
                'bazel build --config=x -k --remote_instance_name="$RANDOM" --config=skymeld server'
    
    Benchmark 1: bazel build --config=x -k --remote_instance_name="$RANDOM" server
      Time (mean ± σ):     241.279 s ± 42.673 s    [User: 0.134 s, System: 0.149 s]
      Range (min … max):   169.694 s … 318.005 s    10 runs
    
    Benchmark 2: bazel build --config=x -k --remote_instance_name="$RANDOM" --config=skymeld server
      Time (mean ± σ):     213.551 s ± 75.335 s    [User: 0.118 s, System: 0.129 s]
      Range (min … max):   148.260 s … 400.258 s    10 runs
    
    Summary
      bazel build --config=x -k --remote_instance_name="$RANDOM" --config=skymeld server ran
        1.13 ± 0.45 times faster than bazel build --config=x -k --remote_instance_name="$RANDOM" server
  2. Building //server with remote cache

    > hyperfine --prepare 'bazel clean --async' \
                --warmup 1 \
                'bazel build --config=x -k server' \
                'bazel build --config=x -k --config=skymeld server'
    
    Benchmark 1: bazel build --config=x -k server
      Time (mean ± σ):     19.282 s ±  0.473 s    [User: 0.014 s, System: 0.023 s]
      Range (min … max):   18.656 s … 20.218 s    10 runs
    
    Benchmark 2: bazel build --config=x -k --config=skymeld server
      Time (mean ± σ):     17.732 s ±  0.407 s    [User: 0.014 s, System: 0.023 s]
      Range (min … max):   17.118 s … 18.626 s    10 runs
    
    Summary
      bazel build --config=x -k --config=skymeld server ran
        1.09 ± 0.04 times faster than bazel build --config=x -k server
  3. Testing everything with remote cache

    > hyperfine --prepare 'bazel clean --async' \
                    --warmup 1 \
                    'bazel build --config=x -k //...' \
                    'bazel build --config=x -k --config=skymeld //...'
    Benchmark 1: bazel build --config=x -k //...
      Time (mean ± σ):     27.113 s ±  1.020 s    [User: 0.014 s, System: 0.020 s]
      Range (min … max):   25.938 s … 28.833 s    10 runs
    
    Benchmark 2: bazel build --config=x -k --config=skymeld //...
      Time (mean ± σ):     25.349 s ±  1.155 s    [User: 0.014 s, System: 0.021 s]
      Range (min … max):   23.567 s … 27.314 s    10 runs
    
    Summary
      bazel build --config=x -k --config=skymeld //... ran
        1.07 ± 0.06 times faster than bazel build --config=x -k //...

Overall, using Skymeld gives a +7-14% speed improvement over the
existing setup.

Removed --experimental_skymeld_ui as it's a no-op flag.

Related issues: N/A

Skymeld config helps us start build actions without having to wait for
all repository rules to finish executing in Analysis phase.

Effectively, this let us start the frontend builds earlier without
having to wait for the Go external dependencies and toolchain to finish
loading.

Some benchmarks

1. Building `//server` without remote cache

   ```bash
   > hyperfine --prepare 'bazel clean --async' \
               --warmup 1 \
               'bazel build --config=x -k --remote_instance_name="$RANDOM" server' \
               'bazel build --config=x -k --remote_instance_name="$RANDOM" --config=skymeld server'

   Benchmark 1: bazel build --config=x -k --remote_instance_name="$RANDOM" server
     Time (mean ± σ):     241.279 s ± 42.673 s    [User: 0.134 s, System: 0.149 s]
     Range (min … max):   169.694 s … 318.005 s    10 runs

   Benchmark 2: bazel build --config=x -k --remote_instance_name="$RANDOM" --config=skymeld server
     Time (mean ± σ):     213.551 s ± 75.335 s    [User: 0.118 s, System: 0.129 s]
     Range (min … max):   148.260 s … 400.258 s    10 runs

   Summary
     bazel build --config=x -k --remote_instance_name="$RANDOM" --config=skymeld server ran
       1.13 ± 0.45 times faster than bazel build --config=x -k --remote_instance_name="$RANDOM" server
   ```

2. Building `//server` with remote cache

   ```bash
   > hyperfine --prepare 'bazel clean --async' \
               --warmup 1 \
               'bazel build --config=x -k server' \
               'bazel build --config=x -k --config=skymeld server'

   Benchmark 1: bazel build --config=x -k server
     Time (mean ± σ):     19.282 s ±  0.473 s    [User: 0.014 s, System: 0.023 s]
     Range (min … max):   18.656 s … 20.218 s    10 runs

   Benchmark 2: bazel build --config=x -k --config=skymeld server
     Time (mean ± σ):     17.732 s ±  0.407 s    [User: 0.014 s, System: 0.023 s]
     Range (min … max):   17.118 s … 18.626 s    10 runs

   Summary
     bazel build --config=x -k --config=skymeld server ran
       1.09 ± 0.04 times faster than bazel build --config=x -k server
   ```

3. Testing everything with remote cache

   ```bash
   > hyperfine --prepare 'bazel clean --async' \
                   --warmup 1 \
                   'bazel build --config=x -k //...' \
                   'bazel build --config=x -k --config=skymeld //...'
   Benchmark 1: bazel build --config=x -k //...
     Time (mean ± σ):     27.113 s ±  1.020 s    [User: 0.014 s, System: 0.020 s]
     Range (min … max):   25.938 s … 28.833 s    10 runs

   Benchmark 2: bazel build --config=x -k --config=skymeld //...
     Time (mean ± σ):     25.349 s ±  1.155 s    [User: 0.014 s, System: 0.021 s]
     Range (min … max):   23.567 s … 27.314 s    10 runs

   Summary
     bazel build --config=x -k --config=skymeld //... ran
       1.07 ± 0.06 times faster than bazel build --config=x -k //...
   ```

Overall, using Skymeld gives a +7-14% speed improvement over the
existing setup.

Removed `--experimental_skymeld_ui` as it's a no-op flag.
@bduffany
Copy link
Member

bduffany commented Jul 27, 2023

IIUC skymeld is only expected to speed up multi-target builds so it's interesting that it has any effect for server?

bazelbuild/bazel#14057 (comment)

Skymeld is expected to improve the end-to-end wall time of multi-target builds with remote execution. All the wall time wins should come from the analysis phase time.

I'm wondering if this could be due to either the --async flag on the bazel clean (maybe the cache isn't getting fully cleaned in time?), or possibly because --remote_instance_name does not affect CAS (only AC) and so the skymeld builds are at an advantage because they run later. (Even with --warmup 1, not all the executors are guaranteed to be warmed up, because of random scheduling)

Do you get the same results if you swap the order of the 2 build commands?

i.e. instead of

            'bazel build --config=x -k --remote_instance_name="$RANDOM" server' \
            'bazel build --config=x -k --remote_instance_name="$RANDOM" --config=skymeld server'

do

            'bazel build --config=x -k --remote_instance_name="$RANDOM" --config=skymeld server' \
            'bazel build --config=x -k --remote_instance_name="$RANDOM" server'

(Small side note, you could omit -k for benchmarks since if the build fails for any target, you probably want to abort the benchmark immediately)

@bduffany
Copy link
Member

bduffany commented Jul 27, 2023

Another thing that's mildly concerning is the relatively high max time (400s) with skymeld enabled, vs 318s disabled (for the //server target w/o remote cache). Might be worth running with a larger number of trials and also dumping the raw data to see the distribution a bit better.

@sluongng
Copy link
Contributor Author

sluongng commented Jul 27, 2023

IIUC skymeld is only expected to speed up multi-target builds so it's interesting that it has any effect for server?

I think it affects both. It was first implemented as a POC for a single target build and later on, expanded to support multi-targets build.

In practice, this implementation starts in BuildTool.java where the logic is something like this

if skymeld {
  analysisAndExecution() //async
  return
}

// sequential
analysis()
execution()

However, it will have more impact with build graphs that have a more fine-grained analysis phase vs clustered up. The reason why I think //server is an ok candidate is because different Go targets depend on a different set of go_repository, and some go_repository might finish before others and allow some targets to start building earlier.

I'm wondering if this could be due to either the --async flag on the bazel clean (maybe the cache isn't getting fully cleaned in time?), or possibly because --remote_instance_name does not affect CAS (only AC) and so the skymeld builds are at an advantage because they run later.

AFAIK, bazel clean --async first move the cache dir to a /tmp dir and perform clean up there. This allows the old locations to be emptied out quickly for subsequent writes. However, I think there could be an impact on I/O performance when doing a new build while the old cache is being cleaned up 🤔 . I could turn it off next time to avoid noise.

(Even with --warmup 1, not all the executors are guaranteed to be warmed up, because of random scheduling)

I don't have a good solution for this other than bumping up run counts in hyperfine and accepting a certain margin of error 🤔

Do you get the same results if you swap the order of the 2 build commands?

Will try this out.

(Small side note, you could omit -k for benchmarks since if the build fails for any target, you probably want to abort the benchmark immediately)

Agree. I was using it during some exploratory runs prior. Probably don't need it for the real benchmark.

Another thing that's mildly concerning is the relatively high max time (400s) with skymeld enabled, vs 318s disabled (for the //server target w/o remote cache). Might be worth running with a larger number of trials and also dumping the raw data to see the distribution a bit better.

Good point. I think the overall spikes were likely due to my local network condition.

I'm thinking of 3 ways to improve the general benchmarking methodology:

  • Use our Workflows API to do the benchmark.
  • Make hyperfine export json result and analyze it separately.
  • Leverage https://github.com/stepancheg/absh, which gives a better statistic report than hyperfine.

@sluongng
Copy link
Contributor Author

This is the default as of Bazel 7.

@sluongng sluongng closed this Jan 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants