Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Surface action cache hit rate #90

Open
saraadams opened this issue Jan 26, 2023 · 2 comments
Open

Surface action cache hit rate #90

saraadams opened this issue Jan 26, 2023 · 2 comments
Assignees
Labels
type/feat Suggests new features.

Comments

@saraadams
Copy link
Collaborator

saraadams commented Jan 26, 2023

Problem

Surface the action cache hit rate, in particular if remote caching is used.

Suggested solution

The following events may help detect these:

  • no cache hit:
    check cache hit (category remote action cache check) within event ActionContinuation.execute (category general information), thereafter execution
    • remote execution: potentially upload missing inputs (category Remote execution upload time) followed by execute remotely (category remote action execution)
    • remote cache only: TODO
    • disk cache: not included in profile?
  • cache hit:
    check cache hit within event ActionContinuation.execute, no execution thereafter

DataProvider to provide rate and/or absolute numbers (cache checks, successful cache checks)
SuggestionProvider to suggest strategies to increase the cache hit rate, e.g. --incompatible_strict_action_env

@saraadams saraadams added the type/feat Suggests new features. label Jan 26, 2023
@saraadams saraadams changed the title Surface action cache hHit Rate Surface action cache hit rate Jan 26, 2023
@saraadams
Copy link
Collaborator Author

If latency is high, then having many parallel check cache hit actions can help speed up getting remote cache hits (as the jobs are idle due to high latency).
An estimated latency might be extracted by looking for the shortest check cache hit entry.
Increasing --jobs to above your machine's # of cores could help.

saraadams added a commit that referenced this issue Nov 25, 2023
This change

* fixes the percentage shown, it was showing the inverse value (cache miss % instead of cache hit %)
* adds absolute numbers for how many cache checks were performed and how many were misses
* filters out local actions that don't do remote cache checks, as these are not relevant

Contributes to #90

Signed-off-by: Sara Adams <sara.e.adams@gmail.com>
saraadams added a commit that referenced this issue Nov 25, 2023
This change

* fixes the percentage shown, it was showing the inverse value (cache miss % instead of cache hit %)
* adds absolute numbers for how many cache checks were performed and how many were misses
* filters out local actions that don't do remote cache checks, as these are not relevant

Contributes to #90

Signed-off-by: Sara Adams <sara.e.adams@gmail.com>
saraadams added a commit that referenced this issue Nov 25, 2023
This change

* fixes the percentage shown, it was showing the inverse value (cache miss % instead of cache hit %)
* adds absolute numbers for how many cache checks were performed and how many were misses
* filters out local actions that don't do remote cache checks, as these are not relevant

Contributes to #90

Signed-off-by: Sara Adams <sara.e.adams@gmail.com>
@saraadams
Copy link
Collaborator Author

  • All actions that weren't in the internal action cache:
    • event with category action processing
  • All events that check a remote cache (disk_cache or remote_cache)
    • event with category remote action cache check
  • Cache hits also have a related event of category remote output download, but not any events indicating remote execution (e.g. of category remote action execution)
  • Cache misses should have a related event for execution (local or remote)

@saraadams saraadams self-assigned this Nov 25, 2023
saraadams added a commit that referenced this issue Nov 25, 2023
If remote caching is used and there are remote cache misses, suggest
investigating the misses with a link to Bazel documentation.

Contributes to #90

Signed-off-by: Sara Adams <sara.e.adams@gmail.com>
saraadams added a commit that referenced this issue Nov 26, 2023
If remote caching is used and there are remote cache misses, suggest
investigating the misses with a link to Bazel documentation.

Contributes to #90

Signed-off-by: Sara Adams <sara.e.adams@gmail.com>
saraadams added a commit that referenced this issue Nov 27, 2023
This change

* fixes the percentage shown, it was showing the inverse value (cache
miss % instead of cache hit %)
* fixes the "remote upload outputs" time, which mistakenly also included
upload times of inputs (for RE).
* filters out local actions that don't do remote cache checks, as these
are not relevant and falsify the %
* adds absolute numbers for how many cache checks were performed and how
many were misses
* adds some documentation on `disk_cache` also being a remote cache

Contributes to #90

---------

Signed-off-by: Sara Adams <sara.e.adams@gmail.com>
saraadams added a commit that referenced this issue Nov 27, 2023
…c `CompleteEvent`s (#125)

This is a refactor and code cleanup in preparation for addressing #90

* moves some static methods over to a new util class that helps identify
the meaning of `CompleteEvent`s seen in Bazel profiles
* adds some missing copyright notices

---------

Signed-off-by: Sara Adams <sara.e.adams@gmail.com>
saraadams added a commit that referenced this issue Nov 27, 2023
If remote caching is used and there are remote cache misses, suggest
investigating the misses with a link to Bazel documentation.

Contributes to #90

Signed-off-by: Sara Adams <sara.e.adams@gmail.com>
saraadams added a commit that referenced this issue Nov 27, 2023
If remote caching is used and there are remote cache misses, suggest
investigating the misses with a link to Bazel documentation.

Contributes to #90

---------

Signed-off-by: Sara Adams <sara.e.adams@gmail.com>
saraadams added a commit that referenced this issue Nov 28, 2023
…on location

This data provider scans all actions and splits them into:

* remote cache hit
* remote cache miss
* remote cache not checked

as well as (for non-cache-hits):

* executed locally
* executed remotely
* execution location not reported

"Internal" Bazel actions are included in "remote cache not checked" and
"execution location not reported".
While I invested ample time in trying to single internal actions out reliably,
I did not succeed. I'm not sure it's possible with the information currently
written to profiles. A TODO to look into separating out "internal" actions is
in the code.

Contributes to #90

Signed-off-by: Sara Adams <sara.e.adams@gmail.com>
saraadams added a commit that referenced this issue Nov 28, 2023
…on location (#138)

This data provider scans all actions and splits them into:

* remote cache hit
* remote cache miss
* remote cache not checked

as well as (for non-cache-hits):

* executed locally
* executed remotely
* execution location not reported

"Internal" Bazel actions are included in "remote cache not checked" and
"execution location not reported".
While I invested ample time in trying to single internal actions out
reliably, I did not succeed. I'm not sure it's possible with the
information currently written to profiles. A TODO to look into
separating out "internal" actions is in the code.

Contributes to #90

---------

Signed-off-by: Sara Adams <sara.e.adams@gmail.com>
saraadams added a commit that referenced this issue Nov 29, 2023
…140)

When highlighting cache misses sorted by their duration, exclude the
time spent for checking the remote cache. This is more reflective of
which cache misses you want to fix.

Contributes to #90

Signed-off-by: Sara Adams <sara.e.adams@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/feat Suggests new features.
Projects
None yet
Development

No branches or pull requests

1 participant