Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Also path map transitive header jar paths with direct classpath optimization #19872

Closed
wants to merge 2 commits into from

Conversation

fmeum
Copy link
Collaborator

@fmeum fmeum commented Oct 18, 2023

JavaHeaderCompileAction can apply an optimization where it compiles the source files only against direct dependencies, making use of the fact that Turbine includes sufficient information about transitive dependencies into the direct header jars in a special directory.

In order to ensure that downstream consumers of header compilation action can use its .jdeps file regardless of which of these actions actually uses path mapping, all such paths to transitive jars have to be unmapped.

With this commit, actions can declare additional artifacts whose paths they want to apply path mapping to. By always passing all transitive jars into the path mapper, even when only the direct jars are inputs to the action, the transitive header jar paths can be unmapped and stripped path collisions between them correctly result in a noop PathMapper being used for the current action.

Fixes #21091

@fmeum fmeum requested review from a team and lberki as code owners October 18, 2023 14:11
@fmeum fmeum requested review from katre and removed request for a team October 18, 2023 14:11
@github-actions github-actions bot added awaiting-review PR is awaiting review from an assigned reviewer team-Configurability platforms, toolchains, cquery, select(), config transitions team-Performance Issues for Performance teams team-Rules-Java Issues for Java rules labels Oct 18, 2023
@fmeum fmeum requested review from gregestren and removed request for lberki and katre October 18, 2023 14:12
@fmeum fmeum assigned aranguyen and unassigned aranguyen Oct 18, 2023
@fmeum fmeum requested a review from aranguyen October 18, 2023 14:23
@fmeum fmeum changed the title Also path map transitive header jars with direct classpath optimization Also path map transitive header jdeps with direct classpath optimization Oct 18, 2023
@fmeum fmeum changed the title Also path map transitive header jdeps with direct classpath optimization Also path map transitive header jar paths with direct classpath optimization Oct 18, 2023
@fmeum
Copy link
Collaborator Author

fmeum commented Oct 18, 2023

Note that there is still one problematic case that this PR isn't meant to address: Since we do not unmap the paths embedded in the META-INF/TRANSITIVE directories in header jars, if a header compile action without path mapping depends on a header jar produced by one with path mapping, it will emit a mapped path into its .jdeps file, which can then result in some compile action falling back to full classpath mode unnecessarily.

To fix that problem, we really need to use a mapping scheme that can't produce collisions, e.g. sorting collidings paths by their hashes and then adding a disambiguating count.

Copy link
Contributor

@gregestren gregestren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In b/303689530 (Google discussion of this issue) @cushon was also concerned about a path-mapped reference being stored in the hjar output. It's stored in a repackaged transitive class which includes a TurbineTransitiveJar class attribute to its original .jar path.

I think that's the source for the jdeps content? And while your PR fixes the .jdeps rewriting, it doesn't change the original reference, which is still embedded in the header jar contents?

I'm curious how this compares to @aranguyen 's idea to make getReduced ClassPath more resilient. And my idea on not disabling path mapping when the same input appears multiple times. And your related idea of not disabling path mapping by remapping away collisions.

@cushon had an interesting idea that if we make getReducedClassPath more resilient, maybe we don't even need createFullOutputDeps. But I think that assumes we can make getReducedClassPath sufficiently resilient. And it assumes we don't have other consumers of jdeps (which I'm not sure is true).

I do like that you're trying to principally make the jdeps content correct from the source.

I'm adding a lot of complicating thoughts here but I'd like to clearly understand how all these different approaches fit together in the bigger picture.

One other critique is the less ad hoc extra logic we need, the better.

@@ -499,7 +522,8 @@ public void build(JavaToolchainProvider javaToolchain) throws RuleErrorException
// If classPathMode == BAZEL, also make sure to inject the dependencies to be
// available to downstream actions. Else just do enough work to locally create the
// full .jdeps from the .stripped .jdeps produced on the executor.
/* insertDependencies= */ classpathMode == JavaClasspathMode.BAZEL));
/* insertDependencies= */ classpathMode == JavaClasspathMode.BAZEL,
additionalArtifactsForPathMapping));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand, this adds a new reference to the transitive class path to JavaheaderCompileAction. Does that have any memory consequences in that the reference could live longer? I realize there's already a reference passed to JavaCompileAction (not sure it's the same reference, since JavaHeaderCompileActionBuilder only constructs a JavaCompileAction under certain circumstances). Maybe that reference eliminates this concern?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified with JOL that the instance size of JavaHeaderCompileAction is still 72 after this change (it had 7 bytes of padding previously), so the field itself shouldn't result in higher memory usage. @justinhorvitz Is "HotSpot >= 15, 64-bit, COOPS, CCPS" the correct JOL setting to analyze retained size for Bazel?

The nested set it creates an additional reference to is either empty or comes from JavaTargetAttributes#getCompileTimeClassPath(), which I think is always retained by the corresponding full compilation action here:

NestedSet<Artifact> classpath =
NestedSetBuilder.<Artifact>naiveLinkOrder()
.addTransitive(bootClassPathInfo.auxiliary())
.addTransitive(attributes.getCompileTimeClassPath())
.build();
if (!bootClassPathInfo.auxiliary().isEmpty()) {
builder.setClasspathEntries(classpath);
builder.setDirectJars(
NestedSetBuilder.<Artifact>naiveLinkOrder()
.addTransitive(bootClassPathInfo.auxiliary())
.addTransitive(attributes.getDirectJars())
.build());
} else {
builder.setClasspathEntries(attributes.getCompileTimeClassPath());
builder.setDirectJars(attributes.getDirectJars());
}
.

@cushon
Copy link
Contributor

cushon commented Oct 19, 2023

Thanks for tackling this! Overall this seems like an encouraging approach to me.

cushon was also concerned about a path-mapped reference being stored in the hjar output. It's stored in a repackaged transitive class which includes a TurbineTransitiveJar class attribute to its original .jar path. ... I think that's the source for the jdeps content?

(+1, the implementation comment here mentions google/turbine@f9f2dec, which is what I was talking about. I think we're on the same page about the turbine behaviour.)

I'm curious how this compares to @aranguyen 's idea to make getReduced ClassPath more resilient. And my idea on not disabling path mapping when the same input appears multiple times. And your related idea of not disabling path mapping by remapping away collisions.

If I'm following I think either of the 'not disabling path mapping' options are complementary with this PR, and they would help with the "problematic case that this PR isn't meant to address" in #19872 (comment).

cushon had an interesting idea that if we make getReducedClassPath more resilient, maybe we don't even need createFullOutputDeps. But I think that assumes we can make getReducedClassPath sufficiently resilient. And it assumes we don't have other consumers of jdeps (which I'm not sure is true).

That's a good point: there are other consumers of jdeps, so I think we would want createFullOutputDeps even if getReducedClassPath was more resilient.

@fmeum
Copy link
Collaborator Author

fmeum commented Oct 21, 2023

I can add a test for the situation in which this change is relevant after the merge of the general test setup in #18155.

@gregestren
Copy link
Contributor

@aranguyen did you say that this PR doesn't address the failure you found in Google?

@aranguyen
Copy link
Contributor

@aranguyen did you say that this PR doesn't address the failure you found in Google?

It's this commit that I tested a4e1d7b#diff-738efdd95b5eedf99d16e391201cb00e284b0ccf3542506f6c88cb01ffb96830 and saw mixed paths still. I think it's different from this one.

@fmeum any idea why the presubmits are failing. I'll work on testing this one soon

@fmeum
Copy link
Collaborator Author

fmeum commented Nov 15, 2023

@fmeum any idea why the presubmits are failing. I'll work on testing this one soon

Not yet, I will look into it.

@keertk
Copy link
Member

keertk commented Nov 17, 2023

Actually, 7.1.0 may be better since this hasn't been reviewed yet, @fmeum.

`JavaHeaderCompileAction` can apply an optimization where it compiles
the source files only against direct dependencies, making use of the
fact that Turbine includes sufficient information about transitive
dependencies into the direct header jars in a special directory.

In order to ensure that downstream consumers of header compilation
action can use its `.jdeps` file regardless of which of these actions
actually uses path mapping, all such paths to transitive jars have to
be unmapped.

With this commit, actions can declare additional artifacts whose paths
they want to apply path mapping to. By always passing all transitive
jars into the path mapper, even when only the direct jars are inputs to
the action.
Copy link
Contributor

@aranguyen aranguyen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@copybara-service copybara-service bot closed this in fd196bf Feb 6, 2024
@github-actions github-actions bot removed the awaiting-review PR is awaiting review from an assigned reviewer label Feb 6, 2024
bazel-io pushed a commit to bazel-io/bazel that referenced this pull request Feb 6, 2024
…ization

`JavaHeaderCompileAction` can apply an optimization where it compiles the source files only against direct dependencies, making use of the fact that Turbine includes sufficient information about transitive dependencies into the direct header jars in a special directory.

In order to ensure that downstream consumers of header compilation action can use its `.jdeps` file regardless of which of these actions actually uses path mapping, all such paths to transitive jars have to be unmapped.

With this commit, actions can declare additional artifacts whose paths they want to apply path mapping to. By always passing all transitive jars into the path mapper, even when only the direct jars are inputs to the action, the transitive header jar paths can be unmapped and stripped path collisions between them correctly result in a noop `PathMapper` being used for the current action.

Fixes bazelbuild#21091

Closes bazelbuild#19872.

PiperOrigin-RevId: 604772306
Change-Id: I21d25785b2e909aace8554251f41b1012185138a
@fmeum fmeum deleted the java-header-path-mapping branch February 6, 2024 23:17
github-merge-queue bot pushed a commit that referenced this pull request Feb 8, 2024
…th optimization (#21227)

`JavaHeaderCompileAction` can apply an optimization where it compiles
the source files only against direct dependencies, making use of the
fact that Turbine includes sufficient information about transitive
dependencies into the direct header jars in a special directory.

In order to ensure that downstream consumers of header compilation
action can use its `.jdeps` file regardless of which of these actions
actually uses path mapping, all such paths to transitive jars have to be
unmapped.

With this commit, actions can declare additional artifacts whose paths
they want to apply path mapping to. By always passing all transitive
jars into the path mapper, even when only the direct jars are inputs to
the action, the transitive header jar paths can be unmapped and stripped
path collisions between them correctly result in a noop `PathMapper`
being used for the current action.

Fixes #21091

Closes #19872.

Commit
fd196bf

PiperOrigin-RevId: 604772306
Change-Id: I21d25785b2e909aace8554251f41b1012185138a

Co-authored-by: Fabian Meumertzheim <fabian@meumertzhe.im>
@Wyverald
Copy link
Member

Wyverald commented Feb 9, 2024

This was rolled back due to a significant memory regression (f5d76b1).

@fmeum
Copy link
Collaborator Author

fmeum commented Feb 9, 2024

Is there any additional information in the internal version of the rollback commit? It looks like it has some redacted information.

I will try to reproduce this on Bazel itself.

@gregestren
Copy link
Contributor

gregestren commented Feb 9, 2024

Not much. It's a 0.6% memory regression on a large Java project, which crosses some performance check threshold (I can't remember the threshold). It also found a 3.2% reduction in system time and 60% reduction in oldGenGarbage (don't know what that means). I'm not sure if they're noise but the benchmark at b/324218949#comment3 (internal) notes them as "significant".

@oquenchil might be able to offer more benchmark insight.

I'm about to go OOO for a week, but @aranguyen is aware. Conceptually, were we expecting this to have any resource impact if path mapping isn't enabled? The affected build isn't using path mapping at all.

@fmeum
Copy link
Collaborator Author

fmeum commented Feb 10, 2024

Thanks for the additional input.

Just so that I understand it correctly: The change increased memory by 0.6% while reducing system time by 3.2%? That would certainly be interesting (and unexpected).

@gregestren
Copy link
Contributor

gregestren commented Feb 10, 2024

I don't think the benchmark is always reliable, even with "significant" findings. It's a basic story of you have to repeat a huge amount of times to get consistently reliable results.

But @oquenchil did repeat the memory profiling a few times, so I believe that result is valid.

@justinhorvitz
Copy link
Contributor

Just from reading the code:

  • The extra field in JavaHeaderCompileAction adds 8 bytes per instance.
  • The classpathEntries NestedSet is newly retained.
  • The above two show up during analysis, but the majority of the regression only shows up when running a full build. There's some extra unconditional NestedSet flattening happening with your change. That might be the culprit (via instantiating the memo). Regardless, that should be defered to only happen if path mapping is active.

@fmeum
Copy link
Collaborator Author

fmeum commented Feb 11, 2024

Thanks for taking a look!

The extra field in JavaHeaderCompileAction adds 8 bytes per instance.

When I finalized this PR, the instance size was 72 bytes with 4 bytes padding, so this change wouldn't have affected the size. However, in the meantime the runfilesSupplier field has been removed from SpawnAction, which brought down the instance size to 64 and thus made this PR a regression.

I submitted a separate PR (#21290) that removes the boolean field and thus frees up the 4 bytes needed for the new field.

The classpathEntries NestedSet is newly retained.

If I traced the flow of this nested set correctly, while it's newly retained by this action, it should already be retained by the sibling JavaCompileAction (see #19872 (comment)).

Regardless, that should be defered to only happen if path mapping is active.

This looks easy to achieve and I will implement it when submitting the rollforward.

@justinhorvitz
Copy link
Contributor

I confirmed that storing the classpathEntries results in 125k additional NestedSet instances retained on the benchmark build. Note that there's a return after constructing the JavaHeaderCompileAction, so the JavaCompileAction construction below is not invoked on the same call.

@fmeum
Copy link
Collaborator Author

fmeum commented Feb 12, 2024

I confirmed that storing the classpathEntries results in 125k additional NestedSet instances retained on the benchmark build. Note that there's a return after constructing the JavaHeaderCompileAction, so the JavaCompileAction construction below is not invoked on the same call.

Thanks for the additional information!

Does the benchmark build involve "full" (non-header) Java compilation actions? If it's mostly just header compilation actions and that's actually a common use case, then I don't see a way to make the approach in this PR work at all.

My assumption so far has been that in "real-world builds" there would always be a sibling non-header compilation action that retains this nested set (which ultimately comes from https://cs.opensource.google/bazel/bazel/+/af28cae84e93b5d706c9cd374d1b8e324a3e8ca0:src/main/java/com/google/devtools/build/lib/rules/java/JavaTargetAttributes.java;l=42;bpv=1;bpt=1) in https://cs.opensource.google/bazel/bazel/+/af28cae84e93b5d706c9cd374d1b8e324a3e8ca0:src/main/java/com/google/devtools/build/lib/rules/java/JavaCompilationHelper.java;l=328-343. If the benchmark analyzes the non-header compilation actions and still ends up with additional nested sets, then I'm missing something crucial here. I will benchmark this locally to get a better understanding of what's retained where.

@justinhorvitz
Copy link
Contributor

The benchmark build has 194,776 JavaCompileAction instances and 127,013 JavaHeaderCompileAction instances. We use --experimental_java_classpath=bazel, which might be relevant.

@aranguyen
Copy link
Contributor

@fmeum I was also able to confirm the regression myself. Since you mentioned the possibility of this approach not working as the regression is definitely a blocker, WDYT about making getReducedClassPath agnostic to pathMapping logic? It would look something like this. This is actually one of the older fix I had (see #19872 (review) where Greg mentioned it). I was able to get the failing targets to pass with this fix and I'll run bench marking tonight. I will update tomorrow morning.

  @VisibleForTesting
  ReducedClasspath getReducedClasspath(
      ActionExecutionContext actionExecutionContext, JavaCompileActionContext context)
      throws IOException {
    HashSet<String> direct = new HashSet<>();
    for (Artifact directJar : directJars.toList()) {
      direct.add(directJar.getExecPathString());
    }
    for (Artifact depArtifact : dependencyArtifacts.toList()) {
      for (Deps.Dependency dep :
          context.getDependencies(depArtifact, actionExecutionContext).getDependencyList()) {
        direct.add(dep.getPath());
      }
    }
    ImmutableList<Artifact> transitiveCollection = transitiveInputs.toList();
    List<Artifact> reducedJarsList = new ArrayList<>();
    StringStripper stringStripper = new StringStripper(configuration.getBinDir().getExecPath().subFragment(0,1).getPathString());
    for (Artifact input : transitiveCollection) {
      if (direct.contains(input.getExecPathString())) {
        reducedJarsList.add(input);
      } else if (direct.contains(stringStripper.strip(input.getExecPathString()))) {
        reducedJarsList.add(input);
      }
    }

    ImmutableList<Artifact> reducedJars = ImmutableList.copyOf(reducedJarsList);
    return new ReducedClasspath(ImmutableList.copyOf(reducedJars), transitiveCollection.size());
  }

@fmeum
Copy link
Collaborator Author

fmeum commented Feb 13, 2024

@aranguyen Yes, that's a great alternative. I would say go ahead with it!

I will spend more time understanding why this regression exists in the first place (I can reproduce it locally on Bazel). Worst case I'll learn something, best case we can switch to this approach later.

@aranguyen
Copy link
Contributor

@fmeum benchmark came back with no memory regression with the approach i mentioned but there's a bit of a cost in wall time. I'll try and sort that out. Please do let me know how your investigation goes as well.

@fmeum
Copy link
Collaborator Author

fmeum commented Feb 14, 2024

I now understand why the actions weren't sharing the NestedSet as I expected. I submitted a separate PR (#21343) to change that. It looks like it would save one NestedSet instance per action, so ideally this would even have a net positive effect on memory usage.

@fmeum
Copy link
Collaborator Author

fmeum commented Feb 17, 2024

Submitted a rollforward change: #21401

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-Configurability platforms, toolchains, cquery, select(), config transitions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add multiplex sandboxing support for Java actions
8 participants