Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLVM12 PCH leads to 2x slower compilation times #49638

Closed
hero101111 mannequin opened this issue May 11, 2021 · 23 comments
Closed

LLVM12 PCH leads to 2x slower compilation times #49638

hero101111 mannequin opened this issue May 11, 2021 · 23 comments
Labels
bugzilla Issues migrated from bugzilla clang Clang issues not falling into any other category

Comments

@hero101111
Copy link
Mannequin

hero101111 mannequin commented May 11, 2021

Bugzilla Link 50294
Resolution FIXED
Resolved on Oct 11, 2021 20:29
Version 12.0
OS Windows NT
Blocks #51489
CC @akrieger,@bcardosolopes,@davidbolvansky,@dwblaikie,@gerboengels,@llunak,@Kojoley,@zygoloid,@tstellar
Fixed by commit(s) 6eda66b a52b7bf

Extended Description

Starting with LLVM12, we are getting a 1.5x - 2x performance hit if we compile our projects using PCH, compared to without PCH with LLVM 12. Exactly the opposite of what one would expect by using precompiled headers.

Compared to LLVM 11, with PCH enabled, we are observing a penalty of 3-4x.

@hero101111
Copy link
Mannequin Author

hero101111 mannequin commented May 11, 2021

We have reproduced the issue on three different machines, all on Windows 10, using LLVM 12.0.0 x86_64-pc-windows-msvc.

The issue is observed no matter the C++ project being compiled.

We have a compilation pipeline (~140 projects) which using LLVM11 would take around 30minutes to complete with PCH and 60 without PCH.

On LLVM12 we observe 60minutes without PCH and ~140minutes when using PCH.

@davidbolvansky
Copy link
Collaborator

You should provide some reproducer, testcase.

@hero101111
Copy link
Mannequin Author

hero101111 mannequin commented May 12, 2021

I am not really sure how to proceed regarding 'reproducer, testcase'. As as I have said previously, we are seeing this issue with all C++ projects, it's not isolated to a particular one.

We have also observed that '-emit-pch' command line argument is broken in LLVM12, calling clang with -v give a hint that the command line is badly parsed, into e and mit-pch:

-e mit-pch
LINK : warning LNK4044: unrecognized option '/e'; ignored
LINK : fatal error LNK1181: cannot open input file 'mit-pch.obj'

We have been able to get around the issue by simply not providing -emit-pch in the command line, and the PCH still getting generated. We tested this to be working on both LLVM11 and LLVM12.

Should I create a separate issue for the 'emit-pch' problem? I can easily provide a sample zip for that.

@davidbolvansky
Copy link
Collaborator

Yes, please.

@llunak
Copy link
Collaborator

llunak commented May 12, 2021

It should be enough to find one compilation that is slower with a PCH than without and then provide that one as a reproducer. Preprocessing the source with '-E -frewrite-includes' (for both PCH creation and use) should give you all sources as one file. And you could test if using -fpch-instantiate-templates makes a difference for you. If you use clang-cl you'll probably need to prefix these with /Xclang (or whatever the option is exactly).

@hero101111
Copy link
Mannequin Author

hero101111 mannequin commented May 13, 2021

Working on getting the reproducer ready.
I've tested using -fpch-instantiate-templates and it does speed the process considerably, yielding aprox 15% better time than not using PCH, but still slower than LLVM11 with PCH.

@hero101111
Copy link
Mannequin Author

hero101111 mannequin commented May 18, 2021

Files and powershell script needed for reproducing issue
I've attached a sample project which creates a PCH and uses it to compile a file multiple times. The powershell script outputs the time it took to run the compilation.

On my i9 9900K machine I get 14.8 seconds on LLVM12, compared to 3.2 seconds using LLVM 11.

The compilation script uses the clang++ driver and presumes the existence of an MSVC toolset and Windows 10 SDK, using hardcoded paths. Please update them accordingly.

@hero101111
Copy link
Mannequin Author

hero101111 mannequin commented May 18, 2021

@akrieger
Copy link
Mannequin

akrieger mannequin commented Aug 16, 2021

I might be seeing the same issue. I also noticed that the resulting codegen is not identical with and without PCH enabled, using clang-cl frontend.

When I did a side by side comparison CPU usage trace I saw the extra cycles being spent starting at clang::Sema::ActOnEndOfTranslationUnit going into a deeply recursive clang::Sema::PerformPendingInstantiations <-> clang::Sema::InstantiateFunctionDefinition call cycle. Each entry into InstantiateFunctionDefinition would incur an expensive call to clang::ASTReader::ReadLateParsedTemplates.

The compared compilations were identical modulo the usage of the PCH file. Both compilations used /FI to include the pch header. I've attached a screenshot showing the comparsion, the top trace is from the compilation using a PCH, the bottom without.

@akrieger
Copy link
Mannequin

akrieger mannequin commented Aug 16, 2021

@akrieger
Copy link
Mannequin

akrieger mannequin commented Aug 16, 2021

More specifically, in the non-PCH case, there do not appear to be any calls to ReadLateParsedTemplates. I can highlight the difference there specifically if desired. The call depth appears to be the same in ActOnEndOfTranslationUnit, at least as far as I've bothered to expand the CPU trace.

@Kojoley
Copy link
Contributor

Kojoley commented Aug 16, 2021

When I did a side by side comparison CPU usage trace I saw the extra cycles
being spent starting at clang::Sema::ActOnEndOfTranslationUnit going into a
deeply recursive clang::Sema::PerformPendingInstantiations <->
clang::Sema::InstantiateFunctionDefinition call cycle. Each entry into
InstantiateFunctionDefinition would incur an expensive call to
clang::ASTReader::ReadLateParsedTemplates.

This should be related to -fpch-instantiate-templates (https://reviews.llvm.org/D69585) which is on by default in clang-cl.

@akrieger
Copy link
Mannequin

akrieger mannequin commented Aug 16, 2021

When I did a side by side comparison CPU usage trace I saw the extra cycles
being spent starting at clang::Sema::ActOnEndOfTranslationUnit going into a
deeply recursive clang::Sema::PerformPendingInstantiations <->
clang::Sema::InstantiateFunctionDefinition call cycle. Each entry into
InstantiateFunctionDefinition would incur an expensive call to
clang::ASTReader::ReadLateParsedTemplates.

This should be related to -fpch-instantiate-templates
(https://reviews.llvm.org/D69585) which is on by default in clang-cl.

That flag, or rather, using -fno-pch-instantiate-templates, makes things worse.

Using -fno-pch-instantiate-templates when creating the PCH and when consuming the PCH causes the worst performance of all, approximately 15% worse than before. Using -fno-pch-instantiate-templates only when compiling the target file (but leaving it unset, defaulting to enabled, when compiling the pch) results in performance comparable to the prior case. In both situations, compiling the actual source file still shows the same call pattern and costs in ReadLateParsedTemplates.

Given the earlier report indicated this is a regression from 11 to 12, I will try a bisect to see if a suspect change can be brought to light.

@llunak
Copy link
Collaborator

llunak commented Aug 17, 2021

Just from quickly looking at the source, your bisecting will most likely point you to 2c9dbcd . Notice how it removes 'LateParsedTemplates.clear();', which means repeated calls of the function are no longer no-ops.

@akrieger
Copy link
Mannequin

akrieger mannequin commented Aug 17, 2021

Confirmeing that the bisect goes to that change (thanks for the pointer!). Was removing the clear() call a bug, so we can add it back, or required to fix the problem the change was addressing?

@llunak
Copy link
Collaborator

llunak commented Aug 17, 2021

I don't know. It seems gargvaibhav64@gmail.com does not have a bugzilla account, so I've sent a direct mail pointing here.

@akrieger
Copy link
Mannequin

akrieger mannequin commented Aug 18, 2021

I think the reason we only see this on Windows is because the default when targeting Windows is to use -fdelayed-template-parsing even when not using the cl frontend.

@zygoloid
Copy link
Mannequin

zygoloid mannequin commented Sep 2, 2021

Thanks for the analysis; I've added back the clear() call. Please can someone test with Clang trunk and see if the problem is still visible?

@hero101111
Copy link
Mannequin Author

hero101111 mannequin commented Sep 3, 2021

I've compiled the trunk and after testing on our internal codebase I can confirm the issue is now gone. Compilation is quick & smooth now. Thank you, Richard Smith, for the fix.

@hero101111
Copy link
Mannequin Author

hero101111 mannequin commented Sep 3, 2021

We are very happy with the results of this fix. However, I've noticed that clang --version on the trunk is already 14.0. Is there a chance of this fix being cherry picked for the upcoming LLVM 13 release?

@tstellar
Copy link
Collaborator

tstellar commented Sep 4, 2021

Richard, any objections to backporting this?

@tstellar
Copy link
Collaborator

tstellar commented Sep 7, 2021

Merged: a52b7bf

@tstellar
Copy link
Collaborator

mentioned in issue #51489

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 11, 2021
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla Issues migrated from bugzilla clang Clang issues not falling into any other category
Projects
None yet
Development

No branches or pull requests

4 participants