Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make -Z self-profile more efficient #58372

Closed
4 tasks done
michaelwoerister opened this issue Feb 11, 2019 · 4 comments
Closed
4 tasks done

Make -Z self-profile more efficient #58372

michaelwoerister opened this issue Feb 11, 2019 · 4 comments
Assignees
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one. I-compiletime Issue: Problems and improvements with respect to compile times. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@michaelwoerister
Copy link
Member

michaelwoerister commented Feb 11, 2019

The self-profiling feature is going to make profiling the compilers performance a lot easier. However, a recent first stab at collecting more detailed information (see #58085) still has too much overhead.

Here are some of the things that could be improved:

  • Move post-processing of the collected data out of the rustc process, as much as possible. SelfProfiler::get_results() does a lot of work for generating the statistics from the collected events. All of this should probably be moved to a separate tool that runs after profiling is done.

  • Reduce the amount of dispatch and locking that needs to be done for each event. For each event we have to get exclusive access to the profiler (RefCell/ parking_lot mutex) and then look up the event stream for the current thread in an FxHashMap. This should probably solved via thread-local data somehow.

  • Reduce the size of events. Events are quite big (32 bytes on x86_64 would be my guess). The timestamp can be reduced to 64 bits if we just measure the time from process start. The &str containing the query name can be replaced by a 4 byte tag.

  • Persist events to disk in a binary format. We should probably open a memory mapped file per thread that we write events to directly. If events don't contain pointers they can be written to disk verbatim. The post-processing tool can then convert them to something platform independent.

Some time soon we also want to record query keys per event. This can already be done efficiently by storing the 32 bit DepNodeIndex that corresponds to a query (which also obviates the need to store the query name in each event). However, in order for the DepNodeIndex to be useful, we'll need to create a persist a mapping of DepNodeIndex -> String at some point before the tcx is destroyed (i.e. in the middle of the compilation process). I expect that creating this map will not be entirely cheap :/

@Mark-Simulacrum, as you can see the whole workflow around self-profiling will change quite a bit, so I
think it's too early to add infrastructure for it to perf.rlo just yet.

cc @wesleywiser @nnethercote (and @rust-lang/wg-compiler-performance for good measure)

@michaelwoerister michaelwoerister added C-enhancement Category: An issue proposing an enhancement or a PR with one. I-compiletime Issue: Problems and improvements with respect to compile times. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Feb 11, 2019
@wesleywiser wesleywiser self-assigned this Feb 11, 2019
wesleywiser added a commit to wesleywiser/rust that referenced this issue Feb 20, 2019
wesleywiser added a commit to wesleywiser/rust that referenced this issue Feb 20, 2019
bors added a commit that referenced this issue Feb 20, 2019
[self-profiler] Make the profiler faster/more efficient

Related to #58372

r? @michaelwoerister
@michaelwoerister
Copy link
Member Author

So here are some thoughts on making the part of profiling that is happening in the compiler process as cheap as possibe:

  • Anything in ProfilerEvent should be small and persistable verbatim. E.g. we don't want &str in there because that needs to be expanded to the actual string contents during serialization, which generates a lot of redundant work. I think this basically leaves integer values (which can be indices into side tables that are also persisted, but with less redundancy).
  • Query keys still have to be rendered to some kind of string form during persistence. This is expensive, so we have to be smart about it:
    • It's good to do this in bulk if possible, so that it shows up as a single event in profiles instead of being spread across other events with little chance to account for it during post-processing.
    • The string representation of query keys contain lots of redundancy, e.g. the same DefPath prefixes will occur over and over again. It would be good if the way we generate and persist these strings takes advantage of that.

As a result of these thoughts I have the following proposal:

  • ProfilerEvent should look like this (modulo more efficient packing into "bitfields"):

    struct ProfilerEvent {
      event_kind: EventKind, // u8 - query-provider, query-cache-hit, generic-event, incr-comp-cache-loading ... 
      event_id: u32, // ~ (query-kind, query-key) or (function-name, arguments)
      timestamp_kind: u8, // start, stop, instant
      timestamp: u64 // nanoseconds since profiler was created
    }
  • The event_id is basically a key into a string table that is owned by the profiler. The entry in the string table does not have to be created at the same time as the event_id. In particular, for queries we can assume that the event_id is the DepNodeIndex of the query instance and can then later iterate over all query tables in order to generate the corresponding string table entries in bulk. We can reserve the range 0 .. 2^31 for queries. Other events are less fine-grained so it should be OK to allocate entries on demand without noticeably skewing the profile.

  • The string table in the profiler is tuned for coping with redundancy, that is, entries can refer to other entries by reference instead of copying the contents. So an entry conceptually looks like:

    struct StringTableEntry {
       components: [EntryComponent]
    }
    
    enum EntryComponent {
       Contents(String),
       Ref(StringTableId),
       End,
    } 

    I say conceptually because I expect these entries to be serialized immediately during their allocation. The table itself does not do any de-duplication or interning. This is done from the outside where we have additional knowledge of the data being stored. This also means that entries cannot be compared for equality by just looking at their ID but I don't think we need that.

  • Because DefId and CrateNum are by far the most common query keys we'll take special care of generating their string table entries in an efficient manner. I imagine something similar to what we do for generating CGU names, that is, use caching/dynamic programming to make this really efficient. Other kinds of query keys (like Ty) can be optimized later too since the string table format is reasonably flexible.

  • Since the string table is rather complex, it should be encapsulated into its own re-usable crate so that the post-processing crate does not have to duplicate the logic.

What I like about this approach is that it should be rather efficient but it is still generic. I.e. the string table only knows about strings and references between them but does not need to duplicate more logic about compiler internal data structures like the DefPath or Ty. We are lowering these into a simpler form.

Any feedback is welcome!

wesleywiser added a commit to wesleywiser/rust that referenced this issue Mar 1, 2019
wesleywiser added a commit to wesleywiser/rust that referenced this issue Mar 1, 2019
Mark-Simulacrum added a commit to Mark-Simulacrum/rust that referenced this issue Mar 2, 2019
…r=michaelwoerister

[self-profiler] Make the profiler faster/more efficient

Related to rust-lang#58372

r? @michaelwoerister
wesleywiser added a commit to wesleywiser/rust that referenced this issue Mar 3, 2019
wesleywiser added a commit to wesleywiser/rust that referenced this issue Mar 3, 2019
bors added a commit that referenced this issue Mar 3, 2019
…erister

[self-profiler] Make the profiler faster/more efficient

Related to #58372

r? @michaelwoerister
@pnkfelix
Copy link
Member

pnkfelix commented Mar 7, 2019

cc #58967

michaelwoerister pushed a commit to michaelwoerister/rust that referenced this issue Apr 12, 2019
wesleywiser added a commit to wesleywiser/rust that referenced this issue Apr 13, 2019
@wesleywiser
Copy link
Member

@michaelwoerister I believe this issue can be closed now that we're using measureme right?

@michaelwoerister
Copy link
Member Author

Yes, I think all things listed here are implemented via measureme. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one. I-compiletime Issue: Problems and improvements with respect to compile times. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

3 participants