Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API flexibility #63

Closed
philipc opened this issue Oct 25, 2017 · 14 comments · Fixed by #64
Closed

API flexibility #63

philipc opened this issue Oct 25, 2017 · 14 comments · Fixed by #64

Comments

@philipc
Copy link
Contributor

philipc commented Oct 25, 2017

Prompted by @main--'s comment, here's a few API changes for consideration. The common theme is to make the API more flexible, and less centered on providing an addr2line clone. Note that more flexible may also mean more complicated, so this is a tradeoff. Some of these may be strawman proposals :)

  • Remove the with_functions option, and use a parameter to locate instead (or a different locate method). This allows the caller to decide when doing the locate, not when constructing the Mapping. This is possible because we lazily read the function information.

  • Remove the with_demangling option and stop doing demangling. This should be handled by the caller. Instead, we return a language value that the caller can use to determine how to demangle. This allows the caller to access both the mangled and demangled names if desired, and to implement their own caching of demangled names if required.

  • Remove the with_symbol_table option. The caller should be able to decide whether to use symbols for each query. The caller should also be able to get both debuginfo and symbol table information simultaneously. This may mean having two independent queries: one for debuginfo and one for the symbol table.

  • Make the object crate dependency optional. Instead, define a trait for the required object loader functionality. We can still implement this trait for object for convenience.

  • Avoid the memmap dependency. Using an OwningHandle to tie something to a memmap is a generic operation that has nothing to do with addr2line, and should be implemented elsewhere.

  • Add an endianity parameter to the mapping instead of using EndianDebugInfo. This allows the caller to decide if they only want to support NativeEndian, without incurring size or performance penalties.

@main--
Copy link
Contributor

main-- commented Oct 25, 2017

I mean, the main reason I wrote my own symbolizer is that I needed inlining info and couldn't see how to easily add that to gimli-addr2line so I just started over. As I said in gimli-rs/gimli#248, I'm definitely willing to merge my unwind-rs codebase into what others have as I don't have the time to maintain it on my own (and it doesn't really make sense when I'm the only user).

What hardliner does right now:

  • API:
    • You start by constructing a Context. This parses the debug_loc stuff.
    • If you want, you can now upgrade to a FullContext by calling parse_functions.
    • (I don't like this Context/FullContext, ideally you would be able to either start with functions only or lines only and go from either to the combined thing that can do both. Also it needs traits to be composeable.)
    • If you just want "dumb" line info (no functions, thus no inlining), call find_location(addr)
    • Else call query(addr) and you get an iterator over Frames.
    • Frame has an optional FunctionName (you can print/demangle/get raw str whatever)
    • Frame has an optional Location (file/line/col)
  • The whole thing uses way less memory than gimli-addr2line (both modes) and it's considerably faster when doing functions, probably because it uses an interval tree instead of running through the entire list every time. In lines-only mode, it's slower because it just does this "resume the line program" thing which saves a lot of memory however.
    time
    memory
    (eu-addr2line disqualified because it didn't do anything on my system - always prints ??, but does that really fast)
  • It also has a nice integration test that makes sure that the output of hardliner-addr2line is byte-equivalent to GNU addr2line for every combination of flags (except for -h and -v of course, also only boolean flags so no -j or -b either, and you can't currently specify a custom demangling style - it's always auto).
  • Oh, and this is so much nicer than the functions-dance in gimli-addr2line 😉
  • I'm quite happy with taking an &object::File to operate on, but offering some flexibility here with a trait certainly makes sense.
  • No symtab parsing (I'm not opposed to it but I also don't personally need it) (but as I saw in Add option to use symbol table #61 most of this happens in the object crate anyways). I agree that it should be a separate interface.

@philipc
Copy link
Contributor Author

philipc commented Oct 25, 2017

The whole thing uses way less memory than gimli-addr2line (both modes) and it's considerably faster when doing functions, probably because it uses an interval tree instead of running through the entire list every time.

Yep gimli-addr2line should be much faster here than what it is currently. I've given this some thought and plan to work on it as soon as I have time (perhaps tomorrow).

In lines-only mode, it's slower because it just does this "resume the line program" thing which saves a lot of memory however.

I wanted to do that, but some compilers generate one sequence for the entire compilation unit, so for some files this can be very slow.

In general, I like a lot about your code style in hardliner, and I understand why you started from scratch, but I still plan to implement inlining info in gimli-addr2line and then evaluate the differences from there.

The rest of the unwinding is not something I've worked on (although I may in future) so my opinions on that don't count for much. It doesn't matter to me if that is in gimli-rs or not, it's whatever works best for you. Moving it to gimli-rs won't immediately accomplish anything though if no-one else has time to work on it.

@fitzgen
Copy link
Member

fitzgen commented Oct 25, 2017

First off: sorry I've been slow to respond in other threads, I've been traveling.


My general take: it doesn't make sense to have competing symbolication libraries without good reason. We should be working together, because none of us have the time/resources to maintain everything ourselves.

If that means throwing out the current code base and replacing it with hardliner's, then we should do that. If it means refactoring addr2line to use techniques from hardliner, then we should do that. If it means experimentation before doing one of the previous options, OK.

But we have to eventually end up with one shared code base (preferably in gimli-rs/addr2line).


As I said in gimli-rs/gimli#248, I'm definitely willing to merge my unwind-rs codebase into what others have as I don't have the time to maintain it on my own (and it doesn't really make sense when I'm the only user).

Yeah, let's do this. I'll send you an org invite in a moment, and then you can transfer the repo over.

The rest of the unwinding is not something I've worked on (although I may in future) so my opinions on that don't count for much. It doesn't matter to me if that is in gimli-rs or not, it's whatever works best for you. Moving it to gimli-rs won't immediately accomplish anything though if no-one else has time to work on it.

What we get is more ecosystem cohesion and reviewers to bounce ideas off of. A bit fluffly, but I want our organization to be the one-stop shop for all kinds of debugging needs and none of us can create that alone :)


I'm still working on findshlibs, which I think we should leverage from unwind-rs. I have a branch where I'm implementing the "give me the eh_frame" bits.

Should I transfer findshlibs to gimli-rs? What about cpp_demangle?

@philipc
Copy link
Contributor Author

philipc commented Oct 26, 2017

If it means experimentation before doing one of the previous options, OK. But we have to eventually end up with one shared code base (preferably in gimli-rs/addr2line).

Agreed. I'll do that experimentation then write up my thoughts.

Should I transfer findshlibs to gimli-rs? What about cpp_demangle?

Sure. I've also transfered ddbug, but until I'm happier with the state of it, I will continue to develop it by simply pushing to master.

@philipc
Copy link
Contributor Author

philipc commented Oct 27, 2017

For testing purposes, I've added function inlining support to gimli-addr2line here. It's ugly, so don't look too closely. This algorithm is based heavily on the interval tree usage in @main-- 's code, and all credit for the good performance of this goes to @main-- . For the binary I'm testing on, the results of this don't fully agree with binutils, but I did a few spot checks and those results do agree with llvm-symbolizer.

I tried comparing the output with hardliner, but currently it is missing many names in C++ code, although it gets the line numbers correct. This is probably because it doesn't follow DW_AT_specification, but I haven't investigated further.

Here's some timing comparisons. I haven't done graphs because they make it harder to compare when there are outliers. The test binary is mostly C++ with about 11MB of text.

~110k consecutive addresses:

==> Benchmarking
binutils   nofunc  2.57   50200
llvm-3.9   nofunc  0.65   30264
gimli      nofunc  0.65   17824
hardliner  nofunc  29.33  18224
==> Benchmarking with -f -i
binutils   func  2.62   49988
llvm-3.9   func  59.76  34856
gimli      func  0.66   24480
hardliner  func  29.96  57228

Every 100th address:

==> Benchmarking
binutils   nofunc  9.22  173416
llvm-3.9   nofunc  0.27  45680
gimli      nofunc  0.27  35772
hardliner  nofunc  6.00  18284
==> Benchmarking with -f -i
binutils   func  9.20   173276
llvm-3.9   func  10.48  87856
gimli      func  0.34   74216
hardliner  func  6.17   57300

Every address:

binutils   nofunc  1888.61  180960
llvm-3.9   nofunc  65.47    45496
gimli      nofunc  64.92    31836
hardliner  nofunc  1570.63  16824

For the consecutive addresses, hardliner's memory usage is higher because it doesn't do lazy parsing, but that can be fixed, so ignore that. A better comparison of memory usage is for the benchmark of every address. I didn't bother waiting for that to do the functions too, but the difference in memory usage between gimli and hardliner should be about the same still.

The main thing to note is that due to hardliner resuming sequences instead of caching them, it is significantly slower. This is also the reason why it uses less memory. However, these benchmarks are for lots of addresses, and the time per address is still quite low. I've been optimizing primarily for speed, but is that the correct thing to optimize for? I don't know...

If we keep gimli-addr2line, we need to:

  • cleanup my inlining patch
  • significantly cleanup the rest of the API

If we switch to hardliner, we need to:

  • fix some minor issues with its parsing
  • add some some performance fixes (lazy parsing, avoid multiple iteration of attributes)
  • add symbol table parsing

And then the big difference that could go either way is which line sequence algorithm to use.

I'm happy with using either code base. If we stick with gimli-addr2line then I can do the majority of the work on that. If we change to hardliner then I would like @main-- to do some of that, but I can still work on it too.

@main--
Copy link
Contributor

main-- commented Oct 27, 2017

Very interesting analysis. I was benchmarking with a debug build of ripgrep (29MB total), where (as you can see from the graphs) the performance penalty from resuming the line number program wasn't nearly as big.

I tried comparing the output with hardliner, but currently it is missing many names in C++ code, although it gets the line numbers correct. This is probably because it doesn't follow DW_AT_specification, but I haven't investigated further.

I'm pretty sure that's the reason - as I'm always testing with Rust binaries, I only implemented DW_AT_abstract_origin but it makes sense that C++ would use DW_AT_specification.

One important point to note is that recent versions (5.0) of llvm-symbolizer are much faster. I wanted to find out why but I'm just not familiar enough with the llvm codebase - on the surface it looks like they just create a sorted list and binary search that??

The main thing to note is that due to hardliner resuming sequences instead of caching them, it is significantly slower. This is also the reason why it uses less memory. However, these benchmarks are for lots of addresses, and the time per address is still quite low. I've been optimizing primarily for speed, but is that the correct thing to optimize for? I don't know...

I know, it's a difficult problem. The way I think about it, the one major usecase for symbolization is stack traces. A typical stack trace:

  • doesn't ever have consecutive addresses
  • may contain repeating addresses (in case of recursion)
  • only hits a small subset of the application's functions (but may over time, in a long-running application that generates backtraces regularly, hit everything)

From that perspective, it seems like the key to this is laziness: Only parse what we need and cache that. The major downside is that it forces us to allocate when querying. That's a problem when you use the library in a panic handler as you might be panicking due to OOM. My opinion on this has shifted a bit though - a backtrace for an OOM panic may not be very useful anyways since you can never be sure who allocated all the memory without a memory profiler.

Now about the symbol table parsing:

  • The data comes basically straight from the object file parser. This is not really compatible with an API that abstracts over this and allows you to use something other than object.
  • It's an acceptable substitute for DWARF data if and only if you don't need inlining info AND the functions actually are available in the object symbol table.
  • Taking all that into account, I'm not sure this functionality even belongs into this crate. It's certainly a useful fallback for an addr2line command line utility (better than nothing) but strictly speaking, what we'd be offering is literally just a search table. I can't really see it being that useful for library users.

@philipc
Copy link
Contributor Author

philipc commented Oct 27, 2017

on the surface it looks like they just create a sorted list and binary search that??

Yes, I had a bit of a look at that, but didn't compile it to test. They take advantage of the fact that the DIEs are a tree. So your address map only needs to tell you which DIE to start at, and then you can traverse back up the tree to get the rest. DWARF doesn't naturally allow traversing backwards, so they have a Vec<depth, offset> to allow that.

The major downside is that it forces us to allocate when querying.

If I recall correctly from when I briefly looked at libbacktrace, it also allocates during the query. That doesn't mean we shouldn't try to do better though. Benchmarking against it might be interesting too. I think it's the same as the others in that it caches the fully decoded line sequences.

Taking all that into account, I'm not sure this functionality even belongs into this crate.

We can easily keep it all in the object crate.

It's certainly a useful fallback for an addr2line command line utility (better than nothing) but strictly speaking, what we'd be offering is literally just a search table. I can't really see it being that useful for library users.

Sometimes you have to link with code for which you don't have debug info. It's nice to have something in the backtrace instead of simply question marks.

@main--
Copy link
Contributor

main-- commented Oct 27, 2017

DWARF doesn't naturally allow traversing backwards, so they have a Vec<depth, offset> to allow that.

Ah, so this is the part I was missing. I'm curious how this compares to an optimized implementation of my interval tree - I'm not sure which is faster.

Taking all that into account, I'm not sure this functionality even belongs into this crate.

We can easily keep it all in the object crate.

Yeah, this is what I'm proposing: Keep it in the object crate, perhaps add a search table implementation there. Then if you want both dwarf and elf symbols (in addr2line for example), just do the fallback in user code.

Sometimes you have to link with code for which you don't have debug info. It's nice to have something in the backtrace instead of simply question marks.

I guess it depends a lot on what you're doing. But for just printing a trace, the better-than-nothing argument obviously applies, you're right.

@fitzgen
Copy link
Member

fitzgen commented Oct 27, 2017

I know, it's a difficult problem. The way I think about it, the one major usecase for symbolization is stack traces. A typical stack trace:

Note that something like a profiler is also "just" symbolicating stack traces, but typically many stack traces at a time. And while each individual sampled stack won't have consecutive addresses, there likely are consecutive addresses across samples.

@tromey
Copy link
Member

tromey commented Oct 27, 2017

I recall correctly from when I briefly looked at libbacktrace, it also allocates during the query

It does, but it has its own allocator that uses mmap, with this comment:

/* Memory allocation on systems that provide anonymous mmap.  This
   permits the backtrace functions to be invoked from a signal
   handler, assuming that mmap is async-signal safe.  */

I think all of libbacktrace is async-signal-safe (if you grant this assumption).

@philipc
Copy link
Contributor Author

philipc commented Nov 3, 2017

We still need to decide between gimli-addr2line and hardliner.

I think the main factor for deciding between them is the .debug_line handling, and I don't have a strong opinion either way. Long term, I want to investigate using the approach used by hardliner, but adding artificial breaks in long sequences, so maybe hardliner is a better starting point for that.

For the other factors, I think hardliner's is closer to what we want. symbol table handling doesn't matter because that's moving to the object crate. So overall I think hardliner is a better starting point.

If we do choose hardliner, how do we go about replacing this crate with it?

It does, but it has its own allocator that uses mmap

So it's conceivable we could do the same with custom allocators. Auditing rust for async-signal-safe might be hard though? I think lazy parsing is important to have for large files.

@main--
Copy link
Contributor

main-- commented Nov 3, 2017

If we do choose hardliner, how do we go about replacing this crate with it?

A few options:

  1. Rename this repo to something else, duplicate unwind-rs repo, remove everything that isn't hardliner
  2. Merge unwind-rs into this repo (either manually or using git-subtree), remove everything that isn't hardliner
  3. Copy hardliner into this repo (losing git history)

I haven't given this a lot of thought but I think I prefer 2 since it both preserves a meaningful git history in this repo as well as for hardliner.

So it's conceivable we could do the same with custom allocators. Auditing rust for async-signal-safe might be hard though? I think lazy parsing is important to have for large files.

I don't think async-signal-safety is really supported in Rust right now? But I agree, lazy parsing can be a huge performance boost in practice.

@fitzgen
Copy link
Member

fitzgen commented Nov 3, 2017

+1 for option 2

@philipc
Copy link
Contributor Author

philipc commented Nov 7, 2017

I prefer option 2 as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants