Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix root span sampling #263

Merged
merged 10 commits into from
Aug 20, 2021

Conversation

dvic
Copy link
Contributor

@dvic dvic commented Aug 5, 2021

This fixes #262

@dvic
Copy link
Contributor Author

dvic commented Aug 5, 2021

I'm not sure if sample/7 should receive Ctx or NewSpanCtx?

edit:

Using NewSpanCtx breaks the following test:

%%% opentelemetry_SUITE ==> 
Failure/Error: ?assertMatch(false, SpanCtx2 # span_ctx . is_recording)
  expected: = false
       got: true
      line: 420

so something is still not right 🤔

edit2:

this might be just because of the "hack" used in the tests not being correct anymore?

    %% hack to set the created span as local not sampled
    ?set_current_span(SpanCtx1#span_ctx{is_remote=false,
                                        is_recording=false,
                                        trace_flags=0}),

@codecov
Copy link

codecov bot commented Aug 5, 2021

Codecov Report

Merging #263 (38b964c) into main (ea75b2a) will decrease coverage by 0.08%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #263      +/-   ##
==========================================
- Coverage   36.47%   36.39%   -0.09%     
==========================================
  Files          41       41              
  Lines        3169     3165       -4     
==========================================
- Hits         1156     1152       -4     
  Misses       2013     2013              
Flag Coverage Δ
api 63.06% <ø> (ø)
elixir 15.98% <ø> (ø)
erlang 36.39% <100.00%> (-0.09%) ⬇️
exporter 19.60% <ø> (ø)
sdk 79.09% <100.00%> (-0.15%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
apps/opentelemetry/src/otel_span_utils.erl 96.00% <100.00%> (-0.56%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ea75b2a...38b964c. Read the comment docs.

@dvic
Copy link
Contributor Author

dvic commented Aug 5, 2021

I think I've fixed it properly now.

@tsloughter
Copy link
Member

Yup, I think this is right. Thanks! I've got to look a little closer but will merge soon I think.

@dvic dvic marked this pull request as draft August 6, 2021 11:19
@dvic
Copy link
Contributor Author

dvic commented Aug 6, 2021

I converted this PR back to a draft. You were right to have a little closer look ;) because there still a problem: I noticed that sometime I would have missing root spans, it seems like this PR broke the other use case: when you do have parent spans, then they are not exported it seems?

@dvic
Copy link
Contributor Author

dvic commented Aug 6, 2021

Never mind, I was testing with also #261 and that seems to be actually the problem, if I use both PRs but don't use any samplers then there are no issues.

@dvic dvic marked this pull request as ready for review August 6, 2021 11:30
@dvic
Copy link
Contributor Author

dvic commented Aug 6, 2021

Please merge #264 first, it fixes the failing test in this PR (the underlying issue is not introduced in this PR, hence the separate PR).

@tsloughter
Copy link
Member

@dvic the test failure, and I think the issue that you opened the next PR about, I think is solved by fixing the setting of is_remote on new spans. I can get a PR together tomorrow that should fix it and thus make this PR merge-able and your PR #264 no longer needed.

@dvic
Copy link
Contributor Author

dvic commented Aug 10, 2021

@tsloughter could it be that there is a bug (introduced by this PR/related) with the sweeper? We're running a long process where we use always_off as the root span and the otel_span_table ETS table just keeps growing (500+ MB). I haven't looked into it yet myself but I thought it might ring a bell?

@dvic
Copy link
Contributor Author

dvic commented Aug 10, 2021

@tsloughter could it be that there is a bug (introduced by this PR/related) with the sweeper? We're running a long process where we use always_off as the root span and the otel_span_table ETS table just keeps growing (500+ MB). I haven't looked into it yet myself but I thought it might ring a bell?

Related to this, I see this comment:
image

@tsloughter
Copy link
Member

@dvic are you sure the sweeper is running?

Not recorded is a tricky situation. Originally a span that wasn't sampled wouldn't be inserted into the ets table, but the addition of "non-recording" complicated that. I think this was done to be safe but it may be the case that we don't need it.

I'll see if I can find anything in the spec that relates to a non-recording span being used that would mean we need to keep it in the table.

But either way they should be swept away, so that is a separate problem I think.

@dvic
Copy link
Contributor Author

dvic commented Aug 10, 2021

@dvic are you sure the sweeper is running?

This is by default right? (I guess with recent Elixir versions, where, how I understand it, deps are automatically started?) Anyways, it seems to be running:

image

Not recorded is a tricky situation. Originally a span that wasn't sampled wouldn't be inserted into the ets table, but the addition of "non-recording" complicated that. I think this was done to be safe but it may be the case that we don't need it.

I'll see if I can find anything in the spec that relates to a non-recording span being used that would mean we need to keep it in the table.

But either way they should be swept away, so that is a separate problem I think.

Clear, I'll have a look to see if I can replicate with a test.

@dvic
Copy link
Contributor Author

dvic commented Aug 10, 2021

I can't replicate with a test (tried really quickly) but non-recording spans really seem to stay put in the ETS table. How is the sweeper supposed to work? Does it clean every 10 minutes the items that are 30 minutes and older? Is this the default setting?

@tsloughter
Copy link
Member

Yea, couldn't remember if it defaults to sweeping anything. But yea, looks like it has defaults of dropping any spans over 30 minutes old.

@dvic
Copy link
Contributor Author

dvic commented Aug 11, 2021

Yea, couldn't remember if it defaults to sweeping anything. But yea, looks like it has defaults of dropping any spans over 30 minutes old.

Alright, well if there is no good reason for having the non-recorded ones in ETS I suggest we skip inserting them, because for this particular type of operation (a cron job "sync" operation that involves many database operations) we now have 10x increase in memory usage :)

@dvic
Copy link
Contributor Author

dvic commented Aug 11, 2021

I've just tested a patch that does not insert non-recording spans anymore and indeed it solves our memory problems (1100+ MB vs ~200 MB usage). Let me know if you want me to open a PR for this. By the way: all tests seem to pass, so if there is some behavior associated with these non-recording spans, it's not tested yet.

@tsloughter
Copy link
Member

Thanks, yes, I think a separate PR would be great. I'm guessing it should be merged but am still not positive, will try to figure out today.

Would be great to also figure out why the sweeper isn't working in this case. I can try to look into it today as well but if you have the opportunity to run a regular erlang trace on the sweeper to see what it is doing in your case where the table is growing that'd be a huge help. Hit me up on slack if you are able to do that and want some help.

@dvic
Copy link
Contributor Author

dvic commented Aug 11, 2021

Thanks, yes, I think a separate PR would be great. I'm guessing it should be merged but am still not positive, will try to figure out today.

Sure, I'll submit one then.

Would be great to also figure out why the sweeper isn't working in this case. I can try to look into it today as well but if you have the opportunity to run a regular erlang trace on the sweeper to see what it is doing in your case where the table is growing that'd be a huge help. Hit me up on slack if you are able to do that and want some help.

The problem is not that the sweeper is not working but that the default config results in a high memory usage. We run a process that executes many queries and each query generates a span. This results in so many spans in ETS and given the fact that they are only cleaned every 10 minutes (older than 30 minutes) it takes 30 minutes or so to clear all non-recording spans. Here is an example log statement that was outputted:

14:11:05.842 module=otel_span_sweeper [info]  sweep old spans: ttl=1800000000000 num_dropped=5109

@tsloughter
Copy link
Member

@dvic aaah, ok. Yea, those defaults were picked to be very conservative.

@tsloughter tsloughter merged commit c0487cf into open-telemetry:main Aug 20, 2021
@dvic dvic deleted the dvic-root-span-sampling branch August 20, 2021 06:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

root spans are not recorded with trace_id_ratio_based
2 participants