Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Add EventPipe Config File Option MultiFileSec #20548

Merged
merged 5 commits into from
Oct 25, 2018

Conversation

brianrob
Copy link
Member

Adds the capability to have EventPipe roll-over to a new file every N seconds. This is configured via the eventpipeconfig file and is specified via the MultiFileSec keyword.

Rundown only occurs in the last trace during the disable operation, which ensures that we don't negatively impact the running process. No events should be lost on transition to a new file - they continue to be written to the circular buffer for inclusion in the next file.

@brianrob brianrob requested a review from vancem October 23, 2018 19:02
@brianrob
Copy link
Member Author

FYI @DamianEdwards, @anurse

@brianrob
Copy link
Member Author

@dotnet-bot test Windows_NT x86 Release Innerloop Build and Test

@@ -189,7 +201,7 @@ internal static class EventPipeInternal
// These PInvokes are used by the configuration APIs to interact with EventPipe.
//
[DllImport(JitHelpers.QCall, CharSet = CharSet.Unicode)]
internal static extern UInt64 Enable(string outputFile, uint circularBufferSizeInMB, long profilerSamplingRateInNanoseconds, EventPipeProviderConfiguration[] providers, int numProviders);
internal static extern UInt64 Enable(string outputFile, uint circularBufferSizeInMB, long profilerSamplingRateInNanoseconds, EventPipeProviderConfiguration[] providers, int numProviders, ulong multiFileTraceLengthInSeconds);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is OK for now, but as a future edit we really should pass a class/struct containing all this config information (with defaults if you pass null), That way adding more is easier and the number of args does not get out of control.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

@@ -307,13 +324,18 @@ class EventPipe
static EventPipeConfiguration *s_pConfig;
static EventPipeSession *s_pSession;
static EventPipeBufferManager *s_pBufferManager;
static LPCWSTR s_pOutputPath;
static unsigned long s_nextFileIndex;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it is OK right now to make this a static (since everything else is as well), but it would have been just as easy to make it a member of the session so there was just that much less to move over to the session when we support multi-session.

@@ -267,9 +281,19 @@ private static string BuildConfigFileName()
return GetAppName() + ConfigFileSuffix;
}

private static string BuildTraceFileName()
private static string BuildTraceFileName(bool multiFile)
Copy link

@vancem vancem Oct 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like this wacky invariant that the TraceFile name includes the suffix when you are doing single file but DOESN'T included it for multi-file. I want it to not be conditional. You can do it either way but my bias is to make the 'common' case simple and use names that mean what they say. FileNames are normally the whole path, so my recommendation is that you do this uniformly. Then when you need multiple files, you simply put the number 'before' the netperf suffix (if Present), or simply add it as a suffix (if the .netperf is not there). There are other variations but the key is that if you don't care about mulit-file (which is the common case, everything is what you expect. You have pushed all the complexity into the Multi-file case.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I like the idea of just pushing the "file numbering" logic down entirely to EventPipe. If s_nextFileIndex is 0, the file name is left unmodified, otherise the number is injected before the netperf suffix (the managed initialization code can ensure the input string either always has or never has the .netperf suffix, whichever is easier for working with in the native code). Then the file naming logic is consistent and just tied to the number of times the file is rolled. A "single-file" tracing session would just be one where the roll interval is "infinite" and so s_nextFileIndex is always 0.

That's all internal details to us though so I'm not worried either way :)

@vancem
Copy link

vancem commented Oct 24, 2018

I know that there was also a desire to insure that the file is flushed at some reasonable rate so tools could do incremental monitoring that was as well (although it seems like it creates a 'leak' (since the output file only grows). Have we addressed this abilty to flush? (It is probably best that they ask for it explictly (e.g. by touching the control file).

Thinking about it, there is a issue with deciding the value of MultiFileTraceSizeInSec parameter. Too big, and you don't get 'realtime behavior', too small, and you end up with way to many files. Having the flush would solve things (you can use a large value for multi-file e.g. 1 min or more), and rely on a short flushing interval (e.g. 1 sec), to get 'real time' behavior.

Copy link

@vancem vancem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like the FileName to be uniformly a file path (or a file without the suffix). I would also like to know if we have a 'flush' capability. Those are my 'blocking' requests.

@analogrelay
Copy link

I know that there was also a desire to insure that the file is flushed at some reasonable rate so tools could do incremental monitoring that was as well

Yeah, our ideal scenario is probably something like this:

Two new options are added: FlushRateMS and MaxFileSizeKB (KB seems like a useful unit, but I'm happy with Bytes if that seems more appropriate). The flush rate controls how often the file is flushed. Instead of rolling files on flush, just write more events to the current file. After each flush, if the file size is now over MaxFileSizeKB, roll to a new numbered file.

Thoughts on that @davidfowl / @DamianEdwards ?

I think that this kind of file rolling (rolling on every flush) is workable and won't block our scenarios, but if we wanted to get events every second (or even at a smaller inteval) we're forced to create and delete a bunch of files in order to do so. For shorter trace scenarios, it's nice to be able to just say: "Flush every X seconds, and roll files every Y bytes".

I agree that growing infinitely is bad and we should set defaults that ensure we don't lead to that. We can have a "reasonable" default MaxFileSizeKB (100KB maybe?). It's also fine for the files to end up larger than MaxFileSizeKB (if the file was at 99KB, the limit was 100KB and the next flush wrote 10KB, for example) since it's much simpler than trying to stop and roll files mid-flush. It's also fine to me if the rolling logic only kicks in when the flush rate is set (since you can view the "single-file" scenario as just being FlushRateMS=Infinity).

Also totally fine with rundown events only occurring when the session ends, that makes sense to me.

@brianrob
Copy link
Member Author

@vancem, thanks for the review.

I am fixing the FileName issue that you point out so that the file path uniformly has the .netperf extension on it.

With respect to flush - I have not added any support for flushing. We could add this as part of the timer and just have it flush every N seconds. Adding a flush rate requires that we plumb more stuff down to native code, but it can be done. The current timer period is 1 second, and I don't really want to reduce this any further to get millisecond granularity. I like the low-overhead nature of triggering once per second.

Also, @anurse, we don't have a good way of limiting by size today. It's do-able but the sizes need to be tracked, and today they are not. I would recommend we hold this for 3.0.

@vancem, @anurse do you have an opinion on whether or not we want flush rate to configurable or just flush once per second for all traces (including single file)? I'm tempted to just make the flush rate 1 second for simplicity and call it a day. We'll get the benefit of "reasonably realtime" traces and traces that are usable even after a crash (for data that doesn't require rundown), and at the same time we won't need to plumb a flush value and have it be part of the configuration.

Thoughts?

@analogrelay
Copy link

analogrelay commented Oct 24, 2018

The current timer period is 1 second, and I don't really want to reduce this any further to get millisecond granularity

That's OK. I suggested milliseconds for flexibility but I don't see a clear scenario for sub-second granularities.

do you have an opinion on whether or not we want flush rate to configurable or just flush once per second for all traces (including single file)? I'm tempted to just make the flush rate 1 second for simplicity and call it a day.

So in this case you'd basically have a non-configurable "flush rate" and a configurable "roll rate" (based on time rather size)? I'm OK with that too. I do prefer rolling on size (since it's what the user is more likely to want to restrict) but I can live with rolling on time :). I'd definitely suggest renaming the config field from MultiFileSec to RollIntervalSec (or something similar). I like that the repeated flushing would also occur in the single-file case, that's 👍 🎉 🎆

Also, @anurse, we don't have a good way of limiting by size today. It's do-able but the sizes need to be tracked, and today they are not. I would recommend we hold this for 3.0.

I wondered if that might be the case. If it's a lot of work to track the sizes then I don't want to bog the 2.2 work down on making that happen :). My only concern is that if we add it in 3.0, we need to reconcile it with the logic to rolling based on time. I think that's not too difficult though.

@vancem
Copy link

vancem commented Oct 24, 2018

I think @brianrob's most preferred option is to simply add a non-configurable 1 second flushing behavior to the EventPipe. It would not be hard to make it a configuration option, but think the 1 second default is frankly OK 99% of the time (put in an optimization to do nothing if there are no events to flush), and has the useful property that by default you will get up to the last 1 second in case of a crash.

In my initial reply I too had the thought that @anurse had that users care more about file size than file time, but I ripped that out of my reply before I posted it, because it was not a trivial thing to change from the current code. However it is also really not 'that bad'. You remove the timers we currently have, and instead when you flush you keep track of the number of bytes you wrote (note it does not have to be precise, we can use the size of the 'frame' in which we put the events as the 'size' (which we already had to compute). When it gets bigger than that size, it closes the file and open a new one (basically the same logic we have in the timer). I actualy think it is 'simpler' because it has less 'asynchonous' (timers) in it. That kind of asychronly is typially where bugs hide.

@anurse, Just, FYI 'flush' with a circular buffer is a bit more problematic. Today circular buffers are easy, because we only flush them at the very end. If you flush them in between, you have this ugly case where you have to start overwriting the beginning of the file. First, our file does have a prefix before we get to events, and second there will be this 'undefined' area between the new data written, and the next block of the 'old data written in the last cycle' that you have to 'skip'. This is possible (we actually have a marker in our frames that is probably pretty reliable to search for), but to do a truly 'perfect' job, you probably have to remember the file offsets of where you put all the frames and put something in the frame you just wrote that tells you where to skip to get the the 'old' frames in the circular buffer. Not, THAT hard, but definitely also not completely trivial. Multi-file side-steps this (which is nice).

@brianrob - at this point we shoudl certainly prioritize. The highest priority is to have the ability to flush (either explictly (thus the tool needs to do this at whatever rate it feels is necessary), or by having a timer in the runtime. Given that today accepting commands is a polling operation, it is probably best if the runtime just do it (ideally you trigger it from the managed side because ultimately we want all 'commands' to originate there). Adding it as a parameter in the file I think is easy and useful (it allows us to set the default to something reasonably big like 10 seconds.

That should be the higest priority. Chaning it to be file size, does not sound too hard, and the main value of it is that we won't have to support the time-based option forever (which is probably good). If it is as easy as I indicate above, it is worth considering, but if there are complications I am OK with doing it later (arguably, all of these options are currently 'private' (indeed the concept of the control file is private), so I guess it is OK to leave the time based option and remove it later.

So the thing I think we should close on first is the flushing.

@brianrob
Copy link
Member Author

I think that flushing or switching files based on file size is bigger than I want to do in this PR. It requires us to know the approximate size of the events, and we don't calculate this today. None of this is super-difficult, but does add enough risk to this PR that I don't want to include it.

I can include a configurable time-based flush - that's just more plumbing, but not hard. I've started working on the flush code but need to investigate first why it isn't working. That will determine if we are able to do flush at all vs. having to open and close the file.

@brianrob
Copy link
Member Author

Unfortunately, it looks like CFileStream, which is the file writer that we use doesn't allow for flushing without closing the file. I've determined this through reading of the code and by testing. It looks like there is some platform-specific behavior here. On UNIX writes occur immediately, but on Windows, they don't.

Furthermore, we aren't able to configure CFileStream to open and append, which would allow us to simulate flush with a Close/Re-open operation.

Based on this, I don't think we're going to be able to implement the separate flush control for 2.2. I think we'll need to live with the file rollover, which does work.

@brianrob
Copy link
Member Author

I've just posted another commit to fix the trace file name logic as requested by @vancem.

@vancem
Copy link

vancem commented Oct 25, 2018

@brianrob: Does CFileStream has a 'seek' API? If it does see if it flushes if you seek to the begining of the file and then back to the end again. There is a pretty good chance that it does, that may be an easy work-around.

We can probably live without flush. If you turn on CPU sampling you will get enough events that frankly you should get a flush reasonably quickly without forcing it.

I don't have a problem with the feature as it currently stands, however it does not make sense to check it in if we are not going to use it in the 2.2 time frame. Thus we really need @anurse to decide if what we can make for 2.2. is worth having...

But otherwise I am OK with it.

@brianrob
Copy link
Member Author

Agreed @vancem. @anurse, if we aren't able to do periodic flushes and can only do file-rollovers, is this worth having for 2.2?

I will check on the seek operation..

@brianrob
Copy link
Member Author

@dotnet-bot test Windows_NT x64 full_opt ryujit CoreCLR Perf Tests Correctness
@dotnet-bot test Windows_NT x64 min_opt ryujit CoreCLR Perf Tests Correctness
@dotnet-bot test Windows_NT x86 full_opt ryujit CoreCLR Perf Tests Correctness
@dotnet-bot test Windows_NT x86 min_opt ryujit CoreCLR Perf Tests Correctness

@brianrob
Copy link
Member Author

Confirmed this morning that CFileStream does not support seek. @anurse, please let me know if you believe that the file rollover is worth having even without the periodic flush. If so, I will get this merged and then port to release/2.2.

Copy link

@analogrelay analogrelay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as we can get some data out every X seconds (via flushing or rolling files, etc.), it'll have some value to us. We'll need to sync back up in the 3.0 time frame to look at things like flushing more regularly (or other things we've discussed like named pipes) but this will give us something to work with in 2.2 (as far as I can tell)

@brianrob
Copy link
Member Author

@anurse, sounds good. I think that what you'll really want is a streaming protocol rather than rolling files. That said, the rolling files are still a valuable feature for other things as well, so it's not as though this is something that won't get use.

I will get this merged and ported. Thanks!

@brianrob brianrob merged commit c99e2dd into dotnet:master Oct 25, 2018
@brianrob brianrob deleted the flush_trace branch October 25, 2018 16:47
brianrob added a commit to brianrob/coreclr that referenced this pull request Oct 25, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants