Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework Profile tool to not require Spark to run and process files faster #3161

Merged
merged 246 commits into from
Aug 9, 2021

Conversation

tgravescs
Copy link
Collaborator

fixes #3046

Redesign the profile tool to not require using Spark and SQL. Similar to the Qualification tool changes. It now run much faster and with less memory. we could probably improve memory further usage if we run into issues but I was able to run on file with all 105 nds queries without changing size.

It now takes 9.7 seconds to run collect mode on all 105 NDS query event logs. took a minute to profile all 105 NDS queries that were in a single event log (using default heap size 1G). Took 16.7 seconds to compare all 105 NDS queries using default heap size.
This is compared to spark-submit way took 10.5 minutes to run collect mode on 105 NDS queries and required much more memory, compare mode we suggested limited to 10 and removed that now.

This does change a few things:

  1. runs with java instead of spark-submit/spark-shell
  2. in collect mode you get 1 file per application where the file is -profile.log
  3. comparison mode - the only differences were 2 tables to compare sql and job ids, everything else goes through normal path and application are just listed with different app index in the same table. Comparison file is now called rapids_4_spark_tools_compare.log
  4. Added user to application info
  5. all sections when empty just print a message to keep things consistent - it used to have some empty tables some messages
  6. plan descriptions filed change to be consistent: {app.appId}-planDescriptions.log
  7. Threadpool used to process applications. In compare mode threadpool used to read each application event log file and in collection mode threadpool reads file and processes each application separately.
  8. I did change the way a bunch of things were stored in the application, some are now mutable to be more like spark history server and helps consolidate some things.
  9. If event logs are failed to parse or have an error we simply skip them now instead of erroring out.
  10. Since we were using spark to create formatted string, I created a class ProfileOutputWriter that borrowed much of the Spark showString functionality to keep the output very similar. In generally we create a *ProfileResult case class from functions and then that has functions for table headers and converts to Seq to be output. This should be able to extended to write to like CSV pretty easily.

All tests pass, I spot checked our other integration tests, I still need to do all of them and update the expected results for them. I will do that in parallel of this being reviewed.

@tgravescs tgravescs added the tools label Aug 6, 2021
@tgravescs tgravescs added this to the Aug 2 - Aug 13 milestone Aug 6, 2021
@tgravescs tgravescs self-assigned this Aug 6, 2021
@tgravescs
Copy link
Collaborator Author

build

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a quick pass through it, but I didn't look deeply enough to truly feel confident enough to approve it yet. My biggest concern are the methods like Analysis.jobAndStageMetricsAggregation that write things out, but also produce a result. It was odd before when it would print things out and update state, but now it is kind of a hybrid and it makes me a bit nervous it could get called too frequently.

@tgravescs
Copy link
Collaborator Author

almost all fo the functions do this and did it before, with the caveat that the writer has to be defined. The main reason they return things is just for testing purposes. We could separate out fetching it and writing it

@revans2
Copy link
Collaborator

revans2 commented Aug 6, 2021

If it is just for testing then a comment or two about that would be fine. It just confused me.

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have had more time to go through it. I don't know if I got everything, but it does look good to me.

@tgravescs
Copy link
Collaborator Author

Merging this one and I am working on adding writing to CSV and I'll look at either commenting or splitting things into get functions and print functions in that PR.

@tgravescs tgravescs merged commit f59de4f into NVIDIA:branch-21.10 Aug 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Profiling tool: Scale to run large number of event logs.
3 participants