Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect task failures in benchmarks #1750

Merged
merged 7 commits into from
Feb 19, 2021

Conversation

andygrove
Copy link
Contributor

@andygrove andygrove commented Feb 17, 2021

This is a breaking change to the JSON reporting of query times.

Previously, the queryTime field would contain -1 if the query failed and >= 0 if the query was successful.

With these changes, queryTime will always be the elapsed time and there is a new queryStatus field which will have one of the following values: Completed, CompletedWithTaskFailures, or Failed.

Signed-off-by: Andy Grove <andygrove@nvidia.com>
Signed-off-by: Andy Grove <andygrove@nvidia.com>
@andygrove andygrove added the benchmark Benchmarking, benchmarking tools label Feb 17, 2021
@andygrove andygrove added this to the Feb 16 - Feb 26 milestone Feb 17, 2021
@andygrove andygrove self-assigned this Feb 17, 2021
abellina
abellina previously approved these changes Feb 17, 2021
@abellina
Copy link
Collaborator

LGTM!

@andygrove andygrove marked this pull request as ready for review February 17, 2021 23:16
@andygrove andygrove changed the title WIP: Benchmark detect task failure Benchmark detect task failure Feb 17, 2021
@andygrove andygrove changed the title Benchmark detect task failure Detect task failures in benchmarks Feb 17, 2021
@andygrove
Copy link
Contributor Author

build

@tgravescs
Copy link
Collaborator

I was actually wanting this today, so its great to see this. Is there a way we can just mark it special for this case - passed but with failures so probably affected overall time - or what does mark the query as failed mean? It probably gets complicated talking when you want to graph it. Does it fail but still report query Time?

@abellina
Copy link
Collaborator

Maybe an option "--fail-on-task-error" defaulting to true?

@andygrove
Copy link
Contributor Author

I was actually wanting this today, so its great to see this. Is there a way we can just mark it special for this case - passed but with failures so probably affected overall time - or what does mark the query as failed mean? It probably gets complicated talking when you want to graph it. Does it fail but still report query Time?

I'll look at this before merging this PR. It would be good to report time and status separately perhaps (status = Completed, Failed, CompletedWithTaskFailures)?

@tgravescs
Copy link
Collaborator

I think separate makes sense

Signed-off-by: Andy Grove <andygrove@nvidia.com>
@andygrove
Copy link
Contributor Author

@abellina @tgravescs This is ready for another review.

abellina
abellina previously approved these changes Feb 18, 2021
@andygrove
Copy link
Contributor Author

build

gerashegalov
gerashegalov previously approved these changes Feb 19, 2021
Copy link
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines 214 to 220
val status = if (taskFailureListener.taskFailures.isEmpty) {
STATUS_COMPLETED
} else {
// add the first task failure to the list of exceptions in the report
exceptions.append(taskFailureListener.taskFailures.head.toString)
STATUS_COMPLETED_WITH_TASK_FAILURES
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

val failureOpt = taskFailureListener.taskFailures.headOption
val status = failureOpt.map(_ =>  STATUS_COMPLETED_WITH_TASK_FAILURES).getOrElse(STATUS_COMPLETED)
failureOpt.foreach(exceptions.add(_.toString))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gerashegalov I went ahead and implemented this suggestion since it is much cleaner. Could you re-approve, please.

Signed-off-by: Andy Grove <andygrove@nvidia.com>
@andygrove
Copy link
Contributor Author

build

val failureOpt = taskFailureListener.taskFailures.headOption
val status = failureOpt.map(_ => STATUS_COMPLETED_WITH_TASK_FAILURES)
.getOrElse(STATUS_COMPLETED)
failureOpt.foreach(failure => exceptions.append(failure.toString))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry if this comment comes through twice, you changed as I was commenting:

do we need to/want to change the suffix on the file at all? I think adding this will make the file name show up with -failed right? line 257 below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. Now that we are recording the status per run it probably doesn't make sense to have the convention of adding -failed to the filename. What do you think @abellina (I think you may have asked for this feature?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and pushed a change to remove the -failed suffix.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine either way, if we keep it might want to change it to not be failed if query succeeded but had failures

Signed-off-by: Andy Grove <andygrove@nvidia.com>
@andygrove
Copy link
Contributor Author

build

@andygrove andygrove merged commit 5758be5 into NVIDIA:branch-0.4 Feb 19, 2021
@andygrove andygrove deleted the benchmark-detect-task-failure branch February 19, 2021 18:52
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmark Benchmarking, benchmarking tools
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants