-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect task failures in benchmarks #1750
Detect task failures in benchmarks #1750
Conversation
Signed-off-by: Andy Grove <andygrove@nvidia.com>
Signed-off-by: Andy Grove <andygrove@nvidia.com>
LGTM! |
build |
I was actually wanting this today, so its great to see this. Is there a way we can just mark it special for this case - passed but with failures so probably affected overall time - or what does mark the query as failed mean? It probably gets complicated talking when you want to graph it. Does it fail but still report query Time? |
Maybe an option "--fail-on-task-error" defaulting to true? |
I'll look at this before merging this PR. It would be good to report time and status separately perhaps (status = Completed, Failed, CompletedWithTaskFailures)? |
I think separate makes sense |
Signed-off-by: Andy Grove <andygrove@nvidia.com>
@abellina @tgravescs This is ready for another review. |
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
val status = if (taskFailureListener.taskFailures.isEmpty) { | ||
STATUS_COMPLETED | ||
} else { | ||
// add the first task failure to the list of exceptions in the report | ||
exceptions.append(taskFailureListener.taskFailures.head.toString) | ||
STATUS_COMPLETED_WITH_TASK_FAILURES | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
val failureOpt = taskFailureListener.taskFailures.headOption
val status = failureOpt.map(_ => STATUS_COMPLETED_WITH_TASK_FAILURES).getOrElse(STATUS_COMPLETED)
failureOpt.foreach(exceptions.add(_.toString))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gerashegalov I went ahead and implemented this suggestion since it is much cleaner. Could you re-approve, please.
integration_tests/src/main/scala/com/nvidia/spark/rapids/tests/common/BenchUtils.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Andy Grove <andygrove@nvidia.com>
build |
val failureOpt = taskFailureListener.taskFailures.headOption | ||
val status = failureOpt.map(_ => STATUS_COMPLETED_WITH_TASK_FAILURES) | ||
.getOrElse(STATUS_COMPLETED) | ||
failureOpt.foreach(failure => exceptions.append(failure.toString)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry if this comment comes through twice, you changed as I was commenting:
do we need to/want to change the suffix on the file at all? I think adding this will make the file name show up with -failed right? line 257 below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. Now that we are recording the status per run it probably doesn't make sense to have the convention of adding -failed
to the filename. What do you think @abellina (I think you may have asked for this feature?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went ahead and pushed a change to remove the -failed
suffix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine either way, if we keep it might want to change it to not be failed if query succeeded but had failures
Signed-off-by: Andy Grove <andygrove@nvidia.com>
build |
This is a breaking change to the JSON reporting of query times.
Previously, the
queryTime
field would contain-1
if the query failed and>= 0
if the query was successful.With these changes,
queryTime
will always be the elapsed time and there is a newqueryStatus
field which will have one of the following values:Completed
,CompletedWithTaskFailures
, orFailed
.