Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhance: [cherry-pick] Add monitoring metrics for task execution time in datacoord #35141

Merged
merged 1 commit into from
Aug 2, 2024

Conversation

xiaocai2333
Copy link
Contributor

issue: #35138

master pr: #35139

@sre-ci-robot sre-ci-robot added the size/L Denotes a PR that changes 100-499 lines. label Jul 31, 2024
@mergify mergify bot added dco-passed DCO check passed. kind/enhancement Issues or changes related to enhancement labels Jul 31, 2024
case indexpb.JobState_JobStateNone:
return
case indexpb.JobState_JobStateInit:
queueTime := time.Since(task.GetQueueTime())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please log warn if the pending or executing time of a task with a given taskID exceeds 6 hours ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 6 hours is too long, as it may delay the timely detection of issues. However, I made this duration configurable so it can be adjusted as needed. default duration is 5 minutes.

Copy link

codecov bot commented Jul 31, 2024

Codecov Report

Attention: Patch coverage is 98.23009% with 2 lines in your changes missing coverage. Please review.

Project coverage is 72.16%. Comparing base (756922e) to head (5154b9b).
Report is 8 commits behind head on 2.4.

Files Patch % Lines
internal/datacoord/task_scheduler.go 97.46% 1 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##              2.4   #35141      +/-   ##
==========================================
+ Coverage   72.12%   72.16%   +0.03%     
==========================================
  Files        1052     1052              
  Lines      136846   136958     +112     
==========================================
+ Hits        98707    98832     +125     
+ Misses      33925    33912      -13     
  Partials     4214     4214              
Files Coverage Δ
internal/datacoord/task_analyze.go 86.97% <100.00%> (+1.02%) ⬆️
internal/datacoord/task_index.go 86.88% <100.00%> (+0.79%) ⬆️
pkg/metrics/datacoord_metrics.go 65.62% <100.00%> (+0.54%) ⬆️
pkg/util/paramtable/component_param.go 98.55% <100.00%> (+<0.01%) ⬆️
internal/datacoord/task_scheduler.go 99.14% <97.46%> (-0.86%) ⬇️

... and 27 files with indirect coverage changes

@mergify mergify bot added the ci-passed label Jul 31, 2024
@mergify mergify bot added ci-passed and removed ci-passed labels Jul 31, 2024
@xiaocai2333 xiaocai2333 added this to the 2.4.7 milestone Aug 1, 2024
log.Warn("task running time is too long", zap.Int64("taskID", taskID),
zap.Int64("running time(ms)", runningTime.Milliseconds()))
}
metrics.DataCoordTaskExecuteLatency.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we report this every time?
I believe we only need to report this once when index excution done?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will only be report once here. This is the task queue, and when a task is finished and the meta is successfully recorded, the task will be cleared from the queue.


task := s.getTask(taskID)
if task == nil {
s.taskLock.Unlock(taskID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to unblock it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it's wrong, will remove it.

Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>
@czs007
Copy link
Contributor

czs007 commented Aug 2, 2024

/approve
/lgtm

@sre-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: czs007, xiaocai2333

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mergify mergify bot added the ci-passed label Aug 2, 2024
@sre-ci-robot sre-ci-robot merged commit 2534b30 into milvus-io:2.4 Aug 2, 2024
13 of 15 checks passed
@xiaocai2333 xiaocai2333 deleted the support_queue_time-4 branch September 25, 2024 08:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved ci-passed dco-passed DCO check passed. kind/enhancement Issues or changes related to enhancement lgtm size/L Denotes a PR that changes 100-499 lines.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants