Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(queue): observability for queue #2721

Draft
wants to merge 22 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
722fb5d
feat(queue): observability for queue
fgozdz Aug 22, 2024
1b2fd0b
feat(worker): observability for worker
fgozdz Aug 23, 2024
e6dfb85
feat(queue, worker): fix telemetry setup
fgozdz Aug 26, 2024
a75a8dd
feat(queue, worker): describe features
fgozdz Aug 27, 2024
1b176c6
Merge branch 'master' of github.com:taskforcesh/bullmq into feat/open…
fgozdz Aug 28, 2024
66b85c5
feat(queue, worker): add exception handler
fgozdz Aug 29, 2024
54a2c18
feat(queue, worker): remove redundancy in code by creating helper met…
fgozdz Aug 29, 2024
2e3fe37
feat(queue, worker): telemetry attributes change, and fix promises
fgozdz Sep 4, 2024
c4cb517
feat(queue, worker): resolve conflict
fgozdz Sep 4, 2024
14b2f2e
Merge branch 'master' of github.com:taskforcesh/bullmq into feat/open…
fgozdz Sep 5, 2024
7f6c674
Revert "feat(queue, worker): resolve conflict"
fgozdz Sep 5, 2024
059d50b
fix(documentation): fix documentation formatting
fgozdz Sep 5, 2024
f70cc8a
feat(queue-base): add context manager for telemetry
fgozdz Sep 5, 2024
8f38087
feat(worker, queue): add spankind and distributed context propagation
fgozdz Sep 10, 2024
379abf5
fix(worker, queue): remove unused import
fgozdz Sep 10, 2024
3e8e0d8
Merge branch 'master' into feat/opentelemetry
roggervalf Sep 10, 2024
6169ea3
feat(queue, worker): documentation and changes to propagation, basic …
fgozdz Sep 16, 2024
908ec13
feat(test_telemetry_interface): remove unused imports
fgozdz Sep 16, 2024
2dab35f
feat(queue, worker): correct tests and spankind, use property mapping…
fgozdz Sep 22, 2024
846e3ef
feat(worker): minor changes
fgozdz Sep 23, 2024
fc733e9
feat(queue, worker): distributed tracing
fgozdz Oct 9, 2024
4c199f8
feat(queue, worker): resolve conflicts
fgozdz Oct 9, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
291 changes: 280 additions & 11 deletions src/classes/queue.ts

Large diffs are not rendered by default.

182 changes: 180 additions & 2 deletions src/classes/worker.ts
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,8 @@ import {
RATE_LIMIT_ERROR,
WaitingChildrenError,
} from './errors';
import { Tracer } from '../interfaces';
import { TelemetryAttributes } from '../enums';

// 10 seconds is the maximum time a BRPOPLPUSH can block.
const maximumBlockTimeout = 10;
Expand Down Expand Up @@ -190,6 +192,13 @@ export class Worker<
protected processFn: Processor<DataType, ResultType, NameType>;
protected running = false;

/**
* Instance of a telemetry client
* To use it create if statement in a method to observe with start and end of a span
* It will check if tracer is provided and if not it will continue as is
*/
private tracer: Tracer | undefined;

static RateLimitError(): Error {
return new RateLimitError();
}
Expand Down Expand Up @@ -314,6 +323,8 @@ export class Worker<
this.blockingConnection.on('ready', () =>
setTimeout(() => this.emit('ready'), 0),
);

this.tracer = opts?.telemetry?.tracer;
}

emit<U extends keyof WorkerListener<DataType, ResultType, NameType>>(
Expand Down Expand Up @@ -402,11 +413,30 @@ export class Worker<
}

async run() {
let span;
if (this.tracer) {
const spanName = `${this.name} ${this.id} Worker.run`;
span = this.tracer.startSpan(spanName);
span.setAttributes({
[TelemetryAttributes.WorkerName]: this.name,
[TelemetryAttributes.WorkerId]: this.id,
[TelemetryAttributes.WorkerOptions]: JSON.stringify(this.opts),
});
}

if (!this.processFn) {
if (this.tracer) {
span.end();
}

throw new Error('No process function is defined.');
}

if (this.running) {
if (this.tracer) {
span.end();
}

throw new Error('Worker is already running.');
}

Expand Down Expand Up @@ -507,10 +537,14 @@ export class Worker<
}

this.running = false;
return asyncFifoQueue.waitAll();
return await asyncFifoQueue.waitAll();
} catch (error) {
this.running = false;
throw error;
} finally {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This made me think, why do we need to make sure the span ends even in the case of an exception, but not in all other places? As most functions perform asynchronous calls, they could also potentially raise an exception. Have you thought about that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, in the case of error, shouldn't we store this information in the span? just wondering here... how does the current open telemetry for BullMQ package handle exceptions?

if (this.tracer) {
span.end();
}
}
}

Expand All @@ -520,12 +554,30 @@ export class Worker<
* @returns a Job or undefined if no job was available in the queue.
*/
async getNextJob(token: string, { block = true }: GetNextJobOptions = {}) {
return this._getNextJob(
let span;
if (this.tracer) {
const spanName = `${this.name} ${this.id} Worker.getNextJob`;
span = this.tracer.startSpan(spanName);
span.setAttributes({
[TelemetryAttributes.WorkerName]: this.name,
[TelemetryAttributes.WorkerId]: this.id,
[TelemetryAttributes.WorkerToken]: token,
[TelemetryAttributes.WorkerOptions]: JSON.stringify({ block }),
});
}

const nextJob = await this._getNextJob(
await this.client,
await this.blockingConnection.client,
token,
{ block },
);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is possible that nextJob is undefined here, so we need to check before trying to set the attribute.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I am not sure we need to create a span at all if there is no next job :thinking_face:

if (this.tracer) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to match the job that was added to the queue we also need to specify the job.id here in the span attributes.

span.end();
}

return nextJob;
}

private async _getNextJob(
Expand Down Expand Up @@ -585,6 +637,17 @@ export class Worker<
* @param expireTimeMs - expire time in ms of this rate limit.
*/
async rateLimit(expireTimeMs: number): Promise<void> {
let span;
if (this.tracer) {
const spanName = `${this.name} ${this.id} Worker.rateLimit`;
span = this.tracer.startSpan(spanName);
span.setAttributes({
[TelemetryAttributes.WorkerName]: this.name,
[TelemetryAttributes.WorkerId]: this.id,
[TelemetryAttributes.WorkerRateLimit]: expireTimeMs,
});
}

await this.client.then(client =>
client.set(
this.keys.limiter,
Expand All @@ -593,6 +656,10 @@ export class Worker<
expireTimeMs,
),
);

if (this.tracer) {
span.end();
}
}

get minimumBlockTimeout(): number {
Expand Down Expand Up @@ -743,7 +810,23 @@ will never work with more accuracy than 1ms. */
fetchNextCallback = () => true,
jobsInProgress: Set<{ job: Job; ts: number }>,
): Promise<void | Job<DataType, ResultType, NameType>> {
let span;
if (this.tracer) {
const spanName = `${this.name} ${this.id} Worker.processJob`;
span = this.tracer.startSpan(spanName);
span.setAttributes({
[TelemetryAttributes.WorkerName]: this.name,
[TelemetryAttributes.WorkerId]: this.id,
[TelemetryAttributes.WorkerToken]: token,
[TelemetryAttributes.JobId]: job.id,
});
}

if (!job || this.closing || this.paused) {
if (this.tracer) {
span.end();
}

return;
}

Expand Down Expand Up @@ -808,6 +891,10 @@ will never work with more accuracy than 1ms. */
return handleFailed(<Error>err);
} finally {
jobsInProgress.delete(inProgressItem);

if (this.tracer) {
span.end();
}
}
}

Expand All @@ -816,6 +903,17 @@ will never work with more accuracy than 1ms. */
* Pauses the processing of this queue only for this worker.
*/
async pause(doNotWaitActive?: boolean): Promise<void> {
let span;
if (this.tracer) {
const spanName = `${this.name} ${this.id} Worker.pause`;
span = this.tracer.startSpan(spanName);
span.setAttributes({
[TelemetryAttributes.WorkerName]: this.name,
[TelemetryAttributes.WorkerId]: this.id,
[TelemetryAttributes.WorkerDoNotWaitActive]: doNotWaitActive,
});
}

if (!this.paused) {
this.paused = new Promise(resolve => {
this.resumeWorker = function () {
Expand All @@ -827,17 +925,35 @@ will never work with more accuracy than 1ms. */
await (!doNotWaitActive && this.whenCurrentJobsFinished());
this.emit('paused');
}

if (this.tracer) {
span.end();
}
}

/**
*
* Resumes processing of this worker (if paused).
*/
resume(): void {
let span;
if (this.tracer) {
const spanName = `${this.name} ${this.id} Worker.resume`;
span = this.tracer.startSpan(spanName);
span.setAttributes({
[TelemetryAttributes.WorkerName]: this.name,
[TelemetryAttributes.WorkerId]: this.id,
});
}

if (this.resumeWorker) {
this.resumeWorker();
this.emit('resumed');
}

if (this.tracer) {
span.end();
}
}

/**
Expand Down Expand Up @@ -872,7 +988,22 @@ will never work with more accuracy than 1ms. */
* @returns Promise that resolves when the worker has been closed.
*/
close(force = false): Promise<void> {
let span;
if (this.tracer) {
const spanName = `${this.name} ${this.id} Worker.close`;
span = this.tracer.startSpan(spanName);
span.setAttributes({
[TelemetryAttributes.WorkerName]: this.name,
[TelemetryAttributes.WorkerId]: this.id,
[TelemetryAttributes.WorkerForceClose]: force,
});
}

if (this.closing) {
if (this.tracer) {
span.end();
}

return this.closing;
}
this.closing = (async () => {
Expand Down Expand Up @@ -905,6 +1036,10 @@ will never work with more accuracy than 1ms. */

this.closed = true;
this.emit('closed');

if (this.tracer) {
span.end();
}
})();
return this.closing;
}
Expand All @@ -922,6 +1057,16 @@ will never work with more accuracy than 1ms. */
* @see {@link https://docs.bullmq.io/patterns/manually-fetching-jobs}
*/
async startStalledCheckTimer(): Promise<void> {
let span;
if (this.tracer) {
const spanName = `${this.name} ${this.id} Worker.startStalledCheckTimer`;
span = this.tracer.startSpan(spanName);
span.setAttributes({
[TelemetryAttributes.WorkerName]: this.name,
[TelemetryAttributes.WorkerId]: this.id,
});
}

if (!this.opts.skipStalledCheck) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, do not create a span if we are skipping the stalled checker. The spans are not just for wrapping methods, but if there is actually something useful to do in them! :)

clearTimeout(this.stalledCheckTimer);

Expand All @@ -936,6 +1081,10 @@ will never work with more accuracy than 1ms. */
}
}
}

if (this.tracer) {
span.end();
}
}

private startLockExtenderTimer(
Expand Down Expand Up @@ -1017,6 +1166,17 @@ will never work with more accuracy than 1ms. */
}

protected async extendLocks(jobs: Job[]) {
let span;
if (this.tracer) {
const spanName = `${this.name} ${this.id} Worker.extendLocks`;
span = this.tracer.startSpan(spanName);
span.setAttributes({
[TelemetryAttributes.WorkerName]: this.name,
[TelemetryAttributes.WorkerId]: this.id,
[TelemetryAttributes.WorkerJobsInvolved]: JSON.stringify(jobs),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't we also want to store all the IDs of the jobs whose locks we are extending?

});
}

try {
const pipeline = (await this.client).pipeline();
for (const job of jobs) {
Expand All @@ -1040,10 +1200,24 @@ will never work with more accuracy than 1ms. */
}
} catch (err) {
this.emit('error', <Error>err);
} finally {
if (this.tracer) {
span.end();
}
}
}

private async moveStalledJobsToWait() {
let span;
if (this.tracer) {
const spanName = `${this.name} ${this.id} Worker.moveStalledJobsToWait`;
span = this.tracer.startSpan(spanName);
span.setAttributes({
[TelemetryAttributes.WorkerName]: this.name,
[TelemetryAttributes.WorkerId]: this.id,
});
}

const chunkSize = 50;
const [failed, stalled] = await this.scripts.moveStalledJobsToWait();

Expand All @@ -1065,6 +1239,10 @@ will never work with more accuracy than 1ms. */
}

this.notifyFailedJobs(await Promise.all(jobPromises));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to store the ids of the failed/stalled jobs here because they need to be mapped to the jobs added to the queue so that we can trace all the way of a job added to the queue to the place where they complete/fail/stall

if (this.tracer) {
span.end();
}
}

private notifyFailedJobs(failedJobs: Job<DataType, ResultType, NameType>[]) {
Expand Down
1 change: 1 addition & 0 deletions src/enums/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,4 @@ export * from './child-command';
export * from './error-code';
export * from './parent-command';
export * from './metrics-time';
export * from './telemetry-attributes';
25 changes: 25 additions & 0 deletions src/enums/telemetry-attributes.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
export enum TelemetryAttributes {
QueueName = 'bullmq.queue.name',
WorkerName = 'bullmq.worker.name',
BulkCount = 'bullmq.job.bulk.count',
BulkNames = 'bullmq.job.bulk.names',
JobName = 'bullmq.job.name',
JobId = 'bullmq.job.id',
JobKey = 'bullmq.job.key',
JobOptions = 'bullmq.job.options',
JobProgress = 'bullmq.job.progress',
QueueDrainDelay = 'bullmq.queue.drain.delay',
QueueGrace = 'bullmq.queue.grace',
QueueCleanLimit = 'bullmq.queue.clean.limit',
JobType = 'bullmq.job.type',
JobTimestamp = 'bullmq.job.timestamp',
QueueOptions = 'bullmq.queue.options',
QueueEventMaxLength = 'bullmq.queue.event.max.length',
WorkerOptions = 'bullmq.worker.options',
WorkerToken = 'bullmq.worker.token',
WorkerId = 'bullmq.worker.id',
WorkerRateLimit = 'bullmq.worker.rate.limit',
WorkerDoNotWaitActive = 'bullmq.worker.do.not.wait.active',
WorkerForceClose = 'bullmq.worker.force.close',
WorkerJobsInvolved = 'bullmq.worker.jobs.involved',
}
1 change: 1 addition & 0 deletions src/interfaces/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -23,3 +23,4 @@ export * from './repeat-options';
export * from './sandboxed-job-processor';
export * from './sandboxed-job';
export * from './worker-options';
export * from './telemetry';
Loading
Loading