Manually started jobs with Run All (Catch-up) mode enabled are not automatically restarted after a server crash/restart. #757

matthewjhands · 2024-05-13T15:54:13Z

Summary

Manually started jobs with Run All (Catch-up) mode enabled are not automatically restarted after a server crash or restart, even on worker hosts. This seems like an oversight to me, given that Run All mode is documented as being able to re-run jobs where the Server running the job crashed or where the Server running the job was shut down. No mention is made of the jobs having needed to be started on a schedule (i.e. not manually) for this to happen, that I can find.

Note that event retries (where just the job crashes, and not the server) do work for manually started jobs.

Recovering from server crashes where the jobs were running on a Primary Server
I understand per the docs that jobs running on a primary server are not and cannot be brought back up when the server crashes. However, my understanding is that this should happen on workers when they crash, but I don't see this happening either, in the case of manually started jobs. For simplicity of the reproduction below, the steps I have written up explain just how to recreate the issue i'm describing using a single server setup, and by restarting (rather than crashing) the Cronicle daemon.

Steps to reproduce the problem

Using a Single Server Cronicle setup:

Create a job "Service1" using the shell plugin. It can be any long-lasting job, such as:

#!/bin/bash
while true
do
	uptime
	sleep 10
done

Give Service1 a schedule that doesn't start any time soon, e.g. Fridays at 9am.
Enable Run All mode.
Start Service1 manually.
Copy Service1, and call it Service2.
Update Service2 and give it a schedule that will start the job imminently.
Wait for Service2 to be started by the scheduler.
Trigger a restart of the Cronicle daemon e.g. by running systemctl restart cronicle.service.
Wait a few mins for the Cronicle daemon to come back up and elect itself as Primary.
Observe that Service1 (started manually) isn't automatically restarted, but Service2 (started by the scheduler) is automatically restarted.

Your Setup

Operating system and version?

WSL, Ubuntu 22.04

Node.js version?

v20.12.1

Cronicle software version?

0.9.48

Are you using a multi-server setup, or just a single server?

Single Server

Are you using the filesystem as back-end storage, or S3/Couchbase?

Local FS.

Can you reproduce the crash consistently?

Yes

Log Excerpts

The text was updated successfully, but these errors were encountered:

jhuckaby · 2024-05-13T16:45:42Z

I'm extremely sorry for this oversight. It will be fixed in the very next release, coming out in a few minutes.

jhuckaby · 2024-05-13T16:49:10Z

Fixed in v0.9.50. Thank you so much for this detailed issue report.

matthewjhands · 2024-05-13T16:54:31Z

No worries at all!

Thanks for creating/maintaining this excellent OS tool, and for your exceptionally quick fix!

matthewjhands · 2024-05-15T15:26:00Z

Hi @jhuckaby,

Unfortunately this fix doesn't seem to have worked, following the same steps as above, the manually started job still doesn't get automatically restarted. I have updated to v.0.9.51.

By looking at your changes/the logs, I can see that the "Reset event" log messages for Service1 and Service2, but Service1 still doesn't get restarted.

Any ideas? Log extract attached.

Many thanks,
Matt

Cronicle-Issue757-log-extract.txt

jhuckaby · 2024-05-15T16:32:15Z

Ah, so here is the problem. "Run All Mode" is really only for scheduled jobs, because all it really does is "rewind" the event cursor to a point in history, and then when the master server comes back up it "ticks" all the missing minutes, running any missed jobs along the way. But those jobs have to be actually scheduled to run on those missed minutes for it to trigger a job launch.

[1715783069.3][2024-05-15 15:24:29][myHostname][766139][Cronicle][debug][6][Aborted Job: Server 'myHostname' shut down unexpectedly.][]
[1715783069.653][2024-05-15 15:24:29][myHostname][766139][Cronicle][debug][4][Scheduler catching events up to: 2024/05/15 15:24:00][]

Since your job wasn't actually scheduled to run on 2024/05/15 15:24:00, it didn't fire off a new one.

Adding "Retries" to the event won't work here either, because retries don't kick in for "aborted" jobs. When a server shuts down, the job is aborted. Hmmm....

This is a design flaw. I'll re-open this issue and keep thinking about ways to solve this. But please understand, Cronicle v0 is in maintenance mode, and I'm hard at work on a huge ground-up rewrite for the big v2 (coming out later this year, with any luck).

I may not have time to solve this in Cronicle v0, as it looks like a core design oversight -- a truly missing feature that was never implemented properly.

I'll put a huge warning in the docs that explains this issue.

matthewjhands · 2024-06-03T15:13:08Z

Hey @jhuckaby - I've been looking at this off and on over the last few weeks; would something like the below work in monitorAllActiveJobs() function here? I can't quite work out what the callback should be to be able test this, but hopefully you get the idea of where i'm going with this.

if(job.catch_up && job.source){
	// If manually started and catch-up enabled, attempt to relaunch or queue
	this.launchOrQueueJob(job,CALLBACK) 
} else {
	// otherwise, just rewind cursor instead
	this.rewindJob(job);
}

jhuckaby · 2024-06-05T06:29:03Z

Oh hey, cool idea! This might just work, and is a very small code change. I need to consider all the ramifications and do a bunch of testing, however. I'll dive into this as soon as I can.

jhuckaby · 2024-06-07T02:52:14Z

Ah yes, so, as I suspected, it's not quite as simple as your suggested change (but it's a start!). There are a number of cases that may result in a aborted job due to an unexpected server loss. Another one is that the server may have been rebooted, or Cronicle was restarted, in which case it detects the leftover job log on disk and "finishes" (aborts) the job on startup. That case also has to be handled, as it should trigger a rerun, if it has catch-up and was manually started.

Working through things...

jhuckaby · 2024-06-07T05:38:41Z

So, there are actually a bunch of different cases that have to be handled:

A "dead job" that timed out due to its remote server going down unexpectedly and never coming back up (this is the case you highlighted above).
A remote job on a worker server that was killed when the remote server went down unexpectedly, but came back up before the dead job timeout.
A local job (running on the master) that was killed when the server went down unexpectedly, but the server came back up.
A local job that was running on the primary when the server was gracefully (deliberately) restarted.
A remote job on a worker server when the server was gracefully (deliberately) restarted, and came back up before the dead job timeout.

So, in all of these different cases what I need to do is add a custom flag to the job object, and then detect that flag when the job is finalized (completed and cleaned up). If the flag is set, and the job has catch-up mode enabled, and was manually started, then and only then should Cronicle re-run the job.

But I also have to write a custom re-run function to facilitate this, because you can't really just shove the job object into launchOrQueueJob() as in your example above. That function is expecting an event object, not a previously run job object (different properties in each). So there's some massaging that has to happen in there.

Anyway, I'm working through all the cases and trying to test as much as I can. It has turned out to be a can of worms. I would normally not do this in Cronicle v0, because it's in "maintenance mode" (no new features), and I'm focusing all my efforts in the big v2 rewrite, but I'm going to make an exception for this issue, because this really is an unimplemented feature that should have existed from the start.

It may take me a while to finish the code changes and test all the cases, but I'm working on it...

- New optional client config param: `prompt_before_run` (see Discussion #771) - Removed duplicate TOC entry in doc. - Fixed manual run-all jobs when servers shut down unexpectedly (fixes #757)

jhuckaby · 2024-06-08T02:57:01Z

Should be fixed in v0.9.52.

I was unable to test all the cases, but I tested some of them. This is the best I can do for v0, I'm afraid (maintenance mode).

matthewjhands · 2024-06-12T11:46:01Z

Many thanks, this seems to have done the trick.

jhuckaby closed this as completed in 846f974 May 13, 2024

jhuckaby reopened this May 15, 2024

jhuckaby added a commit that referenced this issue May 15, 2024

Added note regarding issue #757

ec87118

jhuckaby closed this as completed in b45a635 Jun 8, 2024

jhuckaby added a commit that referenced this issue Jun 8, 2024

Version 0.9.52

5ffd58c

- New optional client config param: `prompt_before_run` (see Discussion #771) - Removed duplicate TOC entry in doc. - Fixed manual run-all jobs when servers shut down unexpectedly (fixes #757)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manually started jobs with Run All (Catch-up) mode enabled are not automatically restarted after a server crash/restart. #757

Manually started jobs with Run All (Catch-up) mode enabled are not automatically restarted after a server crash/restart. #757

matthewjhands commented May 13, 2024 •

edited

Loading

jhuckaby commented May 13, 2024

jhuckaby commented May 13, 2024

matthewjhands commented May 13, 2024

matthewjhands commented May 15, 2024

jhuckaby commented May 15, 2024

matthewjhands commented Jun 3, 2024 •

edited

Loading

jhuckaby commented Jun 5, 2024

jhuckaby commented Jun 7, 2024

jhuckaby commented Jun 7, 2024

jhuckaby commented Jun 8, 2024

matthewjhands commented Jun 12, 2024

Manually started jobs with Run All (Catch-up) mode enabled are not automatically restarted after a server crash/restart. #757

Manually started jobs with Run All (Catch-up) mode enabled are not automatically restarted after a server crash/restart. #757

Comments

matthewjhands commented May 13, 2024 • edited Loading

Summary

Steps to reproduce the problem

Your Setup

Operating system and version?

Node.js version?

Cronicle software version?

Are you using a multi-server setup, or just a single server?

Are you using the filesystem as back-end storage, or S3/Couchbase?

Can you reproduce the crash consistently?

Log Excerpts

jhuckaby commented May 13, 2024

jhuckaby commented May 13, 2024

matthewjhands commented May 13, 2024

matthewjhands commented May 15, 2024

jhuckaby commented May 15, 2024

matthewjhands commented Jun 3, 2024 • edited Loading

jhuckaby commented Jun 5, 2024

jhuckaby commented Jun 7, 2024

jhuckaby commented Jun 7, 2024

jhuckaby commented Jun 8, 2024

matthewjhands commented Jun 12, 2024

matthewjhands commented May 13, 2024 •

edited

Loading

matthewjhands commented Jun 3, 2024 •

edited

Loading