Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manually started jobs with Run All (Catch-up) mode enabled are not automatically restarted after a server crash/restart. #757

Closed
matthewjhands opened this issue May 13, 2024 · 11 comments

Comments

@matthewjhands
Copy link

matthewjhands commented May 13, 2024

Summary

Manually started jobs with Run All (Catch-up) mode enabled are not automatically restarted after a server crash or restart, even on worker hosts. This seems like an oversight to me, given that Run All mode is documented as being able to re-run jobs where the Server running the job crashed or where the Server running the job was shut down. No mention is made of the jobs having needed to be started on a schedule (i.e. not manually) for this to happen, that I can find.

Note that event retries (where just the job crashes, and not the server) do work for manually started jobs.

Recovering from server crashes where the jobs were running on a Primary Server
I understand per the docs that jobs running on a primary server are not and cannot be brought back up when the server crashes. However, my understanding is that this should happen on workers when they crash, but I don't see this happening either, in the case of manually started jobs. For simplicity of the reproduction below, the steps I have written up explain just how to recreate the issue i'm describing using a single server setup, and by restarting (rather than crashing) the Cronicle daemon.

Steps to reproduce the problem

Using a Single Server Cronicle setup:

  1. Create a job "Service1" using the shell plugin. It can be any long-lasting job, such as:
#!/bin/bash
while true
do
	uptime
	sleep 10
done
  1. Give Service1 a schedule that doesn't start any time soon, e.g. Fridays at 9am.
  2. Enable Run All mode.
  3. Start Service1 manually.
  4. Copy Service1, and call it Service2.
  5. Update Service2 and give it a schedule that will start the job imminently.
  6. Wait for Service2 to be started by the scheduler.
  7. Trigger a restart of the Cronicle daemon e.g. by running systemctl restart cronicle.service.
  8. Wait a few mins for the Cronicle daemon to come back up and elect itself as Primary.
  9. Observe that Service1 (started manually) isn't automatically restarted, but Service2 (started by the scheduler) is automatically restarted.

Your Setup

Operating system and version?

WSL, Ubuntu 22.04

Node.js version?

v20.12.1

Cronicle software version?

0.9.48

Are you using a multi-server setup, or just a single server?

Single Server

Are you using the filesystem as back-end storage, or S3/Couchbase?

Local FS.

Can you reproduce the crash consistently?

Yes

Log Excerpts

@jhuckaby
Copy link
Owner

I'm extremely sorry for this oversight. It will be fixed in the very next release, coming out in a few minutes.

@jhuckaby
Copy link
Owner

Fixed in v0.9.50. Thank you so much for this detailed issue report.

@matthewjhands
Copy link
Author

No worries at all!

Thanks for creating/maintaining this excellent OS tool, and for your exceptionally quick fix!

@matthewjhands
Copy link
Author

Hi @jhuckaby,

Unfortunately this fix doesn't seem to have worked, following the same steps as above, the manually started job still doesn't get automatically restarted. I have updated to v.0.9.51.

By looking at your changes/the logs, I can see that the "Reset event" log messages for Service1 and Service2, but Service1 still doesn't get restarted.

Any ideas? Log extract attached.

Many thanks,
Matt

Cronicle-Issue757-log-extract.txt

@jhuckaby
Copy link
Owner

Ah, so here is the problem. "Run All Mode" is really only for scheduled jobs, because all it really does is "rewind" the event cursor to a point in history, and then when the master server comes back up it "ticks" all the missing minutes, running any missed jobs along the way. But those jobs have to be actually scheduled to run on those missed minutes for it to trigger a job launch.

[1715783069.3][2024-05-15 15:24:29][myHostname][766139][Cronicle][debug][6][Aborted Job: Server 'myHostname' shut down unexpectedly.][]
[1715783069.653][2024-05-15 15:24:29][myHostname][766139][Cronicle][debug][4][Scheduler catching events up to: 2024/05/15 15:24:00][]

Since your job wasn't actually scheduled to run on 2024/05/15 15:24:00, it didn't fire off a new one.

Adding "Retries" to the event won't work here either, because retries don't kick in for "aborted" jobs. When a server shuts down, the job is aborted. Hmmm....

This is a design flaw. I'll re-open this issue and keep thinking about ways to solve this. But please understand, Cronicle v0 is in maintenance mode, and I'm hard at work on a huge ground-up rewrite for the big v2 (coming out later this year, with any luck).

I may not have time to solve this in Cronicle v0, as it looks like a core design oversight -- a truly missing feature that was never implemented properly.

I'll put a huge warning in the docs that explains this issue.

@jhuckaby jhuckaby reopened this May 15, 2024
jhuckaby added a commit that referenced this issue May 15, 2024
@matthewjhands
Copy link
Author

matthewjhands commented Jun 3, 2024

Hey @jhuckaby - I've been looking at this off and on over the last few weeks; would something like the below work in monitorAllActiveJobs() function here? I can't quite work out what the callback should be to be able test this, but hopefully you get the idea of where i'm going with this.

if(job.catch_up && job.source){
	// If manually started and catch-up enabled, attempt to relaunch or queue
	this.launchOrQueueJob(job,CALLBACK) 
} else {
	// otherwise, just rewind cursor instead
	this.rewindJob(job);
}

@jhuckaby
Copy link
Owner

jhuckaby commented Jun 5, 2024

Oh hey, cool idea! This might just work, and is a very small code change. I need to consider all the ramifications and do a bunch of testing, however. I'll dive into this as soon as I can.

@jhuckaby
Copy link
Owner

jhuckaby commented Jun 7, 2024

Ah yes, so, as I suspected, it's not quite as simple as your suggested change (but it's a start!). There are a number of cases that may result in a aborted job due to an unexpected server loss. Another one is that the server may have been rebooted, or Cronicle was restarted, in which case it detects the leftover job log on disk and "finishes" (aborts) the job on startup. That case also has to be handled, as it should trigger a rerun, if it has catch-up and was manually started.

Working through things...

@jhuckaby
Copy link
Owner

jhuckaby commented Jun 7, 2024

So, there are actually a bunch of different cases that have to be handled:

  1. A "dead job" that timed out due to its remote server going down unexpectedly and never coming back up (this is the case you highlighted above).
  2. A remote job on a worker server that was killed when the remote server went down unexpectedly, but came back up before the dead job timeout.
  3. A local job (running on the master) that was killed when the server went down unexpectedly, but the server came back up.
  4. A local job that was running on the primary when the server was gracefully (deliberately) restarted.
  5. A remote job on a worker server when the server was gracefully (deliberately) restarted, and came back up before the dead job timeout.

So, in all of these different cases what I need to do is add a custom flag to the job object, and then detect that flag when the job is finalized (completed and cleaned up). If the flag is set, and the job has catch-up mode enabled, and was manually started, then and only then should Cronicle re-run the job.

But I also have to write a custom re-run function to facilitate this, because you can't really just shove the job object into launchOrQueueJob() as in your example above. That function is expecting an event object, not a previously run job object (different properties in each). So there's some massaging that has to happen in there.

Anyway, I'm working through all the cases and trying to test as much as I can. It has turned out to be a can of worms. I would normally not do this in Cronicle v0, because it's in "maintenance mode" (no new features), and I'm focusing all my efforts in the big v2 rewrite, but I'm going to make an exception for this issue, because this really is an unimplemented feature that should have existed from the start.

It may take me a while to finish the code changes and test all the cases, but I'm working on it...

jhuckaby added a commit that referenced this issue Jun 8, 2024
- New optional client config param: `prompt_before_run` (see Discussion #771)
- Removed duplicate TOC entry in doc.
- Fixed manual run-all jobs when servers shut down unexpectedly (fixes #757)
@jhuckaby
Copy link
Owner

jhuckaby commented Jun 8, 2024

Should be fixed in v0.9.52.

I was unable to test all the cases, but I tested some of them. This is the best I can do for v0, I'm afraid (maintenance mode).

@matthewjhands
Copy link
Author

Many thanks, this seems to have done the trick.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants