Signals cannot be processed in final workflow task #258

jeffschoner · 2023-08-04T16:48:43Z

Description

If it's unacceptable for signals sent to workflow to be lost, all outstanding signals must be processed before a workflow returns. If signals arrive while this final workflow task is being processed, the server will fail the workflow task, add the signal to the history, and schedule a new workflow task. This behavior guarantees that workflow code is able to run on the signal history event. Today, however, that signal cannot be processed because its dispatch handlers will only be invoked after the workflow has returned.

Repro

Consider this example workflow code:

class ReproWorkflow < Temporal::Workflow
  def execute(sleep_for)
    signals = []
    workflow.on_signal do |signal|
      signals << signal
    end

    workflow.sleep(sleep_for)
    signals.count
  end
end

invoked in this test,

  it 'signals repro workflow' do
    workflow_id = SecureRandom.uuid
    run_id = Temporal.start_workflow(
      ReproWorkflow,
      1, # seconds
      options: {
        workflow_id: workflow_id,
        signal_name: 'a_signal',
        signal_input: 'foo',
        timeouts: { execution: 10 },
      }
    )

    sent = 1
    loop do
      begin
        Temporal.signal_workflow(ReproWorkflow, 'a_signal', workflow_id, run_id, 'foo')
      rescue
        # Keep going until there's an error such as the workflow finishing
        break
      end
      sent += 1
      sleep 0.01
    end

    result = Temporal.await_workflow_result(
      SignalWithStartWorkflow,
      workflow_id: workflow_id,
      run_id: run_id,
    )

    expect(result).to eq(sent)
  end

Although the exact results can vary because there is a race condition here, almost always this test fails like:

Failures:

  1) signal with start signals repro workflow
     Failure/Error: expect(result).to eq(sent)

       expected: 78
            got: 73

       (compared using ==)
     # ./spec/integration/signal_with_start_spec.rb:108:in `block (2 levels) in <top (required)>'

Looking at the workflow history (JSON collected during a test run), you'll see a workflow task failure of UnhandledCommand. This is an expected failure which ensures all signals can be processed mentioned above. When the code is evaluated on replay, it exits before processing the last batch of signals following the failed task. If a debugger is placed in the code, it's clear that the workflow function returns before the signal handlers fire.

In this example, the signal handlers only modify state that determines what will be returned. However, if these run activities, timers, or other workflow code, other errors complaining about commands being performed after a workflow has returned can occur, causing the workflow task to perpetually fail and put it into a "stuck" state where it cannot complete.

Root cause and solutions

This seems to be happening because signal dispatch handlers are invoked in history order. Other Temporal SDKs evaluate all signal history entries within a history window first. temporal-ruby does this for markers today, but also needs to do it for signals. I confirmed this is the expected behavior with Chad Retz from Temporal.

Making this change at its core is fairly simple. However, it will affect determinism of all existing workflows that use signals. In more recent versions of Temporal server, there is now an SDK metadata field available on the workflow task completion request (source source). This field can be set by the SDK to indicate which features were used in evaluating a workflow task, and read it from workflow task completed history entries to ensure deterministic behavior changes are safe. Moreover, because older versions of the SDK will not be able to understand these SDK flags and will have different signal determinism, a compatibility configuration option will need to be added so that the new version can be safely rolled out "dark" before being switched on.

The text was updated successfully, but these errors were encountered:

jeffschoner mentioned this issue Aug 26, 2023

Signals first ordering #261

Merged

DeRauk closed this as completed in #261 Sep 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Signals cannot be processed in final workflow task #258

Signals cannot be processed in final workflow task #258

jeffschoner commented Aug 4, 2023

Signals cannot be processed in final workflow task #258

Signals cannot be processed in final workflow task #258

Comments

jeffschoner commented Aug 4, 2023

Description

Repro

Root cause and solutions