Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Signals cannot be processed in final workflow task #258

Closed
jeffschoner opened this issue Aug 4, 2023 · 0 comments · Fixed by #261
Closed

Signals cannot be processed in final workflow task #258

jeffschoner opened this issue Aug 4, 2023 · 0 comments · Fixed by #261

Comments

@jeffschoner
Copy link
Contributor

Description

If it's unacceptable for signals sent to workflow to be lost, all outstanding signals must be processed before a workflow returns. If signals arrive while this final workflow task is being processed, the server will fail the workflow task, add the signal to the history, and schedule a new workflow task. This behavior guarantees that workflow code is able to run on the signal history event. Today, however, that signal cannot be processed because its dispatch handlers will only be invoked after the workflow has returned.

Repro

Consider this example workflow code:

class ReproWorkflow < Temporal::Workflow
  def execute(sleep_for)
    signals = []
    workflow.on_signal do |signal|
      signals << signal
    end

    workflow.sleep(sleep_for)
    signals.count
  end
end

invoked in this test,

  it 'signals repro workflow' do
    workflow_id = SecureRandom.uuid
    run_id = Temporal.start_workflow(
      ReproWorkflow,
      1, # seconds
      options: {
        workflow_id: workflow_id,
        signal_name: 'a_signal',
        signal_input: 'foo',
        timeouts: { execution: 10 },
      }
    )

    sent = 1
    loop do
      begin
        Temporal.signal_workflow(ReproWorkflow, 'a_signal', workflow_id, run_id, 'foo')
      rescue
        # Keep going until there's an error such as the workflow finishing
        break
      end
      sent += 1
      sleep 0.01
    end

    result = Temporal.await_workflow_result(
      SignalWithStartWorkflow,
      workflow_id: workflow_id,
      run_id: run_id,
    )

    expect(result).to eq(sent)
  end

Although the exact results can vary because there is a race condition here, almost always this test fails like:

Failures:

  1) signal with start signals repro workflow
     Failure/Error: expect(result).to eq(sent)

       expected: 78
            got: 73

       (compared using ==)
     # ./spec/integration/signal_with_start_spec.rb:108:in `block (2 levels) in <top (required)>'

Looking at the workflow history (JSON collected during a test run), you'll see a workflow task failure of UnhandledCommand. This is an expected failure which ensures all signals can be processed mentioned above. When the code is evaluated on replay, it exits before processing the last batch of signals following the failed task. If a debugger is placed in the code, it's clear that the workflow function returns before the signal handlers fire.

In this example, the signal handlers only modify state that determines what will be returned. However, if these run activities, timers, or other workflow code, other errors complaining about commands being performed after a workflow has returned can occur, causing the workflow task to perpetually fail and put it into a "stuck" state where it cannot complete.

Root cause and solutions

This seems to be happening because signal dispatch handlers are invoked in history order. Other Temporal SDKs evaluate all signal history entries within a history window first. temporal-ruby does this for markers today, but also needs to do it for signals. I confirmed this is the expected behavior with Chad Retz from Temporal.

Making this change at its core is fairly simple. However, it will affect determinism of all existing workflows that use signals. In more recent versions of Temporal server, there is now an SDK metadata field available on the workflow task completion request (source source). This field can be set by the SDK to indicate which features were used in evaluating a workflow task, and read it from workflow task completed history entries to ensure deterministic behavior changes are safe. Moreover, because older versions of the SDK will not be able to understand these SDK flags and will have different signal determinism, a compatibility configuration option will need to be added so that the new version can be safely rolled out "dark" before being switched on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

1 participant