You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If it's unacceptable for signals sent to workflow to be lost, all outstanding signals must be processed before a workflow returns. If signals arrive while this final workflow task is being processed, the server will fail the workflow task, add the signal to the history, and schedule a new workflow task. This behavior guarantees that workflow code is able to run on the signal history event. Today, however, that signal cannot be processed because its dispatch handlers will only be invoked after the workflow has returned.
Repro
Consider this example workflow code:
class ReproWorkflow < Temporal::Workflow
def execute(sleep_for)
signals = []
workflow.on_signal do |signal|
signals << signal
end
workflow.sleep(sleep_for)
signals.count
end
end
invoked in this test,
it 'signals repro workflow' do
workflow_id = SecureRandom.uuid
run_id = Temporal.start_workflow(
ReproWorkflow,
1, # seconds
options: {
workflow_id: workflow_id,
signal_name: 'a_signal',
signal_input: 'foo',
timeouts: { execution: 10 },
}
)
sent = 1
loop do
begin
Temporal.signal_workflow(ReproWorkflow, 'a_signal', workflow_id, run_id, 'foo')
rescue
# Keep going until there's an error such as the workflow finishing
break
end
sent += 1
sleep 0.01
end
result = Temporal.await_workflow_result(
SignalWithStartWorkflow,
workflow_id: workflow_id,
run_id: run_id,
)
expect(result).to eq(sent)
end
Although the exact results can vary because there is a race condition here, almost always this test fails like:
Failures:
1) signal with start signals repro workflow
Failure/Error: expect(result).to eq(sent)
expected: 78
got: 73
(compared using ==)
# ./spec/integration/signal_with_start_spec.rb:108:in `block (2 levels) in <top (required)>'
Looking at the workflow history (JSON collected during a test run), you'll see a workflow task failure of UnhandledCommand. This is an expected failure which ensures all signals can be processed mentioned above. When the code is evaluated on replay, it exits before processing the last batch of signals following the failed task. If a debugger is placed in the code, it's clear that the workflow function returns before the signal handlers fire.
In this example, the signal handlers only modify state that determines what will be returned. However, if these run activities, timers, or other workflow code, other errors complaining about commands being performed after a workflow has returned can occur, causing the workflow task to perpetually fail and put it into a "stuck" state where it cannot complete.
Root cause and solutions
This seems to be happening because signal dispatch handlers are invoked in history order. Other Temporal SDKs evaluate all signal history entries within a history window first. temporal-ruby does this for markers today, but also needs to do it for signals. I confirmed this is the expected behavior with Chad Retz from Temporal.
Making this change at its core is fairly simple. However, it will affect determinism of all existing workflows that use signals. In more recent versions of Temporal server, there is now an SDK metadata field available on the workflow task completion request (sourcesource). This field can be set by the SDK to indicate which features were used in evaluating a workflow task, and read it from workflow task completed history entries to ensure deterministic behavior changes are safe. Moreover, because older versions of the SDK will not be able to understand these SDK flags and will have different signal determinism, a compatibility configuration option will need to be added so that the new version can be safely rolled out "dark" before being switched on.
The text was updated successfully, but these errors were encountered:
Description
If it's unacceptable for signals sent to workflow to be lost, all outstanding signals must be processed before a workflow returns. If signals arrive while this final workflow task is being processed, the server will fail the workflow task, add the signal to the history, and schedule a new workflow task. This behavior guarantees that workflow code is able to run on the signal history event. Today, however, that signal cannot be processed because its dispatch handlers will only be invoked after the workflow has returned.
Repro
Consider this example workflow code:
invoked in this test,
Although the exact results can vary because there is a race condition here, almost always this test fails like:
Looking at the workflow history (JSON collected during a test run), you'll see a workflow task failure of
UnhandledCommand
. This is an expected failure which ensures all signals can be processed mentioned above. When the code is evaluated on replay, it exits before processing the last batch of signals following the failed task. If a debugger is placed in the code, it's clear that the workflow function returns before the signal handlers fire.In this example, the signal handlers only modify state that determines what will be returned. However, if these run activities, timers, or other workflow code, other errors complaining about commands being performed after a workflow has returned can occur, causing the workflow task to perpetually fail and put it into a "stuck" state where it cannot complete.
Root cause and solutions
This seems to be happening because signal dispatch handlers are invoked in history order. Other Temporal SDKs evaluate all signal history entries within a history window first. temporal-ruby does this for markers today, but also needs to do it for signals. I confirmed this is the expected behavior with Chad Retz from Temporal.
Making this change at its core is fairly simple. However, it will affect determinism of all existing workflows that use signals. In more recent versions of Temporal server, there is now an SDK metadata field available on the workflow task completion request (source source). This field can be set by the SDK to indicate which features were used in evaluating a workflow task, and read it from workflow task completed history entries to ensure deterministic behavior changes are safe. Moreover, because older versions of the SDK will not be able to understand these SDK flags and will have different signal determinism, a compatibility configuration option will need to be added so that the new version can be safely rolled out "dark" before being switched on.
The text was updated successfully, but these errors were encountered: