Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eager workflow dispatch #3835

Merged
merged 27 commits into from
Feb 2, 2023
Merged

Conversation

bergundy
Copy link
Member

@bergundy bergundy commented Jan 25, 2023

Implement eager workflow dispatch for StartWorkflowExecution.

I've only added a single unit test to the repo, the rest are in this PR in the features repo.
Added a counter metric workflow_eager_execution to count the number of eager execution requests per namespace + task queue.

I've added a TODO to add support for eager signal with start, I figured that can be added later.

@bergundy bergundy self-assigned this Jan 25, 2023
@bergundy bergundy requested a review from a team as a code owner January 25, 2023 01:22
service/history/api/create_workflow_util.go Show resolved Hide resolved
service/history/api/startworkflow/api.go Outdated Show resolved Hide resolved
service/history/api/startworkflow/api.go Outdated Show resolved Hide resolved
service/history/api/create_workflow_util.go Show resolved Hide resolved
service/history/api/create_workflow_util.go Outdated Show resolved Hide resolved
// The current workflow task is not inflight or not the first task or we exceeded the first attempt and fell back to
// matching based dispatch.
if !mutableStateInfo.hasInflight || mutableStateInfo.workflowTaskInfo.StartedEventID != 3 || mutableStateInfo.workflowTaskInfo.Attempt > 1 {
return nil, serviceerror.NewWorkflowExecutionAlreadyStarted(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why return error here? Start workflow is still a success.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we cannot return an eager task.
Caller should be notified with an error and handle this case.
I think it's clearer than returning nil task.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on the fence whether this should result in an error or not.
In any case what the caller should do is get a handle to the workflow and wait for its completion.
With the error approach there's at least a way to let the caller know what happened.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Spikhalskiy @cretz I could use your opinion here

Copy link
Member

@cretz cretz Jan 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I think it's clearer/safer to consider eagerness to be a request of the server, not a requirement. The server is allowed to use whatever heuristics it wants to deny the request and do a non-eager and it still be a successful task. The absence of a task in the response I think is enough to tell the caller the server denied the request.

I understand your use case is a "require to be eager" but that use case isn't supported for activities either. If we want a require-to-be-eager to be a thing, it should be a thing on both. I could support a server-side/namespace option of "fail if eager requested but cannot be given".

(my opinion is not super strong here...if we agree that eager workflow tasks are a server requirement not a request, we can just doc clearly and then error if it cannot be granted)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrmm...k.

So if I call this start, may a workflow start because of it? If so, IMO that should never error even if something post-start can't happen. To me success means "workflow started because of this request" and failure means "workflow did not start because of this request". Unsure if that's related. Sorry, not familiar w/ details here so I don't have a big opinion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a client gets this error it means that a retried request came in too late and the workflow task cannot be dispatched eagerly (likely due to it already being dispatched via the standard, matching based, path).

The only thing an error gives you over the alternative where the server responds successfully but omits the inline task is that additional piece of information.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simply - if success means workflow started by this request and failure means workflow not started by request, works for me. If failure can mean workflow still started by this request, that's confusing to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the workflow was started from a previous attempt of the same request. So it seems like you're saying it should not be an error.

I tend to agree but I will add a log to avoid losing some of this information.

Copy link
Member Author

@bergundy bergundy Feb 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ended up returning successful result and recording a metric saying eager execution was denied with a reason tag.

service/history/api/startworkflow/api.go Show resolved Hide resolved
service/history/api/startworkflow/api.go Outdated Show resolved Hide resolved
service/history/api/startworkflow/api.go Outdated Show resolved Hide resolved
service/history/api/startworkflow/api.go Outdated Show resolved Hide resolved
config/dynamicconfig/development-sql.yaml Show resolved Hide resolved
service/history/api/startworkflow/api.go Show resolved Hide resolved
service/history/api/startworkflow/api.go Outdated Show resolved Hide resolved
service/history/api/startworkflow/api.go Show resolved Hide resolved
service/history/api/startworkflow/api.go Outdated Show resolved Hide resolved
TaskToken: serializedToken,
WorkflowExecution: &commonpb.WorkflowExecution{WorkflowId: workflowID, RunId: runID},
WorkflowType: request.GetWorkflowType(),
PreviousStartedEventId: 0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ideally we should get the value from mutable state.

service/history/api/startworkflow/api.go Outdated Show resolved Hide resolved
return nil, err
}

// If first workflow task should back off (e.g. cron or workflow retry) a workflow task will not be scheduled.
if requestEagerExecution && scheduledEventID != 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: for readability use newMutableState.HasPendingWorkflowTask()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I find it more readable but I'm fine with making this change

@bergundy bergundy enabled auto-merge (squash) February 1, 2023 01:18
Copy link
Contributor

@MichaelSnowden MichaelSnowden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we abstract an ExecutionStrategy interface here in a separate PR to keep things SOLID and make the review easier? The code looks good to me, but it's hard to review with all of the modifications to the existing code.


// ReasonTag is a generic tag can be used anywhere a reason is needed.
// Make sure that the value is of limited cardinality.
func ReasonTag(value string) Tag {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's take in an opaque enum type here to prevent misuse. I think the safety benefit outweighs the cost of the tediousness involved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow, do you mean defining a type like:

// ReasonString is just a string but used to remind anyone using ReasonTag to limit the cardinality of the possible reasons.
type ReasonString string

If that's the case, I see little benefit to that over documenting the ask to limit the cardinality of values.
I already have a custom string enum type where this is used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up adding this.

@@ -99,16 +99,32 @@ func NewWorkflowWithSignal(
return nil, err
}
}

requestEagerExecution := startRequest.StartRequest.GetRequestEagerExecution()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the first usage of startRequest.StartRequest that I see in this function, so I'm worried about potential NPEs. How do we know this is always non-nil?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't, we count on this request to be validated before this method is called.

@@ -109,7 +109,7 @@ type (
AddWorkflowTaskCompletedEvent(int64, int64, *workflowservice.RespondWorkflowTaskCompletedRequest, int) (*historypb.HistoryEvent, error)
AddWorkflowTaskFailedEvent(scheduledEventID int64, startedEventID int64, cause enumspb.WorkflowTaskFailedCause, failure *failurepb.Failure, identity, binChecksum, baseRunID, newRunID string, forkEventVersion int64) (*historypb.HistoryEvent, error)
AddWorkflowTaskScheduleToStartTimeoutEvent(int64) (*historypb.HistoryEvent, error)
AddFirstWorkflowTaskScheduled(*historypb.HistoryEvent) error
AddFirstWorkflowTaskScheduled(event *historypb.HistoryEvent, bypassTaskGeneration bool) (int64, error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd split this into two separate methods instead of adding a flag argument

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interface already has way too many methods IMHO.

}

// prepare applies request overrides, validates the request, and records eager execution metrics.
func (s *Starter) prepare(ctx context.Context) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider parsing the proto into a new StartRequest type instead of modifying the request: https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-validate/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something that is commonly don't in Go?
If this was JS or a functional language or a language that has the concept of immutable data I would totally clone the request.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking of the same pattern we use for tasks here: https://github.com/temporalio/temporal/blob/66db3aebeead81869dcc864e0f174063b167ee10/common/persistence/serialization/task_serializer.go

Basically taking the proto and parsing it into a plain Go object with the fields already validated and parsed into more structured types.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I conceptually agree here but since the raw protos are used in the internal APIs, parsing does not make sense in this case.


// creationContext is a container for all information obtained from creating the uncommitted execution.
// The information is later used to create a new execution and handle conflicts.
type creationContext struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd rename this to creationParams to avoid having another somethingCtx param floating around because devs will be unsure whether it embeds an actual Context or not

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

metricsHandler.Counter(metrics.WorkflowEagerExecutionDeniedCounter.GetMetricName()).Record(
1,
metrics.NamespaceTag(s.namespace.Name().String()),
metrics.TaskQueueTag(s.request.StartRequest.TaskQueue.Name),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another place where using a parsed type instead of the raw StartRequest is better because it prevents Law of Demeter violations here and elsewhere

if err == nil {
return s.generateResponse(creationCtx.runID, creationCtx.workflowTaskInfo, extractHistoryEvents(creationCtx.workflowEventBatches))
}
t, ok := err.(*persistence.CurrentWorkflowConditionFailedError)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use Errors.As

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied this from the original implementation but yes, As is better here.

Comment on lines 176 to 177
// The history and mutable state we generated above should be deleted by a background process.
return s.handleConflict(ctx, creationCtx, t)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we crash before reaching this line?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the question

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this comment:

	// The history and mutable state we generated above should be deleted by a background process.

Is handleConflict the method that deletes them?

Copy link
Member

@alexshtin alexshtin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WT state machine changes are not in conflict with workflow update changes.

opTag := tag.WorkflowActionWorkflowTaskScheduled
if err := ms.checkMutability(opTag); err != nil {
return err
return 0, err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is common.EmptyEventID (which is 0) and it fits perfectly here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines 534 to 536
if err != nil {
return nil, h.convertError(err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error check needs to be inside if block.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably doesn't matter much but you're right.

@bergundy bergundy enabled auto-merge (squash) February 2, 2023 01:06
@bergundy bergundy merged commit 6ef7749 into temporalio:master Feb 2, 2023
@bergundy bergundy deleted the eager-workflow-dispatch branch February 2, 2023 01:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants