Fix Worker context usage #562

cognifloyd · 2022-04-14T16:37:06Z

Description

The context is not used consistently in the Worker, especially in the Kubernetes runtime.

In testing go-vela/worker#302 and go-vela/worker#303, I noticed that some go routines are still printing logs after the build has completed, so I traced out the use of context in the worker to see if that might be the cause.

Value

Make sure we don't have zombie goroutines from completed builds.

Useful Information

What is the output of vela --version?

I'm working on a fork of the master branch.

What operating system is being used?

linux

Any other important details?

From investigating where we do and don't pass the context, there are several places where we ignore contextcheck, and those are the primary problem spots.

Here are my notes about changes I think are needed:

cmd/vela-worker/operate.go

pass gtx to w.exec()

- err = w.exec(id)
+ err = w.exec(gctx, id)

cmd/vela-worker/exec.go

accept ctx arg in exec()

- func (w *Worker) exec(index  int) error {
+ func (w *Worker) exec(ctx context.Context, index  int) error {

replace context.Background() to ensure we cancel all build-specific bits:

- ctx := context.Background()
+ buildCtx, buildDone := context.WithCancel(ctx)
+ defer buildDone()

- ctx, timeout := context.WithTimeout(ctx, t)
+ ctx, timeout := context.WithTimeout(buildCtx, t)

use new buildCtx instead of context.Background() (ie DestroyBuild continues after timeout)

- err = _executor.DestroyBuild(context.Background())
+ err = _executor.DestroyBuild(buildCtx)

do we need logs.Wait() (logs is an errgroup.WithContext())? see:
- executor/linux/service.go
- executor/linux/step.go

runtime/kubernetes/build.go:

replace context.Background() with ctx in API calls

PipelinePodsTemplate().Get(ctx, ...)

- context.Background(), c.config.PipelinePodsTemplateName, metav1.GetOptions{},
+ ctx, c.config.PipelinePodsTemplateName, metav1.GetOptions{},

Pods().Create(ctx, ...)

- Create(context.Background(), c.Pod, metav1.CreateOptions{})
+ Create(ctx, c.Pod, metav1.CreateOptions{})

Pods().Delete(ctx, ...)

- Delete(context.Background(), c.Pod.ObjectMeta.Name, opts)
+ Delete(ctx, c.Pod.ObjectMeta.Name, opts)

runtime/kubernetes/container.go:
- replace context.Background() with ctx in API calls
  - Pods().Patch(ctx, ...)
```
- context.Background(),
+ ctx,
```
  - Pods().GetLogs().Stream(ctx)
```
- Stream(context.Background())
+ Stream(ctx)
```

The text was updated successfully, but these errors were encountered:

cognifloyd · 2022-04-16T22:05:37Z

I'm studying contexts to understand why the current implementation is the way it is.

@jbrockopp said that they didn't inherit the context everywhere on purpose. At a minimum, those places need more descriptive comments that explain why and not just what (ie "new build context does not inherit global context because ..." instead of "create a background context").

Some places should not be creating a new background context (like the k8s API calls). So, the goal is to figure out and document which background contexts are still needed and which should come from the inherited context.

So far, I think these are the significant contexts. The question is how they should interact.

main context created in Worker.Start()
- new background context
- gets cancelled on interrupt signal (worker should shut down)
executor context created with an errgroup in Worker.operate()
- inherits from main context
- wait for the errgroup at the end of operate().
- should probably be used when popping from the queue in exec()
- does NOT pass the context to Worker.exec() (why?)
build context created in Worker.exec()
- new background context
- never gets cancelled
- it makes sense to separate build and main contexts as a build is roughly analogous to an http request in an http server.
- when a worker gets an interrupt signal what should happen to any active builds? Right now, they just get killed without any chance to cleanup.
timeout build context also created in Worker.exec()
- inherits from build context
- gets passed to executor.*Build() (except for executor.DestroyBuild()which should not be subject to the timeout)
  - gets passed to other executor and runtime methods
stage context created with an errgroup in executor.ExecBuild()
- inherits from timeout build context
- wait for the errgroup at the end of exec().
- gets passed to executor.ExecStage() and related executor and runtime methods
log context created with an errgroup in executor.ExecService() and executor.ExecStep()
- inherits from timeout build context
- nothing waits for the errgroup!
- gets passed to executor.StreamService()/executor.StreamStep() and then runtime.TailContainer()

Things get muddy in executor.ExecBuild() because the context passed to it is a context with a deadline/timeout and some things should not be subject to the timeout. The log context created in executor.ExecService() and executor.ExecStep() should really not be subject to the timeout. So, it should inherit from the build context instead of the timeout build context.

In the docker runtime, TailContainer just ignores the context altogether when creating it's log capture goroutine and the kubernetes runtime isn't any better with its use of wait.ExponentialBackOff() (instead of ExponentialBackoffWithContext()). And I suspect one or more of those log capture goroutines are leaking / not ending with the build.

I wonder if executor.StreamService() and executor.StreamStep() should be called in executor.ExecBuild() instead of executor.ExecService() and executor.ExecStep(). A channel would have to be passed to both ExecStep and StreamStep so that they can coordinate when to begin streaming logs. That would highlight the special context requirements of streaming logs. But, that doesn't make both contexts (build and timeout build) available in ExecBuild. Passing two contexts from Worker.exec() to executor.ExecBuild() causes all sorts of lint errors. I'm not sure the best way to accomplish this.

Maybe some kind of merged context? https://medium.com/@dlagoza/playing-with-multiple-contexts-in-go-9f72cbcff56e

cognifloyd · 2022-04-17T01:26:36Z

Next thought:

Under Worker.exec():
- create a logStream channel
- run new executor.StreamBuild() method in a goroutine for the build
  - pass <-logStream to executor.StreamBuild() (read side)
- pass logStream<- to executor.ExecBuild() (write side)
new executor.StreamBuild(ctx, <-chan) method
- gets the build context instead of the timeout build context because logging does not count for the timeout.
- gets <-logStream (read side)
- for each message received from logStream
  - kick off a new goroutine (in an errgroup) that runs executor.StreamService() or executor.StreamStep()
- on logStream close, Wait for all streaming goroutines (the errgroup) to finish and return.
Add arg to ExecBuild: executor.ExecBuild(ctx, chan<-)
- pass the logStream<- channel to executor.ExecService(), executor.ExecStep(), executor.ExecStage()
- after all stages, services
Add arg to ExecStage: executor.ExecStage(ctx, chan<-)
- pass logStream<- to ExecStep
Add arg to ExecService, ExecStep: executor.ExecService(ctx, chan<-), executor.ExecStep(ctx, chan<-)
- after RunContainer (and before WaitContainer for ExecStep) send a message to logStream channel to begin log streaming

What should the message and channel types be? It needs to

distinguish between step and service since those are distinct methods to call
(maybe the message can be a struct that includes the method?)
include the required arguments (or how to retrieve them)
(maybe include the args on the struct too)

current method signatures:

type Engine interface {
    StreamService(context.Context, *pipeline.Container) error
    StreamStep(context.Context, *pipeline.Container) error
}

So, maybe the message would be:

type streamFunc = func(context.Context, *pipeline.Container) error
type streamRequest struct {
    stream    streamFunc
    container *pipeline.Container
}

And then executor.ExecService()/executor.ExecStep() would do

logStream <- streamRequest{
    stream:    c.StreamService, // or c.StreamStep in ExecStep
    container: ctn,
}

This would effectively divorce the context used in Exec* from the context used in Stream*.

cognifloyd added the bug Indicates a bug label Apr 14, 2022

cognifloyd mentioned this issue Apr 14, 2022

refactor(executor): clarify why use different contexts go-vela/worker#304

Merged

cognifloyd mentioned this issue Apr 20, 2022

enhancement: Make context handling for log streaming more explicit go-vela/worker#317

Merged

wass3r closed this as completed in go-vela/worker#304 Apr 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Worker context usage #562

Fix Worker context usage #562

cognifloyd commented Apr 14, 2022

cognifloyd commented Apr 16, 2022

cognifloyd commented Apr 17, 2022

Fix Worker context usage #562

Fix Worker context usage #562

Comments

cognifloyd commented Apr 14, 2022

Description

Value

Useful Information

cognifloyd commented Apr 16, 2022

cognifloyd commented Apr 17, 2022