Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Worker context usage #562

Closed
cognifloyd opened this issue Apr 14, 2022 · 2 comments · Fixed by go-vela/worker#304
Closed

Fix Worker context usage #562

cognifloyd opened this issue Apr 14, 2022 · 2 comments · Fixed by go-vela/worker#304
Labels
bug Indicates a bug

Comments

@cognifloyd
Copy link
Member

Description

The context is not used consistently in the Worker, especially in the Kubernetes runtime.

In testing go-vela/worker#302 and go-vela/worker#303, I noticed that some go routines are still printing logs after the build has completed, so I traced out the use of context in the worker to see if that might be the cause.

Value

Make sure we don't have zombie goroutines from completed builds.

Useful Information

  1. What is the output of vela --version?

I'm working on a fork of the master branch.

  1. What operating system is being used?

linux

  1. Any other important details?

From investigating where we do and don't pass the context, there are several places where we ignore contextcheck, and those are the primary problem spots.

Here are my notes about changes I think are needed:

  • cmd/vela-worker/operate.go

    • pass gtx to w.exec()
      • - err = w.exec(id)
        + err = w.exec(gctx, id)
  • cmd/vela-worker/exec.go

    • accept ctx arg in exec()
      • - func (w *Worker) exec(index  int) error {
        + func (w *Worker) exec(ctx context.Context, index  int) error {
    • replace context.Background() to ensure we cancel all build-specific bits:
      • - ctx := context.Background()
        + buildCtx, buildDone := context.WithCancel(ctx)
        + defer buildDone()
      • - ctx, timeout := context.WithTimeout(ctx, t)
        + ctx, timeout := context.WithTimeout(buildCtx, t)
    • use new buildCtx instead of context.Background() (ie DestroyBuild continues after timeout)
      • - err = _executor.DestroyBuild(context.Background())
        + err = _executor.DestroyBuild(buildCtx)
  • do we need logs.Wait() (logs is an errgroup.WithContext())? see:

  • runtime/kubernetes/build.go:

    • replace context.Background() with ctx in API calls
      • PipelinePodsTemplate().Get(ctx, ...)
        - context.Background(), c.config.PipelinePodsTemplateName, metav1.GetOptions{},
        + ctx, c.config.PipelinePodsTemplateName, metav1.GetOptions{},
      • Pods().Create(ctx, ...)
        - Create(context.Background(), c.Pod, metav1.CreateOptions{})
        + Create(ctx, c.Pod, metav1.CreateOptions{})
      • Pods().Delete(ctx, ...)
        - Delete(context.Background(), c.Pod.ObjectMeta.Name, opts)
        + Delete(ctx, c.Pod.ObjectMeta.Name, opts)
  • runtime/kubernetes/container.go:

    • replace context.Background() with ctx in API calls
      • Pods().Patch(ctx, ...)
        - context.Background(),
        + ctx,
      • Pods().GetLogs().Stream(ctx)
        - Stream(context.Background())
        + Stream(ctx)
@cognifloyd
Copy link
Member Author

I'm studying contexts to understand why the current implementation is the way it is.

@jbrockopp said that they didn't inherit the context everywhere on purpose. At a minimum, those places need more descriptive comments that explain why and not just what (ie "new build context does not inherit global context because ..." instead of "create a background context").

Some places should not be creating a new background context (like the k8s API calls). So, the goal is to figure out and document which background contexts are still needed and which should come from the inherited context.

So far, I think these are the significant contexts. The question is how they should interact.

  1. main context created in Worker.Start()

    • new background context
    • gets cancelled on interrupt signal (worker should shut down)
  2. executor context created with an errgroup in Worker.operate()

    • inherits from main context
    • wait for the errgroup at the end of operate().
    • should probably be used when popping from the queue in exec()
    • does NOT pass the context to Worker.exec() (why?)
  3. build context created in Worker.exec()

    • new background context
    • never gets cancelled
    • it makes sense to separate build and main contexts as a build is roughly analogous to an http request in an http server.
    • when a worker gets an interrupt signal what should happen to any active builds? Right now, they just get killed without any chance to cleanup.
  4. timeout build context also created in Worker.exec()

    • inherits from build context
    • gets passed to executor.*Build() (except for executor.DestroyBuild()which should not be subject to the timeout)
      • gets passed to other executor and runtime methods
  5. stage context created with an errgroup in executor.ExecBuild()

    • inherits from timeout build context
    • wait for the errgroup at the end of exec().
    • gets passed to executor.ExecStage() and related executor and runtime methods
  6. log context created with an errgroup in executor.ExecService() and executor.ExecStep()

    • inherits from timeout build context
    • nothing waits for the errgroup!
    • gets passed to executor.StreamService()/executor.StreamStep() and then runtime.TailContainer()

Things get muddy in executor.ExecBuild() because the context passed to it is a context with a deadline/timeout and some things should not be subject to the timeout. The log context created in executor.ExecService() and executor.ExecStep() should really not be subject to the timeout. So, it should inherit from the build context instead of the timeout build context.

In the docker runtime, TailContainer just ignores the context altogether when creating it's log capture goroutine and the kubernetes runtime isn't any better with its use of wait.ExponentialBackOff() (instead of ExponentialBackoffWithContext()). And I suspect one or more of those log capture goroutines are leaking / not ending with the build.

I wonder if executor.StreamService() and executor.StreamStep() should be called in executor.ExecBuild() instead of executor.ExecService() and executor.ExecStep(). A channel would have to be passed to both ExecStep and StreamStep so that they can coordinate when to begin streaming logs. That would highlight the special context requirements of streaming logs. But, that doesn't make both contexts (build and timeout build) available in ExecBuild. Passing two contexts from Worker.exec() to executor.ExecBuild() causes all sorts of lint errors. I'm not sure the best way to accomplish this.

Maybe some kind of merged context? https://medium.com/@dlagoza/playing-with-multiple-contexts-in-go-9f72cbcff56e

@cognifloyd
Copy link
Member Author

Next thought:

  • Under Worker.exec():

    • create a logStream channel
    • run new executor.StreamBuild() method in a goroutine for the build
      • pass <-logStream to executor.StreamBuild() (read side)
    • pass logStream<- to executor.ExecBuild() (write side)
  • new executor.StreamBuild(ctx, <-chan) method

    • gets the build context instead of the timeout build context because logging does not count for the timeout.
    • gets <-logStream (read side)
    • for each message received from logStream
      • kick off a new goroutine (in an errgroup) that runs executor.StreamService() or executor.StreamStep()
    • on logStream close, Wait for all streaming goroutines (the errgroup) to finish and return.
  • Add arg to ExecBuild: executor.ExecBuild(ctx, chan<-)

    • pass the logStream<- channel to executor.ExecService(), executor.ExecStep(), executor.ExecStage()
    • after all stages, services
  • Add arg to ExecStage: executor.ExecStage(ctx, chan<-)

    • pass logStream<- to ExecStep
  • Add arg to ExecService, ExecStep: executor.ExecService(ctx, chan<-), executor.ExecStep(ctx, chan<-)

    • after RunContainer (and before WaitContainer for ExecStep) send a message to logStream channel to begin log streaming

What should the message and channel types be? It needs to

  • distinguish between step and service since those are distinct methods to call
    (maybe the message can be a struct that includes the method?)
  • include the required arguments (or how to retrieve them)
    (maybe include the args on the struct too)

current method signatures:

type Engine interface {
    StreamService(context.Context, *pipeline.Container) error
    StreamStep(context.Context, *pipeline.Container) error
}

So, maybe the message would be:

type streamFunc = func(context.Context, *pipeline.Container) error
type streamRequest struct {
    stream    streamFunc
    container *pipeline.Container
}

And then executor.ExecService()/executor.ExecStep() would do

logStream <- streamRequest{
    stream:    c.StreamService, // or c.StreamStep in ExecStep
    container: ctn,
}

This would effectively divorce the context used in Exec* from the context used in Stream*.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Indicates a bug
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

1 participant