Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[component] component.go - when is component.Start returning errors? #9324

Closed
atoulme opened this issue Jan 20, 2024 · 12 comments
Closed

[component] component.go - when is component.Start returning errors? #9324

atoulme opened this issue Jan 20, 2024 · 12 comments

Comments

@atoulme
Copy link
Contributor

atoulme commented Jan 20, 2024

The component.Start function returns an error.

When is an error expected to be returned upon starting? Configuration validation happens outside of the lifecycle of the component.

The error is handled as reporting a permanent error:

if compErr := comp.Start(ctx, host); compErr != nil {
g.telemetry.Status.ReportStatus(
instanceID,
component.NewPermanentErrorEvent(compErr),
)
return compErr
}

A few examples from contrib from working with lifecycle tests recently:

  • A cassandra exporter starts, but cannot connect to Cassandra right away. It returns an error.
  • An aws component starts, but cannot create an aws client, because environment variables are missing. It returns an error.

Typically, we see those when the components cannot pass lifecycle tests:
https://github.com/open-telemetry/opentelemetry-collector-contrib/issues?q=is%3Aissue+is%3Aopen+skip+lifecycle

Given that we have component status reporting now, would it make sense to ask component implementers to handle start failures via component status instead of returning an error?

@atoulme atoulme changed the title [component] component.go - component.Start returns errors [component] component.go - when is component.Start returning errors? Jan 20, 2024
@mwear
Copy link
Member

mwear commented Jan 29, 2024

Thanks for opening this issue @atoulme. I don't think that we have well defined rules for how components should handle errors during start, but now that we have component status reporting, and an in progress health check extension that makes use of that data, it makes sense for us to come up with guidelines for how this should work together.

This table has the definitions that I have in mind for each status:

Status Meaning
Starting The component is starting.
OK The component is running without issue.
RecoverableError The component has experienced a transient error and may recover.
PermanentError The component has detected a condition at runtime that will need human intervention to fix. The collector will continue to run in a degraded mode.
FatalError The collector has experienced a fatal runtime error and will shutdown.
Stopping The component is in the process of shutting down.
Stopped The component has completed shutdown.

Now that we have the permanent error state, I think it makes sense to ask whether or not we need fatal error any longer. The reason it was included with status reporting is that it was a preexisting error state. The intent of the permanent error state is to allow the collector to continue running in a degraded mode and alert users that there was a condition detected at runtime that will require human intervention to fix.

I would propose the following guidelines for start:

  • If a component experiences a recoverable error, it should report it, but not return an error from start
  • If an error is not detected by the end of start, but becomes apparent later, it should be reported as a permanent error. The same should apply to async errors that could arise from start.
  • If a non-recoverable error is detected during start a component should, in most cases, report a permanent error and allow the collector to continue to run. In an extreme case, it could return an error from start and trigger the collector to shutdown. We should define what an extreme case would be, if we are unsure of any, we could remove the error return value altogether.

Users can use the health check extension and / or logs to monitor permanent and recoverable errors. If we agree that a collector that starts without an error should continue running, albeit in a possibly degraded state, we can remove fatal errors altogether and communicate these via permanent error status events instead.

In the examples you site:

A cassandra exporter starts, but cannot connect to Cassandra right away. It returns an error.

This should be reported as a recoverable error and recovery should be monitored / assessed via the health check extension. The extension allows a user to set a recovery duration before considering a recoverable error to be considered unhealthy.

An aws component starts, but cannot create an aws client, because environment variables are missing. It returns an error.

Presumably this could be discovered before start as a part of config validation. However, if there are complications that make that impossible, it should report a permanent error and let the collector run in a degraded mode. The permanent error can be monitored via the health check extension and / or logs.

@mwear
Copy link
Member

mwear commented Apr 4, 2024

I'll propose an alternative to the guidelines I previously mentioned. The previous proposal is still on the table, as well any other alternatives anyone wants to propose. For this alternative, I suggest that we remove the error return value from start and that we remove FatalErrors altogether. Components will use the status API to report RecoverableErrors or PermanentErrors during startup. The graph will automatically report StatusOK in the absence of the component reporting an explicit status.

This would eliminate the confusion between a PermanentError and FatalError and promote the collector running in a degraded mode rather than shutting down. We can surface failures through StatusWatchers, such as the healthcheck extension, and we could consider enhancing status reporting to automatically log PermanentErrors and possibly RecoverableErrors (although recoverables will be more noisy). If we like these guidelines for component.Start we can adopt the same for component.Shutdown.

@TylerHelmuth
Copy link
Member

Regardless of the Start interface we decide on, I think we 100% should be logging any of the Error statuses.

In your alternative approach I like it keeps things unambiguous. Right now we also have no ambiguity - if a component returns an error during Start for any reason the whole collector stops. The cassandra and aws component examples are totally within their rights to return errors and stop the Collector today.

As you mentioned in your guideline proposal, if we try to write rules for when a component should report an error there are always going to be edge cases. We're also leaving the state of the collector up to component authors, hoping they use component status correctly. If a component reports FatalError instead of PermanentError incorrectly, our goal of letting the collector startup in a degraded state, where Components with StatusOk are allowed to keep running, is jeopardized.

With your alternative approach a misused status no longer causes the whole collector to stop. It also is unambiguous about what the collector does when a component has an error on startup - it keeps running.

I think my one fear with the alternative proposal is that it would allow a collector to start with all components experiencing StatusPermanentError, resulting in 0 components in StatusOK state. So there'd be an executable running, but nothing happening except the logging/status reporting that the error occurred.

Maybe thats ok? Feels like it will lead to confusion with end users. Is it possible to detect this situation and stop the collector?

Or maybe it is better to leave Start returning an error, and add options to the collector that allow ignoring those errors on startup.

@mwear
Copy link
Member

mwear commented Apr 23, 2024

I think my one fear with the alternative proposal is that it would allow a collector to start with all components experiencing StatusPermanentError, resulting in 0 components in StatusOK state. So there'd be an executable running, but nothing happening except the logging/status reporting that the error occurred.

This situation is one of the downsides to allowing a collector to run in a degraded state. While not impossible, having all components in a permanent error state should be an edge case.

Is it possible to detect this situation and stop the collector?

The collector itself doesn't keep or aggregate component status. The (new) health check extension does and can surface this information, although, it wouldn't shut the collector down.

@TylerHelmuth
Copy link
Member

In my mind the blockers for this idea are then:

  1. Are we ok with that edge case?
  2. Should the behavior be configurable - should the user be able to switch between the current state (shutdown when experiencing an error on startup) and the proposed state (allow the collector to run in a degraded state when erroring on startup?

If we allow configuring the capability then I think there is no problem with the edge case.

My opinion is that we should allow it to be configurable

If the behavior is configurable, I think there are a couple solutions:

  • Leave error as a return value on the interface. Components would be expected to report a status and return an error on Startup when appropriate. service would ignore the errors when configured to run in a degraded state.
  • Remove error from Start and create a new function on the interface, MustStartt(ctx context.Context, host Host) error. When configured to shutdown on error service will call MustStart and when allowed to run in a degraded state service would call Start.

@mwear
Copy link
Member

mwear commented Apr 26, 2024

I'm ultimately ok with allowing Start to have a fail fast mechanism. My suggestion to consider removing it, was to simplify the distinction between a PermanentError and FatalError, and error reporting during Start generally. Making it configurable only complicates the matter, so I would much rather we pick one.

As things currently stand, the primary way to fail fast is returning an error from Start, however, because some Start methods spawn asynchronous work, there is the ReportFatalError API as a second way to fail fast. This is for errors that aren't known at the time Start returns, but occur as a result of work started in Start. If we want to allow fail fast from Start, we should retain the FatalError status. The distinction is that FatalError should only be used in from Start or async work spawned in Start. During runtime, components should use the PermanentError status.

To sum up this secondary proposal:

  • Change nothing about how Start currently works
  • Retain FatalError, but clarify its distinction from PermanentError

@mwear
Copy link
Member

mwear commented May 1, 2024

After thinking about this a little more, I think we can retain the behavior of the previous proposal and simplify how errors are be handled during start. I suggest that we remove the error return value from Start and rely on component status reporting to handle errors during startup. A component can report StatusFatalError to halt startup in the same way that returning an error does today. This will simplify things by giving us one way to report and handle errors during startup.

To clarify this proposal suggests we:

  • Remove the error return value from Start
  • Components wishing to halt startup and stop the collector can report StatusFatalError
  • Clarify that StatusFatalError should only be reported from Start or async work initiated during Start as a fail fast mechanism

@TylerHelmuth
Copy link
Member

TylerHelmuth commented May 2, 2024

Reading through this issue and #9823, I have an alternative proposal to consider:

  • Keep error on Start. If an error is returned from Start it means the collector should shutdown.
    • this is the Collector's current behavior.
  • Remove StatusFatalError.
    • If a component experiences an error on start that should stop the collector it should return it. If the error should not stop the collector the component should report StatusPermanentError instead.
  • After each Start() has been called and returned, if all receivers are reporting StatusPermanentError, shutdown.
    • Maybe this doesn't work because of async work within start that might not have set StatusPermanentError?

This approach has the benefits of removing any ambiguity between Permanent and Fatal status and doesn't require a breaking change to all components. Components that are erroring when they should be reporting StatusPermanentError would need updating, but the components would still be functional (no breaking change) until that happens.

I think this proposal has a downside and thats how to stop the collector when async work from Start fails. I'd like to better understand this use case, are there any component you can share as examples? I am hoping the async work could report PermanentError instead of Fatal OR if it is actually critical for startup, be made synchronous instead.

@mwear
Copy link
Member

mwear commented May 2, 2024

There are two issues with this proposal. One is that FatalErrors can happen as a result of Start, but not necessarily by the time Start returns. This is discussed a little more here: #9823 (comment). Here is a real world example in the zipkin receiver: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/33b15734edb65cbd861832030790768883f8ede4/receiver/zipkinreceiver/trace_receiver.go#L89-L115.

The other issue is that the collector does not maintain or aggregate status for components. It dispatches status events to StatusWatchers, an optional interface for extensions. The extensions do whatever they see fit with the events. The new healthcheck extension is a StatusWatcher that aggregates events. It could detect this situation, and report a FatalError, supposing we keep it, thereby shutting down the collector. This is a reasonable option to add to the extension, but it doesn't help us eliminate FatalError or simplify the return value of Start.

@TylerHelmuth
Copy link
Member

TylerHelmuth commented May 13, 2024

@mwear in that specific zipkin example, would it be inappropriate to report a PermanentError instead? It feels like at that point in the startup process all synchronous checks have passed, and whats left is to startup the server. If the server doesn't start up is that PermanentError or a FatalError? If other components/pipelines are able to start it feels more like a PermanentError for this receiver.

The other issue is that the collector does not maintain or aggregate status for components.

Could it? Is there a reason why the collector's Run loop couldn't read from the Status channel and react accordingly?

@mwear
Copy link
Member

mwear commented May 15, 2024

If the server doesn't start up is that PermanentError or a FatalError?

The FatalError reported by the Zipkin receiver and FatalErrors generally are a result of the host.ReportFatalError API that predates component status reporting. While introducing component status reporting, it was pointed out that we could and probably should replace host.ReportFatalError with the status reporting equivalent. It was ultimately deprecated, replaced, and removed. For this reason, all existing calls to host.ReportFatalError were replaced with ReportStatus with a FatalError event, see: open-telemetry/opentelemetry-collector-contrib#30501. As far as I know, this API was introduced to allow the collector to fail fast from async work initiated in component.Start. Synchronous failures have always used an error return value. The tl;dr is that nothing has fundamentally changed RE FatalErrors since the introduction of status reporting. What has changed is the API called to initiate the fail fast behavior. Changing this to be a PermanentError would be a change in behavior we could consider, but I think FatalError still has a place for reasons I'll describe.

Could it? Is there a reason why the collector's Run loop couldn't read from the Status channel and react accordingly?

It could, but I don't think that it should. This goes against the design of component status reporting. The system was designed with the idea that components report their status via events, the collector automates this where possible, and it dispatches the events to extensions (implementers of the StatusWatcher interface). The extensions can process events as they see fit. The collector provides the machinery to pass events between components and extensions. It makes no attempt to interpret the events. That responsibility is left to extensions.

As a concrete example, we can briefly mention the new health check extension. It has an event aggregation system in order to answer questions about pipeline health and overall collector health, however, this system is specific to the extension; it would not likely be something that would be adopted into the core collector. The aggregation itself varies based on user configuration. There is a preliminary set of of knobs to tune aggregation depending on what the user wants to consider healthy or not, and there will likely be many more to follow. By default both Permanent and Recoverable errors are considered healthy. Users have to opt-in to consider either, or both to be unhealthy. Recoverables have the additional feature that you can specify a recovery interval during which they should recover. They will be considered healthy until the interval has elapsed and unhealthy afterwards. While the health check extension is an early example of a StatusWatcher, I expect more will be added over time. I believe we will eventually have an extension to export the raw events (as an OTLP logs) to backends that will process them according to their own rules. Getting back on topic, component status reporting was not designed to determine what is healthy, only to facilitate the reporting of events between components and extensions; the extensions handle the events as they choose. Moving aggregation into the collector, or having a parallel system is not really in line with the original design and I don't think there is a good reason to do this if we keep FatalError as a fail-fast mechanism.

codeboten pushed a commit that referenced this issue Aug 22, 2024
#### Description
Adds an RFC for component status reporting. The main goal is to define
what component status reporting is, our current, implementation, and how
such a system interacts with a 1.0 component. When merged, the following
issues will be unblocked:
- #9823
- #10058
- #9957
- #9324
- #6506

---------

Co-authored-by: Matthew Wear <matthew.wear@gmail.com>
Co-authored-by: Pablo Baeyens <pbaeyens31+github@gmail.com>
@TylerHelmuth
Copy link
Member

With #10413 merged, the decision to keep Start returning an error is solidified.

sfc-gh-sili pushed a commit to sfc-gh-sili/opentelemetry-collector that referenced this issue Aug 23, 2024
Adds an RFC for component status reporting. The main goal is to define
what component status reporting is, our current, implementation, and how
such a system interacts with a 1.0 component. When merged, the following
issues will be unblocked:
- open-telemetry#9823
- open-telemetry#10058
- open-telemetry#9957
- open-telemetry#9324
- open-telemetry#6506

---------

Co-authored-by: Matthew Wear <matthew.wear@gmail.com>
Co-authored-by: Pablo Baeyens <pbaeyens31+github@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

3 participants