Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error flagging with status codes #136

Merged
merged 33 commits into from
Sep 17, 2020
Merged
Changes from 1 commit
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
bd0177d
Add error flagging proposal
tedsuo Sep 9, 2020
54121cd
removed half finished sentence
tedsuo Sep 9, 2020
725cbc0
Remove error.message, it is redundant with the status message.
tedsuo Sep 9, 2020
ae86379
whitespace
tedsuo Sep 9, 2020
74dd7f5
whitespace
tedsuo Sep 9, 2020
0d34ea1
clarify convenience methods are in a seprate package
tedsuo Sep 9, 2020
d951674
whitespace
tedsuo Sep 9, 2020
c814974
whitespace
tedsuo Sep 9, 2020
e787f94
use correct RFC language
tedsuo Sep 9, 2020
3a25b4c
spelling
tedsuo Sep 9, 2020
4a42c3a
spellingUpdate text/trace/0000-error_flagging.md
tedsuo Sep 9, 2020
d204bc7
capitalization
tedsuo Sep 9, 2020
f551758
Capitalization
tedsuo Sep 9, 2020
44b1fa4
error mapping -> status mapping
tedsuo Sep 9, 2020
bb1bfb6
whitespace
tedsuo Sep 9, 2020
ccc0f5e
replace override status codes with user_override boolean
tedsuo Sep 9, 2020
241e2aa
rewrite based on error WG feedback
tedsuo Sep 10, 2020
db506d6
indicate that status source is a new field.
tedsuo Sep 10, 2020
7e5ef2b
remove some garbage
tedsuo Sep 10, 2020
6fb4bcb
Consolidate OPERATOR and APPLICATION into USER
tedsuo Sep 10, 2020
639e661
spelling
tedsuo Sep 10, 2020
0c8899b
nominal -> normal
tedsuo Sep 10, 2020
ec77065
markdownlint
tedsuo Sep 11, 2020
788844d
Add PR number to file name
tedsuo Sep 11, 2020
ed50989
more lint
tedsuo Sep 11, 2020
a6de9cc
clarify end users
tedsuo Sep 15, 2020
19a3d23
clarify end user
tedsuo Sep 15, 2020
cc8f305
clarify terms, update old intro
tedsuo Sep 15, 2020
2532107
fix intro
tedsuo Sep 15, 2020
7de13e9
clarify the meaning of normal
tedsuo Sep 15, 2020
1537cc3
clarify status mapping
tedsuo Sep 15, 2020
2b9e943
Update text/trace/0136-error_flagging.md
tedsuo Sep 16, 2020
76f0597
final nits
tedsuo Sep 16, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
replace override status codes with user_override boolean
  • Loading branch information
tedsuo committed Sep 9, 2020
commit ccc0f5ead0ae05f2d308eb10b196c93b0ee57ee2
41 changes: 12 additions & 29 deletions text/trace/0000-error_flagging.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,29 +4,22 @@ This proposal adds two status codes explicitly for use as overrides by the end u
## Motivation
Error reporting is a fundamental use case for distributed tracing. While we prefer that error flagging occurs within analysis tools, and not within instrumentation plugins, a number of currently supported analysis tools and protocols rely on the existence of an explicit error flag reported from instrumentation. In OpenTelemetry, the error flag is called "status codes."
tedsuo marked this conversation as resolved.
Show resolved Hide resolved

However, there is confusion over the mapping of semantic conventions to status codes, and concern over the subjective nature of errors. Which network failures count as an error? Are 404s an error? The answer is dependent on the situation.
However, there is confusion over the mapping of semantic conventions to status codes, and concern over the subjective nature of errors. Which network failures count as an error? Are 404s an error? The answer is often dependent on the situation, but without even a baseline of suggested status codes for each convention, the instrumentation author is placed under the heavy burden of making the decision. Worse, the decisions will not be in sync across different instrumentation packages.

There is one major exception. Both application developers and operators have a deep understanding of what constitutes an error in their system. OpenTelemetry must provide a way for these users to control error flagging, and explicitly indicate both when a span should and should not be counted as an error.
There is one other missing piece, required for proper error flagging. Both application developers and operators have a deep understanding of what constitutes an error in their system. OpenTelemetry must provide a way for these users to control error flagging, and explicitly indicate that it is the end user setting the status code, and not instrumentation plugins. In these specific cases, the error flagging is known to be correct: the end user has decided the status of the span, and they do not want another interpretation.

A second exception is supporting analysis tools which require explicit error flagging in the data which they receive. In this case, an operator must be able to apply an error flagging schema at some point during the OTLP data processing pipeline.
While generic instrumentation can only provide a generic schema, end users are capable of making subjective decisions about their systems. And, as the end user, they should get to have the final call in what consitutes an error. In order to accomplish this, there must be a way to differntiate between errors flagged by instrumentation, and errors flagged by the end user.

## Explanation
The following changes add several missing features required for proper error reporting, and are completely backwards compatible with OpenTelemetry today.

### `span.user_override(boolean)`
tedsuo marked this conversation as resolved.
Show resolved Hide resolved
The `user_override` method indicates that the end user has confirmed that the status code is correct. When using OTLP, this will set the `user_override` field. When setting status codes via the collector or application code, `user_override` can be set to ensure that the span status is not re-interpreted by further analysis.
tedsuo marked this conversation as resolved.
Show resolved Hide resolved

Analysis tools MAY disregard status codes, in favor of their own approach to error analysis. However, it is strongly suggested that analysis tools SHOULD pay attention to the status code when `user_override` is set, as it is a communication from the end-user and contains valuable information.

### Status Codes
The following status codes are added to our current schema.

* `DEFAULT` No status has been set. Any errors must be detected by the analysis tool. (This replaces `OK`.)
* `ERROR` Instrumentation has marked a span as an error. (This replaces `UNKNOWN`.)
* `OK_OVERRIDE` The user has provided an override. The span should NOT be flagged as an error, regardless of other analysis.
* `ERROR_OVERRIDE` The user has provided an override. The span SHOULD be flagged as an error, regardless of other analysis.

(Note that our current status codes include a long list of error types. We may choose to keep them, change them, or drop them in favor of a single `ERROR` code. How many error types we have is not relevant to this proposal.)

`OK_OVERRIDE` and `ERROR_OVERRIDE` are special status codes. These are explicit overrides provided by the end user, and should never be set by shared instrumentation. They should only be set by the application developer (via application code), or by the operator (via the collector).

Analysis tools MAY disregard status codes, in favor of their own approach to error analysis. However, it is strongly suggested that analysis tools SHOULD handle `OK_OVERRIDE` and `ERROR_OVERRIDE`, as these are explicitly set by the end-user and contain valuable information.

Note that our current status codes include a long list of error types. We may choose to keep them, change them, or drop them in favor of a single `ERROR` code. How many error types we have is not relevant to this proposal.

### Status Mapping Schema
As part of the specification, OpenTelemetry provides a canonical mapping of semantic conventions to status codes. This removes any ambiguity as to what OpenTelemetry ships with out of the box.
tedsuo marked this conversation as resolved.
Show resolved Hide resolved
Expand All @@ -41,15 +34,7 @@ Note that these convenience methods simply wire together multiple API calls. The


## Internal details
Except for the renaming of two status codes, this proposal is backwards compatible with existing code, protocols, and the OpenTracing bridge.

**OK is renamed to DEFAULT**
Using the term “OK” as the default implies more meaning than we intend. The span is not necessarily OK - it simply has not triggered our standard status mapping. The default status code should be renamed to `DEFAULT` to avoid confusion.

Note: I intentionally avoided terms like "unset" as it may imply to users that they are required to set the status code, which is not the intention.

**UNKNOWN is renamed to ERROR**
In the new schema, ERROR is the primary status code for reporting errors. Especially in a reduced list of status codes, the term "unknown" is vague and may accidentally imply a meaning to users which we do not intend.
This proposal is backwards compatible with existing code, protocols, and the OpenTracing bridge.


## BUT ERRORS ARE SUBJECTIVE!! HOW CAN WE KNOW WHAT IS AN ERROR? WHO ARE WE TO DEFINE THIS?
Expand All @@ -59,11 +44,9 @@ While flagging errors can be a subjective decision, it is true that many semanti

Obviously, all systems are different, and users will want to adjust error reporting on a case by case basis. Unwanted errors may be suppressed, and additional errors may be added. The collector will provide a processor and a configuration language to make this a straightforward process. Working from a baseline of standard errors will provide a better experience than having to define a schema from scratch.

Note that analysis tools are free to disregard Span Status, and do their own error analysis. For these systems, the only Status codes of import are `OK_OVERRIDE` and `ERROR_OVERRIDE`.

If we really hate the current canonical status codes, most may be removed and added back in later. I do suggest we keep the status codes that map to network failures, and I agree that the rest are a bit suspect for our current needs.
Note that analysis tools MAY disregard Span Status, and do their own error analysis. There is no requirement that the status code is respected, even when `user_override` is set. However, it is strongly suggested that analysis tools SHOULD pay attention to the status code when `user_override` is set, as it represents a subjective decision made by either the operator or application developer.

The minimal number of status codes would be `DEFAULT`, `ERROR`, `OK_OVERRIDE` and `ERROR_OVERRIDE`. `ERROR` will be used to differentiate between standard errors applied by instrumentation and overrides provided by the end user.
If we really hate the current canonical status codes, most may be removed and added back in later. I do suggest we keep the status codes that map to network failures, and I agree that the rest are a bit suspect for our current needs. The minimal number of status codes would be `OK` and `ERROR`.

## Remind me why we need status codes again?
Status codes provide a low overhead mechanism for checking if a span counts against an error budget, without having to scan every attribute and event. This reduces overhead and is a benefit for many systems.
Expand Down