Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLM common metrics for Generative AI #955

Merged
merged 34 commits into from
May 28, 2024
Merged
Show file tree
Hide file tree
Changes from 31 commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
da4fb55
Initial LLM metrics
drewby Apr 24, 2024
1169756
Add link references
drewby Apr 24, 2024
383fa1f
Add LLM Metrics to README
drewby Apr 24, 2024
c4308be
Add changelog
drewby Apr 24, 2024
1239fbd
Fix yamllint error on chloggen
drewby Apr 24, 2024
9565f2c
Update reference to LLM
drewby Apr 24, 2024
b579374
Change metric name to match semconv
drewby Apr 25, 2024
c938263
Add gen_ai.system
drewby Apr 25, 2024
d5b59dc
Updates for review comments
drewby Apr 25, 2024
2b942db
Rename/scope LLM to Gen AI metrics
drewby Apr 26, 2024
5335175
Remove trailing spaces
drewby Apr 26, 2024
4f415b3
Update operation examples.
drewby Apr 30, 2024
f3e6586
Replace pluralized tokens with token
drewby Apr 30, 2024
fd01e65
Update table of contents
drewby May 5, 2024
b1ccbd6
Update token type
drewby May 5, 2024
de89866
Update requirement levels
drewby May 7, 2024
cfd8e86
Override error.type note
drewby May 7, 2024
9ac4406
Allow custom values to true
drewby May 7, 2024
b2828f8
Add ExplicitBucketBoundaries
drewby May 7, 2024
979f732
Make token metric recommended
drewby May 7, 2024
84d78eb
Remove trailing space
drewby May 7, 2024
72cc2c9
Fix recommended label.
drewby May 8, 2024
04c6fb3
Update metrics to be for 'client'
drewby May 8, 2024
7db4c52
Update title
drewby May 8, 2024
d9ab4d8
Update registry table
drewby May 8, 2024
58b10b3
Move error.type from common to duration metric.
drewby May 15, 2024
b361e4a
Add clarifation on used vs billed tokens.
drewby May 15, 2024
aa02859
Regenerate tables
drewby May 22, 2024
b351811
Regenerate tables
drewby May 23, 2024
b90fdc5
Merge branch 'main' into drewby/llm_metrics
drewby May 23, 2024
e48e635
Merge branch 'open-telemetry:main' into drewby/llm_metrics
drewby May 24, 2024
2cd0b90
Remove unnecessary elements
drewby May 25, 2024
fa99804
Update description for error.type
drewby May 25, 2024
a405081
Merge branch 'main' into drewby/llm_metrics
lmolkova May 28, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .chloggen/811.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
change_type: enhancement
component: gen-ai
note: Adding metrics for GenAI clients.
issues: [811]
9 changes: 9 additions & 0 deletions docs/attributes-registry/gen-ai.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ This document defines the attributes used to describe telemetry in the context o
| Attribute | Type | Description | Examples | Stability |
| -------------------------------- | -------- | ------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------------- | ---------------------------------------------------------------- |
| `gen_ai.completion` | string | The full response received from the LLM. [1] | `[{'role': 'assistant', 'content': 'The capital of France is Paris.'}]` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `gen_ai.operation.name` | string | The name of the operation being performed. | `chat`; `completion` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `gen_ai.prompt` | string | The full prompt sent to an LLM. [2] | `[{'role': 'user', 'content': 'What is the capital of France?'}]` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `gen_ai.request.max_tokens` | int | The maximum number of tokens the LLM generates for a request. | `100` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `gen_ai.request.model` | string | The name of the LLM a request is being made to. | `gpt-4` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
Expand All @@ -22,6 +23,7 @@ This document defines the attributes used to describe telemetry in the context o
| `gen_ai.response.id` | string | The unique identifier for the completion. | `chatcmpl-123` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `gen_ai.response.model` | string | The name of the LLM a response was generated from. | `gpt-4-0613` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `gen_ai.system` | string | The Generative AI product as identified by the client instrumentation. [3] | `openai` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `gen_ai.token.type` | string | The type of token being counted. | `input`; `output` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `gen_ai.usage.completion_tokens` | int | The number of tokens used in the LLM response (completion). | `180` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `gen_ai.usage.prompt_tokens` | int | The number of tokens used in the LLM prompt. | `100` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |

Expand All @@ -36,3 +38,10 @@ This document defines the attributes used to describe telemetry in the context o
| Value | Description | Stability |
| -------- | ----------- | ---------------------------------------------------------------- |
| `openai` | OpenAI | ![Experimental](https://img.shields.io/badge/-experimental-blue) |

`gen_ai.token.type` has the following list of well-known values. If one of them applies, then the respective value MUST be used; otherwise, a custom value MAY be used.

| Value | Description | Stability |
| -------- | ------------------------------------------ | ---------------------------------------------------------------- |
| `input` | Input tokens (prompt, input, etc.) | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `output` | Output tokens (completion, response, etc.) | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
4 changes: 4 additions & 0 deletions docs/gen-ai/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,10 @@ This document defines semantic conventions for the following kind of Generative

* LLMs

Semantic conventions for Generative AI operations are defined for the following signals:

* [Metrics](gen-ai-metrics.md): Semantic Conventions for Generative AI operations - *metrics*.

Semantic conventions for LLM operations are defined for the following signals:

* [LLM Spans](llm-spans.md): Semantic Conventions for LLM requests - *spans*.
Expand Down
184 changes: 184 additions & 0 deletions docs/gen-ai/gen-ai-metrics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,184 @@
<!--- Hugo front matter used to generate the website version of this page:
linkTitle: Generative AI metrics
--->

# Semantic Conventions for Generative AI Client Metrics

**Status**: [Experimental][DocumentStatus]

The conventions described in this section are specific to Generative AI client
applications.
drewby marked this conversation as resolved.
Show resolved Hide resolved

**Disclaimer:** These are initial Generative AI client metric instruments
and attributes but more may be added in the future.

<!-- Re-generate TOC with `markdown-toc --no-first-h1 -i` -->

<!-- toc -->

- [Generative AI Client Metrics](#generative-ai-client-metrics)
- [Metric: `gen_ai.client.token.usage`](#metric-gen_aiclienttokenusage)
- [Metric: `gen_ai.client.operation.duration`](#metric-gen_aiclientoperationduration)

<!-- tocstop -->

## Generative AI Client Metrics

The following metric instruments describe Generative AI operations. An
drewby marked this conversation as resolved.
Show resolved Hide resolved
operation may be a request to an LLM, a function call, or some other
distinct action within a larger Generative AI workflow.

### Metric: `gen_ai.client.token.usage`

This metric is [recommended][MetricRecommended] when an operation involves the usage
of tokens and the count is readily available.

For example, if GenAI system returns usage information in the streaming response, it SHOULD be used. Or if GenAI system returns each token independently, instrumentation SHOULD count number of output tokens and record the result.

If instrumentation cannot efficiently obtain number of input and/or output tokens, it MAY allow users to enable offline token counting. Otherwise it MUST NOT report usage metric.

When systems report both used tokens and billable tokens, instrumentation MUST report billable tokens.

This metric SHOULD be specified with [ExplicitBucketBoundaries] of [1, 4, 16, 64, 256, 1024, 4096, 16384, 65536, 262144, 1048576, 4194304, 16777216, 67108864].

<!-- semconv metric.gen_ai.client.token.usage(metric_table) -->
<!-- NOTE: THIS TEXT IS AUTOGENERATED. DO NOT EDIT BY HAND. -->
<!-- see templates/registry/markdown/snippet.md.j2 -->
<!-- prettier-ignore-start -->
<!-- markdownlint-capture -->
<!-- markdownlint-disable -->

drewby marked this conversation as resolved.
Show resolved Hide resolved
| Name | Instrument Type | Unit (UCUM) | Description | Stability |
| -------- | --------------- | ----------- | -------------- | --------- |
| `gen_ai.client.token.usage` | Histogram | `{token}` | Measures number of input and output tokens used | ![Experimental](https://img.shields.io/badge/-experimental-blue) |


<!-- markdownlint-restore -->
<!-- prettier-ignore-end -->
<!-- END AUTOGENERATED TEXT -->
<!-- endsemconv -->

<!-- semconv metric.gen_ai.client.token.usage(full) -->
<!-- NOTE: THIS TEXT IS AUTOGENERATED. DO NOT EDIT BY HAND. -->
<!-- see templates/registry/markdown/snippet.md.j2 -->
<!-- prettier-ignore-start -->
<!-- markdownlint-capture -->
<!-- markdownlint-disable -->

| Attribute | Type | Description | Examples | [Requirement Level](https://opentelemetry.io/docs/specs/semconv/general/attribute-requirement-level/) | Stability |
|---|---|---|---|---|---|
| [`gen_ai.operation.name`](/docs/attributes-registry/gen-ai.md) | string | The name of the operation being performed. | `chat`; `completion` | `Required` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| [`gen_ai.request.model`](/docs/attributes-registry/gen-ai.md) | string | The name of the LLM a request is being made to. | `gpt-4` | `Required` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| [`gen_ai.system`](/docs/attributes-registry/gen-ai.md) | string | The Generative AI product as identified by the client instrumentation. [1] | `openai` | `Required` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| [`gen_ai.token.type`](/docs/attributes-registry/gen-ai.md) | string | The type of token being counted. | `input`; `output` | `Required` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| [`server.port`](/docs/attributes-registry/server.md) | int | Server port number. [2] | `80`; `8080`; `443` | `Conditionally Required` If `sever.address` is set. | ![Stable](https://img.shields.io/badge/-stable-lightgreen) |
| [`gen_ai.response.model`](/docs/attributes-registry/gen-ai.md) | string | The name of the LLM a response was generated from. | `gpt-4-0613` | `Recommended` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| [`server.address`](/docs/attributes-registry/server.md) | string | Server domain name if available without reverse DNS lookup; otherwise, IP address or Unix domain socket name. [3] | `example.com`; `10.1.2.80`; `/tmp/my.sock` | `Recommended` | ![Stable](https://img.shields.io/badge/-stable-lightgreen) |

**[1]:** The actual GenAI product may differ from the one identified by the client. For example, when using OpenAI client libraries to communicate with Mistral, the `gen_ai.system` is set to `openai` based on the instrumentation's best knowledge.

**[2]:** When observed from the client side, and when communicating through an intermediary, `server.port` SHOULD represent the server port behind any intermediaries, for example proxies, if it's available.

**[3]:** When observed from the client side, and when communicating through an intermediary, `server.address` SHOULD represent the server address behind any intermediaries, for example proxies, if it's available.



`gen_ai.system` has the following list of well-known values. If one of them applies, then the respective value MUST be used; otherwise, a custom value MAY be used.

| Value | Description | Stability |
|---|---|---|
| `openai` | OpenAI | ![Experimental](https://img.shields.io/badge/-experimental-blue) |


`gen_ai.token.type` has the following list of well-known values. If one of them applies, then the respective value MUST be used; otherwise, a custom value MAY be used.

| Value | Description | Stability |
|---|---|---|
| `input` | Input tokens (prompt, input, etc.) | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `output` | Output tokens (completion, response, etc.) | ![Experimental](https://img.shields.io/badge/-experimental-blue) |



<!-- markdownlint-restore -->
<!-- prettier-ignore-end -->
<!-- END AUTOGENERATED TEXT -->
<!-- endsemconv -->

### Metric: `gen_ai.client.operation.duration`

This metric is [required][MetricRequired].

drewby marked this conversation as resolved.
Show resolved Hide resolved
This metric SHOULD be specified with [ExplicitBucketBoundaries] of [ 0.01, 0.02, 0.04, 0.08, 0.16, 0.32, 0.64, 1.28, 2.56, 5.12,10.24, 20.48, 40.96, 81.92].

<!-- semconv metric.gen_ai.client.operation.duration(metric_table) -->
<!-- NOTE: THIS TEXT IS AUTOGENERATED. DO NOT EDIT BY HAND. -->
<!-- see templates/registry/markdown/snippet.md.j2 -->
<!-- prettier-ignore-start -->
<!-- markdownlint-capture -->
<!-- markdownlint-disable -->

| Name | Instrument Type | Unit (UCUM) | Description | Stability |
| -------- | --------------- | ----------- | -------------- | --------- |
| `gen_ai.client.operation.duration` | Histogram | `s` | GenAI operation duration | ![Experimental](https://img.shields.io/badge/-experimental-blue) |


<!-- markdownlint-restore -->
<!-- prettier-ignore-end -->
<!-- END AUTOGENERATED TEXT -->
<!-- endsemconv -->

<!-- semconv metric.gen_ai.client.operation.duration(full) -->
<!-- NOTE: THIS TEXT IS AUTOGENERATED. DO NOT EDIT BY HAND. -->
<!-- see templates/registry/markdown/snippet.md.j2 -->
<!-- prettier-ignore-start -->
<!-- markdownlint-capture -->
<!-- markdownlint-disable -->

| Attribute | Type | Description | Examples | [Requirement Level](https://opentelemetry.io/docs/specs/semconv/general/attribute-requirement-level/) | Stability |
|---|---|---|---|---|---|
| [`gen_ai.operation.name`](/docs/attributes-registry/gen-ai.md) | string | The name of the operation being performed. | `chat`; `completion` | `Required` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| [`gen_ai.request.model`](/docs/attributes-registry/gen-ai.md) | string | The name of the LLM a request is being made to. | `gpt-4` | `Required` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| [`gen_ai.system`](/docs/attributes-registry/gen-ai.md) | string | The Generative AI product as identified by the client instrumentation. [1] | `openai` | `Required` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| [`error.type`](/docs/attributes-registry/error.md) | string | Describes a class of error the operation ended with. [2] | `timeout`; `java.net.UnknownHostException`; `server_certificate_invalid`; `500` | `Conditionally Required` if the operation ended in an error | ![Stable](https://img.shields.io/badge/-stable-lightgreen) |
| [`server.port`](/docs/attributes-registry/server.md) | int | Server port number. [3] | `80`; `8080`; `443` | `Conditionally Required` If `sever.address` is set. | ![Stable](https://img.shields.io/badge/-stable-lightgreen) |
| [`gen_ai.response.model`](/docs/attributes-registry/gen-ai.md) | string | The name of the LLM a response was generated from. | `gpt-4-0613` | `Recommended` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| [`server.address`](/docs/attributes-registry/server.md) | string | Server domain name if available without reverse DNS lookup; otherwise, IP address or Unix domain socket name. [4] | `example.com`; `10.1.2.80`; `/tmp/my.sock` | `Recommended` | ![Stable](https://img.shields.io/badge/-stable-lightgreen) |

**[1]:** The actual GenAI product may differ from the one identified by the client. For example, when using OpenAI client libraries to communicate with Mistral, the `gen_ai.system` is set to `openai` based on the instrumentation's best knowledge.

**[2]:** The cardinality of `error.type` SHOULD be low.

When working across multiple models, it is RECOMMENDED to use a common set of error types.

Additional details may be captured in domain-specific attributes.

**[3]:** When observed from the client side, and when communicating through an intermediary, `server.port` SHOULD represent the server port behind any intermediaries, for example proxies, if it's available.

**[4]:** When observed from the client side, and when communicating through an intermediary, `server.address` SHOULD represent the server address behind any intermediaries, for example proxies, if it's available.



`error.type` has the following list of well-known values. If one of them applies, then the respective value MUST be used; otherwise, a custom value MAY be used.

| Value | Description | Stability |
|---|---|---|
| `_OTHER` | A fallback error value to be used when the instrumentation doesn't define a custom value. | ![Stable](https://img.shields.io/badge/-stable-lightgreen) |


`gen_ai.system` has the following list of well-known values. If one of them applies, then the respective value MUST be used; otherwise, a custom value MAY be used.

| Value | Description | Stability |
|---|---|---|
| `openai` | OpenAI | ![Experimental](https://img.shields.io/badge/-experimental-blue) |



<!-- markdownlint-restore -->
<!-- prettier-ignore-end -->
<!-- END AUTOGENERATED TEXT -->
<!-- endsemconv -->

[DocumentStatus]: https://github.com/open-telemetry/opentelemetry-specification/tree/v1.22.0/specification/document-status.md
[MetricRequired]: /docs/general/metric-requirement-level.md#required
[MetricRecommended]: /docs/general/metric-requirement-level.md#recommended
[ExplicitBucketBoundaries]: https://github.com/open-telemetry/opentelemetry-specification/tree/v1.31.0/specification/metrics/api.md#instrument-advisory-parameters
47 changes: 47 additions & 0 deletions model/metrics/gen-ai.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
groups:
- id: metric_attributes.gen_ai
type: attribute_group
brief: 'This group describes GenAI metrics attributes'
attributes:
- ref: server.address
drewby marked this conversation as resolved.
Show resolved Hide resolved
requirement_level: recommended
- ref: server.port
requirement_level:
conditionally_required: If `sever.address` is set.
- ref: gen_ai.response.model
requirement_level: recommended
- ref: gen_ai.request.model
requirement_level: required
drewby marked this conversation as resolved.
Show resolved Hide resolved
- ref: gen_ai.system
requirement_level: required
- ref: gen_ai.operation.name
requirement_level: required
- id: metric.gen_ai.client.token.usage
drewby marked this conversation as resolved.
Show resolved Hide resolved
type: metric
metric_name: gen_ai.client.token.usage
brief: 'Measures number of input and output tokens used'
drewby marked this conversation as resolved.
Show resolved Hide resolved
instrument: histogram
unit: "{token}"
stability: experimental
extends: metric_attributes.gen_ai
attributes:
- ref: gen_ai.token.type
requirement_level: required
- id: metric.gen_ai.client.operation.duration
type: metric
metric_name: gen_ai.client.operation.duration
brief: 'GenAI operation duration'
instrument: histogram
unit: "s"
stability: experimental
extends: metric_attributes.gen_ai
attributes:
- ref: error.type
requirement_level:
conditionally_required: "if the operation ended in an error"
note: |
The cardinality of `error.type` SHOULD be low.

When working across multiple models, it is RECOMMENDED to use a common set of error types.

Additional details may be captured in domain-specific attributes.
drewby marked this conversation as resolved.
Show resolved Hide resolved
22 changes: 22 additions & 0 deletions model/registry/gen-ai.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,22 @@ groups:
brief: The number of tokens used in the LLM response (completion).
examples: [180]
tag: llm-generic-response
- id: token.type
stability: experimental
type:
allow_custom_values: true
drewby marked this conversation as resolved.
Show resolved Hide resolved
members:
- id: input
stability: experimental
value: "input"
brief: 'Input tokens (prompt, input, etc.)'
- id: completion
stability: experimental
value: "output"
brief: 'Output tokens (completion, response, etc.)'
brief: The type of token being counted.
examples: ['input', 'output']
tag: llm-generic-metrics
drewby marked this conversation as resolved.
Show resolved Hide resolved
- id: prompt
stability: experimental
type: string
Expand All @@ -89,3 +105,9 @@ groups:
note: It's RECOMMENDED to format completions as JSON string matching [OpenAI messages format](https://platform.openai.com/docs/guides/text-generation)
examples: ["[{'role': 'assistant', 'content': 'The capital of France is Paris.'}]"]
tag: llm-generic-events
- id: operation.name
drewby marked this conversation as resolved.
Show resolved Hide resolved
stability: experimental
type: string
brief: The name of the operation being performed.
examples: ['chat', 'completion']
tag: llm-generic-metrics
drewby marked this conversation as resolved.
Show resolved Hide resolved
Loading