Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add protocol for dynamic configuration service #155

Closed

Conversation

wtong98
Copy link
Contributor

@wtong98 wtong98 commented Jun 5, 2020

Changes:

  • New protocol for dynamic config service

Goal

We suggest a new protocol for a dynamic configuration service that enables updating metric collection schedules, trace sampling parameters, and other configurations without having to restart an instrumented application. An early draft of the proposal is attached:
OpenTelemetry Configuration Service.pdf

We'd love to hear your thoughts! An example implementation is currently underway, implemented as an extension on the collector and modifications to the Go SDK. Our goal is to deliver a functioning prototype in the upcoming weeks to give an example of what this system may look like.

About

We're an intern group at Google. Members include:

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Jun 5, 2020

CLA Check
The committers are authorized under a signed CLA.

@jmacd
Copy link
Contributor

jmacd commented Jun 5, 2020

I think we will probably want to coordinate this with the Metric Views API. This looks good as a start, for configuring collection periods, but I think people are even more interested in configuring the labels and aggregations used. Thanks!

Copy link
Member

@SergeyKanzhelev SergeyKanzhelev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @jmacd that tighter integration with views API would be beneficial. Left a few comments. Indeed a good start.

@open-telemetry/technical-committee should this be developed as another experimental feature?

3. `metrics` package contains the Metrics Service protos.
1. `trace` package contains the Trace Service protos.
2. `metrics` package contains the Metrics Service protos.
3. `dynamicconfig`package contains the Dynamic Config Service protos
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you adding a note here, also add a it to the maturity matrix: https://github.com/open-telemetry/opentelemetry-proto#maturity-level

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


message ConfigRequest{
// Describes the version of the config service protocol to use. Required.
int32 proto_version = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

proto_version may be misunderstood as a proto2 or proto3. Perhaps a better name can be used here.

Suggested change
int32 proto_version = 1;
int32 config_format_version = 1;


message ConfigRequest{
// Describes the version of the config service protocol to use. Required.
int32 proto_version = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would uint32 be a better type?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will this be used? Protocol Buffers by nature handle versioning, and if a drastic version change was needed you'd probably use a v2 package instead. If it's for the opaque metadata, it would be better to have a struct to contains metadata and metadata_format_version, but it also seems reasonable to leave that versioning to the user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great point, proto_version removed

// whether an update to the configs had occurred -- if the client request's
// fingerprint differs from the server's, then the server knows that a change
// had occurred and an updated configuration should be sent.
string last_fingerprint = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also consider adding some error reporting from SDK to collector when it failed to apply the config?


// Optional. The client is suggested to wait this long (in seconds) before
// pinging the configuration service again.
int32 suggested_wait_time = 5;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int32 suggested_wait_time = 5;
int32 suggested_wait_time_sec = 5;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// whether an update to the configs had occurred -- if the client request's
// fingerprint differs from the server's, then the server knows that a change
// had occurred and an updated configuration should be sent.
string last_fingerprint = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
string last_fingerprint = 3;
string last_known_fingerprint = 3;

just a suggestion. I think it make the name more descriptive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

// Describes the version of the config service protocol to use. Required.
int32 proto_version = 1;

// The target resource. This resource pings the config server requesting
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

required or optional?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is ConfigRequest, target is confusing, since target is something you usually send a request to. So maybe

// The resource for which configuration should be returned.

Also don't think we have to talk about pings / responses since those are implementation details and not part of the protocol.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, changed

// protocol, but still desire to communicate these settings to
// instrumented applications. An application may in turn piggyback
// metadata on a vendor's metric exporter to communicate information back
// to its metric backend. In this way, metadata offers a channel to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how exporter will know this metadata?

}

// Metrics with names that match a rule in the inclusion_patterns are
// targeted by this schedule. Metrics that match the exclusion_patterns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about labels matching?

Copy link

@anuraaga anuraaga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! It'll be cool to have a configuration service if we can work out good semantics that apply to many systems. For reference, here's X-Ray's tracing sapling rules. They're fairly simple and probably similar to others, giving the link in case it provides information about filling out the trace config section

https://docs.aws.amazon.com/xray/latest/devguide/xray-api-sampling.html


message ConfigRequest{
// Describes the version of the config service protocol to use. Required.
int32 proto_version = 1;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will this be used? Protocol Buffers by nature handle versioning, and if a drastic version change was needed you'd probably use a v2 package instead. If it's for the opaque metadata, it would be better to have a struct to contains metadata and metadata_format_version, but it also seems reasonable to leave that versioning to the user.

// Describes the version of the config service protocol to use. Required.
int32 proto_version = 1;

// The target resource. This resource pings the config server requesting
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is ConfigRequest, target is confusing, since target is something you usually send a request to. So maybe

// The resource for which configuration should be returned.

Also don't think we have to talk about pings / responses since those are implementation details and not part of the protocol.

// updated configs, and the server responds with the config information.
opentelemetry.proto.resource.v1.Resource resource = 2;

// Optional. Identifies the last configuration fingerprint / version number.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Optional. The value of ConfigResponse.fingerprint for the last configuration a resource received.

And don't think we need details about how the fingerprint is used in this protocol, it's up to the server whether it uses it or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

message ConfigResponse {
// Describes the version of the config service protocol to use. Should match
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

// the request’s version.
int32 proto_version = 1;

// Optional. The fingerprint associated with the current config settings. Each
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fingerprint associated with this ConfigResponse

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// change in configs yields a different fingerprint. The client may cache a
// fingerprint, then return it on the next request to enable the server to
// quickly check whether a change in the configs occurred. If no changes
// occurred, then all other fields in the response are optional.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear, especially with the unimplemented blocks, whether it's trivial for a client to understand the difference between a response containing no fields because it didn't change or containing default fields. How about adding a boolean indicating whether this response is valid or not?

Copy link
Contributor Author

@wtong98 wtong98 Jun 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I follow. The original intention was that should nothing change, then the response will have only fingerprint, which equals the client's last_known_fingerprint, by which the client knows that no change happened. Does that sound like a practical outcome? I'll also update the comments to make the process more clear

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since a fingerprint is like a hash I don't think comparison with a previous value can guarantee nothing has changed due to hash collisions - it's more a hint which is more useful in the server which actually can compute the fingerprints.

So I think here for the client to know no change happened, it seems it needs to check the actual content. This would probably be many hasX checks using the generated code for proto and might work but may be indirect.

Another thought is perhaps this fingerprint is overoptimization, at least for now so maybe ok to remove for now. The configs don't seem so large and comparing fingerprints (etag + 204 responses in the http world) may not be worth it in practice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great points, thanks for the quick response! I'll bring it up with the rest of the group, and we'll update our proposal (which will hopefully be made available soon)

// occurred, then all other fields in the response are optional.
string fingerprint = 2;

// Dynamic configs specific to metrics, namely the metric collection schedule
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might add different configs later, don't think there's value in restricting to , namely the ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

// adopts the configuration specified by the schedule.
message Schedule {

// Pattern dictates a light-weight pattern that can match 1 or more
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A light-weight pattern that can...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

repeated Pattern exclusion_patterns = 2;

// CollectionPeriod describes the sampling period for each metric. All
// larger units are divisible by all smaller ones.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding some background on why this should be an enum and not just arbitrary uint32 would make it clearer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original intent was to enforce the property that all larger units be divisible by smaller ones (which makes the implementation much cleaner), though we're thinking of altering this -- for example in case a vendor has an extremely good reason to use a collection period of, say, 2 days

// * Using alternate representations for collection schedules, matching
// metrics, resources, etc.
// * Enabling other optimizations
string metadata = 4;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bytes is better for opaque data, easy to store a string in bytes field, impossible to store arbitrary binary in a string field

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great idea, changed

@tigrannajaryan
Copy link
Member

tigrannajaryan commented Jun 8, 2020

This is an interesting proposal that I think warrants an OTEP where it can be discussed. The attached proposal in PDF format is not suitable for commenting.

I would like to see a few things addressed that I don't see in the proposal:

  • What are the security implications of remote configuration. So far OpenTelemetry's operation mode is that instrumented applications voluntarily emit telemetry and the applications and instrumenting libraries control what information to include in the telemetry and when to emit. Remote configuration capabilities create a new attack vector via which applications may be compelled to expose sensitive information (such attack vector perhaps is not viable based on current Config Request/Response definition but may become a possibility once it evolves to include more configuration options). I would like to see this discussed.
  • What are performance implications of mass remote configuration. A malicious or accidental incorrect configuration that increases sampling and collection rates dramatically may result in a storm of telemetry data generation and bring down the collection network and/or the application. I would like to understand if we need safeguards that mitigate this.
  • It is not clear how Collector should proxy config requests. Collector operation is currently based on the notion of data pipelines that are of specific types (traces or metrics). Unclear if config requests should be part of the pipelines, are a new pipeline type or are a separate concept that needs to passthrough.

I think it is best to discuss these as part of an OTEP before discussing the protocol since protocol design/schema will be likely driven by the choices made in that OTEP.

@wtong98
Copy link
Contributor Author

wtong98 commented Jun 9, 2020

Thanks for all the comments! Really appreciate the interest. We'll finalize the proposal and open an OTEP as soon as possible

@wtong98
Copy link
Contributor Author

wtong98 commented Jul 29, 2020

An updated PR has been made with suggestions from here and the OTEP: #183

@wtong98 wtong98 closed this Jul 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants