Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[connector/servicegraph] Coalesce different attr sets into single ScopeMetrics metric entry #34070

Merged
merged 4 commits into from
Jul 17, 2024

Conversation

evantorrie
Copy link
Contributor

Tests in connector/servicegraph were failing because the servicegraph buildMetrics() code was creating multiple metric entries for the same metric name within a single MetricScope. Although this may not be forbidden by the Otel specification, I think there is a general assumption that a metric name does not appear more than once within the same MetricScope.
Instead, different values (e.g. with different sets of attribute values) should be created as separate datapoints within the same metric.

The pmetrictest.CompareScopeMetrics test functionality is not designed to handle multiple metric entries with the same Name(). Instead, it is assumed that in cases where Order is ignored, the first entry found in the actual metrics which matches the name of the expected metrics must be the metric to compare.

This fix changes the buildMetrics() code to create one metric within a scope, and instead create multiple datapoints per metric when there are entries where the datapoint attribute set is unique (i.e. all entries in the internal maps serviceGraphConnector.req{Total,FailedTotal,ServerDurationSecondsCount} are coalesced into a single named metric as appropriate.)

Note: I don't have any past experience working with servicegraphconnector, but just observing that collectClientLatencyMetrics() and collectServerLatencyMetrics() both range over the same map - p.reqServerDurationSecondsCount, although the actual values collected in collectClientLatencyMetrics() are from p.reqServerDurationSeconds{Count,Sum,BucketCounts} and the values collected in collectServerLatencyMetrics() are from p.reqClientDurationSeconds{Count,Sum,BucketCounts}.

This seems a little asymmetrical, but I don't have enough experience to say whether this is an error or not.

Description: Fixes #33998

Testing: All other unit tests now complete, and the previously failing unit test now works reliably.

Documentation: No documentation added. This is a unit test fix.

Tests were failing because the `buildMetrics()` code was creating
multiple metric entries inside a single MetricScope. The `pdatatest`
test functionality is not designed to handle this.

The code now creates one metric within a scope, but creates one
datapoint for each unique set of attributes.
@evantorrie evantorrie marked this pull request as ready for review July 15, 2024 03:04
@evantorrie evantorrie requested a review from a team July 15, 2024 03:04
@evantorrie
Copy link
Contributor Author

No changelog entry as this is a fix for a flaky unit-test.

@jpkrohling jpkrohling added the Skip Changelog PRs that do not require a CHANGELOG.md entry label Jul 15, 2024
Copy link
Contributor

@t00mas t00mas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense 👍🏼

Copy link
Member

@jpkrohling jpkrohling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the following PR has the same purpose as this, but might be a better solution. If this PR here has further improvements, we can consider merging it as well. I'm blocking this, so that we can take that into account before merging.

#34076

@evantorrie
Copy link
Contributor Author

Restore skipped test introduced in #34120

@evantorrie
Copy link
Contributor Author

I believe the following PR has the same purpose as this, but might be a better solution. If this PR here has further improvements, we can consider merging it as well. I'm blocking this, so that we can take that into account before merging.

#34076

#34076 was not successful in fixing the servicegraph connector flaky test. The root cause of the error is at a higher level than the order of datapoint attributes within a metric. As mentioned earlier

The pmetrictest.CompareScopeMetrics test functionality is not designed to handle multiple metric entries with the same Name(). Instead, it is assumed that in cases where Order is ignored, the first entry found in the actual metrics which matches the name of the expected metrics must be the metric to compare.

@jpkrohling
Copy link
Member

Would it make sense to undo #34076 as part of this PR?

@evantorrie
Copy link
Contributor Author

As far as I can tell from #34076, it attempts to sort all datapoint slices based on a canonicalized order of the attributes within each datapoint.

The existing compare{Number,Histogram,Summary}DataPointSlices functions in pmetrictest check for each "expected" datapoint whether there exists an "actual" datapoint using the following test to determine if the datapoints match.

if reflect.DeepEqual(edp.Attributes().AsRaw(), adp.Attributes().AsRaw())

If there is a match, but they are not in the same index order if e != a, then it will append an error indicating the out-of-order data attributes.

The only way to avoid an OutOfOrderErr and get to the point of comparing the actual values is to ensure that the order of the metric datapoints within a DataPointSlice are exactly the same between the expected and the actual which is what #34076 does by sorting both expected and actual in the same way.

So no, I don't think #34076 needs to be reverted.

It's a useful option, particularly if there are multiple datapoints within a single Resource::Scope::Metric structure, and it's non-deterministic as to which order they will be added.

I do see that the merged commit didn't fix the copy/paste error on the function comment for IgnoreDatapointAttributesOrder() though (it still refers to IgnoreMetricAttributeValue).

Fixing that comment is not really part of this PR, but I can put it in as an additional commit to avoid opening/merging another PR just for that comment change.

@jpkrohling
Copy link
Member

Thanks for the comments. I'll merge this as is, the comment can be fixed on a follow up PR

@jpkrohling jpkrohling merged commit f2cfc2d into open-telemetry:main Jul 17, 2024
154 checks passed
@github-actions github-actions bot added this to the next release milestone Jul 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
connector/servicegraph Skip Changelog PRs that do not require a CHANGELOG.md entry
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[connector/servicegraph] Unit test failure: TestConnectorConsume/test_fix_failed_label_not_work
4 participants