Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid gzip payload sent by splunk HEC exporter #34255

Closed
PaulBernier opened this issue Jul 25, 2024 · 3 comments
Closed

Invalid gzip payload sent by splunk HEC exporter #34255

PaulBernier opened this issue Jul 25, 2024 · 3 comments
Labels
bug Something isn't working exporter/splunkhec needs triage New item requiring triage

Comments

@PaulBernier
Copy link
Contributor

PaulBernier commented Jul 25, 2024

Component(s)

exporter/splunkhec

What happened?

Description

At high throughput, the splunk HEC exporter returns some errors, I collected 4 differents

  • Post "https://<redacted>/services/collector/raw?index=main&sourcetype=test&source=eventhub://pbernier-premium1.servicebus.windows.net/zscaler_eh_cef-2%3B&host=<redacted>": net/http: HTTP/1.x transport connection broken: http: ContentLength=2916 with Body length 0
  • flate: closed writer
  • "HTTP/1.1 400 Unparsable gzip header in request data\r\nContent-Length: 261\r\nConnection: keep-alive\r\nContent-Type: text/html; charset=UTF-8\r\nDate: Wed, 24 Jul 2024 20:39:34 GMT\r\nServer: Splunkd\r\nX-Content-Type-Options: nosniff\r\nX-Frame-Options: SAMEORIGIN\r\n\r\n<!doctype html><html><head><meta http-equiv=\"content-type\" content=\"text/html; charset=UTF-8\"><title>400 Unparsable gzip header in request data</title></head><body><h1>Unparsable gzip header in request data</h1><p>HTTP Request was malformed.</p></body></html>\r\n"
  • Permanent error: "HTTP/1.1 400 Bad Request\r\nContent-Length: 27\r\nConnection: keep-alive\r\nContent-Type: application/json; charset=UTF-8\r\nDate: Thu, 25 Jul 2024 17:49:23 GMT\r\nServer: Splunkd\r\nVary: Authorization\r\nX-Content-Type-Options: nosniff\r\nX-Frame-Options: SAMEORIGIN\r\n\r\n{\"text\":\"No data\",\"code\":5}"

Those errors, and especially the second one, makes me wonder if there is an issue with the cancellableGzipWriter, that would occasionally end up being in inconsistent state. The only way I could see this happen if a buffer was being concurrently used. I've looked at the code myself and couldn't find anything obvious, the usage of sync.Pool does make sense.

Steps to Reproduce

I have 300 collectors, sending a total of 309MB/s for 39,000event/s. Events are sent 1by1 (a single event per HEC HTTP request) to a single Splunk Cloud stack. About 1req/s fails (so 1 out of 39,000)

Expected Result

No error, Splunk should not receive incomplete payload.

Actual Result

Errors shared above

Collector version

v0.104.0

Environment information

Environment

alpine linux
Go 1.22

OpenTelemetry Collector configuration

exporters:
  splunk_hec/1:
    token: "{{.hecToken}}"
    endpoint: "{{.hecEndpoint}}/services/collector/raw?index={{.splunkIndex}}&sourcetype={{.splunkSourceType}}&source=eventhub://{{.eventHubFullyQualifiedNamespace}}/{{.eventHubName}}
    sourcetype: "{{.splunkSourceType}}"
    index: "{{.splunkIndex}}"
    export_raw: true
    max_content_length_logs: {{.exporterMaxContentLengthLogs}}
    retry_on_failure:
      enabled: false
    sending_queue:
      enabled: false
    hec_metadata_to_otel_attrs:
      source: source
      host: host
    tls:
      min_version: 1.2

Log output

No response

Additional context

No response

@PaulBernier PaulBernier added bug Something isn't working needs triage New item requiring triage labels Jul 25, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@PaulBernier
Copy link
Contributor Author

I found the root cause, from https://pkg.go.dev/net/http#Client.Do

The request Body, if non-nil, will be closed by the underlying Transport, even on errors. The Body may be closed asynchronously after Do returns.

Because the Body (in the case here, the buffer) can still be closed asynchronously, it is unsafe to return into the pool, as it might end up being closed after already having be picked up again, corrupting the data. There is a GH issue about that in the Golang repo golang/go#51907 (where you can see some high profile projects like Kubernetes got impacted by that as well)

@crobert-1
Copy link
Member

Thanks for filing @PaulBernier, and for including all of the information!

I'm going to close this is a duplicate of #34357, based on the code owner's response in that issue.

@crobert-1 crobert-1 closed this as not planned Won't fix, can't repro, duplicate, stale Aug 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working exporter/splunkhec needs triage New item requiring triage
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants