Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU stuck at 100% upon network drops to Azure #850

Open
bradley-carrion opened this issue Aug 13, 2024 · 7 comments
Open

CPU stuck at 100% upon network drops to Azure #850

bradley-carrion opened this issue Aug 13, 2024 · 7 comments

Comments

@bradley-carrion
Copy link

bradley-carrion commented Aug 13, 2024

Describe the question/issue

Once we enabled aws-for-fluent-bit image with our own fluent bit configuration with new Azure Blob outputs at scale, we see these errors on occasion

[error] [http_client] broken connection to {our_storage_account}.blob.core.windows.net:443

After enough of these we see the container reach a point of no return where CPU spikes to 100% and stays there until the ALB finally marks the task as unhealthy.

We had to move off of the aws-for-fluent-bit image and onto the latest v3.1.4 of fluent bit.

Configuration

Fluent Bit Log Output

We have enabled debug logs and nothing in the logs indicate that the CPU should be having issues.

Fluent Bit Version Info

amazon/aws-for-fluent-bit:2.32.2
which uses v1.9.10 of fluent bit under the hood.

Cluster Details

We're running ECS Fargate w/ sidecar deployment of aws-for-fluent-bit.

(This repros locally btw)

Application Details

I was able to repro this locally with the following throughput:

  • ~80 logs / sec
  • ~1kb / log

Steps to reproduce issue

  1. Start the fluent bit container locally with it pointed to azure blob output
  2. Start sending as many logs as you can locally (see above throughput details)
  3. Turn off your network connection so that the requests to Azure start failing, however your requests to fluent bit should continue to succeed
  4. Wait about 30-60s (longer if you want to really pressure test it)
  5. Turn your network connection back on
  6. Repeat steps 2 - 5 or watch the fluent bit container explode

Related Issues

No related issues but a suspect fix is in fluent/fluent-bit#5918

My suggestion would be to consider upgrading to the latest fluent bit version.

@swapneils
Copy link
Contributor

Two questions here to clarify the specific code-segments that are involved:

  1. So upgrading to build aws-for-fluent-bit with 3.1.4 prevented this issue from occurring?
  2. Which output plugin are you using here?

@guidoiaquinti
Copy link

Since ~2 hours, this is broken on latest too.

@bradley-carrion
Copy link
Author

@swapneils Apologies for the delayed response.

Two questions here to clarify the specific code-segments that are involved:

  1. So upgrading to build aws-for-fluent-bit with 3.1.4 prevented this issue from occurring?

No, we completely dropped the aws-for-fluent-bit image and are purely using the standard fluent-bit 3.1.4 image.

  1. Which output plugin are you using here?

We are using the Azure Blob plugin

@swapneils
Copy link
Contributor

Since ~2 hours, this is broken on latest too.

@guidoiaquinti Are you saying you tested this case ~2 hours ago, or that this case was previously working for you and is now failing with the latest tag?

In the latter case, is the public.ecr.aws/aws-observability/aws-for-fluent-bit:init-debug-2.32.2.20240820 image working without issues? The latest release shouldn't be exhibiting different behavior from stable since we didn't change any fluent-bit code.

@guidoiaquinti
Copy link

Maybe this is completely unrelated, and to be honest, I'm not sure what has changed (I'm currently on mobile with limited connectivity), but all our deployments started failing approximately two hours ago with the following errors:

[2024/10/07 20:17:15] [error] [plugins/out_datadog/datadog.c:184 errno=25] Inappropriate ioctl for device
[2024/10/07 20:17:15] [error] [src/flb_sds.c:109 errno=12] Cannot allocate memory

The timeframe aligns with the update of the latest tag. Reverting to stable fixes it. While not strictly related to this GitHub issue, I arrived here because the bug above seems to be occurring in the same Fluent Bit version of the report.

@bradley-carrion
Copy link
Author

bradley-carrion commented Oct 7, 2024

Maybe this is completely unrelated, and to be honest, I'm not sure what has changed (I'm currently on mobile with limited connectivity), but all our deployments started failing approximately two hours ago with the following errors:

[2024/10/07 20:17:15] [error] [plugins/out_datadog/datadog.c:184 errno=25] Inappropriate ioctl for device
[2024/10/07 20:17:15] [error] [src/flb_sds.c:109 errno=12] Cannot allocate memory

The timeframe aligns with the update of the latest tag. Reverting to stable fixes it. While not strictly related to this GitHub issue, I arrived here because the bug above seems to be occurring in the same Fluent Bit version of the report.

This seems unrelated seeing as my issue is not exclusively on the new latest, did not see the error message you're referring to and they haven't upgraded the underlying fluent bit version from 1.9.10 - which is the compatibility issue I'm calling out here. I'd recommend always using the stable version and creating a new issue for what you're seeing @guidoiaquinti

@swapneils
Copy link
Contributor

Thanks Bradley (and sorry for this additional ping :) )

@guidoiaquinti After making the new Issue, could you pin to 2.32.2.20240820 for the moment and email me an AWS Account ID at swapneis@amazon.com?

The first point is because we plan to update our stable image later this week unless we see issues in stability testing (which I don't expect).
Delaying the update further without a clear availability risk would harm other customers' workflows (e.g. security scanning), but I also don't want to break yours.

The account ID is so I can share test aws-for-fluent-bit images with you to facilitate investigation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants