Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Possible Bug] fluent-bit engine shutdown and FB pod stays RUNNING after SIGTERM #859

Open
containerckf opened this issue Oct 5, 2024 · 1 comment

Comments

@containerckf
Copy link

Describe the question/issue

Experiencing seemingly a fluent-bit related bug (low frequency and sporadic) where in the FB pod is not correctly sending logs from the node. Additionally the node disk space is slowly filled up where flb files are leaked onto the disk. The affected FB pod stays in RUNNING state even after a SIGTERM is received.

The fluent-bit engine shutdown after 5 seconds, however, child processes/tasks such as input:tail:tail.0 kept running and collecting flb files. The container was left running in a non-working state until manual intervention.

Fluent Bit Log Output

[engine] caught signal (SIGTERM)
[ info] [input] pausing tail.0
[ info] [input] pausing tail.1
[ info] [input] pausing tail.2
[ info] [input] pausing systemd.3
[ info] [input] pausing tail.4
[ info] [input] pausing tail.5
[ info] [input] pausing tail.6
[ info] [input] pausing tail.7
[ info] [input] pausing storage_backlog.8
[ warn] [engine] service will shutdown in max 5 seconds
[ info] [task] tail/tail.0 has 128 pending task(s):
...
[ info] [task]   task_id=0 still running on route(s): cloudwatch_logs/cloudwatch_logs.0 
[ info] [task]   task_id=1 still running on route(s): cloudwatch_logs/cloudwatch_logs.0 
[ info] [task]   task_id=2 still running on route(s): cloudwatch_logs/cloudwatch_logs.0 
[ info] [task]   task_id=3 still running on route(s): cloudwatch_logs/cloudwatch_logs.0 
[ info] [task]   task_id=4 still running on route(s): cloudwatch_logs/cloudwatch_logs.0 
[ info] [task]   task_id=5 still running on route(s): cloudwatch_logs/cloudwatch_logs.0 
[ info] [task]   task_id=6 still running on route(s): cloudwatch_logs/cloudwatch_logs.0 
[ info] [task]   task_id=7 still running on route(s): cloudwatch_logs/cloudwatch_logs.0 
[ info] [task]   task_id=8 still running on route(s): cloudwatch_logs/cloudwatch_logs.0 
[ info] [task]   task_id=9 still running on route(s): cloudwatch_logs/cloudwatch_logs.0 
...
[ info] [engine] service has stopped (215 pending tasks)
[output:cloudwatch_logs:cloudwatch_logs.0] thread worker #0 stopping...
  • Below showing the files leaked to the disk:
root@ip-:/var/fluent-bit/state/flb-storage/tail.0# while true; do echo "number of flb files" $(ls -1 | wc -l); sleep 1; done
number of flb files 5871
number of flb files 5866
number of flb files 5862
number of flb files 5859
number of flb files 5860
number of flb files 5856
number of flb files 5854

Fluent Bit Version Info

  • aws-for-fluent-bit version 2.31.12.20231011
  • Pod Configuration:
Name:                 aws-for-fluent-bit-xn9hn
...
Controlled By:  DaemonSet/aws-for-fluent-bit
Containers:
  aws-for-fluent-bit:
    Container ID:   containerd://fe13c77f1c340a68b76a7b749b32d5359aa85905b69f208b9941b8d49eaf6d71
    Image:          public.ecr.aws/aws-observability/aws-for-fluent-bit:2.31.12.20231011
    Image ID:       public.ecr.aws/aws-observability/aws-for-fluent-bit@sha256:70d9a689cd23bd1f37ad61e1a31853a1dc32f504926c071ffc60375f68d5ce31
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Fri, 13 Sep 2024 10:04:39 -0400
    Ready:          True
    Restart Count:  0
    Limits:
      memory:  400Mi
    Requests:
      cpu:     500m
      memory:  100Mi
    Liveness:  http-get http://:2020/api/v1/health delay=30s timeout=10s period=10s #success=1 #failure=2
    Environment:
      AWS_REGION:                   us-east-1
      CLUSTER_NAME:              x
      HTTP_SERVER:                  
      HTTP_PORT:                    2020
      READ_FROM_HEAD:               Off
      READ_FROM_TAIL:               On
      HOST_NAME:                     (v1:spec.nodeName)
      HOSTNAME:                     aws-for-fluent-bit-xn9hn (v1:metadata.name)
      NODE_NAME:                     (v1:spec.nodeName)
      AWS_STS_REGIONAL_ENDPOINTS:   regional
      AWS_ROLE_ARN:                 arn:aws:iam::807800687496:role/mosh-prodb-useast1-eks-fluent-bit
      AWS_WEB_IDENTITY_TOKEN_FILE:  /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    Mounts:
      /fluent-bit/etc/ from fluentbit-config (rw)
      /run/log/journal from runlogjournal (ro)
      /var/fluent-bit/state from fluentbitstate (rw)
      /var/log from varlog (ro)
      /var/log/dmesg from dmesg (ro)
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4hv8q (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  aws-iam-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  86400
  fluentbit-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      aws-for-fluent-bit
    Optional:  false
  varlog:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log
    HostPathType:  
  runlogjournal:
    Type:          HostPath (bare host directory volume)
    Path:          /run/log/journal
    HostPathType:  
  dmesg:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log/dmesg
    HostPathType:  
  fluentbitstate:
    Type:          HostPath (bare host directory volume)
    Path:          /var/fluent-bit/state
    HostPathType:  
  kube-api-access-4hv8q:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 :NoExecute op=Exists
                             :NoSchedule op=Exists
                             node-role.kubernetes.io/master:NoSchedule op=Exists
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/network-unavailable:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:                      <none>

Cluster Details

Version Information
Kubernetes: 1.28
Platform: eks.18
  • Addon Information:
kube-proxy
Configuration
Version: v1.28.2-eksbuild.2
-----------------------------------
coredns
Configuration
Version: v1.10.1-eksbuild.5
-----------------------------------
vpc-cni
Configuration
Version: v1.16.4-eksbuild.2
-----------------------------------
aws-ebs-csi-driver
Configuration
Version: v1.24.1-eksbuild.1

Application Details

Steps to reproduce issue

  • Have not been able to reproduce on demand - issue is low frequency

Related Issues

  • Have combed through a few times and not able to find a similar tracker.
@mw-tlhakhan
Copy link

Thanks @containerckf for creating this issue. I can help with further details on this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants