Silently dropped logs with out of box config #1255

laupow · 2020-12-17T22:39:13Z

Describe the bug
Higher volume log environments need better configuration guardrails to ensure logs aren't dropped silently.

Recently 2 different engineers have expected logs and found none in our production environment.

One instance was a long running service that had intermittent message missing (screenshot attached) Another instance was a new Deployment that did not get logs captured (logs were verified with kubectl logs <pod>. Screenshot in ticket)

Logs
Logs available in ticket

Command used to install/upgrade Collection

helm upgrade -i -f sumologic-collector/base-eks-values.yaml \
  -f sumologic-collector/${ENVIRONMENT}-values.yaml \
  --namespace $NAMESPACE \
  --kube-context ${KUBECTL_CONTEXT} \
  $RELEASE_NAME \
  --version v1.3.1 \
  --set sumologic.accessKey=$SUMOLOGIC_ACCESS_KEY \
  sumologic/sumologic

with Helm 2

Configuration

fluentd:
  logs:
    autoscaling:
      enabled: true
    containers:
      sourceCategory: '%{pod_name}'
      sourceCategoryPrefix: production/
      sourceCategoryReplaceDash: '-'
      sourceName: '%{namespace}.%{pod}.%{container}'
    default:
      sourceCategoryPrefix: production/
    kubelet:
      sourceCategoryPrefix: production/
    statefulset:
      nodeSelector:
        company.com/nodegroup-name: general-public
      tolerations:
      - effect: NoSchedule
        key: dedicated
        operator: Equal
        value: general-public
    systemd:
      sourceCategoryPrefix: production/
  persistence:
    enabled: true
prometheus-operator:
  enabled: false
sumologic:
  accessId: <removed>
  accessKey: <removed>
  clusterName: us-east-1-eks-production
  metrics:
    enabled: false

To Reproduce
I have not been able to reproduce the issue. On Dec 15 we manually changed the HPA minimum from 3 to 7 nobody has reported issues since then, but 🤷

The issue occurs in our production environment so there is somewhat of a disincentive to reproduce it :)

Expected behavior
Provide a clear signal (pod crash, log message) when there is a capacity issue or other case that might cause logs to drop.

Environment (please complete the following information):

Collection version (e.g. helm ls -n sumologic): 1.3.1
Kubernetes version (e.g. kubectl version): 1.15.11
Cloud provider: AWS
Others:

Anything else do we need to know

I noticed 1.3.4 includes a memory related fix, which we don't have running yet. I have observed a few fluentd pod restarts, but they haven't correlated with times when logs are missing.
The HPA out of the box in our environment is very spiky. I have to imagine this much pod churn is not a good configuration.

fluend pod metrics, HPA minimum adjusted Dec 15

Sumo collector volume

The text was updated successfully, but these errors were encountered:

laupow · 2020-12-17T23:38:28Z

Actually, a bit of reproducibility. Scaling fluentd down to one pod triggered these logs in staging.

This makes sense. I think what I'm asking for is how to guarantee we don't drop logs with the fluent-bit/fluend aggregator architecture. I'm not 100% convinced the HPA guarantees no missing messages.

perk-sumo · 2021-01-22T09:41:14Z

Hi @laupow - thank you for reporting!
We are taking a look at this.

sumo-drosiek · 2021-03-17T11:20:26Z

@laupow Sorry for late response

By using graceful shutdown period, liveness and readiness we ensure that logs are coming without misses.

We observed that in fluent-bit below 1.6.10 we had been loosing logs due to invalid rotation handling. We recommend to use latest version of our collecion which uses fixed image

In addition we are going to improve load balancing and HPA by disabling keepalive for fluent-bit #1495

laupow · 2021-03-29T18:06:56Z

Awesome, thanks for the update. Looking forward to v2.1 👍

sumo-drosiek · 2021-04-08T05:44:32Z

@laupow 2.1.0 is released 🎉
Please check how it works for you :)

perk-sumo · 2021-06-07T13:07:08Z

Hi @laupow let me close this issue. Please let me know if the problem still exists.

laupow added the bug Something isn't working label Dec 17, 2020

perk-sumo assigned sumo-drosiek Jan 22, 2021

perk-sumo closed this as completed Jun 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Silently dropped logs with out of box config #1255

Silently dropped logs with out of box config #1255

laupow commented Dec 17, 2020

laupow commented Dec 17, 2020

perk-sumo commented Jan 22, 2021

sumo-drosiek commented Mar 17, 2021

laupow commented Mar 29, 2021

sumo-drosiek commented Apr 8, 2021

perk-sumo commented Jun 7, 2021

Silently dropped logs with out of box config #1255

Silently dropped logs with out of box config #1255

Comments

laupow commented Dec 17, 2020

fluend pod metrics, HPA minimum adjusted Dec 15

Sumo collector volume

laupow commented Dec 17, 2020

perk-sumo commented Jan 22, 2021

sumo-drosiek commented Mar 17, 2021

laupow commented Mar 29, 2021

sumo-drosiek commented Apr 8, 2021

perk-sumo commented Jun 7, 2021