Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Config-reloader hangs during config check #153

Closed
a-b-v opened this issue Nov 20, 2020 · 11 comments
Closed

Config-reloader hangs during config check #153

a-b-v opened this issue Nov 20, 2020 · 11 comments

Comments

@a-b-v
Copy link

a-b-v commented Nov 20, 2020

We use the operator in several k8s clusters, but in one the config-reloader don't create config files.

The log with bad config is:

time="2020-11-20T07:29:07Z" level=info msg="Version: v1.13.0-beta.2-5-g0d3a268"
time="2020-11-20T07:29:07Z" level=info msg="Config: &{Master: KubeConfig: FluentdRPCPort:24444 TemplatesDir:/templates OutputDir:/fluentd/etc LogLevel:debug AnnotConfigmapName:logging.csp.vmware.com/fluentd-configmap AnnotStatus:logging.csp.vmware.com/fluentd-status DefaultConfigmapName:fluentd-config IntervalSeconds:45 Datasource:default CRDMigrationMode:false FsDatasourceDir: AllowFile:false ID:kfo-log-router FluentdValidateCommand:/usr/local/bundle/bin/fluentd -p /fluentd/plugins MetaKey:metadata MetaValues:cluster=cst_local, LabelSelector: KubeletRoot:/var/lib/kubelet Namespaces:[test] PrometheusEnabled:false AllowTagExpansion:false AdminNamespace:kube-system level:0 ParsedMetaValues:map[] ParsedLabelSelector:}"
time="2020-11-20T07:29:08Z" level=info msg="Validator using /usr/local/bundle/bin/fluentd at version fluentd 1.9.1"
W1120 07:29:08.469015       1 client_config.go:543] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2020-11-20T07:29:08Z" level=info msg="Connected to cluster at https://172.23.0.1:443"
time="2020-11-20T07:29:08Z" level=debug msg="Using default configmap name ('fluentd-config') for namespace 'test'"
time="2020-11-20T07:29:08Z" level=info msg="Synced local informer with upstream Kubernetes API"
time="2020-11-20T07:29:08Z" level=info msg="Running main control loop"
time="2020-11-20T07:29:08Z" level=debug msg="Using default configmap name ('fluentd-config') for namespace 'test'"
time="2020-11-20T07:29:08Z" level=debug msg="Loaded config data from config map: test/fluentd-config"
time="2020-11-20T07:29:09Z" level=debug msg="Checked config for namespace test with fluentd and got: 2020-11-20 07:29:09 +0000 [error]: config error in:\n<match kube.test.**>\n  @type http\n</match>\n\n2020-11-20 07:29:09 +0000 [error]: config error file=\"/tmp/validate-ext-test131905260\" error_class=Fluent::ConfigError error=\"'endpoint_url' parameter is required\""
time="2020-11-20T07:29:09Z" level=info msg="Configuration for namespace test cannot be validated with fluentd"
time="2020-11-20T07:29:09Z" level=debug msg="Saving status: &Namespace{ObjectMeta:{test      0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[logging.csp.vmware.com/fluentd-status:2020-11-20 07:29:09 +0000 [error]: config error in:\n<match kube.test.**>\n  @type http\n</match>\n\n2020-11-20 07:29:09 +0000 [error]: config error file=\"/tmp/validate-ext-test131905260\" error_class=Fluent::ConfigError error=\"'endpoint_url' parameter is required\"] [] []  []},Spec:NamespaceSpec{Finalizers:[],},Status:NamespaceStatus{Phase:,Conditions:[]NamespaceCondition{},},}, <nil>"
time="2020-11-20T07:29:09Z" level=info msg="cannot notify fluentd: Post http://127.0.0.1:24444/api/config.reload: dial tcp 127.0.0.1:24444: connect: connection refused"
time="2020-11-20T07:29:09Z" level=info msg="Running main control loop"
time="2020-11-20T07:29:09Z" level=debug msg="Using default configmap name ('fluentd-config') for namespace 'test'"
time="2020-11-20T07:29:09Z" level=debug msg="Loaded config data from config map: test/fluentd-config"
time="2020-11-20T07:29:10Z" level=debug msg="Checked config for namespace test with fluentd and got: 2020-11-20 07:29:10 +0000 [error]: config error in:\n<match kube.test.**>\n  @type http\n</match>\n\n2020-11-20 07:29:10 +0000 [error]: config error file=\"/tmp/validate-ext-test723134811\" error_class=Fluent::ConfigError error=\"'endpoint_url' parameter is required\""
time="2020-11-20T07:29:10Z" level=info msg="Configuration for namespace test cannot be validated with fluentd"

and with next config change lines are added:

    <match **>
      @type null
    </match>
time="2020-11-20T07:35:32Z" level=debug msg="Using default configmap name ('fluentd-config') for namespace 'test'"
time="2020-11-20T07:35:32Z" level=info msg="Running main control loop"
time="2020-11-20T07:35:32Z" level=debug msg="Using default configmap name ('fluentd-config') for namespace 'test'"
time="2020-11-20T07:35:32Z" level=debug msg="Loaded config data from config map: test/fluentd-config"
time="2020-11-20T07:35:33Z" level=debug msg="Checked config for namespace test with fluentd and got: "
time="2020-11-20T07:35:33Z" level=debug msg="Saving status: &Namespace{ObjectMeta:{test      0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[logging.csp.vmware.com/fluentd-status:] [] []  []},Spec:NamespaceSpec{Finalizers:[],},Status:NamespaceStatus{Phase:,Conditions:[]NamespaceCondition{},},}, <nil>"

but when config like this

    <match **>
      @type elasticsearch
    </match>

log

time="2020-11-20T07:38:26Z" level=debug msg="Using default configmap name ('fluentd-config') for namespace 'test'"
time="2020-11-20T07:38:26Z" level=info msg="Running main control loop"
time="2020-11-20T07:38:26Z" level=debug msg="Using default configmap name ('fluentd-config') for namespace 'test'"
time="2020-11-20T07:38:26Z" level=debug msg="Loaded config data from config map: test/fluentd-config"

and no notification for fluentd is sent and configs in /fluentd/etc are absent.
here the log after next config changes:

time="2020-11-20T07:39:53Z" level=debug msg="Using default configmap name ('fluentd-config') for namespace 'test'"
time="2020-11-20T07:40:35Z" level=debug msg="Using default configmap name ('fluentd-config') for namespace 'test'"

config-reloader command-line:

    - /bin/config-reloader
    - --datasource=default
    - --default-configmap=fluentd-config
    - --interval=45
    - --log-level=debug
    - --output-dir=/fluentd/etc
    - --templates-dir=/templates
    - --id=kfo-log-router
    - --fluentd-binary
    - /usr/local/bundle/bin/fluentd -p /fluentd/plugins
    - --kubelet-root
    - /var/lib/kubelet
    - --meta-key=metadata
    - --meta-values=cluster=cluster_local,
    - --admin-namespace=kube-system
    - --namespaces
    - test

If i run config-reloader during exec with next command-line, it works

/bin/config-reloader  --datasource=fs --fs-dir=/tmp --default-configmap=fluentd-config --interval=0 --log-level=debug --output-dir=/tmp/out --templates-dir=/templates --id=kfo-log-router --fluentd-binary "/usr/local/bundle/bin/fluentd -p /fluentd/plugins" --kubelet-root /var/lib/kubelet --meta-key=metadata --meta-values=cluster=ccluster_local,

but with a command-line like this doesn't work

/bin/config-reloader  --datasource=default --default-configmap=fluentd-config --interval=0 --log-level=debug --output-dir=/tmp/out --templates-dir=/templates --id=kfo-log-router --fluentd-binary "/usr/local/bundle/bin/fluentd -p /fluentd/plugins" --kubelet-root /var/lib/kubelet --meta-key=metadata --meta-values=cluster=cluster_local,
@a-b-v
Copy link
Author

a-b-v commented Nov 25, 2020

I figured out the problem. It's due no access to elastic. but i suggest to add parameter --dry-run to fluentd command-line during config checking

@kirek007
Copy link

Hey, I'm having exactly same issue. Have you found a workaround? I'm afraid it's going to hang on production cluster and I'll have to restart whole stack to bring it back to life. Problem is that it's not clear which piece of config is responsible for that (I'm using CRs for configuration)

@a-b-v
Copy link
Author

a-b-v commented Jan 24, 2021

Hey, I'm having exactly same issue. Have you found a workaround? I'm afraid it's going to hang on production cluster and I'll have to restart whole stack to bring it back to life. Problem is that it's not clear which piece of config is responsible for that (I'm using CRs for configuration)

Hi. In my case were wrong settings in target server parameters. But config-reloader do not hangs, it has 15 connect attempts, which take a long time. You can add --dry-run to fluentd command-line in config-reloader deployment or reduce number of attempts in configmap with fluentd settings in specific namespace

@kirek007
Copy link

Hi. In my case were wrong settings in target server parameters. But config-reloader do not hangs, it has 15 connect attempts, which take a long time. You can add --dry-run to fluentd command-line in config-reloader deployment or reduce number of attempts in configmap with fluentd settings in specific namespace

Thanks, I'll take a look into configuration then!

@jliao2011
Copy link
Contributor

Hey, I'm having exactly same issue. Have you found a workaround? I'm afraid it's going to hang on production cluster and I'll have to restart whole stack to bring it back to life. Problem is that it's not clear which piece of config is responsible for that (I'm using CRs for configuration)

Hi. In my case were wrong settings in target server parameters. But config-reloader do not hangs, it has 15 connect attempts, which take a long time. You can add --dry-run to fluentd command-line in config-reloader deployment or reduce number of attempts in configmap with fluentd settings in specific namespace

@a-b-v hey i seem to have this issue too, when you say in the fluentd command line in the config-reloader deployment where do you mean? i tried adding it to the daemonset here: https://github.com/vmware/kube-fluentd-operator/blob/master/charts/log-router/templates/daemonset.yaml#L113 but that didn't seem to work

@a-b-v
Copy link
Author

a-b-v commented Feb 25, 2021

Hey, I'm having exactly same issue. Have you found a workaround? I'm afraid it's going to hang on production cluster and

You have to add --dry-run at lines 113 & 115
https://github.com/vmware/kube-fluentd-operator/blob/master/charts/log-router/templates/daemonset.yaml#L113

@kirek007
Copy link

No I've migrated to other solution.

@Cryptophobia
Copy link
Contributor

We have recently merged in some timeouts and log fixes (#180) in order to make config-reloader a bit more resilient and fail quicker in such cases where it cannot validate the fluentd configurations. It will fail quicker and log out more verbose messages.

We will be making a new release soon.

@mridu23
Copy link

mridu23 commented Mar 16, 2021

@Cryptophobia
Any update on when the new release will be published ?

@Cryptophobia
Copy link
Contributor

@mridu23 , it should be sometime this week.

@Cryptophobia
Copy link
Contributor

New releases are made and new one should be incoming. This issue was fixed in latest releases. We do validation like before, except now we timeout the command and output the WARN log messages from the validator command.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants