Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt to kill a single instance of etcd on SNO results in crash since the cluster is non-responsive while the pod reconciles. #431

Open
achuzhoy opened this issue May 25, 2023 · 1 comment

Comments

@achuzhoy
Copy link

How to reproduce:

Have a config with
chaos_scenarios: # List of policies/chaos scenarios to load - container_scenarios: # List of chaos pod scenarios to load - - scenarios/openshift/container_etcd.yml

`
cat scenarios/openshift/container_etcd.yml
scenarios:

  • name: "kill etcd container"
    namespace: "openshift-etcd"
    label_selector: "k8s-app=etcd"
    container_name: "etcd"
    action: "kill 1"
    count: 1
    expected_recovery_time: 60

`

Run python3.9 run_kraken.py --config config/kill-etcd.yaml

result:

`

_ _
| | ___ __ __ | | _____ _ __
| |/ / '__/ ` | |/ / _ \ ' \
| <| | | (
| | < / | | |
||__| _,||__
|| ||

2023-05-25 12:28:26,437 [INFO] Starting kraken
2023-05-25 12:28:26,449 [INFO] Initializing client to talk to the Kubernetes cluster
2023-05-25 12:28:29,884 [INFO] Publishing kraken status at http://0.0.0.0:8085
2023-05-25 12:28:29,885 [INFO] Publishing kraken status at http://0.0.0.0:8085
2023-05-25 12:28:29,886 [INFO] Starting http server at http://0.0.0.0:8085

2023-05-25 12:28:29,886 [INFO] Fetching cluster info
2023-05-25 12:28:29,894 [INFO] Cluster version is 4.13.0
2023-05-25 12:28:29,895 [INFO] Server URL: https://api.sno-0.qe.lab.redhat.com:6443
2023-05-25 12:28:29,895 [INFO] Generated a uuid for the run: 4c51a145-9664-4339-8735-a4a09da5d43f
2023-05-25 12:28:29,895 [INFO] Daemon mode not enabled, will run through 1 iterations

2023-05-25 12:28:29,895 [INFO] Executing scenarios for iteration 0
2023-05-25 12:28:29,895 [INFO] connection set up
127.0.0.1 - - [25/May/2023 12:28:29] "GET / HTTP/1.1" 200 -
2023-05-25 12:28:29,896 [INFO] response RUN
2023-05-25 12:28:29,897 [INFO] Running container scenarios
2023-05-25 12:28:30,798 [INFO] Killing container etcd in pod etcd-sno-0-0 (ns openshift-etcd)
2023-05-25 12:28:30,953 [INFO] Scenario kill etcd container successfully injected
\^[[3~^[[3~^[[3~2023-05-25 12:29:11,186 [WARNING] Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProtocolError('Connection aborted.', RemoteDisconnect
ed('Remote end closed connection without response'))': /api/v1/namespaces/openshift-etcd/pods?pretty=True
2023-05-25 12:29:11,234 [WARNING] Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connec
tion reset by peer'))': /api/v1/namespaces/openshift-etcd/pods?pretty=True
2023-05-25 12:29:11,236 [WARNING] Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f
b9b062ce20>: Failed to establish a new connection: [Errno 111] Connection refused')': /api/v1/namespaces/openshift-etcd/pods?pretty=True
Traceback (most recent call last):
File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/urllib3/connection.py", line 174, in _new_conn
conn = connection.create_connection(
File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/urllib3/util/connection.py", line 95, in create_connection
raise err
File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/urllib3/util/connection.py", line 85, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/urllib3/connectionpool.py", line 703, in urlopen
httplib_response = self._make_request(
File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/urllib3/connectionpool.py", line 386, in _make_request
self._validate_conn(conn)
File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn
conn.connect()
File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/urllib3/connection.py", line 363, in connect
self.sock = conn = self._new_conn()
File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/urllib3/connection.py", line 186, in _new_conn
raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7fb9b062cc70>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/root/krkn/krkn/run_kraken.py", line 421, in
main(options.cfg)
File "/root/krkn/krkn/run_kraken.py", line 218, in main
failed_post_scenarios = pod_scenarios.container_run(
File "/root/krkn/krkn/kraken/pod_scenarios/setup.py", line 92, in container_run
failed_post_scenarios = check_failed_containers(
File "/root/krkn/krkn/kraken/pod_scenarios/setup.py", line 191, in check_failed_containers
pod_output = kubecli.get_pod_info(killed_container[0], killed_container[1])
File "/root/krkn/krkn/kraken/kubernetes/client.py", line 544, in get_pod_info
pod_exists = check_if_pod_exists(name=name, namespace=namespace)
File "/root/krkn/krkn/kraken/kubernetes/client.py", line 721, in check_if_pod_exists
pod_list = list_pods(namespace=namespace)
File "/root/krkn/krkn/kraken/kubernetes/client.py", line 209, in list_pods
ret = cli.list_namespaced_pod(namespace, pretty=True)
File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/kubernetes/client/api/core_v1_api.py", line 15697, in list_namespaced_pod
return self.list_namespaced_pod_with_http_info(namespace, **kwargs) # noqa: E501
File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/kubernetes/client/api/core_v1_api.py", line 15812, in list_namespaced_pod_with_http_info
return self.api_client.call_api(
File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/kubernetes/client/api_client.py", line 348, in call_api
return self.__call_api(resource_path, method,
File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
response_data = self.request(
File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/kubernetes/client/api_client.py", line 373, in request
return self.rest_client.GET(url,
File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/kubernetes/client/rest.py", line 241, in GET
return self.request("GET", url,
File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/kubernetes/client/rest.py", line 214, in request
r = self.pool_manager.request(method, url,
File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/urllib3/request.py", line 74, in request
return self.request_encode_url(
File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/urllib3/request.py", line 96, in request_encode_url
return self.urlopen(method, url, **extra_kw)
File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/urllib3/poolmanager.py", line 376, in urlopen
response = conn.urlopen(method, u.request_uri, **kw)
File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/urllib3/connectionpool.py", line 815, in urlopen
return self.urlopen(
File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/urllib3/connectionpool.py", line 815, in urlopen
return self.urlopen(
File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/urllib3/connectionpool.py", line 815, in urlopen
return self.urlopen(
File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/urllib3/connectionpool.py", line 787, in urlopen
retries = retries.increment(
File "/root/krkn/krkn/chaos/lib64/python3.9/site-packages/urllib3/util/retry.py", line 592, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='api.sno-0.qe.lab.redhat.com', port=6443): Max retries exceeded with url: /api/v1/namespaces/openshift-etcd/pods?pretty=True (Caused by NewConnectionErr
or('<urllib3.connection.HTTPSConnection object at 0x7fb9b062cc70>: Failed to establish a new connection: [Errno 111] Connection refused'))
`

This is probably because the cluster can't be contacted while the etcd is restarted, but the app shouldn't crash

@achuzhoy
Copy link
Author

achuzhoy commented Aug 1, 2023

cc @tsebastiani

This one still reproduces in OCP 4.13.6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant