Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zha "EZSP is not running" error #107798

Open
ThePapaG opened this issue Jan 11, 2024 · 19 comments
Open

Zha "EZSP is not running" error #107798

ThePapaG opened this issue Jan 11, 2024 · 19 comments
Assignees

Comments

@ThePapaG
Copy link
Contributor

The problem

Zha regularly drops to "initialising" state while using a Sonoff ZBDonlge E. This causes all of my zigbee devices to become unavailable and unusable. The only recovery I have found is reboot the home assistant setup

What version of Home Assistant Core has the issue?

2024.1.2

What was the last working version of Home Assistant Core?

No response

What type of installation are you running?

Home Assistant OS

Integration causing the issue

Zigbee Home Automation

Link to integration documentation on our website

https://www.home-assistant.io/integrations/zha/

Diagnostics information

No response

Example YAML snippet

No response

Anything in the logs that might be useful for us?

2024-01-09 14:16:49.719 DEBUG (MainThread) [zigpy.application] Feeding watchdog
2024-01-09 14:16:49.720 DEBUG (MainThread) [bellows.ezsp] Couldn't send command readCounters(()). EZSP is not running
2024-01-09 14:16:49.720 WARNING (MainThread) [bellows.zigbee.application] Watchdog heartbeat timeout: EzspError('EZSP is not running')
2024-01-09 14:16:49.721 WARNING (MainThread) [zigpy.application] Watchdog failure
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/zigpy/application.py", line 665, in _watchdog_loop
    await self.watchdog_feed()
  File "/usr/local/lib/python3.11/site-packages/zigpy/application.py", line 647, in watchdog_feed
    await self._watchdog_feed()
  File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 999, in _watchdog_feed
    (res,) = await self._ezsp.readCounters()
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/bellows/ezsp/__init__.py", line 212, in _command
    raise EzspError("EZSP is not running")
bellows.exception.EzspError: EZSP is not running
2024-01-09 14:16:49.739 DEBUG (MainThread) [zigpy.application] Connection to the radio has been lost: EzspError('EZSP is not running')
2024-01-09 14:16:49.745 DEBUG (MainThread) [zigpy.application] Stopping watchdog loop

Additional information

Unknown if ever working. I have not had this ever working. I am aware that the sonoff P dongle is reported to have issues but the newer E dongle uses a whole new chip and stack. I have seen reports of the ZBBridge having issues.

The setup is a RPI 3B+ with default RPI PSU. I have tested the dongle connected directly to the serial bus and through a 1 m extension. The RPI is using wifi to connect to the network.

@home-assistant
Copy link

Hey there @dmulcahey, @Adminiuga, @puddly, @TheJulianJES, mind taking a look at this issue as it has been labeled with an integration (zha) you are listed as a code owner for? Thanks!

Code owner commands

Code owners of zha can trigger bot actions by commenting:

  • @home-assistant close Closes the issue.
  • @home-assistant rename Awesome new title Renames the issue.
  • @home-assistant reopen Reopen the issue.
  • @home-assistant unassign zha Removes the current integration label and assignees on the issue, add the integration domain after the command.
  • @home-assistant add-label needs-more-information Add a label (needs-more-information, problem in dependency, problem in custom component) to the issue.
  • @home-assistant remove-label needs-more-information Remove a label (needs-more-information, problem in dependency, problem in custom component) on the issue.

(message by CodeOwnersMention)


zha documentation
zha source
(message by IssueLinks)

@puddly
Copy link
Contributor

puddly commented Jan 11, 2024

If you're able to reproduce the crash, can you enable ZHA debug logging and post the whole log?

@ThePapaG
Copy link
Contributor Author

ThePapaG commented Jan 12, 2024

If you're able to reproduce the crash, can you enable ZHA debug logging and post the whole log?

Absolutely @puddly. This is the log that the snippet came from with debug logging enabled
home-assistant_zha_2024-01-09T06-07-12.411Z.log

@ThePapaG
Copy link
Contributor Author

Is this related to #106185? Seems there's a problem with zha and ezsp?

@puddly
Copy link
Contributor

puddly commented Jan 12, 2024

There's nearly a nine second break in your log during which radio communication (and all other activity within HA's event loop) completely stopped.

EZSP radios, like your Sonoff stick, are extremely timing sensitive. During that period, ZHA is not able to acknowledge data coming in from your radio and the radio resets. ZHA is already running this all in a separate thread for latency purposes but it seems like this stall is system-wide.

Some integration or external process is causing this stutter. So that we can figure out what's responsible, can you list what integrations and addons are you running?

@ThePapaG
Copy link
Contributor Author

Sure!

Add-ons:
Advanced SSH & web terminal
Cloudflared
Home assistant google drive backup
Studio code server

Integrations:
Android tv remote
Bureau of meteorology
Dlna digital media renderer
Google assistant
Google cast
Google translate text to speech
Hacs
Home assistant supervisor
Local tuya
Mobile app
Raspberry pi power supply checker
Smarter things
Sun
Tapo: cameras control
Thermal comfort
Tuya
Upnp/igd
Uptime
Watchman
Zha

@ThePapaG
Copy link
Contributor Author

So the radio resets because it isn't acknowledged and assumes a comms error? How come there's no recovery mechanism on that?

My experience is when it is in this error case, it is just forever gone. I can only recover by reboot of the whole pi (triggered from the os)

@ThePapaG
Copy link
Contributor Author

Are you able to share with me the timestamp where this lock up is occurring

@puddly
Copy link
Contributor

puddly commented Jan 12, 2024

There is error recovery. The problem is that it fails to set up during recovery. It'll be retried indefinitely until it works but the underlying problem remains.

I've annotated a portion of your log, hopefully it makes sense:

# Normal radio traffic. Notice how sending a command immediately results in data being sent 0.001s later.
2024-01-09 14:09:20.679 DEBUG (MainThread) [bellows.ezsp.protocol] Send command getValue: (<EzspValueId.VALUE_FREE_BUFFERS: 3>,)
2024-01-09 14:09:20.680 DEBUG (bellows.thread_0) [bellows.uart] Sending: b'31c021a9fe2a16767f7e'
2024-01-09 14:09:20.688 DEBUG (bellows.thread_0) [bellows.uart] Data frame: b'14c0a1a9fe2a15b3ae10647e'
2024-01-09 14:09:20.689 DEBUG (bellows.thread_0) [bellows.uart] Sending: b'82503a7e'
2024-01-09 14:09:20.692 DEBUG (MainThread) [bellows.ezsp.protocol] Application frame received getValue: [<EzspStatus.SUCCESS: 0>, b'\xf7']
2024-01-09 14:09:20.693 DEBUG (MainThread) [bellows.zigbee.application] Free buffers status EzspStatus.SUCCESS, value: 247
2024-01-09 14:09:20.693 DEBUG (MainThread) [bellows.zigbee.application] ezsp_counters: [MAC_RX_BROADCAST = 58, MAC_TX_BROADCAST = 42, MAC_RX_UNICAST = 26, MAC_TX_UNICAST_SUCCESS = 12, MAC_TX_UNICAST_RETRY = 0, MAC_TX_UNICAST_FAILED = 0, APS_DATA_RX_BROADCAST = 1, APS_DATA_TX_BROADCAST = 1, APS_DATA_RX_UNICAST = 11, APS_DATA_TX_UNICAST_SUCCESS = 7, APS_DATA_TX_UNICAST_RETRY = 32, APS_DATA_TX_UNICAST_FAILED = 16, ROUTE_DISCOVERY_INITIATED = 9, NEIGHBOR_ADDED = 2, NEIGHBOR_REMOVED = 0, NEIGHBOR_STALE = 0, JOIN_INDICATION = 0, CHILD_REMOVED = 0, ASH_OVERFLOW_ERROR = 0, ASH_FRAMING_ERROR = 0, ASH_OVERRUN_ERROR = 0, NWK_FRAME_COUNTER_FAILURE = 0, APS_FRAME_COUNTER_FAILURE = 0, UTILITY = 0, APS_LINK_KEY_NOT_AUTHORIZED = 0, NWK_DECRYPTION_FAILURE = 0, APS_DECRYPTION_FAILURE = 0, ALLOCATE_PACKET_BUFFER_FAILURE = 0, RELAYED_UNICAST = 0, PHY_TO_MAC_QUEUE_LIMIT_REACHED = 0, PACKET_VALIDATE_LIBRARY_DROPPED_COUNT = 0, TYPE_NWK_RETRY_OVERFLOW = 0, PHY_CCA_FAIL_COUNT = 0, BROADCAST_TABLE_FULL = 0, PTA_LO_PRI_REQUESTED = 0, PTA_HI_PRI_REQUESTED = 0, PTA_LO_PRI_DENIED = 0, PTA_HI_PRI_DENIED = 0, PTA_LO_PRI_TX_ABORTED = 0, PTA_HI_PRI_TX_ABORTED = 0, ADDRESS_CONFLICT_SENT = 0, EZSP_FREE_BUFFERS = 247]

# pychromecast seems to start having issues
2024-01-09 14:09:31.229 ERROR (Thread-7) [pychromecast.socket_client] [Living Room TV(192.168.0.98):8009] Error reading from socket.
2024-01-09 14:09:31.230 WARNING (Thread-7) [pychromecast.socket_client] [Living Room TV(192.168.0.98):8009] Error communicating with socket, resetting connection

# We feed the watchdog to make sure the radio is still alive
2024-01-09 14:09:31.261 DEBUG (MainThread) [zigpy.application] Feeding watchdog
2024-01-09 14:09:31.261 DEBUG (MainThread) [bellows.ezsp.protocol] Send command readCounters: ()

# Nine seconds later (!), not a single byte has been sent
2024-01-09 14:10:40.065 WARNING (Thread-8) [pychromecast.socket_client] [Bedroom speaker(192.168.0.178):8009] Heartbeat timeout, resetting connection
2024-01-09 14:10:40.068 WARNING (Thread-9) [pychromecast.socket_client] [Kitchen Assistant(192.168.0.107):8009] Heartbeat timeout, resetting connection

# Finally, data starts being sent and received in one massive block, meaning it has been queuing up in an OS buffer for a long time
2024-01-09 14:10:40.084 DEBUG (bellows.thread_0) [bellows.uart] Sending: b'42c121a9a52a73fd7e'
2024-01-09 14:10:40.095 DEBUG (bellows.thread_0) [bellows.uart] Data frame: b'24c0b1a9112a15b6588d4a24ab1593499cd1bf69b4469874f1c71588fc3f2eade9ccee778e0a037e'
2024-01-09 14:10:40.096 DEBUG (bellows.thread_0) [bellows.uart] Sending: b'83401b7e'
2024-01-09 14:10:40.102 DEBUG (bellows.thread_0) [bellows.uart] Data frame: b'2cc0b1a9112a15b6588d4a24ab1593499cd1bf69b4469874f1c71588fc3f2eade9ccee778ed8bc7e'
2024-01-09 14:10:40.103 DEBUG (bellows.thread_0) [bellows.uart] Sending: b'83401b7e'
2024-01-09 14:10:40.105 DEBUG (bellows.thread_0) [bellows.uart] Data frame: b'2cc0b1a9112a15b6588d4a24ab1593499cd1bf69b4469874f1c71588fc3f2eade9ccee778ed8bc7e'
2024-01-09 14:10:40.106 DEBUG (bellows.thread_0) [bellows.uart] Sending: b'83401b7e'
2024-01-09 14:10:40.108 DEBUG (bellows.thread_0) [bellows.uart] Data frame: b'2cc0b1a9112a15b6588d4a24ab1593499cd1bf69b4469874f1c71588fc3f2eade9ccee778ed8bc7e'
2024-01-09 14:10:40.109 DEBUG (bellows.thread_0) [bellows.uart] Sending: b'83401b7e'

# The radio says that it is resetting because nothing has responded to it in many seconds
2024-01-09 14:10:40.110 DEBUG (bellows.thread_0) [bellows.uart] Error code: ERROR_EXCEEDED_MAXIMUM_ACK_TIMEOUT_COUNT, Version: 2, frame: b'c20251a8bd7e'
2024-01-09 14:10:40.111 DEBUG (bellows.thread_0) [bellows.uart] Data frame: b'34c0b1a9112a15b658914e24ab5593499ceebf69b4469874f5de1483fc7e1ec1f3f7a97e'
2024-01-09 14:10:40.113 DEBUG (bellows.thread_0) [bellows.uart] Sending: b'8430fc7e'
2024-01-09 14:10:40.120 DEBUG (bellows.thread_0) [bellows.uart] Error code: ERROR_EXCEEDED_MAXIMUM_ACK_TIMEOUT_COUNT, Version: 2, frame: b'c20251a8bd7e'
2024-01-09 14:10:40.124 DEBUG (bellows.thread_0) [bellows.uart] Error code: ERROR_EXCEEDED_MAXIMUM_ACK_TIMEOUT_COUNT, Version: 2, frame: b'c20251a8bd7e'
2024-01-09 14:10:40.125 DEBUG (bellows.thread_0) [bellows.uart] Error code: ERROR_EXCEEDED_MAXIMUM_ACK_TIMEOUT_COUNT, Version: 2, frame: b'c20251a8bd7e'
2024-01-09 14:10:40.127 DEBUG (bellows.thread_0) [bellows.uart] Error code: ERROR_EXCEEDED_MAXIMUM_ACK_TIMEOUT_COUNT, Version: 2, frame: b'c20251a8bd7e'
2024-01-09 14:10:40.131 DEBUG (bellows.thread_0) [bellows.uart] Error code: ERROR_EXCEEDED_MAXIMUM_ACK_TIMEOUT_COUNT, Version: 2, frame: b'c20251a8bd7e'
2024-01-09 14:10:40.135 DEBUG (bellows.thread_0) [bellows.uart] Error code: ERROR_EXCEEDED_MAXIMUM_ACK_TIMEOUT_COUNT, Version: 2, frame: b'c20251a8bd7e'
...
2024-01-09 14:10:40.976 DEBUG (MainThread) [homeassistant.components.zha.core.gateway] Connection to the radio was lost: <NcpResetCode.ERROR_EXCEEDED_MAXIMUM_ACK_TIMEOUT_COUNT: 81>
...
# ZHA is restarting again a few seconds later
2024-01-09 14:10:50.830 DEBUG (bellows.thread_0) [zigpy.serial] Opening a serial connection to '/dev/serial/by-id/usb-ITEAD_SONOFF_Zigbee_3.0_USB_Dongle_Plus_V2_20231122145443-if00' (115200 baudrate)
...
2024-01-09 14:10:54.778 DEBUG (MainThread) [bellows.ezsp.protocol] Send command getCurrentSecurityState: ()
2024-01-09 14:10:54.779 DEBUG (bellows.thread_0) [bellows.uart] Sending: b'016d21a93d2ae7007e'
2024-01-09 14:10:54.786 DEBUG (bellows.thread_0) [bellows.uart] Data frame: b'116da1a93d2a154659ccf3b754aa2112d0c8a47e'
2024-01-09 14:10:54.787 DEBUG (bellows.thread_0) [bellows.uart] Sending: b'82503a7e'
2024-01-09 14:10:54.789 DEBUG (MainThread) [bellows.ezsp.protocol] Application frame received getCurrentSecurityState: [<EmberStatus.SUCCESS: 0>, EmberCurrentSecurityState(bitmask=<EmberCurrentSecurityBitmask.GLOBAL_LINK_KEY|HAVE_TRUST_CENTER_LINK_KEY|TRUST_CENTER_USES_HASHED_LINK_KEY|96: 244>, trustCenterLongAddress=4c:5b:b3:ff:fe:92:b9:58)]
# Same problem again: 32 (!) second break in communication with the radio
2024-01-09 14:11:28.988 DEBUG (bellows.thread_0) [bellows.uart] Data frame: b'216db1a9112a15b658914e24ab5593499cef876fb4469874f5de1b83fc7e1e4efc0fff7e'
2024-01-09 14:11:28.989 DEBUG (bellows.thread_0) [bellows.uart] Sending: b'83401b7e'
2024-01-09 14:11:28.992 DEBUG (MainThread) [bellows.ezsp.protocol] Send command setPolicy: (<EzspPolicyId.TRUST_CENTER_POLICY: 0>, <EzspDecisionId.ALLOW_PRECONFIGURED_KEY_JOINS: 1>)
2024-01-09 14:11:29.001 DEBUG (bellows.thread_0) [bellows.uart] Data frame: b'296db1a9112a15b658914e24ab5593499cef876fb4469874f5de1b83fc7e1e4efc219b7e'
2024-01-09 14:11:29.002 DEBUG (bellows.thread_0) [bellows.uart] Sending: b'83401b7e'
# pychromecast having issues again
2024-01-09 14:11:29.006 WARNING (Thread-9) [pychromecast.socket_client] [Kitchen Assistant(192.168.0.107):8009] Heartbeat timeout, resetting connection
2024-01-09 14:11:29.007 WARNING (Thread-7) [pychromecast.socket_client] [Living Room TV(192.168.0.98):8009] Heartbeat timeout, resetting connection
2024-01-09 14:11:29.011 WARNING (Thread-8) [pychromecast.socket_client] [Bedroom speaker(192.168.0.178):8009] Heartbeat timeout, resetting connection
2024-01-09 14:11:29.013 DEBUG (bellows.thread_0) [bellows.uart] Data frame: b'296db1a9112a15b658914e24ab5593499cef876fb4469874f5de1b83fc7e1e4efc219b7e'
2024-01-09 14:11:29.014 DEBUG (bellows.thread_0) [bellows.uart] Sending: b'83401b7e'
2024-01-09 14:11:29.017 DEBUG (bellows.thread_0) [bellows.uart] Data frame: b'296db1a9112a15b658914e24ab5593499cef876fb4469874f5de1b83fc7e1e4efc219b7e'
2024-01-09 14:11:29.017 DEBUG (bellows.thread_0) [bellows.uart] Sending: b'83401b7e'
# The radio resets again
2024-01-09 14:11:29.041 DEBUG (bellows.thread_0) [bellows.uart] Error code: ERROR_EXCEEDED_MAXIMUM_ACK_TIMEOUT_COUNT, Version: 2, frame: b'c20251a8bd7e'

@meramsey
Copy link

having same issue with but with skyconnect (multiprotocol enabled) it only recovers if full reboot is done
core_silabs_multiprotocol_2024-01-12T14-29-57.419Z.log

@puddly
Copy link
Contributor

puddly commented Jan 12, 2024

@meramsey Multiprotocol issues aren't something Core has any influence on. Please upload your logs here: home-assistant/addons#3408

@ThePapaG
Copy link
Contributor Author

ThePapaG commented Jan 12, 2024

I have disabled any integrations I don't actively require right now (cast being one of them). Is cast always having issues around the error?

Although I'm realising the two cases you pointed out, the first cast has already errored before the issue, the second it errors after the issue

@Padrecc
Copy link

Padrecc commented Jan 13, 2024

I'm experiencing the same issue. It started somewhere in the middle of December.
Similar config. Docker,zha,sonoff wifi bridge.

@ThePapaG
Copy link
Contributor Author

I'm experiencing the same issue. It started somewhere in the middle of December. Similar config. Docker,zha,sonoff wifi bridge.

I did read that the wifi bridge is know for issues due to dropped wifi packets. This issue is with a serial connection to dongle. That's why there shouldn't be an issue

@ThePapaG
Copy link
Contributor Author

ThePapaG commented Jan 13, 2024

@puddly I have disabled all integrations I am not using:
Cast
RPI power
Bluetooth
Google translate
Upnp
Android remote
Dlna

Error still occured.
error_log-7.txt
error_log-8.txt

@akunia
Copy link

akunia commented Feb 26, 2024

Issues for me as well, all zigbee devices went offline at night. I see in the logs:
2024-02-26 01:03:14.172 ERROR (MainThread) [bellows.ezsp] NCP entered failed state. Requesting APP controller restart 2024-02-26 01:03:14.173 ERROR (MainThread) [bellows.ezsp] NCP entered failed state. Requesting APP controller restart 2024-02-26 01:03:14.178 WARNING (MainThread) [zigpy.backups] Failed to create a network backup Traceback (most recent call last): File "/usr/local/lib/python3.12/site-packages/zigpy/backups.py", line 220, in _backup_loop await self.create_backup() File "/usr/local/lib/python3.12/site-packages/zigpy/backups.py", line 143, in create_backup await self.app.load_network_info(load_devices=load_devices) File "/usr/local/lib/python3.12/site-packages/bellows/zigbee/application.py", line 243, in load_network_info await self._ensure_network_running() File "/usr/local/lib/python3.12/site-packages/bellows/zigbee/application.py", line 165, in _ensure_network_running (state,) = await self._ezsp.networkState() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/bellows/ezsp/__init__.py", line 212, in _command raise EzspError("EZSP is not running") bellows.exception.EzspError: EZSP is not running

@steenlarsen
Copy link

For what it is worth : I had the same issue with Homeassistant 2024.3.1 running on a Synology inside a container created from homeassistant/home-assistant:stable with Sonoff Zigbee 3.0 Plus-E Dongle. I resolved the problem by downgrading to 2024.2.3.
First everything was running fine on 2024.3.1 but then my wife unplugged the Zigbee dongle and it stopped working. Restarting the container and even the entire Synology did not solve the problem. Downgrading to 2024.2.3 solved it.

@jandirk39
Copy link

Hi I use the Skyconnect zigbee stick and I have the same issues. It started with crashing a couple of months ago after a home assistant upgrade (can be coincidence). now it happens multiple times per day. I'm desperate. I have search for days to find a solution on home assistant forum, github, reddit etc. no fix so far. @puddly You seem to have a lot of knowledge about this. Is it ok if I share my log with you as well?

@pipiche38
Copy link

I guess similar as zigpy/bellows#617 (which is not Home Assistant)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants