Zha "EZSP is not running" error #107798

ThePapaG · 2024-01-11T12:23:43Z

The problem

Zha regularly drops to "initialising" state while using a Sonoff ZBDonlge E. This causes all of my zigbee devices to become unavailable and unusable. The only recovery I have found is reboot the home assistant setup

What version of Home Assistant Core has the issue?

2024.1.2

What was the last working version of Home Assistant Core?

No response

What type of installation are you running?

Home Assistant OS

Integration causing the issue

Zigbee Home Automation

Link to integration documentation on our website

https://www.home-assistant.io/integrations/zha/

Diagnostics information

No response

Example YAML snippet

No response

Anything in the logs that might be useful for us?

2024-01-09 14:16:49.719 DEBUG (MainThread) [zigpy.application] Feeding watchdog
2024-01-09 14:16:49.720 DEBUG (MainThread) [bellows.ezsp] Couldn't send command readCounters(()). EZSP is not running
2024-01-09 14:16:49.720 WARNING (MainThread) [bellows.zigbee.application] Watchdog heartbeat timeout: EzspError('EZSP is not running')
2024-01-09 14:16:49.721 WARNING (MainThread) [zigpy.application] Watchdog failure
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/zigpy/application.py", line 665, in _watchdog_loop
    await self.watchdog_feed()
  File "/usr/local/lib/python3.11/site-packages/zigpy/application.py", line 647, in watchdog_feed
    await self._watchdog_feed()
  File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 999, in _watchdog_feed
    (res,) = await self._ezsp.readCounters()
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/bellows/ezsp/__init__.py", line 212, in _command
    raise EzspError("EZSP is not running")
bellows.exception.EzspError: EZSP is not running
2024-01-09 14:16:49.739 DEBUG (MainThread) [zigpy.application] Connection to the radio has been lost: EzspError('EZSP is not running')
2024-01-09 14:16:49.745 DEBUG (MainThread) [zigpy.application] Stopping watchdog loop

Additional information

Unknown if ever working. I have not had this ever working. I am aware that the sonoff P dongle is reported to have issues but the newer E dongle uses a whole new chip and stack. I have seen reports of the ZBBridge having issues.

The setup is a RPI 3B+ with default RPI PSU. I have tested the dongle connected directly to the serial bus and through a 1 m extension. The RPI is using wifi to connect to the network.

home-assistant · 2024-01-11T12:23:49Z

Hey there @dmulcahey, @Adminiuga, @puddly, @TheJulianJES, mind taking a look at this issue as it has been labeled with an integration (zha) you are listed as a code owner for? Thanks!

Code owner commands

Code owners of zha can trigger bot actions by commenting:

@home-assistant close Closes the issue.
@home-assistant rename Awesome new title Renames the issue.
@home-assistant reopen Reopen the issue.
@home-assistant unassign zha Removes the current integration label and assignees on the issue, add the integration domain after the command.
@home-assistant add-label needs-more-information Add a label (needs-more-information, problem in dependency, problem in custom component) to the issue.
@home-assistant remove-label needs-more-information Remove a label (needs-more-information, problem in dependency, problem in custom component) on the issue.

_{^{(message by CodeOwnersMention)}}

zha documentation
zha source
_{^{(message by IssueLinks)}}

puddly · 2024-01-11T14:39:34Z

If you're able to reproduce the crash, can you enable ZHA debug logging and post the whole log?

ThePapaG · 2024-01-12T08:34:49Z

If you're able to reproduce the crash, can you enable ZHA debug logging and post the whole log?

Absolutely @puddly. This is the log that the snippet came from with debug logging enabled
home-assistant_zha_2024-01-09T06-07-12.411Z.log

ThePapaG · 2024-01-12T11:17:45Z

Is this related to #106185? Seems there's a problem with zha and ezsp?

puddly · 2024-01-12T11:46:05Z

There's nearly a nine second break in your log during which radio communication (and all other activity within HA's event loop) completely stopped.

EZSP radios, like your Sonoff stick, are extremely timing sensitive. During that period, ZHA is not able to acknowledge data coming in from your radio and the radio resets. ZHA is already running this all in a separate thread for latency purposes but it seems like this stall is system-wide.

Some integration or external process is causing this stutter. So that we can figure out what's responsible, can you list what integrations and addons are you running?

ThePapaG · 2024-01-12T12:53:56Z

Sure!

Add-ons:
Advanced SSH & web terminal
Cloudflared
Home assistant google drive backup
Studio code server

Integrations:
Android tv remote
Bureau of meteorology
Dlna digital media renderer
Google assistant
Google cast
Google translate text to speech
Hacs
Home assistant supervisor
Local tuya
Mobile app
Raspberry pi power supply checker
Smarter things
Sun
Tapo: cameras control
Thermal comfort
Tuya
Upnp/igd
Uptime
Watchman
Zha

ThePapaG · 2024-01-12T12:55:14Z

So the radio resets because it isn't acknowledged and assumes a comms error? How come there's no recovery mechanism on that?

My experience is when it is in this error case, it is just forever gone. I can only recover by reboot of the whole pi (triggered from the os)

ThePapaG · 2024-01-12T12:55:53Z

Are you able to share with me the timestamp where this lock up is occurring

puddly · 2024-01-12T14:17:21Z

There is error recovery. The problem is that it fails to set up during recovery. It'll be retried indefinitely until it works but the underlying problem remains.

I've annotated a portion of your log, hopefully it makes sense:

# Normal radio traffic. Notice how sending a command immediately results in data being sent 0.001s later.
2024-01-09 14:09:20.679 DEBUG (MainThread) [bellows.ezsp.protocol] Send command getValue: (<EzspValueId.VALUE_FREE_BUFFERS: 3>,)
2024-01-09 14:09:20.680 DEBUG (bellows.thread_0) [bellows.uart] Sending: b'31c021a9fe2a16767f7e'
2024-01-09 14:09:20.688 DEBUG (bellows.thread_0) [bellows.uart] Data frame: b'14c0a1a9fe2a15b3ae10647e'
2024-01-09 14:09:20.689 DEBUG (bellows.thread_0) [bellows.uart] Sending: b'82503a7e'
2024-01-09 14:09:20.692 DEBUG (MainThread) [bellows.ezsp.protocol] Application frame received getValue: [<EzspStatus.SUCCESS: 0>, b'\xf7']
2024-01-09 14:09:20.693 DEBUG (MainThread) [bellows.zigbee.application] Free buffers status EzspStatus.SUCCESS, value: 247
2024-01-09 14:09:20.693 DEBUG (MainThread) [bellows.zigbee.application] ezsp_counters: [MAC_RX_BROADCAST = 58, MAC_TX_BROADCAST = 42, MAC_RX_UNICAST = 26, MAC_TX_UNICAST_SUCCESS = 12, MAC_TX_UNICAST_RETRY = 0, MAC_TX_UNICAST_FAILED = 0, APS_DATA_RX_BROADCAST = 1, APS_DATA_TX_BROADCAST = 1, APS_DATA_RX_UNICAST = 11, APS_DATA_TX_UNICAST_SUCCESS = 7, APS_DATA_TX_UNICAST_RETRY = 32, APS_DATA_TX_UNICAST_FAILED = 16, ROUTE_DISCOVERY_INITIATED = 9, NEIGHBOR_ADDED = 2, NEIGHBOR_REMOVED = 0, NEIGHBOR_STALE = 0, JOIN_INDICATION = 0, CHILD_REMOVED = 0, ASH_OVERFLOW_ERROR = 0, ASH_FRAMING_ERROR = 0, ASH_OVERRUN_ERROR = 0, NWK_FRAME_COUNTER_FAILURE = 0, APS_FRAME_COUNTER_FAILURE = 0, UTILITY = 0, APS_LINK_KEY_NOT_AUTHORIZED = 0, NWK_DECRYPTION_FAILURE = 0, APS_DECRYPTION_FAILURE = 0, ALLOCATE_PACKET_BUFFER_FAILURE = 0, RELAYED_UNICAST = 0, PHY_TO_MAC_QUEUE_LIMIT_REACHED = 0, PACKET_VALIDATE_LIBRARY_DROPPED_COUNT = 0, TYPE_NWK_RETRY_OVERFLOW = 0, PHY_CCA_FAIL_COUNT = 0, BROADCAST_TABLE_FULL = 0, PTA_LO_PRI_REQUESTED = 0, PTA_HI_PRI_REQUESTED = 0, PTA_LO_PRI_DENIED = 0, PTA_HI_PRI_DENIED = 0, PTA_LO_PRI_TX_ABORTED = 0, PTA_HI_PRI_TX_ABORTED = 0, ADDRESS_CONFLICT_SENT = 0, EZSP_FREE_BUFFERS = 247]

# pychromecast seems to start having issues
2024-01-09 14:09:31.229 ERROR (Thread-7) [pychromecast.socket_client] [Living Room TV(192.168.0.98):8009] Error reading from socket.
2024-01-09 14:09:31.230 WARNING (Thread-7) [pychromecast.socket_client] [Living Room TV(192.168.0.98):8009] Error communicating with socket, resetting connection

# We feed the watchdog to make sure the radio is still alive
2024-01-09 14:09:31.261 DEBUG (MainThread) [zigpy.application] Feeding watchdog
2024-01-09 14:09:31.261 DEBUG (MainThread) [bellows.ezsp.protocol] Send command readCounters: ()

# Nine seconds later (!), not a single byte has been sent
2024-01-09 14:10:40.065 WARNING (Thread-8) [pychromecast.socket_client] [Bedroom speaker(192.168.0.178):8009] Heartbeat timeout, resetting connection
2024-01-09 14:10:40.068 WARNING (Thread-9) [pychromecast.socket_client] [Kitchen Assistant(192.168.0.107):8009] Heartbeat timeout, resetting connection

# Finally, data starts being sent and received in one massive block, meaning it has been queuing up in an OS buffer for a long time
2024-01-09 14:10:40.084 DEBUG (bellows.thread_0) [bellows.uart] Sending: b'42c121a9a52a73fd7e'
2024-01-09 14:10:40.095 DEBUG (bellows.thread_0) [bellows.uart] Data frame: b'24c0b1a9112a15b6588d4a24ab1593499cd1bf69b4469874f1c71588fc3f2eade9ccee778e0a037e'
2024-01-09 14:10:40.096 DEBUG (bellows.thread_0) [bellows.uart] Sending: b'83401b7e'
2024-01-09 14:10:40.102 DEBUG (bellows.thread_0) [bellows.uart] Data frame: b'2cc0b1a9112a15b6588d4a24ab1593499cd1bf69b4469874f1c71588fc3f2eade9ccee778ed8bc7e'
2024-01-09 14:10:40.103 DEBUG (bellows.thread_0) [bellows.uart] Sending: b'83401b7e'
2024-01-09 14:10:40.105 DEBUG (bellows.thread_0) [bellows.uart] Data frame: b'2cc0b1a9112a15b6588d4a24ab1593499cd1bf69b4469874f1c71588fc3f2eade9ccee778ed8bc7e'
2024-01-09 14:10:40.106 DEBUG (bellows.thread_0) [bellows.uart] Sending: b'83401b7e'
2024-01-09 14:10:40.108 DEBUG (bellows.thread_0) [bellows.uart] Data frame: b'2cc0b1a9112a15b6588d4a24ab1593499cd1bf69b4469874f1c71588fc3f2eade9ccee778ed8bc7e'
2024-01-09 14:10:40.109 DEBUG (bellows.thread_0) [bellows.uart] Sending: b'83401b7e'

# The radio says that it is resetting because nothing has responded to it in many seconds
2024-01-09 14:10:40.110 DEBUG (bellows.thread_0) [bellows.uart] Error code: ERROR_EXCEEDED_MAXIMUM_ACK_TIMEOUT_COUNT, Version: 2, frame: b'c20251a8bd7e'
2024-01-09 14:10:40.111 DEBUG (bellows.thread_0) [bellows.uart] Data frame: b'34c0b1a9112a15b658914e24ab5593499ceebf69b4469874f5de1483fc7e1ec1f3f7a97e'
2024-01-09 14:10:40.113 DEBUG (bellows.thread_0) [bellows.uart] Sending: b'8430fc7e'
2024-01-09 14:10:40.120 DEBUG (bellows.thread_0) [bellows.uart] Error code: ERROR_EXCEEDED_MAXIMUM_ACK_TIMEOUT_COUNT, Version: 2, frame: b'c20251a8bd7e'
2024-01-09 14:10:40.124 DEBUG (bellows.thread_0) [bellows.uart] Error code: ERROR_EXCEEDED_MAXIMUM_ACK_TIMEOUT_COUNT, Version: 2, frame: b'c20251a8bd7e'
2024-01-09 14:10:40.125 DEBUG (bellows.thread_0) [bellows.uart] Error code: ERROR_EXCEEDED_MAXIMUM_ACK_TIMEOUT_COUNT, Version: 2, frame: b'c20251a8bd7e'
2024-01-09 14:10:40.127 DEBUG (bellows.thread_0) [bellows.uart] Error code: ERROR_EXCEEDED_MAXIMUM_ACK_TIMEOUT_COUNT, Version: 2, frame: b'c20251a8bd7e'
2024-01-09 14:10:40.131 DEBUG (bellows.thread_0) [bellows.uart] Error code: ERROR_EXCEEDED_MAXIMUM_ACK_TIMEOUT_COUNT, Version: 2, frame: b'c20251a8bd7e'
2024-01-09 14:10:40.135 DEBUG (bellows.thread_0) [bellows.uart] Error code: ERROR_EXCEEDED_MAXIMUM_ACK_TIMEOUT_COUNT, Version: 2, frame: b'c20251a8bd7e'
...
2024-01-09 14:10:40.976 DEBUG (MainThread) [homeassistant.components.zha.core.gateway] Connection to the radio was lost: <NcpResetCode.ERROR_EXCEEDED_MAXIMUM_ACK_TIMEOUT_COUNT: 81>
...
# ZHA is restarting again a few seconds later
2024-01-09 14:10:50.830 DEBUG (bellows.thread_0) [zigpy.serial] Opening a serial connection to '/dev/serial/by-id/usb-ITEAD_SONOFF_Zigbee_3.0_USB_Dongle_Plus_V2_20231122145443-if00' (115200 baudrate)
...
2024-01-09 14:10:54.778 DEBUG (MainThread) [bellows.ezsp.protocol] Send command getCurrentSecurityState: ()
2024-01-09 14:10:54.779 DEBUG (bellows.thread_0) [bellows.uart] Sending: b'016d21a93d2ae7007e'
2024-01-09 14:10:54.786 DEBUG (bellows.thread_0) [bellows.uart] Data frame: b'116da1a93d2a154659ccf3b754aa2112d0c8a47e'
2024-01-09 14:10:54.787 DEBUG (bellows.thread_0) [bellows.uart] Sending: b'82503a7e'
2024-01-09 14:10:54.789 DEBUG (MainThread) [bellows.ezsp.protocol] Application frame received getCurrentSecurityState: [<EmberStatus.SUCCESS: 0>, EmberCurrentSecurityState(bitmask=<EmberCurrentSecurityBitmask.GLOBAL_LINK_KEY|HAVE_TRUST_CENTER_LINK_KEY|TRUST_CENTER_USES_HASHED_LINK_KEY|96: 244>, trustCenterLongAddress=4c:5b:b3:ff:fe:92:b9:58)]
# Same problem again: 32 (!) second break in communication with the radio
2024-01-09 14:11:28.988 DEBUG (bellows.thread_0) [bellows.uart] Data frame: b'216db1a9112a15b658914e24ab5593499cef876fb4469874f5de1b83fc7e1e4efc0fff7e'
2024-01-09 14:11:28.989 DEBUG (bellows.thread_0) [bellows.uart] Sending: b'83401b7e'
2024-01-09 14:11:28.992 DEBUG (MainThread) [bellows.ezsp.protocol] Send command setPolicy: (<EzspPolicyId.TRUST_CENTER_POLICY: 0>, <EzspDecisionId.ALLOW_PRECONFIGURED_KEY_JOINS: 1>)
2024-01-09 14:11:29.001 DEBUG (bellows.thread_0) [bellows.uart] Data frame: b'296db1a9112a15b658914e24ab5593499cef876fb4469874f5de1b83fc7e1e4efc219b7e'
2024-01-09 14:11:29.002 DEBUG (bellows.thread_0) [bellows.uart] Sending: b'83401b7e'
# pychromecast having issues again
2024-01-09 14:11:29.006 WARNING (Thread-9) [pychromecast.socket_client] [Kitchen Assistant(192.168.0.107):8009] Heartbeat timeout, resetting connection
2024-01-09 14:11:29.007 WARNING (Thread-7) [pychromecast.socket_client] [Living Room TV(192.168.0.98):8009] Heartbeat timeout, resetting connection
2024-01-09 14:11:29.011 WARNING (Thread-8) [pychromecast.socket_client] [Bedroom speaker(192.168.0.178):8009] Heartbeat timeout, resetting connection
2024-01-09 14:11:29.013 DEBUG (bellows.thread_0) [bellows.uart] Data frame: b'296db1a9112a15b658914e24ab5593499cef876fb4469874f5de1b83fc7e1e4efc219b7e'
2024-01-09 14:11:29.014 DEBUG (bellows.thread_0) [bellows.uart] Sending: b'83401b7e'
2024-01-09 14:11:29.017 DEBUG (bellows.thread_0) [bellows.uart] Data frame: b'296db1a9112a15b658914e24ab5593499cef876fb4469874f5de1b83fc7e1e4efc219b7e'
2024-01-09 14:11:29.017 DEBUG (bellows.thread_0) [bellows.uart] Sending: b'83401b7e'
# The radio resets again
2024-01-09 14:11:29.041 DEBUG (bellows.thread_0) [bellows.uart] Error code: ERROR_EXCEEDED_MAXIMUM_ACK_TIMEOUT_COUNT, Version: 2, frame: b'c20251a8bd7e'

meramsey · 2024-01-12T14:57:16Z

having same issue with but with skyconnect (multiprotocol enabled) it only recovers if full reboot is done
core_silabs_multiprotocol_2024-01-12T14-29-57.419Z.log

puddly · 2024-01-12T14:59:28Z

@meramsey Multiprotocol issues aren't something Core has any influence on. Please upload your logs here: home-assistant/addons#3408

ThePapaG · 2024-01-12T20:52:43Z

I have disabled any integrations I don't actively require right now (cast being one of them). Is cast always having issues around the error?

Although I'm realising the two cases you pointed out, the first cast has already errored before the issue, the second it errors after the issue

Padrecc · 2024-01-13T06:45:59Z

I'm experiencing the same issue. It started somewhere in the middle of December.
Similar config. Docker,zha,sonoff wifi bridge.

ThePapaG · 2024-01-13T06:58:18Z

I'm experiencing the same issue. It started somewhere in the middle of December. Similar config. Docker,zha,sonoff wifi bridge.

I did read that the wifi bridge is know for issues due to dropped wifi packets. This issue is with a serial connection to dongle. That's why there shouldn't be an issue

ThePapaG · 2024-01-13T09:46:18Z

@puddly I have disabled all integrations I am not using:
Cast
RPI power
Bluetooth
Google translate
Upnp
Android remote
Dlna

Error still occured.
error_log-7.txt
error_log-8.txt

akunia · 2024-02-26T09:42:04Z

Issues for me as well, all zigbee devices went offline at night. I see in the logs:
2024-02-26 01:03:14.172 ERROR (MainThread) [bellows.ezsp] NCP entered failed state. Requesting APP controller restart 2024-02-26 01:03:14.173 ERROR (MainThread) [bellows.ezsp] NCP entered failed state. Requesting APP controller restart 2024-02-26 01:03:14.178 WARNING (MainThread) [zigpy.backups] Failed to create a network backup Traceback (most recent call last): File "/usr/local/lib/python3.12/site-packages/zigpy/backups.py", line 220, in _backup_loop await self.create_backup() File "/usr/local/lib/python3.12/site-packages/zigpy/backups.py", line 143, in create_backup await self.app.load_network_info(load_devices=load_devices) File "/usr/local/lib/python3.12/site-packages/bellows/zigbee/application.py", line 243, in load_network_info await self._ensure_network_running() File "/usr/local/lib/python3.12/site-packages/bellows/zigbee/application.py", line 165, in _ensure_network_running (state,) = await self._ezsp.networkState() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/bellows/ezsp/__init__.py", line 212, in _command raise EzspError("EZSP is not running") bellows.exception.EzspError: EZSP is not running

steenlarsen · 2024-03-18T13:31:04Z

For what it is worth : I had the same issue with Homeassistant 2024.3.1 running on a Synology inside a container created from homeassistant/home-assistant:stable with Sonoff Zigbee 3.0 Plus-E Dongle. I resolved the problem by downgrading to 2024.2.3.
First everything was running fine on 2024.3.1 but then my wife unplugged the Zigbee dongle and it stopped working. Restarting the container and even the entire Synology did not solve the problem. Downgrading to 2024.2.3 solved it.

jandirk39 · 2024-04-12T05:52:03Z

Hi I use the Skyconnect zigbee stick and I have the same issues. It started with crashing a couple of months ago after a home assistant upgrade (can be coincidence). now it happens multiple times per day. I'm desperate. I have search for days to find a solution on home assistant forum, github, reddit etc. no fix so far. @puddly You seem to have a lot of knowledge about this. Is it ok if I share my log with you as well?

pipiche38 · 2024-04-19T08:22:40Z

I guess similar as zigpy/bellows#617 (which is not Home Assistant)

home-assistant bot added the integration: zha label Jan 11, 2024

home-assistant bot assigned dmulcahey, Adminiuga and puddly Jan 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zha "EZSP is not running" error #107798

Zha "EZSP is not running" error #107798

ThePapaG commented Jan 11, 2024

home-assistant bot commented Jan 11, 2024

puddly commented Jan 11, 2024

ThePapaG commented Jan 12, 2024 •

edited

Loading

ThePapaG commented Jan 12, 2024

puddly commented Jan 12, 2024 •

edited

Loading

ThePapaG commented Jan 12, 2024

ThePapaG commented Jan 12, 2024

ThePapaG commented Jan 12, 2024

puddly commented Jan 12, 2024 •

edited

Loading

meramsey commented Jan 12, 2024

puddly commented Jan 12, 2024

ThePapaG commented Jan 12, 2024 •

edited

Loading

Padrecc commented Jan 13, 2024

ThePapaG commented Jan 13, 2024

ThePapaG commented Jan 13, 2024 •

edited

Loading

akunia commented Feb 26, 2024

steenlarsen commented Mar 18, 2024

jandirk39 commented Apr 12, 2024

pipiche38 commented Apr 19, 2024

Zha "EZSP is not running" error #107798

Zha "EZSP is not running" error #107798

Comments

ThePapaG commented Jan 11, 2024

The problem

What version of Home Assistant Core has the issue?

What was the last working version of Home Assistant Core?

What type of installation are you running?

Integration causing the issue

Link to integration documentation on our website

Diagnostics information

Example YAML snippet

Anything in the logs that might be useful for us?

Additional information

home-assistant bot commented Jan 11, 2024

puddly commented Jan 11, 2024

ThePapaG commented Jan 12, 2024 • edited Loading

ThePapaG commented Jan 12, 2024

puddly commented Jan 12, 2024 • edited Loading

ThePapaG commented Jan 12, 2024

ThePapaG commented Jan 12, 2024

ThePapaG commented Jan 12, 2024

puddly commented Jan 12, 2024 • edited Loading

meramsey commented Jan 12, 2024

puddly commented Jan 12, 2024

ThePapaG commented Jan 12, 2024 • edited Loading

Padrecc commented Jan 13, 2024

ThePapaG commented Jan 13, 2024

ThePapaG commented Jan 13, 2024 • edited Loading

akunia commented Feb 26, 2024

steenlarsen commented Mar 18, 2024

jandirk39 commented Apr 12, 2024

pipiche38 commented Apr 19, 2024

ThePapaG commented Jan 12, 2024 •

edited

Loading

puddly commented Jan 12, 2024 •

edited

Loading

puddly commented Jan 12, 2024 •

edited

Loading

ThePapaG commented Jan 12, 2024 •

edited

Loading

ThePapaG commented Jan 13, 2024 •

edited

Loading