Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS resolution errors are sometimes swallowed #9447

Closed
1 task done
bdraco opened this issue Oct 10, 2024 · 8 comments
Closed
1 task done

DNS resolution errors are sometimes swallowed #9447

bdraco opened this issue Oct 10, 2024 · 8 comments
Assignees
Labels
bug reproducer: present This PR or issue contains code, which reproduce the problem described or clearly understandable STR

Comments

@bdraco
Copy link
Member

bdraco commented Oct 10, 2024

Describe the bug

There is definitely some weirdness going on in aiohttp or more likely in aiodns: When I drop packets to my DNS server, I get the following stack trace:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/aiohttp/resolver.py", line 105, in resolve
    resp = await self._resolver.getaddrinfo(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
aiodns.error.DNSError: (12, 'Timeout while contacting DNS servers')

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/aiohttp/connector.py", line 1294, in _create_direct_connection
    hosts = await self._resolve_host(host, port, traces=traces)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/aiohttp/connector.py", line 945, in _resolve_host
    return await asyncio.shield(resolved_host_task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/aiohttp/connector.py", line 982, in _resolve_host_with_throttle
    addrs = await self._resolver.resolve(host, port, family=self._family)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/aiohttp/resolver.py", line 114, in resolve
    raise OSError(msg) from exc
OSError: Timeout while contacting DNS servers

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/tmp/test3.py", line 16, in <module>
    asyncio.run(main())
  File "/usr/local/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/asyncio/base_events.py", line 687, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/tmp/test3.py", line 12, in main
    html = await fetch(client)
           ^^^^^^^^^^^^^^^^^^^
  File "/tmp/test3.py", line 6, in fetch
    async with client.head("https://checkonline.home-assistant.io/online.txt") as resp:
  File "/usr/local/lib/python3.12/site-packages/aiohttp/client.py", line 1357, in __aenter__
    self._resp: _RetType = await self._coro
                           ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/aiohttp/client.py", line 661, in _request
    conn = await self._connector.connect(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/aiohttp/connector.py", line 556, in connect
    proto = await self._create_connection(req, traces, timeout)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/aiohttp/connector.py", line 1009, in _create_connection
    _, proto = await self._create_direct_connection(req, traces, timeout)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/aiohttp/connector.py", line 1300, in _create_direct_connection
    raise ClientConnectorError(req.connection_key, exc) from exc
aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host checkonline.home-assistant.io:443 ssl:default [None]

So far so good, this is expected.
It is a bit weird that sometimes we get a different exception. This does indicate that the previous request is somehow kept around and reused.

What is weird though that even when I add another DNS server in my resolver (which fixes resolving) this script continuous to return errors 😢

It really seems as if the aiohttp/aiodns instance used keeps reusing cached or queued resolve attempts.

Originally posted by @agners in home-assistant/supervisor#5332 (comment)

To Reproduce

When I use this small reproducer:

import aiohttp
import asyncio
import datetime
import logging

async def fetch(client):
    timeout = aiohttp.ClientTimeout(total=10)
    async with client.head("https://checkonline.home-assistant.io/online.txt", timeout=timeout) as resp:
        assert resp.status == 200
        return await resp.text()

async def main():
    async with aiohttp.ClientSession() as client:
        while True:
            print(datetime.datetime.now())
            try:
                html = await fetch(client)
                print(html)
            except Exception as e:
                logging.exception(e)

logging.basicConfig(
    format='%(asctime)s %(levelname)s %(message)s',
    level=logging.INFO,
    datefmt='%Y-%m-%d %H:%M:%S'
)
asyncio.run(main())

Expected behavior

The OSError is not lost and it should produce
ClientConnectorError(ConnectionKey(host='checkonline.home-assistant.io', port=443, is_ssl=True, ssl=True, proxy=None, proxy_auth=None, proxy_headers_hash=None), OSError('Could not contact DNS servers'))

Logs/tracebacks

Returns the following:

docker exec -it hassio_supervisor python /tmp/test4.py
2024-10-09 16:23:42 INFO Start fetch
2024-10-09 16:23:44 ERROR Error is TimeoutError()
2024-10-09 16:23:44 INFO Start fetch
2024-10-09 16:23:46 ERROR Error is TimeoutError()
2024-10-09 16:23:46 INFO Start fetch
2024-10-09 16:23:48 ERROR Error is TimeoutError()
2024-10-09 16:23:48 INFO Start fetch
2024-10-09 16:23:50 ERROR Error is TimeoutError()
2024-10-09 16:23:50 INFO Start fetch
2024-10-09 16:23:51 ERROR Error is ClientConnectorError(ConnectionKey(host='checkonline.home-assistant.io', port=443, is_ssl=True, ssl=True, proxy=None, proxy_auth=None, proxy_headers_hash=None), OSError('Timeout while contacting DNS servers'))
2024-10-09 16:23:51 INFO Start fetch
2024-10-09 16:23:53 ERROR Error is TimeoutError()
2024-10-09 16:23:53 INFO Start fetch
2024-10-09 16:23:55 ERROR Error is TimeoutError()
2024-10-09 16:23:55 INFO Start fetch
2024-10-09 16:23:57 ERROR Error is TimeoutError()
2024-10-09 16:23:57 INFO Start fetch
2024-10-09 16:23:59 ERROR Error is TimeoutError()
2024-10-09 16:23:59 INFO Start fetch
2024-10-09 16:24:00 ERROR Error is ClientConnectorError(ConnectionKey(host='checkonline.home-assistant.io', port=443, is_ssl=True, ssl=True, proxy=None, proxy_auth=None, proxy_headers_hash=None), OSError('Timeout while contacting DNS servers'))


### Python Version

```console
Python 3.12.4

aiohttp Version

3.10.9

multidict Version

n/a

propcache Version

n/a

yarl Version

n/a

OS

Linux, MacOS

Related component

Client

Additional context

No response

Code of Conduct

  • I agree to follow the aio-libs Code of Conduct
@bdraco bdraco added the bug label Oct 10, 2024
@bdraco bdraco self-assigned this Oct 10, 2024
@bdraco bdraco added the reproducer: present This PR or issue contains code, which reproduce the problem described or clearly understandable STR label Oct 10, 2024
@bdraco
Copy link
Member Author

bdraco commented Oct 10, 2024

Changing the host to xcheckonline.home-assistant.io is enough to show its always swallowing it

@bdraco
Copy link
Member Author

bdraco commented Oct 10, 2024

It looks like strerror is never set from aiodns

@bdraco
Copy link
Member Author

bdraco commented Oct 10, 2024

we raise OSError(msg) but the first arg to OSError is the errno

So it should be OSError(None, msg)

@bdraco
Copy link
Member Author

bdraco commented Oct 10, 2024

Another issue

    #
    # If multiple connectors are resolving the same host, we wait
    # for the first one to resolve and then use the result for all of them.
    # We use a throttle event to ensure that we only resolve the host once
    # and then use the result for all the waiters.
    #
    # In this case we need to create a task to ensure that we can shield
    # the task from cancellation as cancelling this lookup should not cancel
    # the underlying lookup or else the cancel event will get broadcast to
    # all the waiters across all connections.
    #

So if it never times out we wait forever...

@bdraco
Copy link
Member Author

bdraco commented Oct 10, 2024

So dns request one is in flight, the other requests back up against it.

EventResultOrError is also quite inefficient as it creates tasks

@bdraco
Copy link
Member Author

bdraco commented Oct 10, 2024

I'm thinking reason you see the timeout error on the subsequent requests the dns resolution is still in progress from the first request even though the request itself timed out, and since the timeout of the request is lower than dns resolution timeout of the getaddrinfo so the 2nd/3rd/4th etc request is waiting for the dns resolution and raises timeout error waiting for the future.

That's super confusing though.....

@bdraco
Copy link
Member Author

bdraco commented Oct 10, 2024

timeout = aiohttp.ClientTimeout(total=10)

When the connect timeout is higher than the overall timeout the connect timeout error gets swallowed

@bdraco
Copy link
Member Author

bdraco commented Oct 10, 2024

the missing errors are fixed in #9451

@bdraco bdraco closed this as completed Oct 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug reproducer: present This PR or issue contains code, which reproduce the problem described or clearly understandable STR
Projects
None yet
Development

No branches or pull requests

1 participant