Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NameResolutionPal.Unix enabled async name resolution #34633

Merged
merged 37 commits into from
Jan 14, 2021

Conversation

gfoidl
Copy link
Member

@gfoidl gfoidl commented Apr 7, 2020

Fixes #33378

Also

  • added basic tests in System.Net.NameResolution.Pal.Tests
  • same order of methods as in the Windows-Pal

Edit: this issue is fixed.

Marking this PR as draft, as there is some bug anywhere which I can't find.
To check the native implementation 8c955bc adds some (c)tests -- this will be reverted later -- and they pass without any problem.
The managed implementation is analogous to the windows' one.

The bug causes either a hang or a segfault.
For the latter I can see GDB Output (see core-dump.txt).

Maybe it's a dumb failure on my side, which I don't see atm, or something different.

If the test (added with 4614267) doesn't hang, the native side it will call into the OS with return code 0. The crash happens anywhere after the OS-call and the invocation of the callback. Unfortunately I'm not able to debug it further (due to lack of knowledge here deep down that stack).

Copy link
Member Author

@gfoidl gfoidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor notes for review

src/libraries/Native/Unix/CMakeLists.txt Outdated Show resolved Hide resolved
Before I didn't last the async operation as it was on the stack. Now it's part of the GetAddrInfoAsyncState which gets heap allocted, so it last.
I'm not sure if addrinfo needs to last the async operation, as the native tests pass either way. This change makes it more correct, nevertheless.
Comment on lines 467 to 474
static void GetHostEntryForNameCreateHint(struct addrinfo* hint)
{
// Get all address families and the canonical name

memset(hint, 0, sizeof(struct addrinfo));
hint->ai_family = AF_UNSPEC;
hint->ai_flags = AI_CANONNAME;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the parameter to getaddrinfo() is constant, this could be defined statically and the address passed like that everywhere where it's needed:

Suggested change
static void GetHostEntryForNameCreateHint(struct addrinfo* hint)
{
// Get all address families and the canonical name
memset(hint, 0, sizeof(struct addrinfo));
hint->ai_family = AF_UNSPEC;
hint->ai_flags = AI_CANONNAME;
}
static const struct addrinfo s_hostEntryForNameCreateHint = { .ai_family = AI_UNSPEC, .ai_flags = AI_CANONNAME };

Comment on lines 445 to 463
int result = gai_error(&state->gai_request);
HostEntry* entry = state->entry;

int ret = ConvertGetAddrInfoAndGetNameInfoErrorsToPal(result);

if (result == 0)
{
const uint8_t* address = state->address;
struct addrinfo* info = state->gai_request.ar_result;

ret = GetHostEntries(address, info, entry);
}

free(state);

if (callback != NULL)
{
callback(entry, ret);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int result = gai_error(&state->gai_request);
HostEntry* entry = state->entry;
int ret = ConvertGetAddrInfoAndGetNameInfoErrorsToPal(result);
if (result == 0)
{
const uint8_t* address = state->address;
struct addrinfo* info = state->gai_request.ar_result;
ret = GetHostEntries(address, info, entry);
}
free(state);
if (callback != NULL)
{
callback(entry, ret);
}
int ret = ConvertGetAddrInfoAndGetNameInfoErrorsToPal(gai_error(&state->gai_request));
if (ret == 0)
{
const uint8_t* address = state->address;
struct addrinfo* info = state->gai_request.ar_result;
ret = GetHostEntries(address, info, entry);
}
if (callback != NULL)
{
callback(state->entry, ret);
}
free(state);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the free(state) intentionally before calling the callback.

  • entry is passed from managed to native and via callback back to managed -- so the position of free(state) doesn't change anything
  • if the callback (managed side) has an exception state won't be freed an so this can be a memory leak -- although ProcessResult on the managed side (the "callback worker") has a try-finally

I prefer leaving the free(state) before the callback. When there is a good reason to move it after the callback, I'll do it for sure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both are equivalent, yes, but the suggested change is slightly shorter (and doesn't have a variable that's attributed to but never read from, which is cleaner).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(It's just a suggestion, no need to change it at all.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but I like it as the code looks cleaner.
Anyway I highly appreciate the feedback here! (My C is a bit rusty, so it's good to get some hints.)

Comment on lines 476 to 483
int32_t SystemNative_PlatformSupportsGetAddrInfoAsync()
{
#if HAVE_GETADDRINFO_A
return 1;
#else
return 0;
#endif
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to avoid this P/Invoke every time?

Also, since cmakedefine01 is used, maybe this could be rewritten as:

Suggested change
int32_t SystemNative_PlatformSupportsGetAddrInfoAsync()
{
#if HAVE_GETADDRINFO_A
return 1;
#else
return 0;
#endif
}
int32_t SystemNative_PlatformSupportsGetAddrInfoAsync()
{
return HAVE_GETADDRINFO_A;
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's cached on the managed side.

Nice trick to return the define. I didn't think about that. Thx.

Will address the other points tomorrow.

Comment on lines 517 to 531
struct GetAddrInfoAsyncState* state = malloc(sizeof(struct GetAddrInfoAsyncState));

state->gai_request.ar_name = (const char*)address;
state->gai_request.ar_service = NULL;
state->gai_request.ar_request = &hint;
state->gai_request.ar_result = NULL;
state->gai_requests = &state->gai_request;

state->sigevent.sigev_notify = SIGEV_THREAD;
state->sigevent.sigev_value.sival_ptr = state;
state->sigevent.sigev_notify_function = GetHostEntryForNameAsyncComplete;

state->address = address;
state->entry = entry;
state->callback = callback;
Copy link
Contributor

@lpereira lpereira Apr 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
struct GetAddrInfoAsyncState* state = malloc(sizeof(struct GetAddrInfoAsyncState));
state->gai_request.ar_name = (const char*)address;
state->gai_request.ar_service = NULL;
state->gai_request.ar_request = &hint;
state->gai_request.ar_result = NULL;
state->gai_requests = &state->gai_request;
state->sigevent.sigev_notify = SIGEV_THREAD;
state->sigevent.sigev_value.sival_ptr = state;
state->sigevent.sigev_notify_function = GetHostEntryForNameAsyncComplete;
state->address = address;
state->entry = entry;
state->callback = callback;
struct GetAddrInfoAsyncState* state = malloc(sizeof(*state));
if (state == NULL) return oops_no_memory;
*state = (struct GetAddrInfoAsyncState) {
.gai_request = {
.ar_name = ...,
.ar_service = ...,
...,
},
.sigevent = {
.sigev_notify = ...,
...,
},
.address = ...,
...
};

Comment on lines 545 to 549
// Actually not needed, but otherwise the parameters are unused.
if (address != NULL && entry != NULL && callback != NULL)
{
return -1;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Actually not needed, but otherwise the parameters are unused.
if (address != NULL && entry != NULL && callback != NULL)
{
return -1;
}
(void)address;
(void)entry;
(void)calback;

{
struct GetAddrInfoAsyncState* state = (struct GetAddrInfoAsyncState*)context.sival_ptr;

atomic_thread_fence(memory_order_acquire);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the fence necessary?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This callback comes in on another thread. It ensures visibility to this thread of everything in state.

getaddrinfo_a probably does something like this already, so it might be safe to remove, but it isn't documented.

}
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just do this in a static field initializer?

private static readonly bool s_supportsGetAddrInfoAsync = Interop.Sys.PlatformSupportsGetAddrInfoAsync();

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to have it the same as the windows-implementation. But I see (now) that there is a need to start the socket, which isn't needed on linux.

I'll update to this.

@lpereira
Copy link
Contributor

lpereira commented Apr 8, 2020

Are you still experiencing the crash/hang you mentioned in the PR message?

@gfoidl
Copy link
Member Author

gfoidl commented Apr 8, 2020

Yes, unfortunately.

@lpereira
Copy link
Contributor

lpereira commented Apr 8, 2020

Have you tried either running under Valgrind or building with AddressSanitizer to have a better insight?

@gfoidl
Copy link
Member Author

gfoidl commented Apr 8, 2020

With pure native execution the bug does not show up. Only when called via P/Invoke, i.e. from the managed unit test with testhost.

Do you know how to incorporate AddressSanitizer with the testhost?

@lpereira
Copy link
Contributor

lpereira commented Apr 8, 2020

I've never tested this with testhost, but it looks like if you build a debug build with the $DEBUG_SANITIZERS environment variable set to asan, it should create an AddressSanitizer-enabled build for you. (Looking in src/coreclr/configurecompiler.cmake, around line 117.)

@gfoidl
Copy link
Member Author

gfoidl commented Apr 8, 2020

Thanks. Will try this tomorrow.

When I read the GDB dump (see attachment in top post) correct, I have the assumption that P/Invoke for Linux and threads has an issue. But I'm not really able to interpret this correctly, because of lack of knowledge with these internals. Hopefully that's not the case and P/Invoke is correct.

@gfoidl
Copy link
Member Author

gfoidl commented Apr 9, 2020

Valgrind outputs quite some messages -- even for dotnet new console in a fresh docker container. I'd expect there shouldn't be only some message for OS-memory cleanup. Is this known / expected or should we file an issue to investigate and fix that?

Running valgrind dotnet build /t:test in runtime/src/libraries/System.Net.NameResolution/tests/PalTests the valgrind-output is a bit longer.
So I removed the failling tests (the ones for the async code added with this PR), tests pass, and compared it to the valgrind-output for the failing tests (core dumps as described in the OP).
Except the process-id there's only one difference (leaving out the final heap summary):

-vex amd64->IR: unhandled instruction bytes: 0x48 0xE9 0xB0 0x35 0x42 0xC9 0xE8 0x2B
+vex amd64->IR: unhandled instruction bytes: 0x48 0xE9 0xB0 0x35 0x41 0xC9 0xE8 0x2B
 vex amd64->IR:   REX=1 REX.W=1 REX.R=0 REX.X=0 REX.B=0
 vex amd64->IR:   VEX=0 VEX.L=0 VEX.nVVVV=0x0 ESC=NONE
 vex amd64->IR:   PFX.66=0 PFX.F2=0 PFX.F3=0
-==xxxx== valgrind: Unrecognised instruction at address 0x3d9bafca.
-==xxxx==    at 0x3D9BAFCA: ???
-==xxxx==    by 0x3D41F58F: ??? (in /home/gfoidl/dotnet-repos/runtime/.dotnet/shared/Microsoft.NETCore.App/5.0.0-preview.3.20169.1/System.Private.CoreLib.dll)
-==xxxx==    by 0x3D443245: ??? (in /home/gfoidl/dotnet-repos/runtime/.dotnet/shared/Microsoft.NETCore.App/5.0.0-preview.3.20169.1/System.Private.CoreLib.dll)
-==xxxx==    by 0x3D45BCC2: ??? (in /home/gfoidl/dotnet-repos/runtime/.dotnet/shared/Microsoft.NETCore.App/5.0.0-preview.3.20169.1/System.Private.CoreLib.dll)
+==xxxx== valgrind: Unrecognised instruction at address 0x3d9cafca.
+==xxxx==    at 0x3D9CAFCA: ???
+==xxxx==    by 0x3D42F58F: ??? (in /home/gfoidl/dotnet-repos/runtime/.dotnet/shared/Microsoft.NETCore.App/5.0.0-preview.3.20169.1/System.Private.CoreLib.dll)
+==xxxx==    by 0x3D453245: ??? (in /home/gfoidl/dotnet-repos/runtime/.dotnet/shared/Microsoft.NETCore.App/5.0.0-preview.3.20169.1/System.Private.CoreLib.dll)
+==xxxx==    by 0x3D46BCC2: ??? (in /home/gfoidl/dotnet-repos/runtime/.dotnet/shared/Microsoft.NETCore.App/5.0.0-preview.3.20169.1/System.Private.CoreLib.dll)
 ==xxxx==    by 0x6DDD8BE: CallDescrWorkerInternal (in /home/gfoidl/dotnet-repos/runtime/.dotnet/shared/Microsoft.NETCore.App/5.0.0-preview.3.20169.1/libcoreclr.so)
 ==xxxx==    by 0x6D09CCA: MethodDescCallSite::CallTargetWorker(unsigned long const*, unsigned long*, int) (in /home/gfoidl/dotnet-repos/runtime/.dotnet/shared/Microsoft.NETCore.App/5.0.0-preview.3.20169.1/libcoreclr.so)
 ==xxxx==    by 0x6C5675D: CorHost2::_CreateAppDomain(char16_t const*, unsigned int, char16_t const*, char16_t const*, int, char16_t const**, char16_t const**, unsigned int*) (in /home/gfoidl/dotnet-repos/runtime/.dotnet/shared/Microsoft.NETCore.App/5.0.0-preview.3.20169.1/libcoreclr.so)

See attached files for raw output.

Will investigate further and try to incorparate some sanitizers.
valgrind_new_console.txt
valgrind_test_OK.txt
valgrind_test_KO.txt

@gfoidl
Copy link
Member Author

gfoidl commented Apr 9, 2020

I have a clue now. The address is passed in as string, so the runtime needs to marshal it, and the marshalled string will only last until SystemNative_GetHostEntryForNameAsync returns. Then at the callback this reference has gone and it crashes.
This explains why the native tests pass, and the managed ones fail.

This can be seen in the following trace:

MANAGED: start with hostName: microsoft.com
MANAGED: addr context: 0x7f8dc8003560
MANAGED: addr state: 0x7f8e8f731220
NATIVE: arg address: microsoft.com
NATIVE: entry: 0x7f8dc8003560
NATIVE: callback: 0x7f8e13a93544
MANAGED: result from os-call: errorCode: 0
MANAGED: returning task
NATIVE (callback): entry: 0x7f8dc8003560
NATIVE (callback): callback: 0x7f8e13a93544
NATIVE (callback): address:  BÎwŽ�
NATIVE (callback): gai_error: -2
NATIVE (callback): pal_error: 5
MANAGED: entering callback with error: 5
MANAGED: entering ProcessResult with errorCode: HostNotFound
MANAGED: addr context: 0x7f8dc8003560
MANAGED: addr state: 0x7f8e8f731220

In the callback the address is garbage.

A trivial workaround / fix would be duplicating the string in the native side (strdup).
Is there any .NET-way to tell P/Invoke how long the marshalled memory (string) should last?

@stephentoub
Copy link
Member

Is there any .NET-way to tell P/Invoke how long the marshalled memory (string) should last?

No (and if there were, the best you could hope for would be "until it's manually free'd by the native code", since the runtime has no idea what the lifetime is if it's not tied to the synchronous method call's lifetime). If you want to manage it yourself, you can just change the P/Invoke to take a char* and use something like Marshal.StringToHGlobalAnsi, then free it manually later.

@VSadov
Copy link
Member

VSadov commented Dec 18, 2020

It looks like anl is not necessarily present on Linux. I see the failures at least on Linux_musl and Android where the build could not find the library. That is a bit inconvenient.

One way to deal with this is to make the library optional and make the feature conditional on HAVE_GETADDRINFO_A.
That would imply making HAVE_GETADDRINFO_A work in installer partition.

@wfurt
Copy link
Member

wfurt commented Dec 18, 2020

yes, this is part of glibc - so it will not be there in MUSL based distributions @VSadov . I look as I was concerned about extra dependency but I think it is ok to use swhen part of glibc bundle.

@VSadov
Copy link
Member

VSadov commented Dec 18, 2020

If you go the route of making this conditional on Linux, you would need to add similar probe for HAVE_GETADDRINFO_A to installer/corehost/cli/apphost/static.

We should think about reducing duplication like this going forward, but for the time being it could be acceptable solution.

@wfurt
Copy link
Member

wfurt commented Jan 6, 2021

Do you have plan how to move forward @gfoidl ? (I'm not pushing but making sure you are not stuck) It seems like this got more complicated beyond base DNS changes ;(

@gfoidl
Copy link
Member Author

gfoidl commented Jan 7, 2021

Completed

  • code changes to native and managed side are done
  • tests added
  • a follow-up PR is ready to address cancellation
  • local build and test on ubuntu succeedes

Open questions (from review)

What's left?

Async name resolution on Unix relies on glibc-functionality (getaddrinfo_a from lib anl). It won't work on distros that don't provide these symbols, hence we check for this and emit the symbol HAVE_GETADDRINFO_A.

Here comes the trouble with "single file host builds in installer partition" and for me it's hard to follow what's going on, since this is pretty much down the entrails of .NET 😉

From #34633 (comment):
What is meant by "other partitions"?

If I read #34633 (comment) correct, so the check for HAVE_GETADDRINFO_A should be done in configure.cmake too and CMakeLists.txt updated to conditionally link based on HAVE_GETADDRINFO_A?
Also undo fa9c6f9

@gfoidl
Copy link
Member Author

gfoidl commented Jan 8, 2021

@wfurt good news: CI is green 🎉 -- PTAL

Copy link
Member

@wfurt wfurt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM @gfoidl. thanks for the contribution.

I caused one more merge conflict by rushing fix to servicing. I think that should be simple to resolve.
I will merge as soon as the PR is green again (and sorry for the troubles)

@@ -66,29 +64,31 @@ public void GetHostEntry_InvalidHost_LogsError()

[ConditionalFact]
[PlatformSpecific(~TestPlatforms.Windows)] // Unreliable on Windows.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. Could be use CI stability while back. Perhaps best would be do this separately e.g. put this in and do separate PR to enable this in ci and watch the outcome. If we see failures, we can try to reproduce locally or we can try to collect more info.

[Theory]
[InlineData(false)]
[InlineData(true)]
public async Task GetAddrInfoAsync_ExternalHost(bool justAddresses)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably would not at this point. Since all DNS tests depend on external server, we would need to make all of them outer loop IMHO. For now, I would assume that base resolution work and I would leave it as inner loop. was thinking while back about running batch of resolutions to make sure DNS work and perhaps prime caches so test runs are more predictable. I think we can leave is is for now and react if we see CI failures.

Copy link
Member

@VSadov VSadov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for the cmake changes

@gfoidl
Copy link
Member Author

gfoidl commented Jan 14, 2021

iOS.Simulator.Aot.Test Work Item fails due to this test TryGetAddrInfo_EmptyHost.

The check for the operating system got taken over from existing tests, which have the comment

// On Unix, we are not guaranteed to be able to resove the local host. The ability to do so depends on the
// machine configurations, which varies by distro and is often inconsistent.

As more and more different OS come into play, maybe it's safer to flip this check and test for !Windows here?
Is there a better way to handle this fact, except just returning from the test-code?

@gfoidl
Copy link
Member Author

gfoidl commented Jan 14, 2021

@wfurt CI looks green 😃
Please have a look at the last commit (dd1e245) -- I'm not entirely happy with this.

For the test on windows I'll follow #34633 (comment) and create a separate PR for this, so it can be handled in isolation.
Also there will come a PR to add cancellation-support.

@karelz
Copy link
Member

karelz commented Jan 15, 2021

@gfoidl thanks a lot for your PR!!! Also, appreciate your patience with this one ;)

@gfoidl
Copy link
Member Author

gfoidl commented Jan 15, 2021

This was an interesting journey with some very valuable input (some C hints, and being down at the PAL to see how things work / a place where a usual dotnet-developer won't be), so I'd say: perfect (except that it took that long, but now it's in so this doesn't matter).

MichalStrehovsky added a commit to dotnet/runtimelab that referenced this pull request Jan 21, 2021
dotnet/runtime#34633 introduced a dependency on this.
jkotas pushed a commit to dotnet/runtimelab that referenced this pull request Jan 21, 2021
dotnet/runtime#34633 introduced a dependency on this.
@karelz karelz added this to the 6.0.0 milestone Jan 26, 2021
gfoidl added a commit to gfoidl/dotnet-runtime that referenced this pull request Feb 23, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Feb 25, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use async name resolution on Linux
8 participants