Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query: Add servername to grpc dial options for DNS stores (Fixes #1507) #2407

Closed
wants to merge 5 commits into from

Conversation

j3p0uk
Copy link

@j3p0uk j3p0uk commented Apr 9, 2020

Closes #1507

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

To avoid needing a query per remote cluster, get the name to add to
the dial options from the dns provider when making the grpc connection.

Verification

Building via make docker and running a query against multiple remote query services through nginx proxies.

Comment on lines 447 to 461
// New store or was unactive and was removed in the past - create new one.
conn, err := grpc.DialContext(ctx, addr, s.dialOpts...)
tlsCfg, err := tls.NewClientConfig(s.logger, s.cert, s.key, s.caCert, spec.ServerName())
if err != nil {
level.Warn(s.logger).Log("msg", "update of store node failed", "err", errors.Wrap(err, "setting TLS"), "address", addr)
return
}
dialOpts := append(s.dialOpts, grpc.WithTransportCredentials(credentials.NewTLS(tlsCfg)))
conn, err := grpc.DialContext(ctx, addr, dialOpts...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this makes an assumption that store is listening with TLS & won't work in insecure mode. Could you please verify that?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, That would be absolutely true. I need to conditionally do this based on the combination of the secure flag and the new grpc-client-dns-server-name flag.

To avoid needing a query per remote cluster, get the name to add to
the dial options from the dns provider when making the grpc connection.

Signed-off-by: JP Sullivan <jonpsull@cisco.com>
The grpc-client-dns-server-name needs to propogate to the storeset so that
TLS is conditionally added to the grpc dialOpts based on whether it is set.

Test that the grpc-client-tls-secure flag is also set when the
grpc-client-dns-server-name flag is set to make it clear that a secure store
connection is required when using the latter flag.

Signed-off-by: JP Sullivan <jonpsull@cisco.com>
Signed-off-by: JP Sullivan <jonpsull@cisco.com>
Signed-off-by: JP Sullivan <jonpsull@cisco.com>
@j3p0uk j3p0uk requested a review from povilasv April 10, 2020 10:41
@EvertonSA
Copy link

looks legit, can we have this approved and released?

Copy link
Member

@squat squat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi thanks for writing this contribution. I have a few questions about the implementation. Please take a look and let know what you think :)

conn, err := grpc.DialContext(ctx, addr, s.dialOpts...)
dialOpts := s.dialOpts
// Update the dialOpts if we are asked to add dnsServerName.
if s.dnsServerName {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this seems a bit odd. The NewStoreSet func already accepts a slice of dialopts. I think we should expect that the dialopts has the correct WithTransportCredentials already set, no? ie correctly configure this option in cmd/Thanos/query rather than leaking configuration into this struct. Also, it looks to me like the server name is already set in the dialopts generated with extgrpc https://github.com/j3p0uk/thanos/blob/40526f52f54d4501737e5246c0e71e56dd7e0b2d/pkg/extgrpc/client.go#L56

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, taking out this new dial opt and server name bool from this struct would eliminate the need for extending the StoreSpec interface and adding the extra field to the grpcStoreSpec struct

Copy link
Author

@j3p0uk j3p0uk Apr 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bug, as it stands, is that the arguments provided to the query at start time do not allow for individual server names per configured store. The store set gets populated with the resolved addresses of the stores in the store spec and not the original server name, as far as I could tell. Given this, I attempted to pass minimal data through to generate the TLS portion of the dialOpts on demand at connection time by retrieving the server name from the dns provider.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed this is a bit weird because with this we not only have two places where these options are set but also this newly introduced option overrides the other one. It should probably be refactored so that query.NewGRPCStoreSpec* would hook into some function which would give it a list of opts to use. We could discuss it on #thanos-dev.
Also, I wonder if there any downsides to enabling this kind of feature by default for everyone, and then only allow overriding the server name via the option that exists already? That would probably simplify things too.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, happy to discuss further.

My thoughts are that this looks odd because a StoreSet is a set of Stores, and the code assumes that each of the stores shares a particular set of characteristics, which may not be true (and is the basis for this bug fix). The StoreSet currently holds attributes that can differ on a per-store basis. The ServerName is one such attribute.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I think we need to refactor this because the current approach causes the StoreSet to have duplicate WithTransportCredentials dialopts, one of which overrides the other. This makes it hard to reason about the code IMO

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chat in slack...

Sounds like implementation of this would be something like removing dialOpts from the StoreSet, and adding it to the StoreSpec.
We should be able to generate the dialOpts at creation of the StoreSpec array in the StoreSet, so would only be once at startup.
Need to decide on what flags should exist, and how would they interact and how to handle address-specified stores and DNS SRV records.

For the SRV case, I think the correct behaviour would be to enable lookup of the A record resolved from the SRV lookup. This data is currently lost during resolution, as only the IPs are returned to the provider from resolver, and stored under the key of the original store spec.
To improve the DNS Provider lookup of address to server name, we could pass back the data to stash in the provider as a reverse lookup cache. This would allow SRV to do addr -> A lookup, and others to do addr -> host lookup, simplifying the Provider.ServerName() function.

I can drop the “--grpc-client-dns-server-name” flag and then use “dns” in the “--grpc-client-server-name” to trigger addition of server name from DNS in the dialOpts, as “dns” is likely an illegal server name anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bwplotka @squat ping ^

CHANGELOG.md Outdated Show resolved Hide resolved
cmd/thanos/query.go Outdated Show resolved Hide resolved
pkg/query/storeset.go Outdated Show resolved Hide resolved
pkg/query/storeset.go Outdated Show resolved Hide resolved
pkg/query/storeset.go Outdated Show resolved Hide resolved
* Comment formatting
* Changelog formatting
* Remove nginx references

Signed-off-by: JP Sullivan <jonpsull@cisco.com>
@@ -240,7 +247,11 @@ func runQuery(
func() (specs []query.StoreSpec) {
// Add DNS resolved addresses.
for _, addr := range dnsProvider.Addresses() {
specs = append(specs, query.NewGRPCStoreSpec(addr, false))
if dnsServerName {
specs = append(specs, query.NewGRPCStoreSpecServerName(addr, false, dnsProvider.ServerName(addr)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
specs = append(specs, query.NewGRPCStoreSpecServerName(addr, false, dnsProvider.ServerName(addr)))
specs = append(specs, query.NewGRPCStoreSpecServerName(addr, false, serverName)

can we just use serverName provided from --grpc-client-server-name instead of relying on dns?

Copy link
Author

@j3p0uk j3p0uk May 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the code as-is will use a provided server-name (--grpc-client-server-name). This works when using a single store from the querier. But adding multiple store options to collect data from many stores will not be able to use individual server names as the code currently stands.

The change intended here is to allow for multiple stores to each use different server names.

@stale
Copy link

stale bot commented Jul 10, 2020

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale label Jul 10, 2020
@j3p0uk
Copy link
Author

j3p0uk commented Jul 10, 2020

Needs updating with new design as detailed in #2407 (comment) and

@stale stale bot removed the stale label Jul 10, 2020
@stale
Copy link

stale bot commented Sep 9, 2020

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale label Sep 9, 2020
@kakkoyun
Copy link
Member

Needs updating with new design as detailed in #2407 (comment)

@j3p0uk Friendly ping. Are you still planning to work on this?

@stale stale bot removed the stale label Sep 10, 2020
@j3p0uk
Copy link
Author

j3p0uk commented Sep 10, 2020

Needs updating with new design as detailed in #2407 (comment)

@j3p0uk Friendly ping. Are you still planning to work on this?

I want to, but I haven't yet found the time to do so, and don't see myself getting to it in the short term :(

@kakkoyun
Copy link
Member

All good. I just asked so that we can keep this open.

@stale
Copy link

stale bot commented Nov 9, 2020

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale label Nov 9, 2020
@stale stale bot closed this Nov 16, 2020
@billylindeman
Copy link

Is this still being tracked? @kakkoyun @j3p0uk I just hit this issue today

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

thanos+ingress-nginx+grpc: impossible setup due missing host header
7 participants