Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query: Add servername to grpc dial options for DNS stores (Fixes #1507) #2407

Closed
wants to merge 5 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ We use *breaking* word for marking changes that are not backward compatible (rel

## Unreleased

- [#2407](https://github.com/thanos-io/thanos/pull/2407) Query: Add servername to gRPC dial options for DNS stores (Fixes #1507)

### Fixed

- [#2288](https://github.com/thanos-io/thanos/pull/2288) Ruler: Fixes issue #2281 bug in ruler with parsing query addr with path prefix
Expand Down
17 changes: 16 additions & 1 deletion cmd/thanos/query.go
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ func registerQuery(m map[string]setupFunc, app *kingpin.Application) {
key := cmd.Flag("grpc-client-tls-key", "TLS Key for the client's certificate").Default("").String()
caCert := cmd.Flag("grpc-client-tls-ca", "TLS CA Certificates to use to verify gRPC servers").Default("").String()
serverName := cmd.Flag("grpc-client-server-name", "Server name to verify the hostname on the returned gRPC certificates. See https://tools.ietf.org/html/rfc4366#section-3.1").Default("").String()
dnsServerName := cmd.Flag("grpc-client-dns-server-name", "For stores that are DNS, use the dns name as the server name for connection. Only applicable if using the secure flag").Default("false").Bool()

webRoutePrefix := cmd.Flag("web.route-prefix", "Prefix for API and UI endpoints. This allows thanos UI to be served on a sub-path. This option is analogous to --web.route-prefix of Promethus.").Default("").String()
webExternalPrefix := cmd.Flag("web.external-prefix", "Static prefix for all HTML links and redirect URLs in the UI query web interface. Actual endpoints are still served on / or the web.route-prefix. This allows thanos UI to be served behind a reverse proxy that strips a URL sub-path.").Default("").String()
Expand Down Expand Up @@ -107,6 +108,10 @@ func registerQuery(m map[string]setupFunc, app *kingpin.Application) {
storeResponseTimeout := modelDuration(cmd.Flag("store.response-timeout", "If a Store doesn't send any data in this specified duration then a Store will be ignored and partial data will be returned if it's enabled. 0 disables timeout.").Default("0ms"))

m[comp.String()] = func(g *run.Group, logger log.Logger, reg *prometheus.Registry, tracer opentracing.Tracer, _ <-chan struct{}, _ bool) error {
if *dnsServerName && !*secure {
return errors.New("grpc-client-dns-server-name specified without grpc-client-tls-secure")
}

selectorLset, err := parseFlagLabels(*selectorLabels)
if err != nil {
return errors.Wrap(err, "parse federation labels")
Expand Down Expand Up @@ -147,6 +152,7 @@ func registerQuery(m map[string]setupFunc, app *kingpin.Application) {
*key,
*caCert,
*serverName,
*dnsServerName,
*httpBindAddr,
time.Duration(*httpGracePeriod),
*webRoutePrefix,
Expand Down Expand Up @@ -188,6 +194,7 @@ func runQuery(
key string,
caCert string,
serverName string,
dnsServerName bool,
httpBindAddr string,
httpGracePeriod time.Duration,
webRoutePrefix string,
Expand Down Expand Up @@ -240,7 +247,11 @@ func runQuery(
func() (specs []query.StoreSpec) {
// Add DNS resolved addresses.
for _, addr := range dnsProvider.Addresses() {
specs = append(specs, query.NewGRPCStoreSpec(addr, false))
if dnsServerName {
specs = append(specs, query.NewGRPCStoreSpecServerName(addr, false, dnsProvider.ServerName(addr)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
specs = append(specs, query.NewGRPCStoreSpecServerName(addr, false, dnsProvider.ServerName(addr)))
specs = append(specs, query.NewGRPCStoreSpecServerName(addr, false, serverName)

can we just use serverName provided from --grpc-client-server-name instead of relying on dns?

Copy link
Author

@j3p0uk j3p0uk May 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the code as-is will use a provided server-name (--grpc-client-server-name). This works when using a single store from the querier. But adding multiple store options to collect data from many stores will not be able to use individual server names as the code currently stands.

The change intended here is to allow for multiple stores to each use different server names.

} else {
specs = append(specs, query.NewGRPCStoreSpec(addr, false))
}
}
// Add strict & static nodes.
for _, addr := range strictStores {
Expand All @@ -253,6 +264,10 @@ func runQuery(
},
dialOpts,
unhealthyStoreTimeout,
dnsServerName,
cert,
key,
caCert,
)
proxy = store.NewProxyStore(logger, reg, stores.Get, component.Query, selectorLset, storeResponseTimeout)
queryableCreator = query.NewQueryableCreator(logger, proxy)
Expand Down
4 changes: 4 additions & 0 deletions docs/components/query.md
Original file line number Diff line number Diff line change
Expand Up @@ -288,6 +288,10 @@ Flags:
Server name to verify the hostname on the
returned gRPC certificates. See
https://tools.ietf.org/html/rfc4366#section-3.1
--grpc-client-dns-server-name
For stores that are DNS, use the dns name as
the server name for connection. Only applicable
if using the secure flag
--web.route-prefix="" Prefix for API and UI endpoints. This allows
thanos UI to be served on a sub-path. This
option is analogous to --web.route-prefix of
Expand Down
16 changes: 16 additions & 0 deletions pkg/discovery/dns/provider.go
Original file line number Diff line number Diff line change
Expand Up @@ -150,3 +150,19 @@ func (p *Provider) Addresses() []string {
}
return result
}

// ServerName returns the server name for an address.
func (p *Provider) ServerName(addr string) string {
p.Lock()
defer p.Unlock()

for name, addrs := range p.resolved {
for _, a := range addrs {
if addr == a {
_, n := GetQTypeName(name)
return strings.SplitN(n, ":", 2)[0]
}
}
}
return ""
}
41 changes: 39 additions & 2 deletions pkg/query/storeset.go
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,9 @@ import (
"github.com/thanos-io/thanos/pkg/runutil"
"github.com/thanos-io/thanos/pkg/store"
"github.com/thanos-io/thanos/pkg/store/storepb"
"github.com/thanos-io/thanos/pkg/tls"
"google.golang.org/grpc"
"google.golang.org/grpc/credentials"
)

const (
Expand All @@ -38,6 +40,8 @@ type StoreSpec interface {
Metadata(ctx context.Context, client storepb.StoreClient) (labelSets []storepb.LabelSet, mint int64, maxt int64, storeType component.StoreAPI, err error)
// StrictStatic returns true if the StoreAPI has been statically defined and it is under a strict mode.
StrictStatic() bool
// ServerName returns the StoreAPI's server name for the store spec.
ServerName() string
}

type StoreStatus struct {
Expand All @@ -53,12 +57,19 @@ type StoreStatus struct {
type grpcStoreSpec struct {
addr string
strictstatic bool
serverName string
}

// NewGRPCStoreSpecServerName creates store pure gRPC spec with a server name.
// It uses Info gRPC call to get Metadata.
func NewGRPCStoreSpecServerName(addr string, strictstatic bool, serverName string) StoreSpec {
return &grpcStoreSpec{addr: addr, strictstatic: strictstatic, serverName: serverName}
}

// NewGRPCStoreSpec creates store pure gRPC spec.
// It uses Info gRPC call to get Metadata.
func NewGRPCStoreSpec(addr string, strictstatic bool) StoreSpec {
return &grpcStoreSpec{addr: addr, strictstatic: strictstatic}
return &grpcStoreSpec{addr: addr, strictstatic: strictstatic, serverName: ""}
}

// StrictStatic returns true if the StoreAPI has been statically defined and it is under a strict mode.
Expand All @@ -71,6 +82,10 @@ func (s *grpcStoreSpec) Addr() string {
return s.addr
}

func (s *grpcStoreSpec) ServerName() string {
return s.serverName
}

// Metadata method for gRPC store API tries to reach host Info method until context timeout. If we are unable to get metadata after
// that time, we assume that the host is unhealthy and return error.
func (s *grpcStoreSpec) Metadata(ctx context.Context, client storepb.StoreClient) (labelSets []storepb.LabelSet, mint int64, maxt int64, storeType component.StoreAPI, err error) {
Expand Down Expand Up @@ -165,6 +180,10 @@ type StoreSet struct {
storeSpecs func() []StoreSpec
dialOpts []grpc.DialOption
gRPCInfoCallTimeout time.Duration
dnsServerName bool
cert string
key string
caCert string

updateMtx sync.Mutex
storesMtx sync.RWMutex
Expand All @@ -186,6 +205,10 @@ func NewStoreSet(
storeSpecs func() []StoreSpec,
dialOpts []grpc.DialOption,
unhealthyStoreTimeout time.Duration,
dnsServerName bool,
cert string,
key string,
caCert string,
) *StoreSet {
storesMetric := newStoreSetNodeCollector()
if reg != nil {
Expand All @@ -208,6 +231,10 @@ func NewStoreSet(
stores: make(map[string]*storeRef),
storeStatuses: make(map[string]*StoreStatus),
unhealthyStoreTimeout: unhealthyStoreTimeout,
dnsServerName: dnsServerName,
cert: cert,
key: key,
caCert: caCert,
}
return ss
}
Expand Down Expand Up @@ -420,7 +447,17 @@ func (s *StoreSet) getActiveStores(ctx context.Context, stores map[string]*store
st, seenAlready := stores[addr]
if !seenAlready {
// New store or was unactive and was removed in the past - create new one.
conn, err := grpc.DialContext(ctx, addr, s.dialOpts...)
dialOpts := s.dialOpts
// Update the dialOpts if we are asked to add dnsServerName.
if s.dnsServerName {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this seems a bit odd. The NewStoreSet func already accepts a slice of dialopts. I think we should expect that the dialopts has the correct WithTransportCredentials already set, no? ie correctly configure this option in cmd/Thanos/query rather than leaking configuration into this struct. Also, it looks to me like the server name is already set in the dialopts generated with extgrpc https://github.com/j3p0uk/thanos/blob/40526f52f54d4501737e5246c0e71e56dd7e0b2d/pkg/extgrpc/client.go#L56

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, taking out this new dial opt and server name bool from this struct would eliminate the need for extending the StoreSpec interface and adding the extra field to the grpcStoreSpec struct

Copy link
Author

@j3p0uk j3p0uk Apr 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bug, as it stands, is that the arguments provided to the query at start time do not allow for individual server names per configured store. The store set gets populated with the resolved addresses of the stores in the store spec and not the original server name, as far as I could tell. Given this, I attempted to pass minimal data through to generate the TLS portion of the dialOpts on demand at connection time by retrieving the server name from the dns provider.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed this is a bit weird because with this we not only have two places where these options are set but also this newly introduced option overrides the other one. It should probably be refactored so that query.NewGRPCStoreSpec* would hook into some function which would give it a list of opts to use. We could discuss it on #thanos-dev.
Also, I wonder if there any downsides to enabling this kind of feature by default for everyone, and then only allow overriding the server name via the option that exists already? That would probably simplify things too.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, happy to discuss further.

My thoughts are that this looks odd because a StoreSet is a set of Stores, and the code assumes that each of the stores shares a particular set of characteristics, which may not be true (and is the basis for this bug fix). The StoreSet currently holds attributes that can differ on a per-store basis. The ServerName is one such attribute.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I think we need to refactor this because the current approach causes the StoreSet to have duplicate WithTransportCredentials dialopts, one of which overrides the other. This makes it hard to reason about the code IMO

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chat in slack...

Sounds like implementation of this would be something like removing dialOpts from the StoreSet, and adding it to the StoreSpec.
We should be able to generate the dialOpts at creation of the StoreSpec array in the StoreSet, so would only be once at startup.
Need to decide on what flags should exist, and how would they interact and how to handle address-specified stores and DNS SRV records.

For the SRV case, I think the correct behaviour would be to enable lookup of the A record resolved from the SRV lookup. This data is currently lost during resolution, as only the IPs are returned to the provider from resolver, and stored under the key of the original store spec.
To improve the DNS Provider lookup of address to server name, we could pass back the data to stash in the provider as a reverse lookup cache. This would allow SRV to do addr -> A lookup, and others to do addr -> host lookup, simplifying the Provider.ServerName() function.

I can drop the “--grpc-client-dns-server-name” flag and then use “dns” in the “--grpc-client-server-name” to trigger addition of server name from DNS in the dialOpts, as “dns” is likely an illegal server name anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bwplotka @squat ping ^

tlsCfg, err := tls.NewClientConfig(s.logger, s.cert, s.key, s.caCert, spec.ServerName())
if err != nil {
level.Warn(s.logger).Log("msg", "update of store node failed", "err", errors.Wrap(err, "setting TLS"), "address", addr)
return
}
dialOpts = append(s.dialOpts, grpc.WithTransportCredentials(credentials.NewTLS(tlsCfg)))
}
conn, err := grpc.DialContext(ctx, addr, dialOpts...)
if err != nil {
s.updateStoreStatus(&storeRef{addr: addr}, err)
level.Warn(s.logger).Log("msg", "update of store node failed", "err", errors.Wrap(err, "dialing connection"), "address", addr)
Expand Down
6 changes: 3 additions & 3 deletions pkg/query/storeset_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -184,7 +184,7 @@ func TestStoreSet_Update(t *testing.T) {
specs = append(specs, NewGRPCStoreSpec(addr, false))
}
return specs
}, testGRPCOpts, time.Minute)
}, testGRPCOpts, time.Minute, false, "", "", "")
storeSet.gRPCInfoCallTimeout = 2 * time.Second
defer storeSet.Close()

Expand Down Expand Up @@ -532,7 +532,7 @@ func TestStoreSet_Update_NoneAvailable(t *testing.T) {
specs = append(specs, NewGRPCStoreSpec(addr, false))
}
return specs
}, testGRPCOpts, time.Minute)
}, testGRPCOpts, time.Minute, false, "", "", "")
storeSet.gRPCInfoCallTimeout = 2 * time.Second

// Should not matter how many of these we run.
Expand Down Expand Up @@ -596,7 +596,7 @@ func TestQuerierStrict(t *testing.T) {
NewGRPCStoreSpec(st.StoreAddresses()[0], true),
NewGRPCStoreSpec(st.StoreAddresses()[1], false),
}
}, testGRPCOpts, time.Minute)
}, testGRPCOpts, time.Minute, false, "", "", "")
defer storeSet.Close()
storeSet.gRPCInfoCallTimeout = 1 * time.Second

Expand Down