Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write last xds snapshot to persisted storage #6115

Open
kdorosh opened this issue Mar 18, 2022 · 8 comments
Open

Write last xds snapshot to persisted storage #6115

kdorosh opened this issue Mar 18, 2022 · 8 comments
Labels
Area: Stability Issues related to stability of the product, engineering, tech debt no stalebot This issue won't be closed by stalebot even after inactivity. stale Issues that are stale. These will not be prioritized without further engagement on the issue. Type: Enhancement New feature or request Type: EPIC

Comments

@kdorosh
Copy link
Contributor

kdorosh commented Mar 18, 2022

Version

No response

Is your feature request related to a problem? Please describe.

we need to ensure Ensure Gloo Edge is reliable across all pod restarts and invalid configuration

the shortest path to ensure reliable xds configuration is always being served is to write last acked xds snapshot to persistent storage, and load that if gloo translation is unable to complete (related #6114)

Describe the solution you'd like

alternative solution: write last xds cache to persistent storage. Size M, Risk M

Describe alternatives you've considered

No response

Additional Context

downside: breaks multitenancy, which may or may not be a product requirement in all gloo settings deployments

@kdorosh
Copy link
Contributor Author

kdorosh commented Mar 21, 2022

per discussion with @nrjpoddar and @kcbabo , the preferred long term solution is #6114

That change is larger and risker; in the meantime we will add this support (temporarily) and deprecate and remove it once the other feature is implemented and well tested in the field.

@chrisgaun
Copy link

chrisgaun commented Mar 23, 2022

@kdorosh we need to consider the implications of adding private keys - certificates when not using SDS - to a PV. They would like to have the persistence in HA Redis. This can be follow up work.

@kdorosh
Copy link
Contributor Author

kdorosh commented Mar 23, 2022

@chrisgaun as noted earlier, the preferred long-term solution is #6114 so all state is stored in etcd.

In the meantime, an encrypted volume (e.g. https://kubernetes.io/docs/concepts/storage/storage-classes/#aws-ebs) may be acceptable.

We could explore HA redis, but that seems similar to making xds-relay HA which might be preferable, although #6114 is still preferred in my opinion

@kdorosh
Copy link
Contributor Author

kdorosh commented Mar 29, 2022

related blocker i ran into while doing the work solo-io/solo-kit#461

@kdorosh
Copy link
Contributor Author

kdorosh commented Apr 7, 2022

related: #5022

steps to reproduce:

  • kind create cluster --name kind --image kindest/node:v1.21.1@sha256:69860bda5563ac81e3c0057d654b5253219618a22ec3a346306239bba8cfa1a6
  • glooctl install gateway enterprise --version 1.11.0-beta8 --license-key $LICENSE_KEY
    • glooctl install gateway --version 1.12.0-beta1
  • kubectl scale -n gloo-system deploy/discovery --replicas 0
  • kubectl apply -f https://raw.githubusercontent.com/solo-io/gloo/v1.2.9/example/petstore/petstore.yaml
  • apply the following yaml
apiVersion: gateway.solo.io/v1
kind: VirtualService
metadata:
  name: default
  namespace: gloo-system
spec:
  virtualHost:
    domains:
    - '*'
    routes:
    - matchers:
      - exact: /all-pets
      options:
        prefixRewrite: /api/pets
      routeAction:
        single:
          upstream:
            name: default-petstore-8080
            namespace: gloo-system
    - matchers:
      - exact: /all-pets2
      options:
        prefixRewrite: /api/pets
      routeAction:
        single:
          upstream:
            name: default-petstore2-8080
            namespace: gloo-system
---
apiVersion: v1
kind: Service
metadata:
  name: petstore
  namespace: default
  labels:
    service: petstore
spec:
  ports:
  - port: 8080
    protocol: TCP
  selector:
    app: petstore
---
apiVersion: v1
kind: Service
metadata:
  name: petstore2
  namespace: default
  labels:
    service: petstore
spec:
  ports:
  - port: 8080
    protocol: TCP
  selector:
    app: petstore
---
apiVersion: gloo.solo.io/v1
kind: Upstream
metadata:
  name: default-petstore-8080
  namespace: gloo-system
spec:
  kube:
    selector:
      app: petstore
    serviceName: petstore
    serviceNamespace: default
    servicePort: 8080
---
apiVersion: gloo.solo.io/v1
kind: Upstream
metadata:
  name: default-petstore2-8080
  namespace: gloo-system
spec:
  kube:
    selector:
      app: petstore
    serviceName: petstore2
    serviceNamespace: default
    servicePort: 8080
  • kubectl port-forward -n gloo-system deploy/gateway-proxy 8080
  • see both curls work
    • curl -H "Host: foo" localhost:8080/all-pets
    • curl -H "Host: foo" localhost:8080/all-pets2
  • kubectl delete svc petstore2
  • see first curl work, second returns http 503
    • curl -H "Host: foo" localhost:8080/all-pets
    • curl -H "Host: foo" localhost:8080/all-pets2
  • kubectl delete po -n gloo-system --all
  • see both curls FAIL
    • curl -H "Host: foo" localhost:8080/all-pets
    • curl -H "Host: foo" localhost:8080/all-pets2 --> only this one should fail
  • notice in the logs
{"level":"warn","ts":"2022-04-07T14:55:55.263Z","logger":"gloo-ee.v1.event_loop.setup.gloosnapshot.event_loop.reporter","caller":"reporter/reporter.go:255","msg":"failed to write status state:Warning reason:\"warning: \\n  1 error occurred:\\n\\t* Upstream name:\\\"default-petstore2-8080\\\" namespace:\\\"gloo-system\\\" references the service \\\"petstore2\\\" which does not exist in namespace \\\"default\\\"\\n\\n\" reported_by:\"gloo\" for resource default-petstore2-8080: updating kube resource default-petstore2-8080:112756 (want 112756): admission webhook \"gateway.gloo-system.svc\" denied the request: resource incompatible with current Gloo snapshot: [Validating v1.Upstream failed: 1 error occurred:\n\t* Upstream name:\"default-petstore2-8080\" namespace:\"gloo-system\" references the service \"petstore2\" which does not exist in namespace \"default\"\n\n]","version":"1.11.0-beta7"}
{"level":"error","ts":"2022-04-07T14:55:55.265Z","logger":"gloo-ee.v1.event_loop.setup","caller":"setup/setup_syncer.go:668","msg":"gloo main event loop","version":"1.11.0-beta7","error":"event_loop.gloo: 1 error occurred:\n\t* writing reports: 1 error occurred:\n\t* failed to write status state:Warning reason:\"warning: \\n  1 error occurred:\\n\\t* Upstream name:\\\"default-petstore2-8080\\\" namespace:\\\"gloo-system\\\" references the service \\\"petstore2\\\" which does not exist in namespace \\\"default\\\"\\n\\n\" reported_by:\"gloo\" for resource default-petstore2-8080: updating kube resource default-petstore2-8080:112756 (want 112756): admission webhook \"gateway.gloo-system.svc\" denied the request: resource incompatible with current Gloo snapshot: [Validating v1.Upstream failed: 1 error occurred:\n\t* Upstream name:\"default-petstore2-8080\" namespace:\"gloo-system\" references the service \"petstore2\" which does not exist in namespace \"default\"\n\n]\n\n\n\n","errorVerbose":"1 error occurred:\n\t* writing reports: 1 error occurred:\n\t* failed to write status state:Warning reason:\"warning: \\n  1 error occurred:\\n\\t* Upstream name:\\\"default-petstore2-8080\\\" namespace:\\\"gloo-system\\\" references the service \\\"petstore2\\\" which does not exist in namespace \\\"default\\\"\\n\\n\" reported_by:\"gloo\" for resource default-petstore2-8080: updating kube resource default-petstore2-8080:112756 (want 112756): admission webhook \"gateway.gloo-system.svc\" denied the request: resource incompatible with current Gloo snapshot: [Validating v1.Upstream failed: 1 error occurred:\n\t* Upstream name:\"default-petstore2-8080\" namespace:\"gloo-system\" references the service \"petstore2\" which does not exist in namespace \"default\"\n\n]\n\n\n\n\nevent_loop.gloo\ngithub.com/solo-io/go-utils/errutils.AggregateErrs\n\t/go/pkg/mod/github.com/solo-io/go-utils@v0.21.24/errutils/aggregate_errs.go:19\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1371","stacktrace":"github.com/solo-io/gloo/projects/gloo/pkg/syncer/setup.RunGlooWithExtensions.func6\n\t/go/pkg/mod/github.com/solo-io/gloo@v1.11.0-beta11/projects/gloo/pkg/syncer/setup/setup_syncer.go:668"}

update: if we don't run the route replacement sanitizer then we don't have this issue

return xdsSnapshot, reports.ValidateStrict()
(i.e., return xds snapshot and nil error here)

update: old config is also stuck, e.g. after deleting the service but before rolling pods apply:

apiVersion: gateway.solo.io/v1
kind: VirtualService
metadata:
  name: default
  namespace: gloo-system
spec:
  virtualHost:
    domains:
    - '*'
    routes:
    - matchers:
      - exact: /all-pets
      options:
        prefixRewrite: /api/pets
      routeAction:
        single:
          upstream:
            name: default-petstore-8080
            namespace: gloo-system
    - matchers:
      - exact: /all-pets2
      options:
        prefixRewrite: /api/pets
      routeAction:
        single:
          upstream:
            name: default-petstore2-8080
            namespace: gloo-system
    - matchers:
      - exact: /all-pets3
      options:
        prefixRewrite: /api/pets
      routeAction:
        single:
          upstream:
            name: default-petstore3-8080
            namespace: gloo-system
---
apiVersion: v1
kind: Service
metadata:
  name: petstore3
  namespace: default
  labels:
    service: petstore
spec:
  ports:
  - port: 8080
    protocol: TCP
  selector:
    app: petstore
---
apiVersion: gloo.solo.io/v1
kind: Upstream
metadata:
  name: default-petstore3-8080
  namespace: gloo-system
spec:
  kube:
    selector:
      app: petstore
    serviceName: petstore3
    serviceNamespace: default
    servicePort: 8080

@kdorosh kdorosh mentioned this issue Apr 7, 2022
8 tasks
@kdorosh
Copy link
Contributor Author

kdorosh commented Apr 8, 2022

also highly relevant to the initial ask here of persisting xds config; this is/was made much harder because we did not do this https://bryanftan.medium.com/accept-interfaces-return-structs-in-go-d4cab29a301b

we may want to investigate a refactor to make the implementation more future-proof

@kdorosh
Copy link
Contributor Author

kdorosh commented Apr 13, 2022

This may still be desirable to do depending on how hard it is to rewrite gateway translation to never fail once the gloo and gateway pods merge; fyi @elcasteel @sam-heilbron @nfuden

the code I wrote has been pushed to these branches

@kdorosh kdorosh removed their assignment Apr 13, 2022
Copy link

github-actions bot commented Jun 2, 2024

This issue has been marked as stale because of no activity in the last 180 days. It will be closed in the next 180 days unless it is tagged "no stalebot" or other activity occurs.

@github-actions github-actions bot added the stale Issues that are stale. These will not be prioritized without further engagement on the issue. label Jun 2, 2024
@DuncanDoyle DuncanDoyle added Area: Stability Issues related to stability of the product, engineering, tech debt no stalebot This issue won't be closed by stalebot even after inactivity. labels Jul 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Stability Issues related to stability of the product, engineering, tech debt no stalebot This issue won't be closed by stalebot even after inactivity. stale Issues that are stale. These will not be prioritized without further engagement on the issue. Type: Enhancement New feature or request Type: EPIC
Projects
None yet
Development

No branches or pull requests

3 participants