Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implement thin vertical slice of system-generated notifications #13537

Open
wants to merge 5 commits into
base: dk/system-notifications-database
Choose a base branch
from

Conversation

dannykopping
Copy link
Contributor

@dannykopping dannykopping commented Jun 11, 2024

NOTE: this is a stacked PR, please review #13536 as well.

This introduces a new subsystem into Coder to enable notifications (see milestone). In this PR, I have implemented a very thin vertical slice of this functionality:

  • immutable system-generated notifications only (later we will expand this to operator-managed notifications)
  • SMTP and webhook delivery only
    • SMTP auth & TLS is not yet enabled; will be added subsequently
    • additional notification methods to come later as well

This functionality is also hidden behind an experiment.


The main entities in this package are:

Notifier is responsible for:

  • periodically dequeuing pending notification_messages entries
  • preparing them for delivery
  • delivering them
  • informing the Manager of successful/failed deliveries

Manager is a controller which is responsible for:

  • facilitating the enqueuing of new messages
  • starting up queue consumers (Notifiers)
  • periodically updating the store with delivery updates
  • managing its own and the Notifiers' lifecycles (run/shutdown)

The queue has been designed to be safe for concurrent use. This can either mean multiple coderd instances running side-by-side and/or each coderd having multiple queue consumers.

Copy link
Contributor Author

dannykopping commented Jun 11, 2024

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @dannykopping and the rest of your teammates on Graphite Graphite

@dannykopping dannykopping force-pushed the dk/system-notifications-lib branch 2 times, most recently from a1b4e2b to d2a947a Compare June 11, 2024 12:52
@dannykopping dannykopping force-pushed the dk/system-notifications-database branch from 1cb776a to 25636ec Compare June 11, 2024 13:15
@dannykopping dannykopping force-pushed the dk/system-notifications-lib branch 2 times, most recently from dc359de to 2e16d76 Compare June 11, 2024 13:40
@dannykopping dannykopping changed the title feat: system-generated notifications feat: implement thin vertical slice of system-generated notifications Jun 11, 2024
@dannykopping dannykopping marked this pull request as ready for review June 11, 2024 14:13
Comment on lines +56 to +57
"github.com/coder/coder/v2/coderd/database/dbauthz"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: move with the rest of the coder/coder imports

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strange how the formatter doesn't do this for us; will do 👍

cli/server.go Show resolved Hide resolved
cli/server.go Show resolved Hide resolved
Comment on lines +330 to +376
--notifications-dispatch-timeout duration, $CODER_NOTIFICATIONS_DISPATCH_TIMEOUT (default: 1m0s)
How long to wait while a notification is being sent before giving up.

--notifications-fetch-interval duration, $CODER_NOTIFICATIONS_FETCH_INTERVAL (default: 15s)
How often to query the database for queued notifications.

--notifications-lease-count int, $CODER_NOTIFICATIONS_LEASE_COUNT (default: 10)
How many notifications a notifier should lease per fetch interval.

--notifications-lease-period duration, $CODER_NOTIFICATIONS_LEASE_PERIOD (default: 2m0s)
How long a notifier should lease a message. This is effectively how
long a notification is 'owned' by a notifier, and once this period
expires it will be available for lease by another notifier. Leasing is
important in order for multiple running notifiers to not pick the same
messages to deliver concurrently. This lease period will only expire
if a notifier shuts down ungracefully; a dispatch of the notification
releases the lease.

--notifications-max-send-attempts int, $CODER_NOTIFICATIONS_MAX_SEND_ATTEMPTS (default: 5)
The upper limit of attempts to send a notification.

--notifications-method string, $CODER_NOTIFICATIONS_METHOD (default: smtp)
Which delivery method to use (available options: 'smtp', 'webhook').

--notifications-retry-interval duration, $CODER_NOTIFICATIONS_RETRY_INTERVAL (default: 5m0s)
The minimum time between retries.

--notifications-store-sync-buffer-size int, $CODER_NOTIFICATIONS_STORE_SYNC_BUFFER_SIZE (default: 50)
The notifications system buffers message updates in memory to ease
pressure on the database. This option controls how many updates are
kept in memory. The lower this value the lower the change of state
inconsistency in a non-graceful shutdown - but it also increases load
on the database. It is recommended to keep this option at its default
value.

--notifications-store-sync-interval duration, $CODER_NOTIFICATIONS_STORE_SYNC_INTERVAL (default: 2s)
The notifications system buffers message updates in memory to ease
pressure on the database. This option controls how often it
synchronizes its state with the database. The shorter this value the
lower the change of state inconsistency in a non-graceful shutdown -
but it also increases load on the database. It is recommended to keep
this option at its default value.

--notifications-worker-count int, $CODER_NOTIFICATIONS_WORKER_COUNT (default: 2)
How many workers should be processing messages in the queue; increase
this count if notifications are not being processed fast enough.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like a lot of these will end up never being touched and could probably just be hard-coded to sensible defaults. However since you've already exposed the knobs, I'm happy to leave them Just In Case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think we should cut this down to size. Exposed knobs have a cost in terms of perceived complexity in our coder server -h output and docs. There isn't a good reason for operators to be mucking around with things like worker counts, sync intervals, buffer sizes, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. It did feel a bit icky exposing all of these.
I mainly did this to document the various inputs (and encouraged operators to not change certain settings), but I'll move that documentation inline instead.

coderd/database/dbmem/dbmem.go Show resolved Hide resolved
coderd/notifications/types/payload.go Show resolved Hide resolved
// display and delivery.
type Labels map[string]string

func (l Labels) GetStrict(k string) (string, bool) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: GetOk

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not loving either name, TBH.
Maybe TryGet? GetSafe? GetIfExists?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only reason I suggested GetOk is due to having seen similar verbiage elsewhere: https://pkg.go.dev/github.com/hashicorp/terraform/helper/schema#ResourceData.GetOk

Feel free to ignore this, it's just a nit and non-blocking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, prior art does make me feel a bit better about GetOk 👍
Thanks

Comment on lines +30 to +35
select {
case ok := <-manager.enqueued:
require.True(t, ok)
case <-time.After(testutil.WaitShort):
t.Fatalf("timed out")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: testutil.RequireRecvCtx

)

// GoTemplate attempts to substitute the given payload into the given template using Go's templating syntax.
// TODO: memoize templates for memory efficiency?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential follow-up: add a benchmark test for sending notifications

coderd/notifications/provider.go Show resolved Hide resolved
@dannykopping dannykopping force-pushed the dk/system-notifications-database branch from 1c4046c to 538bc42 Compare June 12, 2024 09:08
Comment on lines +56 to +57
"github.com/coder/coder/v2/coderd/database/dbauthz"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strange how the formatter doesn't do this for us; will do 👍

cli/server.go Show resolved Hide resolved
cli/server.go Show resolved Hide resolved
Comment on lines +330 to +376
--notifications-dispatch-timeout duration, $CODER_NOTIFICATIONS_DISPATCH_TIMEOUT (default: 1m0s)
How long to wait while a notification is being sent before giving up.

--notifications-fetch-interval duration, $CODER_NOTIFICATIONS_FETCH_INTERVAL (default: 15s)
How often to query the database for queued notifications.

--notifications-lease-count int, $CODER_NOTIFICATIONS_LEASE_COUNT (default: 10)
How many notifications a notifier should lease per fetch interval.

--notifications-lease-period duration, $CODER_NOTIFICATIONS_LEASE_PERIOD (default: 2m0s)
How long a notifier should lease a message. This is effectively how
long a notification is 'owned' by a notifier, and once this period
expires it will be available for lease by another notifier. Leasing is
important in order for multiple running notifiers to not pick the same
messages to deliver concurrently. This lease period will only expire
if a notifier shuts down ungracefully; a dispatch of the notification
releases the lease.

--notifications-max-send-attempts int, $CODER_NOTIFICATIONS_MAX_SEND_ATTEMPTS (default: 5)
The upper limit of attempts to send a notification.

--notifications-method string, $CODER_NOTIFICATIONS_METHOD (default: smtp)
Which delivery method to use (available options: 'smtp', 'webhook').

--notifications-retry-interval duration, $CODER_NOTIFICATIONS_RETRY_INTERVAL (default: 5m0s)
The minimum time between retries.

--notifications-store-sync-buffer-size int, $CODER_NOTIFICATIONS_STORE_SYNC_BUFFER_SIZE (default: 50)
The notifications system buffers message updates in memory to ease
pressure on the database. This option controls how many updates are
kept in memory. The lower this value the lower the change of state
inconsistency in a non-graceful shutdown - but it also increases load
on the database. It is recommended to keep this option at its default
value.

--notifications-store-sync-interval duration, $CODER_NOTIFICATIONS_STORE_SYNC_INTERVAL (default: 2s)
The notifications system buffers message updates in memory to ease
pressure on the database. This option controls how often it
synchronizes its state with the database. The shorter this value the
lower the change of state inconsistency in a non-graceful shutdown -
but it also increases load on the database. It is recommended to keep
this option at its default value.

--notifications-worker-count int, $CODER_NOTIFICATIONS_WORKER_COUNT (default: 2)
How many workers should be processing messages in the queue; increase
this count if notifications are not being processed fast enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. It did feel a bit icky exposing all of these.
I mainly did this to document the various inputs (and encouraged operators to not change certain settings), but I'll move that documentation inline instead.

coderd/database/dbmem/dbmem.go Show resolved Hide resolved
coderd/notifications/dispatch/webhook.go Show resolved Hide resolved

eg.Go(func() error {
// Every interval, collect the messages in the channels and bulk update them in the database.
tick := time.NewTicker(m.cfg.StoreSyncInterval.Value())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool idea, I'll look into it.

coderd/notifications/provider.go Show resolved Hide resolved
// display and delivery.
type Labels map[string]string

func (l Labels) GetStrict(k string) (string, bool) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not loving either name, TBH.
Maybe TryGet? GetSafe? GetIfExists?

coderd/notifications/types/payload.go Show resolved Hide resolved
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
…ightly

Signed-off-by: Danny Kopping <danny@coder.com>
Comment on lines +42 to +43
// TODO: implement auth
// TODO: implement TLS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I was looking for the field to enter credentials


var d net.Dialer
// Outer context has a deadline (see CODER_NOTIFICATIONS_DISPATCH_TIMEOUT).
conn, err = d.DialContext(ctx, "tcp", fmt.Sprintf("%s:%s", smarthost, smarthostPort))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it connect & disconnect everytime there is a notification to send?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it does. This is how Alertmanager does it currently.
We could establish multiple connections and reuse them if SMTP supports it, which I suspect it would.

</div>
<div style="text-align: center; padding: 10px 0; border-top: 1px solid #34495E; margin-top: 20px; color: #BDC3C7;">
<!-- TODO: dynamic copyright -->
&copy; 2024 Coder. All rights reserved.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: dynamic year + version for bonus points ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd also probably need the custom "App Name" here when whitelabeled.


// Prepare request.
// Outer context has a deadline (see CODER_NOTIFICATIONS_DISPATCH_TIMEOUT).
req, err := http.NewRequestWithContext(ctx, http.MethodPost, endpoint, bytes.NewBuffer(m))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: in theory HTTP method could be configurable too. Otherwise it is POST-JSON-dispatcher.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think POST is the most idiomatic method to use for this.
Which other(s) would you expect folks to want to use?

import "github.com/google/uuid"

// Workspaces.
var TemplateWorkspaceDeleted = uuid.MustParse("'f517da0b-cdc9-410f-ab89-a86107c420ed'") // ...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: interesting, why ' '?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect when I copied this from the db table it added the single quotes. Good catch 👍

@@ -2233,6 +2456,7 @@ const (
ExperimentAutoFillParameters Experiment = "auto-fill-parameters" // This should not be taken out of experiments until we have redesigned the feature.
ExperimentMultiOrganization Experiment = "multi-organization" // Requires organization context for interactions, default org is assumed.
ExperimentCustomRoles Experiment = "custom-roles" // Allows creating runtime custom roles
ExperimentNotifications Experiment = "notifications"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: might be worth adding a comment

@@ -327,6 +327,68 @@ can safely ignore these settings.
Minimum supported version of TLS. Accepted values are "tls10",
"tls11", "tls12" or "tls13".

NOTIFICATIONS OPTIONS:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking loud: maybe you can start with max 3 options and consider others to be hardcoded for now?

// The intermediary SMTP host through which emails are sent (host:port).
Smarthost serpent.HostPort `json:"smarthost" typescript:",notnull"`
// The hostname identifying the SMTP server.
Hello serpent.String `json:"hello" typescript:",notnull"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello or EHLO?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More human-readable, mimicks what Alertmanager does.


// Labels represents the metadata defined in a notification message, which will be used to augment the notification
// display and delivery.
type Labels map[string]string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add extra unit tests for basic operations on Labels?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good idea.

"golang.org/x/xerrors"
)

// TODO: this is (mostly) a copy of coderd/parameter/renderer.go; unify them?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIR we use it to render plaintext descriptions. As long as there aren't any hidden corner cases, you can unify them 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing I changed was adding parser.HardLineBreak to the HTML output.
Cool, I'll unify.

Signed-off-by: Danny Kopping <danny@coder.com>
@dannykopping dannykopping force-pushed the dk/system-notifications-database branch from 538bc42 to 438bcba Compare June 12, 2024 10:51
notifierMu sync.Mutex

handlers *HandlerRegistry
helpers map[string]any
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be template.FuncMap. I realize the underlying type is map[string]any but if you look at the definition of template.FuncMap there is a comment explaining what it really has to be

// Manager maintains a group of notifiers: these consume the queue of notification messages in the store.
//
// Notifiers dequeue messages from the store _CODER_NOTIFICATIONS_LEASE_COUNT_ at a time and concurrently "dispatch" these messages, meaning they are
// sent by their respective methods (email, webhook, etc).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are concurrently dispatching the messages, what's the point of having multiple notifiers? Or are you saying each notifier dispatches sequentially but the notifiers are running concurrently?

"github.com/coder/coder/v2/coderd/database"
)

type HandlerRegistry struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels unnecessary. It's just a map that returns an error instead of a found bool, and you wrap the error again anyway the one time you call Resolve().

Using a plain map[database.NotificationMethod]Handler in the Manager would be clearer.

)

var (
singleton Enqueuer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very strongly against this Singleton pattern.

  1. It will make it a nightmare to test that other parts of the code send notifications when they should
  2. It obscures side effects when you look at the state and signatures of other objects.

// Enqueue queues a notification message for later delivery.
// Messages will be dequeued by a notifier later and dispatched.
func (m *Manager) Enqueue(ctx context.Context, userID, templateID uuid.UUID, labels types.Labels, createdBy string, targets ...uuid.UUID) (*uuid.UUID, error) {
payload, err := m.buildPayload(ctx, userID, templateID, labels)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to compute the payload before the message goes into the database?

If you didn't need to compute the payload here, then you wouldn't need a Manager to enqueue notifications, we could have a very lightweight "enqueuer" that just needs a database.Store. I think that would alleviate some of the plumbing that motivated you to have a singleton.

But, even if we do need to compute the payload prior to enqueuing, I think Manager should be split into enqueueing and dispatching. No consumer of this package needs to do both things at the same time, and they don't share any state.

var eg errgroup.Group
for i := 0; i < notifiers; i++ {
eg.Go(func() error {
m.notifierMu.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so m.notifiers gets used here and in Stop(), so presumably we are worried about what happens if Stop() is called concurrently with this.

First of all, Stop() never takes this lock.

Secondly, Stop() stops all the notifiers that have been added, but if its running concurrently with this, there is nothing that prevents more notifiers from getting added that never get stopped.

func newNotifier(ctx context.Context, cfg codersdk.NotificationsConfig, id int, log slog.Logger, db Store, hr *HandlerRegistry) *notifier {
return &notifier{
id: id,
ctx: ctx,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

context is set, but never read

// In the case of backpressure (i.e. the success/failure channels are full because the database is slow),
// and this caused delivery timeout (CODER_NOTIFICATIONS_DISPATCH_TIMEOUT), we can't append any more updates to
// the channels otherwise this, too, will block.
if xerrors.Is(err, context.Canceled) || xerrors.Is(err, context.DeadlineExceeded) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how success/failure channels being full can cause a delivery timeout, given that you only attempt to write to the channel after calling deliver

return nil
}

var eg errgroup.Group
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we go through a lot of effort piping these errors up from the delivery function, then collecting them in an error group, and the end result is that the call of this function just logs the error.

Instead, we should just log the errors down in the delivery functions and not return errors at all.

Comment on lines +1528 to +1530
system.NotifyWorkspaceDeleted(ctx, workspace.OwnerID, workspace.Name, reason, "provisionerdserver",
// Associate this notification with all the related entities.
workspace.ID, workspace.OwnerID, workspace.TemplateID, workspace.OrganizationID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these ID's always passed with a workspace notification?

Would it make sense to have the argument be database.Workspace and then the system package know to pull out the various fields of a workspace? If there is a circular dependency, one thing we do with RBAC is define an interface that almost all database types implement in modelmethods.

I have not read through all the code, this just feels susceptible to missing fields if we add a NotifyWorkspaceCreated or other verbs.

@github-actions github-actions bot added the stale This issue is like stale bread. label Jun 20, 2024
@github-actions github-actions bot closed this Jun 23, 2024
@johnstcn johnstcn reopened this Jun 23, 2024
@johnstcn johnstcn removed the stale This issue is like stale bread. label Jun 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants