-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: implement thin vertical slice of system-generated notifications #13537
base: dk/system-notifications-database
Are you sure you want to change the base?
feat: implement thin vertical slice of system-generated notifications #13537
Conversation
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
This stack of pull requests is managed by Graphite. Learn more about stacking. Join @dannykopping and the rest of your teammates on |
a1b4e2b
to
d2a947a
Compare
1cb776a
to
25636ec
Compare
dc359de
to
2e16d76
Compare
d8f51ce
to
284b781
Compare
"github.com/coder/coder/v2/coderd/database/dbauthz" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: move with the rest of the coder/coder imports
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strange how the formatter doesn't do this for us; will do 👍
--notifications-dispatch-timeout duration, $CODER_NOTIFICATIONS_DISPATCH_TIMEOUT (default: 1m0s) | ||
How long to wait while a notification is being sent before giving up. | ||
|
||
--notifications-fetch-interval duration, $CODER_NOTIFICATIONS_FETCH_INTERVAL (default: 15s) | ||
How often to query the database for queued notifications. | ||
|
||
--notifications-lease-count int, $CODER_NOTIFICATIONS_LEASE_COUNT (default: 10) | ||
How many notifications a notifier should lease per fetch interval. | ||
|
||
--notifications-lease-period duration, $CODER_NOTIFICATIONS_LEASE_PERIOD (default: 2m0s) | ||
How long a notifier should lease a message. This is effectively how | ||
long a notification is 'owned' by a notifier, and once this period | ||
expires it will be available for lease by another notifier. Leasing is | ||
important in order for multiple running notifiers to not pick the same | ||
messages to deliver concurrently. This lease period will only expire | ||
if a notifier shuts down ungracefully; a dispatch of the notification | ||
releases the lease. | ||
|
||
--notifications-max-send-attempts int, $CODER_NOTIFICATIONS_MAX_SEND_ATTEMPTS (default: 5) | ||
The upper limit of attempts to send a notification. | ||
|
||
--notifications-method string, $CODER_NOTIFICATIONS_METHOD (default: smtp) | ||
Which delivery method to use (available options: 'smtp', 'webhook'). | ||
|
||
--notifications-retry-interval duration, $CODER_NOTIFICATIONS_RETRY_INTERVAL (default: 5m0s) | ||
The minimum time between retries. | ||
|
||
--notifications-store-sync-buffer-size int, $CODER_NOTIFICATIONS_STORE_SYNC_BUFFER_SIZE (default: 50) | ||
The notifications system buffers message updates in memory to ease | ||
pressure on the database. This option controls how many updates are | ||
kept in memory. The lower this value the lower the change of state | ||
inconsistency in a non-graceful shutdown - but it also increases load | ||
on the database. It is recommended to keep this option at its default | ||
value. | ||
|
||
--notifications-store-sync-interval duration, $CODER_NOTIFICATIONS_STORE_SYNC_INTERVAL (default: 2s) | ||
The notifications system buffers message updates in memory to ease | ||
pressure on the database. This option controls how often it | ||
synchronizes its state with the database. The shorter this value the | ||
lower the change of state inconsistency in a non-graceful shutdown - | ||
but it also increases load on the database. It is recommended to keep | ||
this option at its default value. | ||
|
||
--notifications-worker-count int, $CODER_NOTIFICATIONS_WORKER_COUNT (default: 2) | ||
How many workers should be processing messages in the queue; increase | ||
this count if notifications are not being processed fast enough. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like a lot of these will end up never being touched and could probably just be hard-coded to sensible defaults. However since you've already exposed the knobs, I'm happy to leave them Just In Case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually think we should cut this down to size. Exposed knobs have a cost in terms of perceived complexity in our coder server -h
output and docs. There isn't a good reason for operators to be mucking around with things like worker counts, sync intervals, buffer sizes, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point. It did feel a bit icky exposing all of these.
I mainly did this to document the various inputs (and encouraged operators to not change certain settings), but I'll move that documentation inline instead.
// display and delivery. | ||
type Labels map[string]string | ||
|
||
func (l Labels) GetStrict(k string) (string, bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: GetOk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not loving either name, TBH.
Maybe TryGet
? GetSafe
? GetIfExists
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only reason I suggested GetOk
is due to having seen similar verbiage elsewhere: https://pkg.go.dev/github.com/hashicorp/terraform/helper/schema#ResourceData.GetOk
Feel free to ignore this, it's just a nit and non-blocking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, prior art does make me feel a bit better about GetOk
👍
Thanks
select { | ||
case ok := <-manager.enqueued: | ||
require.True(t, ok) | ||
case <-time.After(testutil.WaitShort): | ||
t.Fatalf("timed out") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: testutil.RequireRecvCtx
) | ||
|
||
// GoTemplate attempts to substitute the given payload into the given template using Go's templating syntax. | ||
// TODO: memoize templates for memory efficiency? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential follow-up: add a benchmark test for sending notifications
1c4046c
to
538bc42
Compare
"github.com/coder/coder/v2/coderd/database/dbauthz" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strange how the formatter doesn't do this for us; will do 👍
--notifications-dispatch-timeout duration, $CODER_NOTIFICATIONS_DISPATCH_TIMEOUT (default: 1m0s) | ||
How long to wait while a notification is being sent before giving up. | ||
|
||
--notifications-fetch-interval duration, $CODER_NOTIFICATIONS_FETCH_INTERVAL (default: 15s) | ||
How often to query the database for queued notifications. | ||
|
||
--notifications-lease-count int, $CODER_NOTIFICATIONS_LEASE_COUNT (default: 10) | ||
How many notifications a notifier should lease per fetch interval. | ||
|
||
--notifications-lease-period duration, $CODER_NOTIFICATIONS_LEASE_PERIOD (default: 2m0s) | ||
How long a notifier should lease a message. This is effectively how | ||
long a notification is 'owned' by a notifier, and once this period | ||
expires it will be available for lease by another notifier. Leasing is | ||
important in order for multiple running notifiers to not pick the same | ||
messages to deliver concurrently. This lease period will only expire | ||
if a notifier shuts down ungracefully; a dispatch of the notification | ||
releases the lease. | ||
|
||
--notifications-max-send-attempts int, $CODER_NOTIFICATIONS_MAX_SEND_ATTEMPTS (default: 5) | ||
The upper limit of attempts to send a notification. | ||
|
||
--notifications-method string, $CODER_NOTIFICATIONS_METHOD (default: smtp) | ||
Which delivery method to use (available options: 'smtp', 'webhook'). | ||
|
||
--notifications-retry-interval duration, $CODER_NOTIFICATIONS_RETRY_INTERVAL (default: 5m0s) | ||
The minimum time between retries. | ||
|
||
--notifications-store-sync-buffer-size int, $CODER_NOTIFICATIONS_STORE_SYNC_BUFFER_SIZE (default: 50) | ||
The notifications system buffers message updates in memory to ease | ||
pressure on the database. This option controls how many updates are | ||
kept in memory. The lower this value the lower the change of state | ||
inconsistency in a non-graceful shutdown - but it also increases load | ||
on the database. It is recommended to keep this option at its default | ||
value. | ||
|
||
--notifications-store-sync-interval duration, $CODER_NOTIFICATIONS_STORE_SYNC_INTERVAL (default: 2s) | ||
The notifications system buffers message updates in memory to ease | ||
pressure on the database. This option controls how often it | ||
synchronizes its state with the database. The shorter this value the | ||
lower the change of state inconsistency in a non-graceful shutdown - | ||
but it also increases load on the database. It is recommended to keep | ||
this option at its default value. | ||
|
||
--notifications-worker-count int, $CODER_NOTIFICATIONS_WORKER_COUNT (default: 2) | ||
How many workers should be processing messages in the queue; increase | ||
this count if notifications are not being processed fast enough. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point. It did feel a bit icky exposing all of these.
I mainly did this to document the various inputs (and encouraged operators to not change certain settings), but I'll move that documentation inline instead.
|
||
eg.Go(func() error { | ||
// Every interval, collect the messages in the channels and bulk update them in the database. | ||
tick := time.NewTicker(m.cfg.StoreSyncInterval.Value()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool idea, I'll look into it.
// display and delivery. | ||
type Labels map[string]string | ||
|
||
func (l Labels) GetStrict(k string) (string, bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not loving either name, TBH.
Maybe TryGet
? GetSafe
? GetIfExists
?
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
…ightly Signed-off-by: Danny Kopping <danny@coder.com>
// TODO: implement auth | ||
// TODO: implement TLS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I was looking for the field to enter credentials
|
||
var d net.Dialer | ||
// Outer context has a deadline (see CODER_NOTIFICATIONS_DISPATCH_TIMEOUT). | ||
conn, err = d.DialContext(ctx, "tcp", fmt.Sprintf("%s:%s", smarthost, smarthostPort)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it connect & disconnect everytime there is a notification to send?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it does. This is how Alertmanager does it currently.
We could establish multiple connections and reuse them if SMTP supports it, which I suspect it would.
</div> | ||
<div style="text-align: center; padding: 10px 0; border-top: 1px solid #34495E; margin-top: 20px; color: #BDC3C7;"> | ||
<!-- TODO: dynamic copyright --> | ||
© 2024 Coder. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: dynamic year + version for bonus points ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd also probably need the custom "App Name" here when whitelabeled.
|
||
// Prepare request. | ||
// Outer context has a deadline (see CODER_NOTIFICATIONS_DISPATCH_TIMEOUT). | ||
req, err := http.NewRequestWithContext(ctx, http.MethodPost, endpoint, bytes.NewBuffer(m)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: in theory HTTP method could be configurable too. Otherwise it is POST-JSON-dispatcher.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think POST
is the most idiomatic method to use for this.
Which other(s) would you expect folks to want to use?
import "github.com/google/uuid" | ||
|
||
// Workspaces. | ||
var TemplateWorkspaceDeleted = uuid.MustParse("'f517da0b-cdc9-410f-ab89-a86107c420ed'") // ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: interesting, why ' '
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect when I copied this from the db table it added the single quotes. Good catch 👍
@@ -2233,6 +2456,7 @@ const ( | |||
ExperimentAutoFillParameters Experiment = "auto-fill-parameters" // This should not be taken out of experiments until we have redesigned the feature. | |||
ExperimentMultiOrganization Experiment = "multi-organization" // Requires organization context for interactions, default org is assumed. | |||
ExperimentCustomRoles Experiment = "custom-roles" // Allows creating runtime custom roles | |||
ExperimentNotifications Experiment = "notifications" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: might be worth adding a comment
@@ -327,6 +327,68 @@ can safely ignore these settings. | |||
Minimum supported version of TLS. Accepted values are "tls10", | |||
"tls11", "tls12" or "tls13". | |||
|
|||
NOTIFICATIONS OPTIONS: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking loud: maybe you can start with max 3 options and consider others to be hardcoded for now?
// The intermediary SMTP host through which emails are sent (host:port). | ||
Smarthost serpent.HostPort `json:"smarthost" typescript:",notnull"` | ||
// The hostname identifying the SMTP server. | ||
Hello serpent.String `json:"hello" typescript:",notnull"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello or EHLO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More human-readable, mimicks what Alertmanager does.
|
||
// Labels represents the metadata defined in a notification message, which will be used to augment the notification | ||
// display and delivery. | ||
type Labels map[string]string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add extra unit tests for basic operations on Labels?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good idea.
"golang.org/x/xerrors" | ||
) | ||
|
||
// TODO: this is (mostly) a copy of coderd/parameter/renderer.go; unify them? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIR we use it to render plaintext descriptions. As long as there aren't any hidden corner cases, you can unify them 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only thing I changed was adding parser.HardLineBreak
to the HTML output.
Cool, I'll unify.
Signed-off-by: Danny Kopping <danny@coder.com>
284b781
to
9fd8527
Compare
538bc42
to
438bcba
Compare
notifierMu sync.Mutex | ||
|
||
handlers *HandlerRegistry | ||
helpers map[string]any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be template.FuncMap
. I realize the underlying type is map[string]any
but if you look at the definition of template.FuncMap
there is a comment explaining what it really has to be
// Manager maintains a group of notifiers: these consume the queue of notification messages in the store. | ||
// | ||
// Notifiers dequeue messages from the store _CODER_NOTIFICATIONS_LEASE_COUNT_ at a time and concurrently "dispatch" these messages, meaning they are | ||
// sent by their respective methods (email, webhook, etc). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are concurrently dispatching the messages, what's the point of having multiple notifiers? Or are you saying each notifier dispatches sequentially but the notifiers are running concurrently?
"github.com/coder/coder/v2/coderd/database" | ||
) | ||
|
||
type HandlerRegistry struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels unnecessary. It's just a map
that returns an error
instead of a found bool
, and you wrap the error again anyway the one time you call Resolve()
.
Using a plain map[database.NotificationMethod]Handler
in the Manager
would be clearer.
) | ||
|
||
var ( | ||
singleton Enqueuer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very strongly against this Singleton pattern.
- It will make it a nightmare to test that other parts of the code send notifications when they should
- It obscures side effects when you look at the state and signatures of other objects.
// Enqueue queues a notification message for later delivery. | ||
// Messages will be dequeued by a notifier later and dispatched. | ||
func (m *Manager) Enqueue(ctx context.Context, userID, templateID uuid.UUID, labels types.Labels, createdBy string, targets ...uuid.UUID) (*uuid.UUID, error) { | ||
payload, err := m.buildPayload(ctx, userID, templateID, labels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to compute the payload before the message goes into the database?
If you didn't need to compute the payload here, then you wouldn't need a Manager
to enqueue notifications, we could have a very lightweight "enqueuer" that just needs a database.Store
. I think that would alleviate some of the plumbing that motivated you to have a singleton.
But, even if we do need to compute the payload prior to enqueuing, I think Manager should be split into enqueueing and dispatching. No consumer of this package needs to do both things at the same time, and they don't share any state.
var eg errgroup.Group | ||
for i := 0; i < notifiers; i++ { | ||
eg.Go(func() error { | ||
m.notifierMu.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so m.notifiers
gets used here and in Stop()
, so presumably we are worried about what happens if Stop()
is called concurrently with this.
First of all, Stop()
never takes this lock.
Secondly, Stop()
stops all the notifiers that have been added, but if its running concurrently with this, there is nothing that prevents more notifiers from getting added that never get stopped.
func newNotifier(ctx context.Context, cfg codersdk.NotificationsConfig, id int, log slog.Logger, db Store, hr *HandlerRegistry) *notifier { | ||
return ¬ifier{ | ||
id: id, | ||
ctx: ctx, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
context is set, but never read
// In the case of backpressure (i.e. the success/failure channels are full because the database is slow), | ||
// and this caused delivery timeout (CODER_NOTIFICATIONS_DISPATCH_TIMEOUT), we can't append any more updates to | ||
// the channels otherwise this, too, will block. | ||
if xerrors.Is(err, context.Canceled) || xerrors.Is(err, context.DeadlineExceeded) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how success/failure channels being full can cause a delivery timeout, given that you only attempt to write to the channel after calling deliver
return nil | ||
} | ||
|
||
var eg errgroup.Group |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, we go through a lot of effort piping these errors up from the delivery function, then collecting them in an error group, and the end result is that the call of this function just logs the error.
Instead, we should just log the errors down in the delivery functions and not return errors at all.
system.NotifyWorkspaceDeleted(ctx, workspace.OwnerID, workspace.Name, reason, "provisionerdserver", | ||
// Associate this notification with all the related entities. | ||
workspace.ID, workspace.OwnerID, workspace.TemplateID, workspace.OrganizationID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these ID's always passed with a workspace notification?
Would it make sense to have the argument be database.Workspace
and then the system
package know to pull out the various fields of a workspace? If there is a circular dependency, one thing we do with RBAC is define an interface that almost all database types implement in modelmethods
.
I have not read through all the code, this just feels susceptible to missing fields if we add a NotifyWorkspaceCreated
or other verbs.
NOTE: this is a stacked PR, please review #13536 as well.
This introduces a new subsystem into Coder to enable notifications (see milestone). In this PR, I have implemented a very thin vertical slice of this functionality:
This functionality is also hidden behind an experiment.
The main entities in this package are:
Notifier
is responsible for:notification_messages
entriesManager
of successful/failed deliveriesManager
is a controller which is responsible for:Notifier
s)Notifier
s' lifecycles (run/shutdown)The queue has been designed to be safe for concurrent use. This can either mean multiple
coderd
instances running side-by-side and/or eachcoderd
having multiple queue consumers.