-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about revisions and auto-compaction in etcd3 store. #7472
Comments
And another question, how cold I know that auto-compaction works? I saw now log entries about auto compaction, though there are messages when compacting manually.
How does auto-compaction works? It takes the latest revision and kills all previous ones? |
Also txns if the txn completes a put/delete. Aside from that, the revision will also increase if a lease is revoked while it is attached to keys. The rule is that if the operation changes the key-value space, it'll increase the revision.
The change represented by the revisions is incremental (e.g., events that would be sent over watches); internally, writing a new revision means writing the changes to the b+tree, keyed by the incremented revision. If a revision is given to
There will be a log message. It's on an hourly basis.
It compacts starting from the revision of the last auto-compaction interval. By default, this is the revision from an hour in the past.
No. Defrag is too expensive. Usually there's no need to defrag the db; new writes will reclaim the pages freed by compaction. |
Thanks for reply!
Actually, we got some strange problems on production - we occasionally hit 2 GB storage limits. Unfortunately, the guys who used my library turned off logging, so I can't exactly understand what was going on. Facepalm here :) And the strangest thing, we have 6 DCs, and the problem occured in 4 of them. With the same codebase, of course! o_O So in 2 DCs the storage is stable, ~2.5MB with ~10K revisions for almost 2 months. I would appreciate it very much if you could have a look at this piece of code and tell me if I'm doing anything wrong. // RegisterValues validates registration data, runs keepalive loop and blocks until context is finished.
func (p *provider) RegisterValues(ctx context.Context, kvs ...discovery.KV) error {
if err := validateRegistrationData(kvs...); err != nil {
return err
}
defer p.logger.Warningf("register loop exit")
for {
minInterval := time.After(registerRetryInterval)
leaseID, err := p.doRegister(ctx, kvs...)
select {
case <-ctx.Done():
p.revoke(leaseID)
return nil
default:
}
p.logger.Warningf("could not register: %s", err)
<-minInterval
}
}
// doRegister runs all put operations with Lease TTL and waits on lease keepalive channel
func (p *provider) doRegister(ctx context.Context, kvs ...discovery.KV) (clientv3.LeaseID, error) {
resp, err := p.leaser.Grant(ctx, int64(defaultKeyTTL.Seconds()))
if err != nil {
return 0, fmt.Errorf("lease grant error: %s", err)
}
leaseID := resp.ID
ops := p.prepareOps(leaseID, kvs...)
if _, err := p.kvAPI.Txn(ctx).Then(ops...).Commit(); err != nil {
return 0, fmt.Errorf("lease %d transaction error: %s", leaseID, err)
}
keepAlives, err := p.leaser.KeepAlive(ctx, leaseID)
if err != nil {
// almost unreal situation, but better to handle it - revoke the lease
go p.revoke(leaseID)
return 0, fmt.Errorf("lease %d keep alive error: %s", leaseID, err)
}
// wait on keepAlives channel
for _ = range keepAlives {
}
// it's not really an error if ctx is canceled - this situation should be handled in the caller
return leaseID, fmt.Errorf("keepalive timeout, TTL reached")
}
// revoke revokes the lease, all the keys are deleted at the same moment
func (p *provider) revoke(leaseID clientv3.LeaseID) {
if _, err := p.leaser.Revoke(context.Background(), leaseID); err != nil {
p.logger.Warningf("can't revoke leaseID: %d", leaseID)
}
} |
@mwf What version of etcd? There were some problems with the lease client receiving unknown server errors. If If that's what's happening here, that code will call |
@heyitsanthony etcd server is 3.0.15 and client lib is 3.1.0 (maybe some services have 3.0.x client, but I doubt that) Omg, it looks like we got exactly this error. It makes sense then, Txn was OK, values are written. Then KeepAlive fails and the lease is revoked. So, why have you added such behaviour? If it's supposed that Lease client could possibly ruin its internal state and can't function anymore - it should be well documented and added to the release notes! And if you can do nothing and the lease client could possibly crash - should we create the new lease client from etcd client? Should ErrKeepAliveHalted be check for that case and should we call Close() after that? Unfortunately I can't return the error for RegisterValues if KeepAlive fails - it's ment that all errors are handled inside the retry loop, so we ensure that the values will be set when the problems are gone. |
This was present in 3.0.x; it has always worked like that since the beginning (hence no release note). It's just that the error was very rare. If it's still happening in 3.1.0 it's still getting some error that it's not being translated correctly. It would be helpful to know what error it's receiving. The reasoning is that if it gets an unknown error from the server then there's no way to reason about what actually went wrong, so it's treated as fatal. This could potentially be revisited in the future by extending the client lease keepalive response to include an error and just totally remove returning an error from the
That would be the workaround, yes. |
I'm talking about returning the error. You fixed the keep alive freeze, introducing the new error and assuming that user should handle it. It's a new behaviour, and not a word in release notes. And it really looks like a dirty workaround :) Having some failed state inside is a bad practice. And it actually comes out, that Grant works OK, the lease is given, but KeepAlive fails just because of some corrupted state o__O And KeepAliveOnce() keeps working (https://godoc.org/github.com/coreos/etcd/clientv3#ErrKeepAliveHalted), though the main Lease doc https://godoc.org/github.com/coreos/etcd/clientv3#Lease says that And the best :) No example of handling ErrKeepAliveHalted in the KeepAlive example o___O https://godoc.org/github.com/coreos/etcd/clientv3#Lease So my point is, maybe internal loop fatal failure is not the new behaviour, but the error handling is. And now I don't know what's the best workaround - to add handling ErrKeepAliveHalted and recreate the Lease client or just to use pure KeepAliveOnce() in a self-made keep-alive loop.
I'll be glad to help you! If I found one - will report Immediately :)
Looking forward to see it in future! It's better option, and it won't require to fatal the lease loop. So the lease client can continue to operate. So if you need any help with the documentation or examples - tell me, I can add them. |
BTW, what do you think, maybe it's some problem with 3.0.15 server and 3.1.0 client? |
Tear down the lease client and create a new one like I already suggested.
As far as I can tell from the git logs, no errors were removed but |
Thanks, Anthony! I'll let you know if find the unexpected error. |
Hi, guys! I've got several questions.
Are put and delete operations the only one increasing revision history?
And I'm wondering if every revision is a full copy of all the data in etcd3 store? Or it contains incremental updates?
The text was updated successfully, but these errors were encountered: