Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about revisions and auto-compaction in etcd3 store. #7472

Closed
mwf opened this issue Mar 10, 2017 · 10 comments
Closed

Question about revisions and auto-compaction in etcd3 store. #7472

mwf opened this issue Mar 10, 2017 · 10 comments

Comments

@mwf
Copy link

mwf commented Mar 10, 2017

Hi, guys! I've got several questions.

Are put and delete operations the only one increasing revision history?

And I'm wondering if every revision is a full copy of all the data in etcd3 store? Or it contains incremental updates?

@mwf
Copy link
Author

mwf commented Mar 10, 2017

And another question, how cold I know that auto-compaction works?

I saw now log entries about auto compaction, though there are messages when compacting manually.

store.index: compact 5491000
finished scheduled compaction at 5491000 (took 2m35.34360828s)

How does auto-compaction works? It takes the latest revision and kills all previous ones?
Does it make defrag?

@mwf mwf changed the title Question about revisions in etcd3 store. Question about revisions and auto-compaction in etcd3 store. Mar 10, 2017
@heyitsanthony
Copy link
Contributor

Are put and delete operations the only one increasing revision history?

Also txns if the txn completes a put/delete. Aside from that, the revision will also increase if a lease is revoked while it is attached to keys. The rule is that if the operation changes the key-value space, it'll increase the revision.

And I'm wondering if every revision is a full copy of all the data in etcd3 store? Or it contains incremental updates?

The change represented by the revisions is incremental (e.g., events that would be sent over watches); internally, writing a new revision means writing the changes to the b+tree, keyed by the incremented revision. If a revision is given to Range, it's like a view of the keyspace at the point in time that revision was committed.

And another question, how cold I know that auto-compaction works?

There will be a log message. It's on an hourly basis.

How does auto-compaction works? It takes the latest revision and kills all previous ones?

It compacts starting from the revision of the last auto-compaction interval. By default, this is the revision from an hour in the past.

Does it make defrag?

No. Defrag is too expensive. Usually there's no need to defrag the db; new writes will reclaim the pages freed by compaction.

@mwf
Copy link
Author

mwf commented Mar 11, 2017

Thanks for reply!

There will be a log message. It's on an hourly basis.
It compacts starting from the revision of the last auto-compaction interval. By default, this is the revision from an hour in the past.
Strange, we did not notice any log message for that on our production servers :( I'll try to check it locally myself.

Actually, we got some strange problems on production - we occasionally hit 2 GB storage limits.
+----------------+-----------------+---------+---------+-----------+-----------+------------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX | +----------------+-----------------+---------+---------+-----------+-----------+------------+ | 127.0.0.1:2379 | 8717efedd865ff3 | 3.0.15 | 2.1 GB | false | 1841 | 1143730396 | +----------------+-----------------+---------+---------+-----------+-----------+------------+
With ~7 million revisions of the DB.
The funniest thing, I can't understand how it could possibly happen.
First I thought some guys from our company tried to use it like etcd2 - making PUTs from time to time for keep alives.
But no, the core reason is in "flapping" of values. I saw ~20 revisions per second, creating (with lease) and revoking the same values O_O

Unfortunately, the guys who used my library turned off logging, so I can't exactly understand what was going on. Facepalm here :)

And the strangest thing, we have 6 DCs, and the problem occured in 4 of them. With the same codebase, of course! o_O So in 2 DCs the storage is stable, ~2.5MB with ~10K revisions for almost 2 months.
Maybe there could be some issues with cluster stability, or something like that, because I know there could be issues when some endpoints are not available.

I would appreciate it very much if you could have a look at this piece of code and tell me if I'm doing anything wrong.

// RegisterValues validates registration data, runs keepalive loop and blocks until context is finished.
func (p *provider) RegisterValues(ctx context.Context, kvs ...discovery.KV) error {
	if err := validateRegistrationData(kvs...); err != nil {
		return err
	}

	defer p.logger.Warningf("register loop exit")
	for {
		minInterval := time.After(registerRetryInterval)

		leaseID, err := p.doRegister(ctx, kvs...)
		select {
		case <-ctx.Done():
			p.revoke(leaseID)
			return nil
		default:
		}
		p.logger.Warningf("could not register: %s", err)

		<-minInterval
	}
}

// doRegister runs all put operations with Lease TTL and waits on lease keepalive channel
func (p *provider) doRegister(ctx context.Context, kvs ...discovery.KV) (clientv3.LeaseID, error) {
	resp, err := p.leaser.Grant(ctx, int64(defaultKeyTTL.Seconds()))
	if err != nil {
		return 0, fmt.Errorf("lease grant error: %s", err)
	}
	leaseID := resp.ID

	ops := p.prepareOps(leaseID, kvs...)
	if _, err := p.kvAPI.Txn(ctx).Then(ops...).Commit(); err != nil {
		return 0, fmt.Errorf("lease %d transaction error: %s", leaseID, err)
	}

	keepAlives, err := p.leaser.KeepAlive(ctx, leaseID)
	if err != nil {
		// almost unreal situation, but better to handle it - revoke the lease
		go p.revoke(leaseID)
		return 0, fmt.Errorf("lease %d keep alive error: %s", leaseID, err)
	}

	// wait on keepAlives channel
	for _ = range keepAlives {
	}

	// it's not really an error if ctx is canceled - this situation should be handled in the caller
	return leaseID, fmt.Errorf("keepalive timeout, TTL reached")
}

// revoke revokes the lease, all the keys are deleted at the same moment
func (p *provider) revoke(leaseID clientv3.LeaseID) {
	if _, err := p.leaser.Revoke(context.Background(), leaseID); err != nil {
		p.logger.Warningf("can't revoke leaseID: %d", leaseID)
	}
}

@heyitsanthony
Copy link
Contributor

heyitsanthony commented Mar 11, 2017

@mwf What version of etcd? There were some problems with the lease client receiving unknown server errors. If KeepAlive receives an unknown error code, it will treat that as fatal (it expects Unavailable or Internal in order to retry) and halt. If there's an error from KeepAlive it's because it's halted and the lease client will continually return that error.

If that's what's happening here, that code will call Txn, see err != nil on KeepAlive, then retry on every call to doRegister; RegisterValues should probably return an error if KeepAlive returns an error.

@mwf
Copy link
Author

mwf commented Mar 11, 2017

@heyitsanthony etcd server is 3.0.15 and client lib is 3.1.0 (maybe some services have 3.0.x client, but I doubt that)

Omg, it looks like we got exactly this error. It makes sense then, Txn was OK, values are written. Then KeepAlive fails and the lease is revoked.

So, why have you added such behaviour?
When I developed this code - there were no such situations at 3.0.x client and KeepAlive never returned errors... Unrecoverable error is a very strange pattern!
It's totally unclear that an error in KeepAlive could ruin the whole Lease client o___O What's the problem to recover the leaser?

If it's supposed that Lease client could possibly ruin its internal state and can't function anymore - it should be well documented and added to the release notes!

And if you can do nothing and the lease client could possibly crash - should we create the new lease client from etcd client? Should ErrKeepAliveHalted be check for that case and should we call Close() after that?

Unfortunately I can't return the error for RegisterValues if KeepAlive fails - it's ment that all errors are handled inside the retry loop, so we ensure that the values will be set when the problems are gone.

@heyitsanthony
Copy link
Contributor

heyitsanthony commented Mar 11, 2017

When I developed this code - there were no such situations at 3.0.x client and KeepAlive never returned errors... Unrecoverable error is a very strange pattern!

This was present in 3.0.x; it has always worked like that since the beginning (hence no release note). It's just that the error was very rare. If it's still happening in 3.1.0 it's still getting some error that it's not being translated correctly. It would be helpful to know what error it's receiving.

The reasoning is that if it gets an unknown error from the server then there's no way to reason about what actually went wrong, so it's treated as fatal.

This could potentially be revisited in the future by extending the client lease keepalive response to include an error and just totally remove returning an error from the KeepAlive() call itself. This would be closer to how Watch acts and is probably the way it should have been done in the first place.

And if you can do nothing and the lease client could possibly crash - should we create the new lease client from etcd client? Should ErrKeepAliveHalted be check for that case and should we call Close() after that?

That would be the workaround, yes.

@mwf
Copy link
Author

mwf commented Mar 12, 2017

This was present in 3.0.x; it has always worked like that since the beginning (hence no release note). It's just that the error was very rare.

I'm talking about returning the error.
https://github.com/coreos/etcd/blob/v3.0.17/clientv3/lease.go#L167
3.0.x doesn't return ErrKeepAliveHalted
It's introduced in 3.1.0 in #6922

You fixed the keep alive freeze, introducing the new error and assuming that user should handle it. It's a new behaviour, and not a word in release notes. And it really looks like a dirty workaround :) Having some failed state inside is a bad practice.

And it actually comes out, that Grant works OK, the lease is given, but KeepAlive fails just because of some corrupted state o__O And KeepAliveOnce() keeps working (https://godoc.org/github.com/coreos/etcd/clientv3#ErrKeepAliveHalted), though the main Lease doc https://godoc.org/github.com/coreos/etcd/clientv3#Lease says that In most of the cases, Keepalive should be used instead of KeepAliveOnce.

And the best :) No example of handling ErrKeepAliveHalted in the KeepAlive example o___O https://godoc.org/github.com/coreos/etcd/clientv3#Lease
How could users possibly know that it should be handled? :)

So my point is, maybe internal loop fatal failure is not the new behaviour, but the error handling is.
And BTW, it all means actually that Client's embedded Lease may also unrecoverably fail! So it can't be used in real life.

And now I don't know what's the best workaround - to add handling ErrKeepAliveHalted and recreate the Lease client or just to use pure KeepAliveOnce() in a self-made keep-alive loop.
What are the pros and cons? Manual KeepAliveOnce may result in some overhead if doing it inaccuratly. Does Lease client recreating add any significant overhead?

If it's still happening in 3.1.0 it's still getting some error that it's not being translated correctly. It would be helpful to know what error it's receiving.

I'll be glad to help you! If I found one - will report Immediately :)

This could potentially be revisited in the future by extending the client lease keepalive response to include an error and just totally remove returning an error from the KeepAlive() call itself. This would be closer to how Watch acts and is probably the way it should have been done in the first place.

Looking forward to see it in future! It's better option, and it won't require to fatal the lease loop. So the lease client can continue to operate.

So if you need any help with the documentation or examples - tell me, I can add them.

@mwf
Copy link
Author

mwf commented Mar 12, 2017

BTW, what do you think, maybe it's some problem with 3.0.15 server and 3.1.0 client?
For example, have you dropped any errors in 3.1 release which never occur, but 3.0.15 may still send them?

@heyitsanthony
Copy link
Contributor

And now I don't know what's the best workaround - to add handling ErrKeepAliveHalted and recreate the Lease client or just to use pure KeepAliveOnce() in a self-made keep-alive loop. What are the pros and cons? Manual KeepAliveOnce may result in some overhead if doing it inaccuratly. Does Lease client recreating add any significant overhead?

Tear down the lease client and create a new one like I already suggested.

For example, have you dropped any errors in 3.1 release which never occur, but 3.0.15 may still send them?

As far as I can tell from the git logs, no errors were removed but invalid auth token changed from etcdserver: invalid auth token to auth: invalid auth token.

@mwf
Copy link
Author

mwf commented Mar 13, 2017

Thanks, Anthony!

I'll let you know if find the unexpected error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants