Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

double hashing implementation #1

Open
wants to merge 23 commits into
base: master
Choose a base branch
from
Open

double hashing implementation #1

wants to merge 23 commits into from

Conversation

noot
Copy link

@noot noot commented Aug 30, 2022

  • calculate and use sha256 hash of content multihash before advertising/discovering
  • run golangci-lint and fix lint errors

@noot noot marked this pull request as ready for review September 1, 2022 17:04
@willscott
Copy link

Cool, this is a good first chunk,

  • Deciding on a migration story is going to be important in figuring out what additional complexity is needed here
  • The stored PeerID should be encrypted by the MH in the dht record

Copy link

@guillaumemichel guillaumemichel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great progress! This looks all go to me!

Concerning the isHashed variable, I guess that the setting it to true or false will depend on the migration that we choose to go with.

Did you already run some test cases with isHashed = true in lookup.go, or any test case that 1. provides some content to the DHT using double hashing and 2. retrieving this content (from another peer)?

lookup.go Outdated
// TODO: I can break the interface! return []peer.ID
lookupRes, err := dht.runLookupWithFollowup(ctx, key,

const isHashed = false

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can the host determine whether the requested key is hashed or not?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the recipient of the query will receive Message_FIND_NODE, Message_GET_VALUE or Message_GET_PROVIDERS (see handlers.go). for the first 2, the key is not hashed, but for Message_GET_PROVIDERS the key will be hashed.

query.go Outdated
q := &query{
id: uuid.New(),
key: target,
key: target, // TODO change this if hashed?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this doesn't need to be changed as if isHashed, target should already be the hash

routing.go Outdated
@@ -288,7 +288,8 @@ func (dht *IpfsDHT) getValues(ctx context.Context, key string, stopQuery chan st
go func() {
defer close(valCh)
defer close(lookupResCh)
lookupRes, err := dht.runLookupWithFollowup(ctx, key,
const isHashed = false

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is isHashed set to false here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we decided on hashing only the key for the AddProviders/GetProviders path, as it's what's used specifically in IPFS to find content providers. Get/Put keys is only used for IPNS as far as I know

@noot
Copy link
Author

noot commented Sep 29, 2022

@guillaumemichel yes, the existing unit tests that use Provide and GetProviders... should cover the double hashing case, for example TestProvides and TestProvidesMany.

@@ -777,10 +777,14 @@ func (dht *FullRT) Provide(ctx context.Context, key cid.Cid, brdcst bool) (err e
return fmt.Errorf("invalid cid: undefined")
}
keyMH := key.Hash()
logger.Debugw("providing", "cid", key, "mh", internal.LoggableProviderRecordBytes(keyMH))
mhHash := internal.Sha256Multihash(keyMH)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think it might be a good idea to add a salt for this hash? this way there's another layer of protection that could be used from the provider?
but i can see the other side where if we want to make this easily reproducible, it would be better to leave out the additional step (-:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If required, we could add a constant salt (e.g to specify the purpose of the hash).

However, the secret to access the content is already the CID, adding a random salt means that knowledge of both the CID and the salt is required to find the content. If we add a random salt here, it would have to be shared along with the CID, thus it wouldn't add much security. (in other words, it would only make the secret longer)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider adding a constant salt (or prefix) to the spec and here - so that the derived hashes are in a known specific domain that doesn't easily have collisions with other things.

lookup.go Outdated
// TODO: I can break the interface! return []peer.ID
lookupRes, err := dht.runLookupWithFollowup(ctx, key,

const isHashed = false

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will we add a check to see if it's hashed later on?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it's checked in query.go line 152

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, this got removed since the double hash is now a multihash.Multihash

func(ctx context.Context, p peer.ID) ([]*peer.AddrInfo, error) {
// For DHT query command
routing.PublishQueryEvent(ctx, &routing.QueryEvent{
Type: routing.SendingQuery,
ID: p,
})

provs, closest, err := dht.protoMessenger.GetProviders(ctx, p, key)
provs, closest, err := dht.protoMessenger.GetProviders(ctx, p, multihash.Multihash(mhHash[:])) // TODO: this isn't a multihash anymore, change GetProvider param type instead?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it worth having a multihash code for these? doublehashed-sha256 or similar and encode the derived hash as a multihash?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, updated to use DBL_SHA2_256

type Hash [32]byte

func Sha256Multihash(mh multihash.Multihash) Hash {
return sha256.Sum256(mh)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a derived hash, so e.g. add a custom prefix like sha256.Sum256("CR_DOUBLEHASH" + mh.Bytes())

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated!

@@ -777,10 +777,14 @@ func (dht *FullRT) Provide(ctx context.Context, key cid.Cid, brdcst bool) (err e
return fmt.Errorf("invalid cid: undefined")
}
keyMH := key.Hash()
logger.Debugw("providing", "cid", key, "mh", internal.LoggableProviderRecordBytes(keyMH))
mhHash := internal.Sha256Multihash(keyMH)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider adding a constant salt (or prefix) to the spec and here - so that the derived hashes are in a known specific domain that doesn't easily have collisions with other things.


func Sha256Multihash(mh multihash.Multihash) multihash.Multihash {
prefix := []byte("CR_DOUBLEHASH")
mh, err := multihash.Sum(append(prefix, mh...), multihash.DBL_SHA2_256, 30)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason we're saying this has a length of 30?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, since the multihash automatically prepends the code and length of the data, the returned multihash is of length 32. If I set the length to 32 (which I initially tried) it causes the returned multihash to be of length 34, which then causes a panic in routingTable.NearestPeers when it calls CommonPrefixLen to compare the multihash with the local peer ID, since the lengths are mismatched.

the other workaround I thought off is slicing off the first 2 bytes of the multihash / only using the digest, but I'm not sure if that would have unintended consequences. any thoughts on what method would be better?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

func Encode(buf []byte, code uint64) ([]byte, error) {
	// FUTURE: this function always causes heap allocs... but when used, this value is almost always going to be appended to another buffer (either as part of CID creation, or etc) -- should this whole function be rethought and alternatives offered?
	newBuf := make([]byte, varint.UvarintSize(code)+varint.UvarintSize(uint64(len(buf)))+len(buf))
	n := varint.PutUvarint(newBuf, code)
	n += varint.PutUvarint(newBuf[n:], uint64(len(buf)))

	copy(newBuf[n:], buf)
	return newBuf, nil
}

The Encode() function basically takes the digest and append it to the code, hence returning something of the format: [code, digest]. The code is represented by 2 bytes.

We don't want the digest to start with a constant, as this digest determine the location of the provider records in the DHT. Having a common prefix (code) to all digest would mean that all Double Hashed Provider Records are allocated to a restricted location of the keyspace. We don't need to have the code in the output. We could update the code as follow:

mh, err := multihash.Sum(append(prefix, mh...), multihash.DBL_SHA2_256, KEYSIZE)
	if err != nil {
		// this shouldn't ever happen
		panic(err)
	}
        // don't return the code prefix, only return the last KEYSIZE bytes
	return mh[len(mh)-KEYSIZE:]

KEYSIZE should be a constant value defining the DHT key size. Ideally this constant already exists somewhere in the DHT implementation or it is passed down by kubo. Its value should be 32 for now, but if the DHT keysize was to change this piece of code shouldn't need to be updated, updating the constant should be enough.

Copy link
Author

@noot noot Oct 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guillaumemichel that works, but the issue is now the function returns a multihash.Multihash that isn't actually a multihash (since the code is gone). the purpose from what I remember of using multihash.Sum was that we could specify the hash was of type DBL_SHA2_256, whereas previously it was just using sha256 (which is effectively the same as doing multihash.Sum then choosing the last 32 bytes). should I just change the function back to do a sha256 sum? because I agree we shouldn't have the hashes all having the same first 2 bytes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should be fine to change the function back to sha256

fullrt/dht.go Outdated

// add self locally
dht.ProviderManager.AddProvider(ctx, keyMH, peer.AddrInfo{ID: dht.h.ID()})
err = dht.ProviderManager.AddProvider(ctx, keyMH, peer.AddrInfo{ID: dht.h.ID()})

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be mhHash instead of keyMH?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, updated!

@guillaumemichel
Copy link

@noot we can now probably get rid of this abomination in go-libp2p-kbucket. IIUC it is the result of a preimage bruteforce to find a key that would fit every single small bucket.

It isn't required anymore with double hashing as one doesn't request the preimage of a kad key, but the kad key itself. We can simply forge random keys that fit inside the bucket instead of taking the ones resulting from the brute force.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants