-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Embeddings: allow max embedding counts to be configurable #51326
Conversation
Codenotify: Notifying subscribers in CODENOTIFY files for diff 931d6ad...80f2576.
|
@@ -158,7 +162,7 @@ func embedFiles( | |||
) | |||
for _, file := range files { | |||
// This is a fail-safe measure to prevent producing an extremely large index for large repositories. | |||
if len(index.RowMetadata) >= maxEmbeddingVectors { | |||
if statsEmbeddedChunkCount >= maxEmbeddingVectors { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a small unrelated bug fix that I discovered while writing the test. Basically, we were using the index size to check whether we should stop, but the index isn't updated on every iteration because we batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's nice that making this configurable improved testability too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
@@ -158,7 +162,7 @@ func embedFiles( | |||
) | |||
for _, file := range files { | |||
// This is a fail-safe measure to prevent producing an extremely large index for large repositories. | |||
if len(index.RowMetadata) >= maxEmbeddingVectors { | |||
if statsEmbeddedChunkCount >= maxEmbeddingVectors { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's nice that making this configurable improved testability too.
"maxCodeEmbeddingsPerRepo": { | ||
"description": "The maximum number of embeddings for code files to generate per repo", | ||
"type": "integer", | ||
"minimum": 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tiny comment: maybe it's a little confusing that 0 is a legal value, but it means using the default (a large number)? We could use a default value of -1 instead to avoid this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, default values aren't actually supported with our JSONSchema generator, so we have to rely on the Go zero-values, which means using 0 as the sentinel. It's worth looking into adding support to the generator, but probably not right now.
This adds the `maxCodeEmbeddingsPerRepo` and `maxTextEmbeddingsPerRepo` config options to our embeddings config. This will allow us to adjust the max to support larger monorepos without pushing a new patch release. As part of this change, I also included a small refactor that puts all `EmbedRepo` options into a struct because the function args were getting a little overwhelming. (cherry picked from commit 35ad0f9)
This adds the
maxCodeEmbeddingsPerRepo
andmaxTextEmbeddingsPerRepo
config options to our embeddings config. This will allow us to adjust the max to support larger monorepos without pushing a new patch release.As part of this change, I also included a small refactor that puts all
EmbedRepo
options into a struct because the function args were getting a little overwhelming.Test plan
Added a test and manually tested that the stats we spit out reflect what I'd expect.