Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Embeddings: allow max embedding counts to be configurable #51326

Merged
merged 5 commits into from
May 2, 2023

Conversation

camdencheek
Copy link
Member

@camdencheek camdencheek commented May 1, 2023

This adds the maxCodeEmbeddingsPerRepo and maxTextEmbeddingsPerRepo config options to our embeddings config. This will allow us to adjust the max to support larger monorepos without pushing a new patch release.

As part of this change, I also included a small refactor that puts all EmbedRepo options into a struct because the function args were getting a little overwhelming.

Test plan

Added a test and manually tested that the stats we spit out reflect what I'd expect.

@cla-bot cla-bot bot added the cla-signed label May 1, 2023
@sourcegraph-bot
Copy link
Contributor

sourcegraph-bot commented May 1, 2023

Codenotify: Notifying subscribers in CODENOTIFY files for diff 931d6ad...80f2576.

Notify File(s)
@efritz enterprise/cmd/worker/internal/embeddings/repo/handler.go

@@ -158,7 +162,7 @@ func embedFiles(
)
for _, file := range files {
// This is a fail-safe measure to prevent producing an extremely large index for large repositories.
if len(index.RowMetadata) >= maxEmbeddingVectors {
if statsEmbeddedChunkCount >= maxEmbeddingVectors {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a small unrelated bug fix that I discovered while writing the test. Basically, we were using the index size to check whether we should stop, but the index isn't updated on every iteration because we batch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's nice that making this configurable improved testability too.

Copy link
Member

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@@ -158,7 +162,7 @@ func embedFiles(
)
for _, file := range files {
// This is a fail-safe measure to prevent producing an extremely large index for large repositories.
if len(index.RowMetadata) >= maxEmbeddingVectors {
if statsEmbeddedChunkCount >= maxEmbeddingVectors {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's nice that making this configurable improved testability too.

"maxCodeEmbeddingsPerRepo": {
"description": "The maximum number of embeddings for code files to generate per repo",
"type": "integer",
"minimum": 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny comment: maybe it's a little confusing that 0 is a legal value, but it means using the default (a large number)? We could use a default value of -1 instead to avoid this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, default values aren't actually supported with our JSONSchema generator, so we have to rely on the Go zero-values, which means using 0 as the sentinel. It's worth looking into adding support to the generator, but probably not right now.

@camdencheek camdencheek merged commit 35ad0f9 into main May 2, 2023
@camdencheek camdencheek deleted the cc/configurable-max branch May 2, 2023 16:30
github-actions bot pushed a commit that referenced this pull request May 2, 2023
This adds the `maxCodeEmbeddingsPerRepo` and `maxTextEmbeddingsPerRepo`
config options to our embeddings config. This will allow us to adjust
the max to support larger monorepos without pushing a new patch release.

As part of this change, I also included a small refactor that puts all
`EmbedRepo` options into a struct because the function args were getting
a little overwhelming.

(cherry picked from commit 35ad0f9)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants