Embeddings: allow max embedding counts to be configurable #51326

camdencheek · 2023-05-01T23:18:56Z

This adds the maxCodeEmbeddingsPerRepo and maxTextEmbeddingsPerRepo config options to our embeddings config. This will allow us to adjust the max to support larger monorepos without pushing a new patch release.

As part of this change, I also included a small refactor that puts all EmbedRepo options into a struct because the function args were getting a little overwhelming.

Test plan

Added a test and manually tested that the stats we spit out reflect what I'd expect.

sourcegraph-bot · 2023-05-01T23:20:41Z

Codenotify: Notifying subscribers in CODENOTIFY files for diff 931d6ad...80f2576.

Notify	File(s)
@efritz	enterprise/cmd/worker/internal/embeddings/repo/handler.go

camdencheek · 2023-05-01T23:23:41Z

enterprise/internal/embeddings/embed/embed.go

@@ -158,7 +162,7 @@ func embedFiles(
 	)
 	for _, file := range files {
 		// This is a fail-safe measure to prevent producing an extremely large index for large repositories.
-		if len(index.RowMetadata) >= maxEmbeddingVectors {
+		if statsEmbeddedChunkCount >= maxEmbeddingVectors {


This is a small unrelated bug fix that I discovered while writing the test. Basically, we were using the index size to check whether we should stop, but the index isn't updated on every iteration because we batch.

It's nice that making this configurable improved testability too.

jtibshirani

Looks good!

jtibshirani · 2023-05-01T23:42:57Z

enterprise/internal/embeddings/embed/embed.go

@@ -158,7 +162,7 @@ func embedFiles(
 	)
 	for _, file := range files {
 		// This is a fail-safe measure to prevent producing an extremely large index for large repositories.
-		if len(index.RowMetadata) >= maxEmbeddingVectors {
+		if statsEmbeddedChunkCount >= maxEmbeddingVectors {


It's nice that making this configurable improved testability too.

jtibshirani · 2023-05-01T23:45:16Z

schema/site.schema.json

+        "maxCodeEmbeddingsPerRepo": {
+          "description": "The maximum number of embeddings for code files to generate per repo",
+          "type": "integer",
+          "minimum": 0


Tiny comment: maybe it's a little confusing that 0 is a legal value, but it means using the default (a large number)? We could use a default value of -1 instead to avoid this.

Unfortunately, default values aren't actually supported with our JSONSchema generator, so we have to rely on the Go zero-values, which means using 0 as the sentinel. It's worth looking into adding support to the generator, but probably not right now.

This adds the `maxCodeEmbeddingsPerRepo` and `maxTextEmbeddingsPerRepo` config options to our embeddings config. This will allow us to adjust the max to support larger monorepos without pushing a new patch release. As part of this change, I also included a small refactor that puts all `EmbedRepo` options into a struct because the function args were getting a little overwhelming. (cherry picked from commit 35ad0f9)

camdencheek added 4 commits May 1, 2023 16:22

extract EmbedRepoOpts

9821069

add max embeddings to opts

d34a640

move max embeddings into config

f27a54c

add test

c3f4e30

cla-bot bot added the cla-signed label May 1, 2023

rename const to default

80f2576

camdencheek commented May 1, 2023

View reviewed changes

camdencheek added the backport 5.0 label May 1, 2023

camdencheek requested a review from jtibshirani May 1, 2023 23:27

jtibshirani approved these changes May 1, 2023

View reviewed changes

camdencheek merged commit 35ad0f9 into main May 2, 2023

camdencheek deleted the cc/configurable-max branch May 2, 2023 16:30

github-actions bot mentioned this pull request May 2, 2023

[Backport 5.0] Embeddings: allow max embedding counts to be configurable #51369

Merged

camdencheek added backported-to-5.0 and removed backport 5.0 labels May 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Embeddings: allow max embedding counts to be configurable #51326

Embeddings: allow max embedding counts to be configurable #51326

camdencheek commented May 1, 2023 •

edited

Loading

sourcegraph-bot commented May 1, 2023 •

edited

Loading

camdencheek May 1, 2023

jtibshirani May 1, 2023

jtibshirani left a comment

jtibshirani May 1, 2023

jtibshirani May 1, 2023

camdencheek May 2, 2023

Embeddings: allow max embedding counts to be configurable #51326

Embeddings: allow max embedding counts to be configurable #51326

Conversation

camdencheek commented May 1, 2023 • edited Loading

Test plan

sourcegraph-bot commented May 1, 2023 • edited Loading

camdencheek May 1, 2023

Choose a reason for hiding this comment

jtibshirani May 1, 2023

Choose a reason for hiding this comment

jtibshirani left a comment

Choose a reason for hiding this comment

jtibshirani May 1, 2023

Choose a reason for hiding this comment

jtibshirani May 1, 2023

Choose a reason for hiding this comment

camdencheek May 2, 2023

Choose a reason for hiding this comment

camdencheek commented May 1, 2023 •

edited

Loading

sourcegraph-bot commented May 1, 2023 •

edited

Loading