Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use TopK node for KNN #1324

Closed
wjones127 opened this issue Sep 27, 2023 · 1 comment · Fixed by #2535
Closed

Use TopK node for KNN #1324

wjones127 opened this issue Sep 27, 2023 · 1 comment · Fixed by #2535
Assignees
Labels
arrow Apache Arrow related issues enhancement New feature or request performance priority: high Issues that are high priority (for LanceDb, the organization)

Comments

@wjones127
Copy link
Contributor

Right now to perform KNN, we compute the top k for each batch, concatenate all the results, and get the top k from those batches. If there are a lot of batches, this can lead to OOM error.

KNN is essentially Project(distance) -> TopK(k=k, order_by=distance), so we might just want to use the DataFusion nodes and build upon them.

There is a tracking issue upstream in DataFusion: apache/datafusion#7195
Also there is a drafted PR for an optimized TopK node: apache/datafusion#7250

We could complete that PR and use that to implement an optimize KNN query plan.

@wjones127 wjones127 added enhancement New feature or request arrow Apache Arrow related issues performance labels Sep 27, 2023
@westonpace westonpace added the priority: high Issues that are high priority (for LanceDb, the organization) label Jan 29, 2024
@wjones127 wjones127 self-assigned this Jan 29, 2024
@changhiskhan changhiskhan assigned eddyxu and unassigned wjones127 Feb 9, 2024
@wjones127
Copy link
Contributor Author

Also handle here:

// TODO: Use a heap sort to get the top-k.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Apache Arrow related issues enhancement New feature or request performance priority: high Issues that are high priority (for LanceDb, the organization)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants