Retrieval
A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.
An original implementation of EMNLP 2020, "AmbigQA: Answering Ambiguous Open-domain Questions"
ColBERT: state-of-the-art neural search (SIGIR'20, TACL'21, NeurIPS'21, NAACL'22, CIKM'22, ACL'23, EMNLP'23)
Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust
Weaviate is an open-source vector database that stores both objects and vectors, allowing for the combination of vector search with structured filtering with the fault tolerance and scalability of …
🔍 AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your da…
A cloud-native vector database, storage for next generation AI applications
☁️ Build multimodal AI applications with cloud-native stack
🚀 RocketQA, dense retrieval for information retrieval and question answering, including both Chinese and English state-of-the-art models.
MS MARCO(Microsoft Machine Reading Comprehension) is a large scale dataset focused on machine reading comprehension, question answering, and passage/document ranking
SPECTER: Document-level Representation Learning using Citation-informed Transformers
The code related to the baselines from NeurIPS 2021 paper "DUE: End-to-End Document Understanding Benchmark."
Language-Agnostic SEntence Representations
Provides a common interface to many IR ranking datasets.
Open-Source Information Retrieval Courses @ TU Wien
A curated list of awesome resources related to Semantic Search🔎 and Semantic Similarity tasks.
Neural-IR-Explorer: A Content-Focused Tool to Explore Neural Re-Ranking Results
NBoost is a scalable, search-api-boosting platform for deploying transformer models to improve the relevance of search results on different platforms (i.e. Elasticsearch)
A lightning-fast search API that fits effortlessly into your apps, websites, and workflow
[EMNLP 2021] SimCSE: Simple Contrastive Learning of Sentence Embeddings https://arxiv.org/abs/2104.08821
Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk
Qdrant - High-performance, massive-scale Vector Database for the next generation of AI. Also available in the cloud https://cloud.qdrant.io/
typeahead.js is a fast and fully-featured autocomplete library