GitHub - ashvardanian/RetriEval: Benchmark suite that compares vector search engines against each other on billion-scale datasets, from in-memory HNSW libraries like USearch, FAISS, & cuVS to managed databases like Qdrant, Weaviate, LanceDB, & Redis – built for reproducibility · GitHub

RetriEval is a benchmarking suite designed for Billion-scale Vector Search workloads. It's primarily used to benchmark in-process Search Engines on CPUs and GPUs, like USearch, FAISS, and cuVS, but it also reuses similar profiling logic for standalone databases like Qdrant, Weaviate, and Redis. It works with the same plain input format standardized by the BigANN benchmark, aiming for reproducible measurements – with shuffled parallel construction, incremental recall curves, normalized metrics, and machine-readable reports, capturing everything from machine topology to indexing hyper-parameters.

Engine	Config	N	Recall @ 10	Add/s	Search/s	Memory	Duration
PubChem MACCS — 168-bit binary, Hamming · calibrated at 10M, same config at 100M
USearch	M=32, ef=128/64	10M	0.9696	36,347	35,767	4.7 GB	5.0m
USearch	M=32, ef=128/64	100M	0.8438	35,080	38,432	40.8 GB	54.6m
FAISS	M=64, ef=40/16	10M	0.9661	95,230	293,795	7.6 GB	1.5m
FAISS	M=64, ef=40/16	100M	—	—	—	≥ 63 GB	killed at 9h
SIFT — 128D `u8`, L2 · iso-recall baseline at ≥ 99 % recall@10
USearch	M=16, ef=128/256	10M	0.9938	35,405	80,729	4.4 GB	4.8m
USearch	M=16, ef=128/256	100M	0.9833	39,831	75,808	53.7 GB	48.7m
FAISS	M=16, ef=128/256	10M	0.9952	26,374	38,278	5.9 GB	5.6m
FAISS	M=16, ef=128/256	100M	—	—	—	≥ 46 GB	killed at 9h
Microsoft Turing-ANNS — 100D `f32`, L2 · iso-recall baseline at ≥ 99 % recall@10
USearch	M=48, ef=768/384, `f32`	10M	0.9929	8,532	12,331	13.0 GB	18.3m
	M=48, ef=768/384, `f32`	100M	0.9929	6,646	10,398	139.6 GB	4h 1m
	M=48, ef=768/384, `bf16`	10M	0.9929	10,496	16,940	10.9 GB	14.1m
	M=48, ef=768/384, `bf16`	100M	0.9931	8,564	14,772	105.2 GB	3h 1m
	M=48, ef=768/384, `f16`	10M	0.9929	10,969	20,246	10.9 GB	13.5m
	M=48, ef=768/384, `f16`	100M	0.9930	8,807	15,412	105.2 GB	2h 54m
	M=48, ef=768/384, `e5m2`	10M	0.9919	10,526	20,534	9.8 GB	13.5m
	M=48, ef=768/384, `e5m2`	100M	0.9924	7,368	13,227	88.0 GB	3h 15m
	M=48, ef=768/384, `e4m3`	10M	0.9930	7,353	12,106	9.8 GB	19.4m
	M=48, ef=768/384, `e3m2`	10M	0.9728	10,398	18,022	9.8 GB	13.3m
	M=48, ef=768/384, `e2m3`	10M	0.7941	10,935	21,313	9.8 GB	13.2m
FAISS	M=48, ef=768/384, `f32`	10M	0.9944	7,491	16,486	14.1 GB	20.6m
	M=48, ef=768/384, `bf16`	10M	0.9944	3,800	10,391	12.1 GB	39.4m
	M=48, ef=768/384, `f16`	10M	0.9944	2,545	10,032	12.1 GB	1h 1m

Benchmarks were conducted on dual socket Intel Xeon6 with 192 logical threads. USearch v2.25 was compared to FAISS v1.12.0 (static, via faiss-sys 0.7.0). Both engines used the native input quantization type — no rescaling in either.

The recommended methodology is to parameter-sweep different configuration options to achieve comparable recall between search backends on a given dataset. Once the behavior is confirmed on a small 1M–10M subset, 100M–1B and larger benchmarks can be run to validate scaling curves.

Quick Start

Install the default retri-eval-usearch binary:

cargo install --path .

Fetch the Unum Wiki 1M dataset — ~400 MB of vectors, queries, and ground truth:

mkdir -p datasets/wiki_1M && \
    wget -nc https://huggingface.co/datasets/unum-cloud/ann-wiki-1m/resolve/main/base.1M.fbin -P datasets/wiki_1M/ && \
    wget -nc https://huggingface.co/datasets/unum-cloud/ann-wiki-1m/resolve/main/query.public.100K.fbin -P datasets/wiki_1M/ && \
    wget -nc https://huggingface.co/datasets/unum-cloud/ann-wiki-1m/resolve/main/groundtruth.public.100K.ibin -P datasets/wiki_1M/

Run a sweep over three quantizations and write JSON reports under results/:

retri-eval-usearch \
    --vectors datasets/wiki_1M/base.1M.fbin \
    --queries datasets/wiki_1M/query.public.100K.fbin \
    --neighbors datasets/wiki_1M/groundtruth.public.100K.ibin \
    --data-type f32,f16,i8 \
    --metric ip \
    --output results/

Generate plots from the results:

uv run scripts/plot.py results/ --output-dir plots/

Backends

Search Engines

Backend	Parallelism	Quantization	Metrics
USearch	ForkUnion	f64, f32, bf16, f16, e5m2, e4m3, e3m2, e2m3, i8, u8, b1	ip, l2, cos, hamming, ...
FAISS	OpenMP	f32, f16, bf16, u8, i8, b1	ip, l2
cuVS	CUDA	f32, f16, i8, u8	l2, ip, cos

USearch: Input is passed directly in the specified type. --data-type selects both the input interpretation and the internal quantization.
FAISS: Input is always f32. --data-type selects the internal scalar quantizer (SQfp16, SQbf16, SQ8_direct, etc.).
cuVS: Currently benchmarks with f32. CAGRA natively supports f32, f16, i8, u8 for build.

retri-eval-usearch --data-type bf16 --metric l2 ...
retri-eval-faiss --data-type f16 --metric l2 ...
retri-eval-cuvs --metric l2 ...

Vector Databases

Server-side quantization is managed by the database engine, not the benchmark. Binary quantization is deterministic sign(x) per dim, and scalar quantization is per-dim min/max — neither trains a codebook, so both stay inside the "no learned logic" constraint the rest of the benchmark holds for the native backends. Product quantization is deliberately excluded everywhere.

Backend	Client	Docker Image	Metrics	Wire dtype sweep	Server-side quantization
Qdrant	`qdrant-client`, gRPC	`qdrant/qdrant:v1.17.1`	ip, l2, cos, manhattan	`f32`, `f16`, `u8`	`none`, `binary`, `scalar`
Redis	`redis`, RESP	`redis:8.6`	ip, l2, cos	`f32`, `f64`, `f16`, `bf16`, `u8`, `i8`	—
Weaviate	`weaviate-community`, REST	`semitechnologies/weaviate:1.36.10`	ip, l2, cos	`f32` only	`none`, `binary`
LanceDB	`lancedb`, in-process, Arrow	—	ip, l2, cos	`f32` only	— (IVF-bucketed only) ¹

¹ LanceDB's Rust client — lancedb 0.27 — exposes graph-based search only via IvfHnswPq / IvfHnswSq, both IVF-bucketed and PQ k-means-trained. No pure-HNSW variant is offered, so this benchmark leaves LanceDB on plain f32 + L2/IP/Cos until upstream adds one. Hamming is only available on IvfFlat, outside our graph path.

Redis 8.x is required for i8, u8, f16, and bf16 — the older redis/redis-stack images on Redis 7.4 reject those four types at FT.CREATE. Qdrant server-side Float16 and Uint8 accept f32 upserts and convert on ingest, so the wire payload we send is unchanged. Weaviate stores only f32 internally; the wire dtype sweep there is intentionally a single-option list.

Each backend is behind its own feature flag. Build only what you need:

cargo build --release --features usearch-backend    # USearch
cargo build --release --features faiss-backend      # FAISS
cargo build --release --features qdrant-backend     # Qdrant
cargo build --release --features redis-backend      # Redis
cargo build --release --features lancedb-backend    # LanceDB
cargo build --release --features weaviate-backend   # Weaviate
cargo build --release --features cuvs-backend       # cuVS

Or combine multiple:

cargo build --release --features usearch-backend,faiss-backend,qdrant-backend

CLI Reference

Each backend is a separate binary. Common flags shared by all:

--vectors <PATH|GLOB>      # Base vectors (.fbin, .u8bin, .i8bin, .b1bin)
--queries <PATH|GLOB>      # Query vectors
--neighbors <PATH|GLOB>    # Ground-truth neighbors (.ibin)
--keys <PATH|GLOB>         # Optional keys file (.i32bin)
--epochs <N>               # Measurement steps (dataset split into N parts, default: 10)
--no-shuffle               # Disable random insertion order (shuffle is on by default)
--output <DIR>             # Output directory for JSON result files (omit for progress-only)
--index <PATH>             # Persisted index handle. If the path exists, the run skips the
                           # add phase, loads, and search-only-runs; otherwise the run
                           # builds, then saves to that path. Requires a single-config sweep.
                           # USearch / FAISS / cuVS only.
--dimensions <LIST>        # Matryoshka truncations to evaluate (e.g. 128,256,512,1024).
                           # Empty → use the file's native dim. Each value must be ≤ native;
                           # for `.b1bin` files each must be a multiple of 8.

--vectors / --queries / --neighbors / --keys accept shell glob patterns (*, ?, […]). Matched shards are natural-sorted (shard_2.fbin before shard_10.fbin) and validated for matching dim and scalar format — useful for multi-shard datasets like USearchWiki.

retri-eval-usearch additionally supports comma-separated sweeps:

--data-type <LIST>         # f32, f16, bf16, e5m2, e4m3, e3m2, e2m3, i8, u8, b1
--metric <LIST>            # ip, l2, cos, hamming, jaccard, sorensen, pearson, haversine, divergence
--connectivity <LIST>      # HNSW M parameter (default: 0 = auto)
--expansion-add <LIST>     # expansion factor during indexing (default: 0 = auto)
--expansion-search <LIST>  # expansion factor during search (default: 0 = auto)
--shards <LIST>            # Index shards (default: 2)
--threads <LIST>           # Thread count (default: available cores)

retri-eval-cuvs — requires --features cuvs-backend and an NVIDIA GPU:

--data-type <LIST>                 # f32, f16, u8                  (default: f32)
--metric <LIST>                    # l2, ip, cos (default: l2)
--graph-degree <LIST>              # CAGRA output graph degree (default: 32)
--intermediate-graph-degree <LIST> # CAGRA intermediate graph degree (default: 64)
--itopk-size <LIST>                # Search-time intermediate results (default: 64)

retri-eval-qdrant extends the common flags with:

--data-type <LIST>      # f32, f16, u8                  (default: f32)
--quantization <LIST>   # none, binary, scalar          (default: none)
--metric <LIST>         # ip, l2, cos, manhattan        (default: l2)

retri-eval-redis extends the common flags with:

--data-type <LIST>      # f32, f64, f16, bf16, u8, i8   (default: f32)
--metric <LIST>         # ip, l2, cos                   (default: l2)

retri-eval-weaviate extends the common flags with:

--quantization <LIST>   # none, binary                  (default: none)
--metric <LIST>         # ip, l2, cos                   (default: l2)

Observability

Hardware Counters on Linux

Wall-clock throughput and peak RSS are always recorded in the JSON report. For deeper attribution — "how many cycles did construction spend in cache misses vs searching?" — build with --features perf-counters. On Linux this pulls [perf-event2] and wraps the index.add and index.search loops inside src/bench.rs::run with system-wide hardware counters, populating eight new optional fields on each StepEntry:

cycles_add / instructions_add / cache_misses_add / branch_misses_add
cycles_search / instructions_search / cache_misses_search / branch_misses_search

Fields are Option<u64> with skip_serializing_if = "Option::is_none", so reports from runs without the feature are byte-identical to the pre-feature schema.

sudo sysctl -w kernel.perf_event_paranoid=-1   # once per host
ulimit -n 65536                                 # see RLIMIT note below
cargo build --release --features usearch-backend,perf-counters

retri-eval-usearch \
    --vectors datasets/pubchem_maccs/base.115627267.b1bin \
    --queries datasets/pubchem_maccs/query.10000.b1bin \
    --neighbors datasets/pubchem_maccs/groundtruth.10000.ibin \
    --data-type b1 --metric hamming --output results/pubchem_maccs

Scope is system-wide per-CPU — pid == -1, cpu == i, one counter group per online CPU, summed at read. This is the only way to cover every ForkUnion pool thread, because per-process inherit(true) would miss workers spawned before the counter was enabled. Trade-off: on shared hosts the numbers include other tenants' activity; on a dedicated box this is exactly what you want.

Permissions require CAP_PERFMON or CAP_SYS_ADMIN, or relaxed paranoia via kernel.perf_event_paranoid ≤ 0. Without either, PerfCounters::new returns EACCES, the bench prints perf counters: unavailable …; running without and completes normally with the counter fields absent.

RLIMIT_NOFILE matters: each CPU opens six file descriptors — a no-op leader fd plus five hardware counters. At 192 CPUs that's 1,152 fds, above the default ulimit -n 1024 on most distros. Bump it per shell with ulimit -n 65536 or system-wide via /etc/security/limits.conf before running. Without the bump you'll get EMFILE around the 170th CPU's group.

Cross-platform is Linux-only. On macOS, Windows, or BSD, Cargo simply does not pull perf-event2 into the dep graph — the dep line is gated behind [target.'cfg(target_os = "linux")']. The module falls back to a stub whose PerfCounters::new returns Unsupported. Enabling the feature on a non-Linux target compiles cleanly and runs as if it were disabled — you still get the JSON, just without counter fields.

External `perf stat` Sidecar

When profiling an already-compiled binary, or when you want OS-level metrics alongside hardware counters, run perf stat and mpstat directly alongside the bench instead of rebuilding with --features perf-counters.

sudo apt install linux-tools-common linux-tools-generic sysstat
sudo sysctl -w kernel.perf_event_paranoid=-1

mpstat 1 > results/cohere_en/mpstat.txt &        # 1 Hz all-core utilization
perf stat -a -e cycles,instructions,cache-references,cache-misses,\
LLC-load-misses,branch-misses,context-switches,cpu-migrations,page-faults \
    --output results/cohere_en/perf.txt -- \
    retri-eval-usearch \
        --vectors datasets/cohere_en/base.41488110.b1bin \
        --queries datasets/cohere_en/query.10000.b1bin \
        --neighbors datasets/cohere_en/groundtruth.10000.ibin \
        --data-type b1 --metric hamming \
        --output results/cohere_en
kill %1

This covers the whole process lifetime including dataset-load and ground-truth I/O rather than just the add/search loops, useful for spotting cost outside the measured regions.

Memory Consumption Tracking

StepEntry.memory_bytes is populated per step by asking the backend what it's currently using. The mechanism depends on the backend:

Backend	How `memory_bytes` is measured
In-process — USearch, FAISS, cuVS	The engine exposes its internal allocator or `index.size()` API, giving exact index footprint excluding dataset mmap. USearch: `index.memory_usage()`. FAISS: `index.stats().indexed_vectors * sizeof`.
Tier 2 Docker — Qdrant, Redis, Weaviate	`docker stats --no-stream --format '{{.MemUsage}}'` is sampled per step against the running container and parsed into bytes. This includes the whole engine process, not just the index, so it's an overcount.
LanceDB — in-process, Arrow IPC	Filesystem-backed; `memory_bytes` reports the table's on-disk size from `fs::metadata`, not RSS.

The peak memory line printed at the end of a run is steps.iter().map(|s| s.memory_bytes).max(). Process-wide peak RSS — the kernel's accounting of everything including mmapped datasets — is available via getrusage(RUSAGE_SELF) but is not currently reported in the JSON. On the wishlist if you want mmap cost separated out.

Tier 2 Backend Docker Lifecycle

Tier 2 backends — Qdrant, Redis, Weaviate — don't run in-process. They run as Docker containers the benchmark spawns and tears down automatically. src/docker.rs wraps bollard, the async Docker API client, and does:

Pull — runs docker pull qdrant/qdrant:vX.Y.Z or equivalent if the image isn't cached locally.
Run — creates the container with port bindings and environment variables from the compose file at docker/<backend>.yml, then starts it.
Wait for ready — polls an HTTP health endpoint such as /healthz or /health with 500 ms intervals until the backend accepts connections, or a configurable timeout fires.
Run the benchmark against the container.
Stop and remove the container regardless of success or failure — RAII-style via ContainerHandle::Drop.

Per-step memory for these backends comes from the Docker stats API. memory_bytes reflects the container's resident set including the engine process, its heap, page cache attributed to it, and so on. Overcount compared to just-the-index, but it's the honest picture of what the engine costs to run.

Requirements: Docker daemon accessible over Unix socket or TCP, images pullable by the current user. On systems where the Docker daemon runs as root, either add your user to the docker group or run the benchmark with sudo.

Output Format

One JSON file per backend configuration, written to --output <dir>. Files are auto-named <backend>-<hash>.json.

{
  "machine": { "cpu_model": "Intel Xeon 6776P", "physical_cores": 96, ... },
  "dataset": { "vectors_path": "...", "vectors_count": 10000000, "dimensions": 100, ... },
  "config": { "backend": "usearch", "data_type": "f32", "metric": "l2", "connectivity": 16, ... },
  "steps": [
    {
      "vectors_indexed": 1000000,
      "add_elapsed": 12.3,
      "add_throughput": 81300,
      "memory_bytes": 412000000,
      "search_elapsed": 0.45,
      "search_throughput": 222000,
      "recall_at_1": 0.0942,
      "recall_at_10": 0.2815,
      "ndcg_at_10": 0.1847,
      "recall_at_1_normalized": 0.9420,
      "recall_at_10_normalized": 0.9512,
      "ndcg_at_10_normalized": 0.8470
    }
  ]
}

Project Structure

Cargo.toml
src/
    bench.rs                # Library root: Backend trait, types, BenchState, benchmark loop
    dataset.rs              # Memory-mapped .fbin/.ibin loading (zero-copy)
    eval.rs                 # Recall@K, NDCG@K
    output.rs               # Report types, JSON writer, machine info
    docker.rs               # Docker container lifecycle (Tier 2 backends)
    usearch.rs              # retri-eval-usearch binary
    faiss.rs                # retri-eval-faiss binary
    cuvs.rs                 # retri-eval-cuvs binary
    qdrant.rs               # retri-eval-qdrant binary
    redis.rs                # retri-eval-redis binary
    lancedb.rs              # retri-eval-lancedb binary
    weaviate.rs             # retri-eval-weaviate binary
    generate.rs             # retri-generate — synthetic dataset generator with GT
    perf_counters.rs        # Linux perf_event_open wrapper for hardware counters
docker/
    qdrant.yml              # Docker compose for Qdrant
    redis.yml               # Docker compose for Redis
    weaviate.yml            # Docker compose for Weaviate
scripts/
    plot.py                 # JSON results → PNG plots (Plotly, runnable via uv)
    download_molecules.rs   # retri-download-molecules binary (--features download)
    download_cohere.rs      # retri-download-cohere binary (--features download)

Datasets

BigANN benchmark is a good starting point, if you are searching for large collections of high-dimensional vectors. Those often come with precomputed ground-truth neighbors, which is handy for recall evaluation. Datasets below are grouped by scale; only configurations with matching ground truth support recall evaluation.

Most datasets ship as one file per role (base / queries / ground-truth), but larger ones — like USearchWiki — are split across many .fbin shards. RetriEval accepts shell glob patterns on --vectors / --queries / --neighbors / --keys, so a sharded dataset reads exactly like a single-file one: pass --vectors 'base.shard_*.fbin', quoted so the shell doesn't expand it. Matched shards are natural-sorted (shard_2.fbin before shard_10.fbin) and validated for consistent dimensionality and scalar format; per-row stride and recall metrics are unchanged versus the single-file path.

~1M Scale — Development & Testing

Dataset	Scalar Type	Dimensions	Metric	Base Size	Ground Truth
Unum UForm Wiki	`f32`	256	IP	1 GB	100K queries, yes
Unum UForm Creative Captions	`f32`	256	IP	3 GB	3M queries, yes
Arxiv with E5	`f32`	768	IP	6 GB	2M queries, yes

~10M Scale

Dataset	Scalar Type	Dimensions	Metric	Base Size	Ground Truth
Meta BIGANN — SIFT	`u8`	128	L2	1.2 GB	10K queries, yes
Microsoft Turing-ANNS	`f32`	100	L2	3.7 GB	100K queries, yes
Cohere Wiki EN	`b1`	1024	Hamming	5.3 GB	self-sampled ¹

¹ Binary fingerprint and embedding sources ship vectors but no ground truth. The retri-download-molecules and retri-download-cohere binaries behind --features download fetch the Parquet shards from S3 and Hugging Face, extract the bit-packed column straight into .b1bin, sample queries with a fixed seed, and compute exact brute-force Hamming top-K using NumKong's SIMD kernels.

~100M Scale

Dataset	Scalar Type	Dimensions	Metric	Base Size	Ground Truth
Meta BIGANN — SIFT	`u8`	128	L2	12 GB	10K queries, yes
Microsoft Turing-ANNS	`f32`	100	L2	37 GB	100K queries, yes
Microsoft SpaceV	`i8`	100	L2	9.3 GB	30K queries, yes
Unum WikiVerse ²	`f16`	128–4096³	Cos/IP	95-505GB	pipeline pending
USearchMolecules PubChem MACCS	`b1`	168	Hamming	2.4 GB	self-sampled ¹
USearchMolecules PubChem ECFP4	`b1`	2048	Hamming	29 GB	self-sampled ¹

² WikiVerse uses .f16bin (u32 rows + u32 cols + f16 values), which RetriEval does not yet read — adding f16 to Dataset::load's extension match is a small follow-up. ³ Per-model: nomic-embed 768, arctic-embed/Qwen3 1024, e5-mistral 4096; ColBERT-style multi-vector (128d/token) needs the deferred multi-vector plan.

~1B Scale

Dataset	Scalar Type	Dimensions	Metric	Base Size	Ground Truth
Meta BIGANN — SIFT	`u8`	128	L2	119 GB	10K queries, yes
Microsoft Turing-ANNS	`f32`	100	L2	373 GB	100K queries, yes
Microsoft SpaceV	`i8`	100	L2	93 GB	30K queries, yes
Yandex Text-to-Image	`f32`	200	Cos	750 GB	100K queries, yes
Yandex Deep	`f32`	96	L2	358 GB	10K queries, yes
USearchMolecules GDB-13 MACCS	`b1`	168	Hamming	21 GB	self-sampled ¹
USearchMolecules GDB-13 ECFP4	`b1`	2048	Hamming	250 GB	self-sampled ¹
USearchMolecules Enamine REAL MACCS	`b1`	168	Hamming	127 GB	self-sampled ¹
USearchMolecules Enamine REAL ECFP4	`b1`	2048	Hamming	1.55 TB	self-sampled ¹

Unum UForm Wiki

Image-and-text embeddings from the UForm small multimodal model, projected to a shared 256d space. Bench against IP since UForm is L2-normalised at training time.

1M — f32, 256d, IP, ~1 GB

mkdir -p datasets/wiki_1M/ && \
    wget -nc https://huggingface.co/datasets/unum-cloud/ann-wiki-1m/resolve/main/base.1M.fbin -P datasets/wiki_1M/ && \
    wget -nc https://huggingface.co/datasets/unum-cloud/ann-wiki-1m/resolve/main/query.public.100K.fbin -P datasets/wiki_1M/ && \
    wget -nc https://huggingface.co/datasets/unum-cloud/ann-wiki-1m/resolve/main/groundtruth.public.100K.ibin -P datasets/wiki_1M/

retri-eval-usearch \
    --vectors datasets/wiki_1M/base.1M.fbin \
    --queries datasets/wiki_1M/query.public.100K.fbin \
    --neighbors datasets/wiki_1M/groundtruth.public.100K.ibin \
    --data-type f32,f16,i8 --metric ip \
    --output results/wiki_1M

Unum UForm Creative Captions

Conceptual Captions image embeddings from the same UForm model as Wiki. Ground truth was computed offline by shuffling the base set as queries and recording the top-100 IP neighbors per row — see scripts/compute_unum_orphan_gt.py.

3M — f32, 256d, IP, ~3 GB

mkdir -p datasets/cc_3M/ && \
    wget -nc https://huggingface.co/datasets/unum-cloud/ann-cc-3m/resolve/main/base.fbin -P datasets/cc_3M/ && \
    wget -nc https://huggingface.co/datasets/unum-cloud/ann-cc-3m/resolve/main/query.fbin -P datasets/cc_3M/ && \
    wget -nc https://huggingface.co/datasets/unum-cloud/ann-cc-3m/resolve/main/groundtruth.ibin -P datasets/cc_3M/

retri-eval-usearch \
    --vectors datasets/cc_3M/base.fbin \
    --queries datasets/cc_3M/query.fbin \
    --neighbors datasets/cc_3M/groundtruth.ibin \
    --data-type f32,bf16,f16,i8 --metric ip \
    --output results/cc_3M

Arxiv with E5

Arxiv abstracts embedded with the intfloat/e5-base model. Same offline GT recipe as Creative Captions: shuffled base as queries, top-100 IP neighbors.

2M — f32, 768d, IP, ~6 GB

mkdir -p datasets/arxiv_2M/ && \
    wget -nc https://huggingface.co/datasets/unum-cloud/ann-arxiv-2m/resolve/main/base.fbin -P datasets/arxiv_2M/ && \
    wget -nc https://huggingface.co/datasets/unum-cloud/ann-arxiv-2m/resolve/main/query.fbin -P datasets/arxiv_2M/ && \
    wget -nc https://huggingface.co/datasets/unum-cloud/ann-arxiv-2m/resolve/main/groundtruth.ibin -P datasets/arxiv_2M/

retri-eval-usearch \
    --vectors datasets/arxiv_2M/base.fbin \
    --queries datasets/arxiv_2M/query.fbin \
    --neighbors datasets/arxiv_2M/groundtruth.ibin \
    --data-type f32,bf16,f16,i8 --metric ip \
    --output results/arxiv_2M

Meta BIGANN — SIFT

Billion-scale SIFT descriptors from Meta. No pre-sliced subset base files exist, so the recipes use range requests against the single 1B file followed by an in-place header patch to update the vector count. Pre-computed ground truth is available for 10M and 100M subsets.

10M — u8, 128d, L2, ~1.2 GB

mkdir -p datasets/sift_10M/ && \
    wget -nc https://dl.fbaipublicfiles.com/billion-scale-ann-benchmarks/bigann/query.public.10K.u8bin -P datasets/sift_10M/ && \
    wget -nc https://dl.fbaipublicfiles.com/billion-scale-ann-benchmarks/GT_10M/bigann-10M -O datasets/sift_10M/groundtruth.public.10K.ibin && \
    wget --header="Range: bytes=0-1280000007" \
        https://dl.fbaipublicfiles.com/billion-scale-ann-benchmarks/bigann/base.1B.u8bin \
        -O datasets/sift_10M/base.10M.u8bin && \
    python3 -c "
import struct
with open('datasets/sift_10M/base.10M.u8bin', 'r+b') as f:
    f.write(struct.pack('I', 10_000_000))
"

retri-eval-usearch \
    --vectors datasets/sift_10M/base.10M.u8bin \
    --queries datasets/sift_10M/query.public.10K.u8bin \
    --neighbors datasets/sift_10M/groundtruth.public.10K.ibin \
    --data-type f32,f16,i8 --metric l2 \
    --output results/sift_10M

100M — u8, 128d, L2, ~12 GB

mkdir -p datasets/sift_100M/ && \
    wget -nc https://dl.fbaipublicfiles.com/billion-scale-ann-benchmarks/bigann/query.public.10K.u8bin -P datasets/sift_100M/ && \
    wget -nc https://dl.fbaipublicfiles.com/billion-scale-ann-benchmarks/GT_100M/bigann-100M -O datasets/sift_100M/groundtruth.public.10K.ibin && \
    wget --header="Range: bytes=0-12800000007" \
        https://dl.fbaipublicfiles.com/billion-scale-ann-benchmarks/bigann/base.1B.u8bin \
        -O datasets/sift_100M/base.100M.u8bin && \
    python3 -c "
import struct
with open('datasets/sift_100M/base.100M.u8bin', 'r+b') as f:
    f.write(struct.pack('I', 100_000_000))
"

retri-eval-usearch \
    --vectors datasets/sift_100M/base.100M.u8bin \
    --queries datasets/sift_100M/query.public.10K.u8bin \
    --neighbors datasets/sift_100M/groundtruth.public.10K.ibin \
    --data-type f32,f16,i8 --metric l2 \
    --epochs 20 --output results/sift_100M

Microsoft Turing-ANNS

373 GB of f32 vectors with 100 dimensions at full 1B scale. Subsets follow the same range-request + header-patch recipe as BIGANN. Pre-computed ground truth is available for 1M, 10M, and 100M.

1M — f32, 100d, L2, ~400 MB

mkdir -p datasets/turing_1M/ && \
    wget -nc https://comp21storage.z5.web.core.windows.net/comp21/MSFT-TURING-ANNS/query100K.fbin \
        -O datasets/turing_1M/query.public.100K.fbin && \
    wget -nc https://comp21storage.z5.web.core.windows.net/comp21/MSFT-TURING-ANNS/msturing-gt-1M \
        -O datasets/turing_1M/groundtruth.public.100K.ibin && \
    wget --header="Range: bytes=0-400000007" \
        https://comp21storage.z5.web.core.windows.net/comp21/MSFT-TURING-ANNS/base1b.fbin \
        -O datasets/turing_1M/base.1M.fbin && \
    python3 -c "
import struct
with open('datasets/turing_1M/base.1M.fbin', 'r+b') as f:
    f.write(struct.pack('I', 1_000_000))
"

retri-eval-usearch \
    --vectors datasets/turing_1M/base.1M.fbin \
    --queries datasets/turing_1M/query.public.100K.fbin \
    --neighbors datasets/turing_1M/groundtruth.public.100K.ibin \
    --data-type f32,bf16,f16,i8 --metric l2 \
    --output results/turing_1M

10M — f32, 100d, L2, ~3.7 GB

mkdir -p datasets/turing_10M/ && \
    wget -nc https://comp21storage.z5.web.core.windows.net/comp21/MSFT-TURING-ANNS/query100K.fbin \
        -O datasets/turing_10M/query.public.100K.fbin && \
    wget -nc https://comp21storage.z5.web.core.windows.net/comp21/MSFT-TURING-ANNS/msturing-gt-10M \
        -O datasets/turing_10M/groundtruth.public.100K.ibin && \
    wget --header="Range: bytes=0-4000000007" \
        https://comp21storage.z5.web.core.windows.net/comp21/MSFT-TURING-ANNS/base1b.fbin \
        -O datasets/turing_10M/base.10M.fbin && \
    python3 -c "
import struct
with open('datasets/turing_10M/base.10M.fbin', 'r+b') as f:
    f.write(struct.pack('I', 10_000_000))
"

retri-eval-usearch \
    --vectors datasets/turing_10M/base.10M.fbin \
    --queries datasets/turing_10M/query.public.100K.fbin \
    --neighbors datasets/turing_10M/groundtruth.public.100K.ibin \
    --data-type f32,bf16,f16,i8 --metric l2 \
    --output results/turing_10M

100M — f32, 100d, L2, ~37 GB

mkdir -p datasets/turing_100M/ && \
    wget -nc https://comp21storage.z5.web.core.windows.net/comp21/MSFT-TURING-ANNS/query100K.fbin \
        -O datasets/turing_100M/query.public.100K.fbin && \
    wget -nc https://comp21storage.z5.web.core.windows.net/comp21/MSFT-TURING-ANNS/msturing-gt-100M \
        -O datasets/turing_100M/groundtruth.public.100K.ibin && \
    wget --header="Range: bytes=0-40000000007" \
        https://comp21storage.z5.web.core.windows.net/comp21/MSFT-TURING-ANNS/base1b.fbin \
        -O datasets/turing_100M/base.100M.fbin && \
    python3 -c "
import struct
with open('datasets/turing_100M/base.100M.fbin', 'r+b') as f:
    f.write(struct.pack('I', 100_000_000))
"

retri-eval-usearch \
    --vectors datasets/turing_100M/base.100M.fbin \
    --queries datasets/turing_100M/query.public.100K.fbin \
    --neighbors datasets/turing_100M/groundtruth.public.100K.ibin \
    --data-type f32,bf16,f16,i8 --metric l2 \
    --epochs 20 --output results/turing_100M

Microsoft SpaceV

Web-search embeddings already quantised to int8 at the source. A 100M subset is mirrored on Hugging Face; the original 1B lives on AWS S3.

100M — i8, 100d, L2, ~9.3 GB

mkdir -p datasets/spacev_100M/ && \
    wget -nc https://huggingface.co/datasets/unum-cloud/ann-spacev-100m/resolve/main/base.100M.i8bin -P datasets/spacev_100M/ && \
    wget -nc https://huggingface.co/datasets/unum-cloud/ann-spacev-100m/resolve/main/query.30K.i8bin -P datasets/spacev_100M/ && \
    wget -nc https://huggingface.co/datasets/unum-cloud/ann-spacev-100m/resolve/main/groundtruth.30K.i32bin -P datasets/spacev_100M/

retri-eval-usearch \
    --vectors datasets/spacev_100M/base.100M.i8bin \
    --queries datasets/spacev_100M/query.30K.i8bin \
    --neighbors datasets/spacev_100M/groundtruth.30K.i32bin \
    --data-type f32,f16,i8 --metric l2 \
    --epochs 20 --output results/spacev_100M

Yandex Deep

Image embeddings extracted from the GoogLeNet penultimate layer. Only the full 1B is included here — the smaller subsets duplicate the same distribution at scales already covered by other datasets.

1B — f32, 96d, L2, ~358 GB

mkdir -p datasets/deep_1B/ && \
    wget -nc https://storage.yandexcloud.net/yandex-research/ann-datasets/DEEP/base.1B.fbin -P datasets/deep_1B/ && \
    wget -nc https://storage.yandexcloud.net/yandex-research/ann-datasets/DEEP/query.public.10K.fbin -P datasets/deep_1B/ && \
    wget -nc https://storage.yandexcloud.net/yandex-research/ann-datasets/DEEP/groundtruth.public.10K.ibin -P datasets/deep_1B/

Yandex Text-to-Image

Cross-modal text-and-image embeddings benchmarked under cosine similarity.

1M — f32, 200d, Cos, ~750 MB

mkdir -p datasets/t2i/ && \
    wget -nc https://storage.yandexcloud.net/yandex-research/ann-datasets/T2I/base.1M.fbin -P datasets/t2i/ && \
    wget -nc https://storage.yandexcloud.net/yandex-research/ann-datasets/T2I/query.public.100K.fbin -P datasets/t2i/ && \
    wget -nc https://storage.yandexcloud.net/yandex-research/ann-datasets/T2I/groundtruth.public.100K.ibin -P datasets/t2i/

retri-eval-usearch \
    --vectors datasets/t2i/base.1M.fbin \
    --queries datasets/t2i/query.public.100K.fbin \
    --neighbors datasets/t2i/groundtruth.public.100K.ibin \
    --data-type f32,bf16,f16,i8 --metric cos \
    --output results/t2i_1M

1B — f32, 200d, Cos, ~750 GB

mkdir -p datasets/t2i_1B/ && \
    wget -nc https://storage.yandexcloud.net/yandex-research/ann-datasets/T2I/base.1B.fbin -P datasets/t2i_1B/ && \
    wget -nc https://storage.yandexcloud.net/yandex-research/ann-datasets/T2I/query.public.100K.fbin -P datasets/t2i_1B/ && \
    wget -nc https://storage.yandexcloud.net/yandex-research/ann-datasets/T2I/groundtruth.public.100K.ibin -P datasets/t2i_1B/

USearchMolecules

A corpus of small molecules with pre-computed binary fingerprints at four widths: MACCS 166 bits, PubChem 881 bits, ECFP4 2048 bits, FCFP4 2048 bits. Three subsets are hosted on AWS Open Data as Parquet shards: PubChem at 115M molecules, GDB-13 at 977M, and Enamine REAL at 6.04B. Natural fit for Hamming and Jaccard benchmarks since the vectors are genuinely binary rather than quantised floats.

The retri-download-molecules binary fetches the requested fingerprint column directly into .b1bin, samples queries with a fixed seed, and computes brute-force Hamming top-K ground truth. Use --limit N to take a subset and --source {pubchem,gdb13,enamine} to pick the scale.

PubChem 115M MACCS — b1, 168 bits, Hamming, ~2.4 GB

cargo install --path . --features download
retri-download-molecules \
    --source pubchem --fingerprint maccs \
    --query-count 10000 --neighbors 10 \
    --output datasets/pubchem_maccs/

retri-eval-usearch \
    --vectors datasets/pubchem_maccs/base.115627267.b1bin \
    --queries datasets/pubchem_maccs/query.10000.b1bin \
    --neighbors datasets/pubchem_maccs/groundtruth.10000.ibin \
    --data-type b1 --metric hamming,jaccard \
    --output results/pubchem_maccs

PubChem 115M ECFP4 — b1, 2048 bits, Hamming, ~29 GB

retri-download-molecules \
    --source pubchem --fingerprint ecfp4 \
    --query-count 10000 --neighbors 10 \
    --output datasets/pubchem_ecfp4/

retri-eval-usearch \
    --vectors datasets/pubchem_ecfp4/base.115627267.b1bin \
    --queries datasets/pubchem_ecfp4/query.10000.b1bin \
    --neighbors datasets/pubchem_ecfp4/groundtruth.10000.ibin \
    --data-type b1 --metric hamming \
    --output results/pubchem_ecfp4

GDB-13 977M MACCS — b1, 168 bits, Hamming, ~21 GB

retri-download-molecules \
    --source gdb13 --fingerprint maccs \
    --query-count 10000 --neighbors 10 \
    --output datasets/gdb13_maccs/

Enamine REAL 6.04B MACCS — b1, 168 bits, Hamming, ~127 GB

retri-download-molecules \
    --source enamine --fingerprint maccs \
    --query-count 10000 --neighbors 10 \
    --output datasets/enamine_maccs/

Substitute --fingerprint ecfp4 for the 2048-bit variant, which multiplies the base-file size by roughly 12× at each scale. Ground-truth time dominates at billion scale; set --batch-size explicitly if you have a lot of RAM and want larger query batches.

Cohere Wikipedia Multilingual

247M Wikipedia paragraphs embedded with Cohere Embed v3 and bit-packed into 1024-bit emb_ubinary columns at 128 bytes per vector. The dataset also ships text metadata — title, paragraph body, URL — alongside the vectors. --with-text extracts them into aligned newline-delimited files for downstream semantic-search demos.

English subset 41.5M — b1, 1024 bits, Hamming, ~5.3 GB

retri-download-cohere \
    --language en \
    --query-count 10000 --neighbors 10 \
    --output datasets/cohere_en/

retri-eval-usearch \
    --vectors datasets/cohere_en/base.41488110.b1bin \
    --queries datasets/cohere_en/query.10000.b1bin \
    --neighbors datasets/cohere_en/groundtruth.10000.ibin \
    --data-type b1 --metric hamming \
    --output results/cohere_en

FAISS binary indexes via IndexBinaryHNSW also work — pass --data-type b1, and the metric is Hamming by construction.

retri-eval-faiss \
    --vectors datasets/cohere_en/base.41488110.b1bin \
    --queries datasets/cohere_en/query.10000.b1bin \
    --neighbors datasets/cohere_en/groundtruth.10000.ibin \
    --data-type b1 --metric hamming \
    --output results/cohere_en_faiss

Unum WikiVerse

Multi-model embedding dataset built on HuggingFaceFW/finewiki — 61.5M articles across 325 languages, embedded by five models (Qwen3-Embedding-0.6B 1024d, GTE-ModernColBERT-v1 128d/token, Snowflake arctic-embed-l-v2.0 1024d, nomic-embed-text-v1.5 768d, e5-mistral-7b-instruct 4096d). Each .f16bin shard is row-aligned with the source FineWiki parquet — directory layout is <model>/<lang>wiki/<group>_<shard>.{body,title}.f16bin, mirroring FineWiki 1:1. The full corpus is 95-505 GB depending on the model; ColBERT-style embeddings reach 6.2 TB at one vector per token.

Two prerequisites are still pending on the RetriEval side: Dataset::load doesn't yet recognize the .f16bin extension (a small follow-up — same header layout as .fbin, swap f32 for f16 in ScalarFormat), and the ColBERT model needs the deferred multi-vector plan. The dense models will work as soon as f16 lands; the example below assumes that, plus the existing GLOB support for sharded inputs.

English subset, Qwen3-Embedding-0.6B — f16, 1024d, ~13 GB

GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/unum-cloud/WikiVerse datasets/wikiverse/
cd datasets/wikiverse
hf download unum-cloud/WikiVerse \
    --repo-type dataset \
    --include "qwen3-embedding-0.6b/enwiki/*.body.f16bin"
cd ../..

retri-eval-usearch \
    --vectors 'datasets/wikiverse/qwen3-embedding-0.6b/enwiki/*.body.f16bin' \
    --queries datasets/wikiverse/qwen3-embedding-0.6b/enwiki/000_00000.body.f16bin \
    --data-type f16 --metric cos \
    --output results/wikiverse_en_qwen3

The --vectors glob picks up every English shard in natural-sort order; queries reuse one shard until the official query/GT split lands upstream.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
docker		docker
scripts		scripts
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
clippy.toml		clippy.toml
rustfmt.toml		rustfmt.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quick Start

Backends

Search Engines

Vector Databases

CLI Reference

Observability

Hardware Counters on Linux

External `perf stat` Sidecar

Memory Consumption Tracking

Tier 2 Backend Docker Lifecycle

Output Format

Project Structure

Datasets

~1M Scale — Development & Testing

~10M Scale

~100M Scale

~1B Scale

Unum UForm Wiki

Unum UForm Creative Captions

Arxiv with E5

Meta BIGANN — SIFT

Microsoft Turing-ANNS

Microsoft SpaceV

Yandex Deep

Yandex Text-to-Image

USearchMolecules

Cohere Wikipedia Multilingual

Unum WikiVerse

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Quick Start

Backends

Search Engines

Vector Databases

CLI Reference

Observability

Hardware Counters on Linux

External perf stat Sidecar

Memory Consumption Tracking

Tier 2 Backend Docker Lifecycle

Output Format

Project Structure

Datasets

~1M Scale — Development & Testing

~10M Scale

~100M Scale

~1B Scale

Unum UForm Wiki

Unum UForm Creative Captions

Arxiv with E5

Meta BIGANN — SIFT

Microsoft Turing-ANNS

Microsoft SpaceV

Yandex Deep

Yandex Text-to-Image

USearchMolecules

Cohere Wikipedia Multilingual

Unum WikiVerse

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages

External `perf stat` Sidecar