Skip to content

Topic · A4

The 12 Embedding Models Worth Running (MTEB-Graded, 2026)

Voyage-3-large beats OpenAI text-embedding-3-large by 10.58% at 1/24 the storage. BGE-M3 is the self-host cost leader. Here are the 12 models worth picking from, with the numbers.

# The 12 Embedding Models Worth Running The decision matters more than most teams realize. Picking the wrong embedding model is the single most common reason a RAG pipeline returns the wrong chunks — not the vector DB, not the chunking strategy, not the LLM. Picking the right one is also the cheapest performance fix in the stack. This page grades 12 models against four criteria: retrieval quality (MTEB rank + domain notes), cost per million docs, dimensions, and "would we run this in production." Numbers are pulled from primary sources (vendor blogs, the MTEB leaderboard as of May 2026, HN engineer threads); listicle aggregators are not cited.

The table

| Model | MTEB rank (Eng) | Cost / 1M docs | Dimensions | Best for | |---|---:|---:|---:|---| | Voyage-3-large | 4 | $1,000 | 256-1024 (Matryoshka) | Top quality on prose; cost-quality leader | | Voyage-code-3 | n/a (code-specific) | $1,000 | 1024 | Code search; 12-language eval beats OpenAI | | OpenAI text-embedding-3-large | 17 | $1,300 | 3072 | Default if you are already on OpenAI | | OpenAI text-embedding-3-small | 24 | $200 | 1536 | Cost-conscious; OK quality | | Cohere embed-v3 + rerank-v3 | 9 | $1,000 | 1024 | Reranker-paired stack | | BGE-M3 | 12 | $5-20 (self-host) | 1024 | OSS multilingual cost leader | | BGE-large-en-v1.5 | 18 | Free (self-host) | 1024 | English-only OSS, decent floor | | Nomic Embed v2 | 23 | ~$50 (API) | 768 | Multilingual cost-leader | | GTE (Alibaba) | 11 | Free (self-host) | 768 | Small footprint, strong MTEB | | E5 family (intfloat) | 14 | Free (self-host) | 384-1024 | Task-prefix specialty | | Qwen3-embedding | 6 | Free (self-host) | 1024 | Rising star on MTEB | | Jina v3 | 15 | API | 1024 | Multilingual API alternative | Two observations the listicles miss:
  • OpenAI text-embedding-3-large is rank 17. It is still the default in most RAG tutorials because the OpenAI client is ubiquitous, not because the retrieval quality is competitive. If your team is OpenAI-locked for other reasons, fine. If you are picking fresh in 2026, this should not be the first choice.
  • The self-host options dominate on cost. BGE-M3 at $5-20/M docs self-hosted versus $1,300 for OpenAI 3-large is a 65-260x cost ratio. That math matters when corpora cross 100M docs.

The surprise: Voyage-3-large at 1/24 storage

The most important finding in this space is buried in Voyage's January 2025 announcement post. At 1024 dimensions vs OpenAI's 3072 dimensions, Voyage-3-large is 10.58% better on Voyage's benchmark. At 256 dimensions int8 vs OpenAI's 3072 float, it is 11.47% better at 1/24 the storage cost — because int8 is 4x smaller per element and 256 is 1/12 the count. The math: | Setting | Per-vector storage | Quality vs OpenAI 3-large | |---|---:|---:| | OpenAI 3-large (default) | 12,288 bytes (3072 × float32) | baseline | | Voyage-3-large 1024 float | 4,096 bytes | +10.58% | | Voyage-3-large 256 int8 | 256 bytes | +11.47% | A 10M-vector corpus stored as Voyage-3-large 256 int8 is 2.5 GB instead of 117 GB. At Pinecone serverless pricing that is the difference between $3/mo and $140/mo — for a corpus that retrieves better. The trade-off: Voyage's Matryoshka embeddings have to be queried at the same dimension they are indexed at. Picking 256 int8 is a commitment, not a tuning knob. Reindex is the only way back.

The cost leader: BGE-M3 self-hosted

If you can run a GPU box (or pay Modal / RunPod for one), BGE-M3 is the practitioner consensus on r/LocalLLaMA and across HN production-RAG threads. Cluster A4 §6 surfaces three recurring quotes:
  • "BGE-M3 dominates self-hosted embeddings for cost." (r/LocalLLaMA consensus, cited by aitooldiscovery)
  • "People are tired of LangChain abstractions and prefer minimal frameworks (FAISS + sentence-transformers)." (same)
  • "Hybrid retrieval (BM25 + embeddings) always beats embeddings alone." (Eugene Yan, ex-Anthropic)
BGE-M3 is multilingual (100+ languages), supports dense + sparse + multi-vector representations in a single forward pass, and runs on a single L4 GPU for most production loads. The cost ($5-20/M docs) is just the GPU rental time; embedding throughput is ~5K chunks/sec on an L4. Where BGE-M3 loses: dedicated commercial models (Voyage-code-3 for code, Cohere embed-v3 paired with rerank-v3) still beat it on their target tasks. The right strategy for most teams: BGE-M3 as the default, swap in a commercial model when the corpus is narrow enough to justify a per-task model.

The code-search special case

Code is not prose. Embedding code with a prose-trained model (OpenAI 3-large, BGE-M3) loses ~15-20% recall vs a code-specific model. Voyage-code-3 is the strongest commercial option as of May 2026, with documented gains on Python, TypeScript, Go, Rust, Java, Ruby, PHP, Swift, Kotlin, C#, C++, and C. Outside that set (Verilog, Solidity, Crystal, OCaml), recall drops below 60% and you should fall back to BGE-M3 or a per-language fine-tune. The self-host option for code is jina-embeddings-v2-base-code, which is free and runs on a CPU box at low throughput (~200 chunks/sec). Acceptable for personal-project codebases under 10K files; below the bar for monorepos.

The legacy you should retire

If your pipeline runs any of these, plan a migration: | Model | Status | Migration to | |---|---|---| | OpenAI text-embedding-ada-002 | Deprecated effectively; "slipping behind Qwen3, Voyage-3-large, Mixedbread on MTEB" (HN production-RAG threads) | OpenAI 3-large or Voyage-3-large | | OpenAI text-embedding-3-large (default settings) | Mid-tier in 2026 | Voyage-3-large with Matryoshka | | sentence-transformers/all-MiniLM-L6-v2 | Fine for prototypes; insufficient for production | BGE-large-en-v1.5 (similar size, much better quality) | Eugene Yan's recommendation from "LLM Patterns": "Skip text-embedding-ada-002. Use E5/GTE/sentence-transformers. Hybrid retrieval (BM25 + embeddings) always beats embeddings alone."

The reranker question

A common confusion: rerankers are not embedding models. They take a query plus a candidate passage and output a relevance score. They do not produce vectors. Pairing an embedding model with a reranker is a separate decision, but the pairings matter:
  • Cohere embed-v3 + Cohere rerank-v3 — designed to be paired; ~91% recall@10 on hybrid setups.
  • Voyage-3-large + Voyage rerank-2 — same pairing logic, Voyage tunes them together.
  • BGE-M3 + bge-reranker-v2-m3 — the OSS equivalent.
  • Anything + Cohere rerank-v3 — reranker portability is fine; you do not have to match brands.
HN consensus: "Rerankers are the highest value 5 lines of code you'll add." The cost is a single extra API call per query (10-50ms latency, ~$0.0005). The recall gain is 10-25 points absolute on noisy corpora.

Dimensionality decision tree

If you must pick one number:
  • Corpus < 1M vectors: Use whatever the model recommends. Storage is not the bottleneck.
  • Corpus 1-10M: Voyage Matryoshka at 1024 dims float, or OpenAI 3-large at 1536 dims float.
  • Corpus 10-100M: Voyage Matryoshka at 512 dims int8, or BGE-M3 self-hosted at default.
  • Corpus > 100M: Voyage Matryoshka at 256 dims int8 + reranker. Or shard.
The mistake most teams make is picking the highest-dim setting because they think it equals quality. It does not. 256-dim int8 Voyage beats 3072-dim float OpenAI by 11.47%. Dimensions are a cost dial; the model is the quality dial.

Where this fails

1. Domain-narrow corpora. Legal, clinical, and patent corpora often need fine-tuning. Out-of-the-box MTEB-leading models can still lose by 20+ points on narrow domains. Plan for a 1-week eval-and-fine-tune phase if your corpus is narrow. 2. Non-English non-Latin scripts. BGE-M3 handles 100+ languages but quality varies by language. For Arabic, Korean, Hindi — benchmark on your own corpus before committing. 3. The reranker is not optional on noisy data. Embedding similarity alone returns garbage on conversational logs, social media, or any corpus with high lexical variation. Plan for the reranker line item in your cost budget. 4. Migration is reindexing. Switching embedding models means re-embedding the entire corpus. For a 100M-vector corpus this is hours-to-days of API calls or GPU time. Pick once with intention.

What to read next

Sources

  • Eugene Yan. "LLM Patterns". The "skip ada-002, use E5/GTE/sentence-transformers, hybrid beats dense alone" practitioner recommendation.

Frequently asked

What is the best embedding model in 2026?
There is no single best — it depends on stack and budget. For top quality on English text and willingness to pay a commercial API, Voyage-3-large is the current leader (beats OpenAI text-embedding-3-large by 10.58% on Voyage's own benchmark, January 2025). For self-host on a budget, BGE-M3 is the cost leader at $5-20 per million docs versus $1,300 for OpenAI 3-large. For code search specifically, Voyage-code-3. For multilingual, BGE-M3 again.
Should I move off OpenAI text-embedding-3-large?
If you embed more than ~10M tokens per month and care about retrieval quality, yes — Voyage-3-large gives you 10.58% better retrieval at a fraction of the storage cost when used at 512-dim int8. If you embed less and the latency of the OpenAI API matters, the migration may not be worth the rewrite. The MTEB leaderboard as of May 2026 shows OpenAI text-embedding-3-large at rank 17; this is not the embedding model frontier.
Why does the SERP say OpenAI is the best?
Because the SERPs for 'best embedding model 2026' are dominated by reintech, aimultiple, and pecollective — SaaS listicles that aggregate vendor blogs without benchmarking. The MTEB leaderboard, the Voyage cost-quality numbers, and the BGE-M3 self-host numbers are all published on primary sources but rarely make it into the listicle layer.
Is dimensionality the same thing as quality?
No. The Voyage finding is that you can use 256-dimensional int8 vectors and still beat OpenAI's 3072-dimensional float vectors by 11.47%. Dimensionality affects storage and search cost; quality is a function of training data and model architecture. The two are not the same dial. Most teams overprovision dimensions and underprovision model selection.
What is MTEB and is it trustworthy?
MTEB is the Massive Text Embedding Benchmark, a 56-task suite run by Hugging Face that evaluates models on retrieval, classification, clustering, and more. It is the closest thing to a neutral benchmark in this space. Its weakness is that domain-specific tasks (legal retrieval, code search, clinical text) are underweighted — a model can rank top-5 on MTEB and still lose to BGE-M3 on multilingual queries. Use MTEB as a starting filter, then benchmark on your own corpus.

Related topics