Book a Free Strategy Call
Skip the read — talk to Walid in 30 min.
Free strategy call. We map your AI engineering team, you keep the notes.
RAG with Claude in 2026 looks nothing like RAG in 2023. Back then, every byte of context cost real money, every prompt had to be hand-trimmed, and "retrieval-augmented generation" was mostly a way to squeeze a 4K window. By 2026, Claude Sonnet 4.5 and Opus 4.7 ship with 1M-token context, prompt caching that drops repeat-context costs by up to 90%, and native citations that tie every claim back to a source span. The constraints have moved.
The hard part is no longer "how do I fit my docs in the prompt." It is "how do I retrieve the right 20K tokens out of 50M, re-rank them, and stream a grounded answer with citations the user can click." That requires a real pipeline: chunking, embeddings, a vector store, hybrid retrieval, a re-ranker, a generation step with cached system prompts, and an eval loop that tells you when retrieval quality is regressing.
This tutorial walks through the full stack end-to-end. You will get architecture, library choices, working TypeScript and Python snippets for pgvector, retrieval and re-ranking patterns, a Claude generation call with prompt caching and citations enabled, an eval setup, and the cost knobs that matter. By the end you will have a production-ready RAG system that costs a fraction of a 2024 build and answers with citations users actually trust.
Architecture overview
Every production RAG system in 2026 follows the same six-stage pipeline. The names are old; the implementations are not.
[Source docs] → [Chunk] → [Embed] → [Store] → [Retrieve] → [Re-rank] → [Generate]
↑ ↓
└───── [Eval loop] ────────┘
- Chunk: split source documents into semantic units (300–800 tokens). Avoid fixed-size splits.
- Embed: convert chunks to vectors using Voyage, OpenAI, or Cohere embedding models.
- Store: persist vectors + metadata in pgvector, Pinecone, Weaviate, Qdrant, or Turbopuffer.
- Retrieve: top-k nearest neighbor search, often combined with BM25 keyword search (hybrid).
- Re-rank: rescore top 50–100 candidates with a cross-encoder or LLM rerank to surface top 5–10.
- Generate: pass re-ranked chunks to Claude with a cached system prompt; return answer + citations.
Two cross-cutting concerns sit on top: an eval loop (Ragas, Langfuse, or Braintrust) that measures retrieval and answer quality on a frozen test set, and an observability layer that tracks token cost, cache hit rate, and latency per request. Skip either of these and you will be flying blind within a week of launch.
In 2026, the biggest architectural shift is that Claude's 1M-token window means you no longer need to retrieve aggressively for small corpora. If your knowledge base is under ~500K tokens, you can often stuff the whole thing into a cached system prompt and skip retrieval entirely. For anything larger, a real RAG pipeline still wins on cost and latency.
Step 1: Choose your stack
The stack matters more than the model. Here is the 2026 decision matrix.
Vector store
| Store | Best for | Pricing model |
|---|---|---|
pgvector (Postgres + extension) | Teams already on Postgres, <50M vectors, hybrid SQL filters | Self-host or Supabase / Neon |
| Pinecone | Managed, low ops, serverless, multi-tenant SaaS | $0.096/M reads, $4/M writes (serverless) |
| Weaviate | Hybrid search out of the box, GraphQL API | Self-host or Weaviate Cloud |
| Qdrant | Rust performance, payload filtering, on-prem friendly | Self-host or Qdrant Cloud |
| Turbopuffer | Cold-tier object-store backed, cheap for >100M vectors | Pay-per-query, very cheap at scale |
Default recommendation: pgvector on Supabase or Neon. You already have Postgres for app data. One database means one backup story, one access-control layer, and SQL joins between vectors and your business tables. Move to Pinecone or Turbopuffer only when you hit real scale pain.
Embedding model
| Model | Dim | Strength | Cost (per 1M tokens) |
|---|---|---|---|
voyage-3-large | 1024 | Best retrieval quality 2026, Anthropic-recommended | $0.18 |
voyage-3-lite | 512 | 80% of large quality, 5x cheaper | $0.02 |
text-embedding-3-large (OpenAI) | 3072 | Strong baseline, broad ecosystem | $0.13 |
embed-english-v3.0 (Cohere) | 1024 | Solid quality, good multilingual variant | $0.10 |
Default: voyage-3-large for English-heavy production. Switch to voyage-multilingual-2 if you serve French, Arabic, or Spanish users. The embeddings are what determine ceiling retrieval quality — do not cheap out here.
Re-ranker
Use cohere-rerank-3.5 or voyage-rerank-2. Both score 100 candidates in under 200ms and lift retrieval precision by 15–30 points on most benchmarks. If you cannot use a third-party reranker, fall back to an LLM rerank call to Claude Haiku — it works, just costs more per query.
Orchestration
Skip LangChain unless you have a strong reason. In 2026, most teams ship RAG with a thin TypeScript or Python wrapper plus the Anthropic SDK. If you need agentic retrieval (multi-hop, query rewriting, tool use), look at LangGraph or build directly on the Claude Agent SDK. We cover orchestration choices in detail in our best RAG frameworks roundup.
Step 2: Ingest + chunk
Bad chunking is the #1 cause of bad retrieval. Fixed 512-token splits across paragraph boundaries will destroy your recall. Use semantic chunking that respects document structure.
// chunk.ts
import { encode } from "gpt-tokenizer";
type Chunk = {
id: string;
docId: string;
text: string;
tokens: number;
metadata: { section?: string; sourceUrl: string; position: number };
};
const MAX_TOKENS = 600;
const OVERLAP_TOKENS = 80;
export function semanticChunk(
docId: string,
markdown: string,
sourceUrl: string,
): Chunk[] {
// Split on H2/H3 headings first to preserve semantic boundaries
const sections = markdown.split(/\n(?=##\s)/);
const chunks: Chunk[] = [];
let position = 0;
for (const section of sections) {
const headingMatch = section.match(/^##\s+(.+)/);
const sectionTitle = headingMatch?.[1]?.trim();
// If section fits, keep it whole
const tokens = encode(section).length;
if (tokens <= MAX_TOKENS) {
chunks.push({
id: `${docId}:${position}`,
docId,
text: section.trim(),
tokens,
metadata: { section: sectionTitle, sourceUrl, position },
});
position++;
continue;
}
// Otherwise split on paragraphs with overlap
const paragraphs = section.split(/\n\n+/);
let buffer = "";
let bufferTokens = 0;
for (const p of paragraphs) {
const pTokens = encode(p).length;
if (bufferTokens + pTokens > MAX_TOKENS && buffer) {
chunks.push({
id: `${docId}:${position}`,
docId,
text: buffer.trim(),
tokens: bufferTokens,
metadata: { section: sectionTitle, sourceUrl, position },
});
position++;
// Carry overlap
const overlap = buffer.split(/\n\n/).slice(-1)[0] ?? "";
buffer = overlap + "\n\n" + p;
bufferTokens = encode(buffer).length;
} else {
buffer = buffer ? `${buffer}\n\n${p}` : p;
bufferTokens += pTokens;
}
}
if (buffer) {
chunks.push({
id: `${docId}:${position}`,
docId,
text: buffer.trim(),
tokens: bufferTokens,
metadata: { section: sectionTitle, sourceUrl, position },
});
position++;
}
}
return chunks;
}
Key principles:
- Respect structure: never split mid-heading, mid-table, mid-code-block.
- 600 tokens is the sweet spot for
voyage-3-large. Bigger chunks dilute embeddings; smaller chunks fragment context. - Always overlap by 10–15%. The overlap rescues context that falls on a chunk boundary.
- Attach metadata.
sourceUrl,section, andpositionare mandatory. You will need them at re-rank and citation time.
For PDFs use unpdf or pdfplumber; for HTML use @mozilla/readability to strip chrome before chunking; for code, chunk by AST node, not lines.
Step 3: Embed + store
Postgres + pgvector is the default. Below is the schema and ingestion code.
-- migration: enable extension + create table
create extension if not exists vector;
create table doc_chunks (
id text primary key,
doc_id text not null,
text text not null,
embedding vector(1024) not null, -- voyage-3-large
metadata jsonb not null default '{}',
tokens int not null,
created_at timestamptz default now()
);
-- HNSW index for fast cosine similarity
create index doc_chunks_embedding_idx
on doc_chunks
using hnsw (embedding vector_cosine_ops)
with (m = 16, ef_construction = 64);
-- For hybrid search: tsvector for BM25-style keyword
alter table doc_chunks add column text_search tsvector
generated always as (to_tsvector('english', text)) stored;
create index doc_chunks_text_search_idx on doc_chunks using gin (text_search);
// embed.ts
import { VoyageAIClient } from "voyageai";
import postgres from "postgres";
const voyage = new VoyageAIClient({ apiKey: process.env.VOYAGE_API_KEY! });
const sql = postgres(process.env.DATABASE_URL!);
export async function embedAndStore(chunks: Chunk[]) {
// Batch embed (Voyage accepts up to 128 inputs per call)
for (let i = 0; i < chunks.length; i += 128) {
const batch = chunks.slice(i, i + 128);
const res = await voyage.embed({
input: batch.map((c) => c.text),
model: "voyage-3-large",
inputType: "document", // CRITICAL: use "query" at search time
});
const rows = batch.map((c, idx) => ({
id: c.id,
doc_id: c.docId,
text: c.text,
embedding: `[${res.data[idx].embedding.join(",")}]`,
metadata: c.metadata,
tokens: c.tokens,
}));
await sql`
insert into doc_chunks ${sql(rows)}
on conflict (id) do update set
text = excluded.text,
embedding = excluded.embedding,
metadata = excluded.metadata,
tokens = excluded.tokens
`;
}
}
Two non-obvious things matter: use inputType: "document" at ingest and "query" at search time — Voyage uses asymmetric embeddings and skipping this costs you 5–10 recall points. And use HNSW, not IVFFlat — HNSW is faster and more accurate for any corpus under 100M vectors.
Step 4: Retrieve
Pure vector search misses exact keyword matches (product codes, error strings, proper nouns). Pure BM25 misses semantic paraphrase. Hybrid search wins almost every benchmark in 2026.
// retrieve.ts
export async function hybridSearch(query: string, k = 50) {
// 1. Embed the query
const embRes = await voyage.embed({
input: [query],
model: "voyage-3-large",
inputType: "query",
});
const queryVec = `[${embRes.data[0].embedding.join(",")}]`;
// 2. Vector + BM25 in a single SQL with RRF fusion
const results = await sql`
with vector_hits as (
select id, text, doc_id, metadata,
1 - (embedding <=> ${queryVec}::vector) as score,
row_number() over (order by embedding <=> ${queryVec}::vector) as rank
from doc_chunks
order by embedding <=> ${queryVec}::vector
limit ${k}
),
keyword_hits as (
select id, text, doc_id, metadata,
ts_rank(text_search, plainto_tsquery('english', ${query})) as score,
row_number() over (
order by ts_rank(text_search, plainto_tsquery('english', ${query})) desc
) as rank
from doc_chunks
where text_search @@ plainto_tsquery('english', ${query})
limit ${k}
)
-- Reciprocal Rank Fusion
select id, text, doc_id, metadata,
sum(1.0 / (60 + rank)) as rrf_score
from (
select * from vector_hits
union all
select * from keyword_hits
) combined
group by id, text, doc_id, metadata
order by rrf_score desc
limit ${k}
`;
return results;
}
Reciprocal Rank Fusion (RRF) with k=60 is the standard 2026 fusion algorithm. It outperforms weighted score blending because it does not require normalizing across two very different score distributions.
Retrieve 50 candidates here, not 5. The re-ranker in the next step needs candidate diversity to do its job.
Step 5: Re-rank
A cross-encoder reranker reads (query, chunk) together and outputs a scalar relevance score. This is dramatically more accurate than the bi-encoder embeddings used at retrieval time, because the model attends across query and document jointly.
// rerank.ts
import { CohereClient } from "cohere-ai";
const cohere = new CohereClient({ token: process.env.COHERE_API_KEY! });
export async function rerank(query: string, candidates: any[], topN = 8) {
const res = await cohere.rerank({
model: "rerank-v3.5",
query,
documents: candidates.map((c) => c.text),
topN,
});
return res.results.map((r) => ({
...candidates[r.index],
rerank_score: r.relevanceScore,
}));
}
If you cannot send data to Cohere, use Voyage's reranker (rerank-2) or do an LLM rerank with Claude Haiku:
const prompt = `Rate each passage 0-10 for relevance to: "${query}"\n\n` +
candidates.map((c, i) => `[${i}] ${c.text}`).join("\n\n");
// Parse JSON scores back, sort descending, take top 8
Re-ranking typically lifts answer accuracy by 10–25 points. Skip it only if latency budget is sub-300ms per query.
Step 6: Generate with Claude
This is where 2026 RAG diverges hard from 2023 RAG. Three features change everything: prompt caching, citations, and 1M-token context.
// generate.ts
import Anthropic from "@anthropic-ai/sdk";
const claude = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY! });
export async function answerWithRag(query: string) {
const candidates = await hybridSearch(query, 50);
const top = await rerank(query, candidates, 8);
// Build "documents" content blocks — Claude returns citations against these
const documents = top.map((c, i) => ({
type: "document" as const,
source: {
type: "text" as const,
media_type: "text/plain" as const,
data: c.text,
},
title: c.metadata.section ?? `Source ${i + 1}`,
context: c.metadata.sourceUrl,
citations: { enabled: true },
}));
const response = await claude.messages.create({
model: "claude-sonnet-4-5",
max_tokens: 1024,
system: [
{
type: "text",
text: SYSTEM_PROMPT,
cache_control: { type: "ephemeral" }, // cache the system prompt
},
],
messages: [
{
role: "user",
content: [
...documents,
{ type: "text", text: `Question: ${query}` },
],
},
],
});
// response.content includes text blocks AND citation blocks tying spans to documents
return response;
}
const SYSTEM_PROMPT = `You are a precise retrieval-grounded assistant. Answer the user's question using ONLY the provided documents. If the documents do not contain the answer, say so. Always cite. Keep answers under 6 sentences unless asked for more.`;
What is happening here:
cache_control: ephemeralon the system prompt makes repeat queries 90% cheaper on cached tokens (5-minute TTL, or 1 hour with the extended cache header).citations: { enabled: true }tells Claude to return structured citation blocks. Each cited span includes the source document index and the exact text range. You can render these as clickable footnotes in your UI without parsing brackets out of prose.documentcontent blocks are the 2026-native way to feed retrieved context — better than embedding chunks in a text block because the citations engine knows the boundaries.
For corpora under 500K tokens, you can skip retrieval entirely and put the whole knowledge base in the cached system prompt. The first request pays full cost; every subsequent request within the cache TTL pays 10%. For very small, very high-traffic knowledge bases this beats a real RAG pipeline on both cost and latency.
Step 7: Eval
You cannot improve what you do not measure. Build the eval harness on day one.
# eval.py — minimal Ragas setup
from ragas import evaluate
from ragas.metrics import (
context_precision,
context_recall,
faithfulness,
answer_relevancy,
)
from datasets import Dataset
# 1. Freeze a 100-question test set with ground-truth answers + relevant doc IDs
test_set = Dataset.from_list([
{
"question": "What is the refund window for annual plans?",
"ground_truth": "30 days from purchase date.",
"answer": rag_pipeline("What is the refund window for annual plans?"),
"contexts": retrieved_chunks_for_question,
},
# ... 99 more
])
# 2. Run all four metrics
results = evaluate(
test_set,
metrics=[context_precision, context_recall, faithfulness, answer_relevancy],
)
print(results)
Four numbers to watch every time you change the pipeline:
- Context precision: of the chunks you retrieved, how many were actually relevant. Low score → improve retrieval or re-ranking.
- Context recall: of the truly relevant chunks, how many did you retrieve. Low score → improve chunking, embeddings, or top-k.
- Faithfulness: does the generated answer match the retrieved context. Low score → tighten the system prompt or switch model.
- Answer relevancy: does the answer actually address the question. Low score → query rewriting or prompt issue.
For production observability, pair Ragas with Langfuse or Braintrust to track these metrics on a sample of real traffic, not just a frozen test set. Drift detection on context precision is how you catch a stale index before users complain.
Common pitfalls
These are the issues we have seen on every RAG audit. Fix them once and you save weeks.
- Fixed-size chunking across structure. Splitting mid-table, mid-code-block, or mid-list destroys retrieval. Always chunk on semantic boundaries.
- Same embedding for query and document. Voyage, Cohere, and OpenAI all expose asymmetric modes. Use them.
- No re-ranking. Bi-encoder retrieval alone caps quality around 65% precision@5 for hard questions. Re-rank to 80%+.
- No hybrid search. Pure vector search will miss product codes and proper nouns. Add BM25.
- Stuffing the prompt with 50 chunks. More context is not better — irrelevant chunks confuse the model. Re-rank to 5–10.
- No citations. If users cannot click through to source, they will not trust the system. Use Claude's native citations.
- No eval harness. You will regress retrieval quality the first time you change the chunker. A frozen test set tells you immediately.
- Caching everything in DB row format. Embeddings are not human-readable — store them as
vector(N), not JSON arrays of floats.
Cost optimization
A naive RAG pipeline on Claude Sonnet 4.5 costs around $0.012 per query. With the optimizations below, the same pipeline lands at $0.0015 — an 8x reduction.
- Prompt cache your system prompt. Cached tokens cost 10% of base. If your system prompt is 2K tokens, you save $0.005 per query immediately.
- Cache the document block too when serving common queries. The same top-8 retrieved chunks repeat across similar questions; mark them ephemeral.
- Use Haiku for re-ranking and query rewriting. Reserve Sonnet/Opus for the final generation call. Haiku 4.5 handles rerank prompts at ~1/10th the cost.
- Cap
max_tokensaggressively. Set 512 for support answers, 1024 for explanations, 2048+ only when you actually need long output. Most teams overpay here. - Sample, do not log every request. Send 5% of production traffic to your eval pipeline, not 100%. Storage and re-eval costs add up fast.
- Re-embed only changed documents. Use content-hash deduplication on chunk text — a typo fix in a 500-page doc should not trigger 5,000 re-embeddings.
- Pick
voyage-3-litefor retrieval,voyage-3-largefor re-ranking. Or use Cohere rerank and skip the second embed call entirely.
For most production RAG systems, prompt caching alone is the difference between "this scales" and "this is bankrupting us." Set it up before you ship.
Building a RAG system end-to-end is genuinely hard — not the prototype, but the part where it stays accurate as your corpus grows, your team ships features, and your traffic shifts. At AY Automate we ship production RAG systems on Claude every week, from internal knowledge bases to customer-facing copilots, with full eval harnesses, observability, and cost guardrails wired in from day one. If you want help designing the architecture, picking the stack, or auditing an existing pipeline, book a free consultation — we will walk through your use case and tell you exactly what to build (and what to skip).
FAQ
Do I need RAG if Claude has a 1M-token context window?
Sometimes no. If your knowledge base fits comfortably under ~500K tokens and your traffic is bursty, you can put the whole corpus in a cached system prompt and skip retrieval. For larger corpora, frequently changing content, or strict latency budgets, RAG still wins on cost and speed.
Which embedding model should I use in 2026?
voyage-3-large is the strongest general-purpose embedding model right now and is Anthropic's recommended pairing with Claude. Use voyage-multilingual-2 if you serve non-English content. OpenAI's text-embedding-3-large is a solid alternative if you are already in the OpenAI ecosystem.
Pinecone or pgvector?
Default to pgvector. You get vectors next to your business data, one auth model, SQL joins, and no extra vendor. Move to Pinecone or Turbopuffer when you cross ~50M vectors or need multi-region replication you do not want to operate yourself.
How do I get citations from Claude?
Pass retrieved chunks as document content blocks and set citations: { enabled: true } on each one. Claude returns structured citation blocks alongside the answer text, with exact character spans into your source documents. Render them as clickable footnotes in your UI.
How much does a production RAG system cost to run?
With prompt caching, hybrid search, and aggressive max_tokens caps, a typical support-style RAG query lands at $0.001–$0.003 on Claude Sonnet 4.5. Without those optimizations, the same query costs $0.01–$0.03. The optimizations are not optional at scale.
What is the difference between RAG and an agent?
RAG is single-shot: retrieve → generate → return. An agent loops: it can rewrite the query, retrieve multiple times, call tools, and self-correct. For most knowledge-base questions RAG is enough. For multi-step research or transactional workflows you want an agent — see our breakdown in Claude API vs Claude Code.
Which framework should I use to orchestrate this?
For straightforward RAG, a thin TypeScript or Python wrapper over the Anthropic SDK is enough — frameworks add more weight than value. If you need agentic retrieval or multi-step workflows, LangGraph or the Claude Agent SDK are the strongest 2026 choices. Our best RAG frameworks post compares them in depth.
How do I evaluate retrieval quality?
Freeze a 100-question test set with ground-truth answers and relevant document IDs. Run Ragas (or Braintrust / Langfuse) on every pipeline change. Track context precision, context recall, faithfulness, and answer relevancy. Alert when any metric drops more than 5 points from baseline.
Book a Free Strategy Call
Building this in production?
Walid runs a 30-min call to map your AI engineering team. Free, no slides.

Walid founded AY Automate to help businesses ship AI workflows that actually move revenue. He leads strategy and oversees every client engagement end-to-end.
Full Bio →