AY Automate
Services
Case Studies
Industries
Contact
n8n logo
Claude logo
Cursor logo
Make logo
OpenAI logo
AUTOMATION GATEWAY

DEPLOYAUTOMATION

> System status: READY_FOR_DEPLOYMENT
Transform your business operations today.

Company
AY Automate
Connect with us
LinkedInXXYouTube
Explore AI Summary
ChatGPTClaude wrapperPerplexityGoogle AIGrokCopilot
Free Tools
  • ROI Calculator
  • AI Readiness Assessment
  • AI Budget Planner
  • Workflow Audit
  • AI Maturity Quiz
  • AI Use Case Generator
  • AI Tool Selector
  • Digital Transformation Scorecard
  • AI Job Description Generator
+ 5 more free tools
Our Builds
  • Ayn8nn8n Library
  • AyclaudeClaude Library
  • AyDesignMake your vibecoded app look like a $10M company
  • AyRankBe the solution cited by AI
  • LiwalaOpen Source
  • AY SkillsOur best skills
  • n8n × Claude CodeWorkflow builder
  • AY FrameworkOpen Source
Services
  • All Services
  • AI Strategy Consulting
  • AI Agent Development
  • Workflow Automation
  • Custom Automation
  • RAG Pipeline Development
  • SaaS MVP Development
  • AI Workshops
  • Engineer Placement
  • Custom Training
  • Maintenance & Support
  • OpenClaw & NemoClaw Setup
Industries
  • All Industries
  • Marketing Agencies
  • Ecommerce
  • Consulting Firms
  • Revenue Operations
  • Law Firms
  • SaaS Startups
  • Logistics
  • Finance
  • Professional Services
Resources
  • Blog
  • Case Studies
  • Playbooks
  • Courses
  • FAQ
  • Contact Us
  • Careers
Stay Updated

Stay tuned

Get the latest automation insights, playbooks, and case studies delivered to your inbox. No spam, ever.

Join 4,500+ operators · Weekly · Unsubscribe anytime

Featured
Claude

30 Days of Claude Code

Daily challenges + agents

n8n

AI Automation Playbook

Free guide · 1,000+ hours saved

Golden Offer

Scale your company without hiring more staff

Get in touch
Walid Boulanouar
Walid BoulanouarCo-Founder · CEO
Adel Dahani
Adel DahaniCo-Founder · CTO
contact@ayautomate.com

Operating Globally

Serving clients worldwide - across North America, Europe, MENA, Asia & beyond.

© 2026 AY Automate. All rights reserved.
Terms of UsePrivacy Policy
Blog
22 June 2026/15 min read

How to Build a RAG System with Claude in 2026

RAG with Claude in 2026 looks nothing like RAG in 2023. With 1M-token context, prompt caching that cuts costs by 90%, and native citations, the build pattern has shifted. This guide walks through the full pipeline: chunking, embeddings, pgvector, retrieval, re-ranking, generation with Claude, and evaluation.

Boulanouar Walid
Author:Boulanouar Walid,Founder & CEO
How to Build a RAG System with Claude in 2026

Book a Free Strategy Call

Skip the read — talk to Walid in 30 min.

Free strategy call. We map your AI engineering team, you keep the notes.

Or send us a brief →

RAG with Claude in 2026 looks nothing like RAG in 2023. Back then, every byte of context cost real money, every prompt had to be hand-trimmed, and "retrieval-augmented generation" was mostly a way to squeeze a 4K window. By 2026, Claude Sonnet 4.5 and Opus 4.7 ship with 1M-token context, prompt caching that drops repeat-context costs by up to 90%, and native citations that tie every claim back to a source span. The constraints have moved.

The hard part is no longer "how do I fit my docs in the prompt." It is "how do I retrieve the right 20K tokens out of 50M, re-rank them, and stream a grounded answer with citations the user can click." That requires a real pipeline: chunking, embeddings, a vector store, hybrid retrieval, a re-ranker, a generation step with cached system prompts, and an eval loop that tells you when retrieval quality is regressing.

This tutorial walks through the full stack end-to-end. You will get architecture, library choices, working TypeScript and Python snippets for pgvector, retrieval and re-ranking patterns, a Claude generation call with prompt caching and citations enabled, an eval setup, and the cost knobs that matter. By the end you will have a production-ready RAG system that costs a fraction of a 2024 build and answers with citations users actually trust.

Architecture overview

Every production RAG system in 2026 follows the same six-stage pipeline. The names are old; the implementations are not.

[Source docs] → [Chunk] → [Embed] → [Store] → [Retrieve] → [Re-rank] → [Generate]
                                                  ↑                          ↓
                                                  └───── [Eval loop] ────────┘
  • Chunk: split source documents into semantic units (300–800 tokens). Avoid fixed-size splits.
  • Embed: convert chunks to vectors using Voyage, OpenAI, or Cohere embedding models.
  • Store: persist vectors + metadata in pgvector, Pinecone, Weaviate, Qdrant, or Turbopuffer.
  • Retrieve: top-k nearest neighbor search, often combined with BM25 keyword search (hybrid).
  • Re-rank: rescore top 50–100 candidates with a cross-encoder or LLM rerank to surface top 5–10.
  • Generate: pass re-ranked chunks to Claude with a cached system prompt; return answer + citations.

Two cross-cutting concerns sit on top: an eval loop (Ragas, Langfuse, or Braintrust) that measures retrieval and answer quality on a frozen test set, and an observability layer that tracks token cost, cache hit rate, and latency per request. Skip either of these and you will be flying blind within a week of launch.

In 2026, the biggest architectural shift is that Claude's 1M-token window means you no longer need to retrieve aggressively for small corpora. If your knowledge base is under ~500K tokens, you can often stuff the whole thing into a cached system prompt and skip retrieval entirely. For anything larger, a real RAG pipeline still wins on cost and latency.

Step 1: Choose your stack

The stack matters more than the model. Here is the 2026 decision matrix.

Vector store

StoreBest forPricing model
pgvector (Postgres + extension)Teams already on Postgres, <50M vectors, hybrid SQL filtersSelf-host or Supabase / Neon
PineconeManaged, low ops, serverless, multi-tenant SaaS$0.096/M reads, $4/M writes (serverless)
WeaviateHybrid search out of the box, GraphQL APISelf-host or Weaviate Cloud
QdrantRust performance, payload filtering, on-prem friendlySelf-host or Qdrant Cloud
TurbopufferCold-tier object-store backed, cheap for >100M vectorsPay-per-query, very cheap at scale

Default recommendation: pgvector on Supabase or Neon. You already have Postgres for app data. One database means one backup story, one access-control layer, and SQL joins between vectors and your business tables. Move to Pinecone or Turbopuffer only when you hit real scale pain.

Embedding model

ModelDimStrengthCost (per 1M tokens)
voyage-3-large1024Best retrieval quality 2026, Anthropic-recommended$0.18
voyage-3-lite51280% of large quality, 5x cheaper$0.02
text-embedding-3-large (OpenAI)3072Strong baseline, broad ecosystem$0.13
embed-english-v3.0 (Cohere)1024Solid quality, good multilingual variant$0.10

Default: voyage-3-large for English-heavy production. Switch to voyage-multilingual-2 if you serve French, Arabic, or Spanish users. The embeddings are what determine ceiling retrieval quality — do not cheap out here.

Re-ranker

Use cohere-rerank-3.5 or voyage-rerank-2. Both score 100 candidates in under 200ms and lift retrieval precision by 15–30 points on most benchmarks. If you cannot use a third-party reranker, fall back to an LLM rerank call to Claude Haiku — it works, just costs more per query.

Orchestration

Skip LangChain unless you have a strong reason. In 2026, most teams ship RAG with a thin TypeScript or Python wrapper plus the Anthropic SDK. If you need agentic retrieval (multi-hop, query rewriting, tool use), look at LangGraph or build directly on the Claude Agent SDK. We cover orchestration choices in detail in our best RAG frameworks roundup.

Step 2: Ingest + chunk

Bad chunking is the #1 cause of bad retrieval. Fixed 512-token splits across paragraph boundaries will destroy your recall. Use semantic chunking that respects document structure.

// chunk.ts
import { encode } from "gpt-tokenizer";

type Chunk = {
  id: string;
  docId: string;
  text: string;
  tokens: number;
  metadata: { section?: string; sourceUrl: string; position: number };
};

const MAX_TOKENS = 600;
const OVERLAP_TOKENS = 80;

export function semanticChunk(
  docId: string,
  markdown: string,
  sourceUrl: string,
): Chunk[] {
  // Split on H2/H3 headings first to preserve semantic boundaries
  const sections = markdown.split(/\n(?=##\s)/);
  const chunks: Chunk[] = [];
  let position = 0;

  for (const section of sections) {
    const headingMatch = section.match(/^##\s+(.+)/);
    const sectionTitle = headingMatch?.[1]?.trim();

    // If section fits, keep it whole
    const tokens = encode(section).length;
    if (tokens <= MAX_TOKENS) {
      chunks.push({
        id: `${docId}:${position}`,
        docId,
        text: section.trim(),
        tokens,
        metadata: { section: sectionTitle, sourceUrl, position },
      });
      position++;
      continue;
    }

    // Otherwise split on paragraphs with overlap
    const paragraphs = section.split(/\n\n+/);
    let buffer = "";
    let bufferTokens = 0;

    for (const p of paragraphs) {
      const pTokens = encode(p).length;
      if (bufferTokens + pTokens > MAX_TOKENS && buffer) {
        chunks.push({
          id: `${docId}:${position}`,
          docId,
          text: buffer.trim(),
          tokens: bufferTokens,
          metadata: { section: sectionTitle, sourceUrl, position },
        });
        position++;
        // Carry overlap
        const overlap = buffer.split(/\n\n/).slice(-1)[0] ?? "";
        buffer = overlap + "\n\n" + p;
        bufferTokens = encode(buffer).length;
      } else {
        buffer = buffer ? `${buffer}\n\n${p}` : p;
        bufferTokens += pTokens;
      }
    }
    if (buffer) {
      chunks.push({
        id: `${docId}:${position}`,
        docId,
        text: buffer.trim(),
        tokens: bufferTokens,
        metadata: { section: sectionTitle, sourceUrl, position },
      });
      position++;
    }
  }
  return chunks;
}

Key principles:

  • Respect structure: never split mid-heading, mid-table, mid-code-block.
  • 600 tokens is the sweet spot for voyage-3-large. Bigger chunks dilute embeddings; smaller chunks fragment context.
  • Always overlap by 10–15%. The overlap rescues context that falls on a chunk boundary.
  • Attach metadata. sourceUrl, section, and position are mandatory. You will need them at re-rank and citation time.

For PDFs use unpdf or pdfplumber; for HTML use @mozilla/readability to strip chrome before chunking; for code, chunk by AST node, not lines.

Step 3: Embed + store

Postgres + pgvector is the default. Below is the schema and ingestion code.

-- migration: enable extension + create table
create extension if not exists vector;

create table doc_chunks (
  id text primary key,
  doc_id text not null,
  text text not null,
  embedding vector(1024) not null,  -- voyage-3-large
  metadata jsonb not null default '{}',
  tokens int not null,
  created_at timestamptz default now()
);

-- HNSW index for fast cosine similarity
create index doc_chunks_embedding_idx
  on doc_chunks
  using hnsw (embedding vector_cosine_ops)
  with (m = 16, ef_construction = 64);

-- For hybrid search: tsvector for BM25-style keyword
alter table doc_chunks add column text_search tsvector
  generated always as (to_tsvector('english', text)) stored;

create index doc_chunks_text_search_idx on doc_chunks using gin (text_search);
// embed.ts
import { VoyageAIClient } from "voyageai";
import postgres from "postgres";

const voyage = new VoyageAIClient({ apiKey: process.env.VOYAGE_API_KEY! });
const sql = postgres(process.env.DATABASE_URL!);

export async function embedAndStore(chunks: Chunk[]) {
  // Batch embed (Voyage accepts up to 128 inputs per call)
  for (let i = 0; i < chunks.length; i += 128) {
    const batch = chunks.slice(i, i + 128);
    const res = await voyage.embed({
      input: batch.map((c) => c.text),
      model: "voyage-3-large",
      inputType: "document",  // CRITICAL: use "query" at search time
    });

    const rows = batch.map((c, idx) => ({
      id: c.id,
      doc_id: c.docId,
      text: c.text,
      embedding: `[${res.data[idx].embedding.join(",")}]`,
      metadata: c.metadata,
      tokens: c.tokens,
    }));

    await sql`
      insert into doc_chunks ${sql(rows)}
      on conflict (id) do update set
        text = excluded.text,
        embedding = excluded.embedding,
        metadata = excluded.metadata,
        tokens = excluded.tokens
    `;
  }
}

Two non-obvious things matter: use inputType: "document" at ingest and "query" at search time — Voyage uses asymmetric embeddings and skipping this costs you 5–10 recall points. And use HNSW, not IVFFlat — HNSW is faster and more accurate for any corpus under 100M vectors.

Step 4: Retrieve

Pure vector search misses exact keyword matches (product codes, error strings, proper nouns). Pure BM25 misses semantic paraphrase. Hybrid search wins almost every benchmark in 2026.

// retrieve.ts
export async function hybridSearch(query: string, k = 50) {
  // 1. Embed the query
  const embRes = await voyage.embed({
    input: [query],
    model: "voyage-3-large",
    inputType: "query",
  });
  const queryVec = `[${embRes.data[0].embedding.join(",")}]`;

  // 2. Vector + BM25 in a single SQL with RRF fusion
  const results = await sql`
    with vector_hits as (
      select id, text, doc_id, metadata,
        1 - (embedding <=> ${queryVec}::vector) as score,
        row_number() over (order by embedding <=> ${queryVec}::vector) as rank
      from doc_chunks
      order by embedding <=> ${queryVec}::vector
      limit ${k}
    ),
    keyword_hits as (
      select id, text, doc_id, metadata,
        ts_rank(text_search, plainto_tsquery('english', ${query})) as score,
        row_number() over (
          order by ts_rank(text_search, plainto_tsquery('english', ${query})) desc
        ) as rank
      from doc_chunks
      where text_search @@ plainto_tsquery('english', ${query})
      limit ${k}
    )
    -- Reciprocal Rank Fusion
    select id, text, doc_id, metadata,
      sum(1.0 / (60 + rank)) as rrf_score
    from (
      select * from vector_hits
      union all
      select * from keyword_hits
    ) combined
    group by id, text, doc_id, metadata
    order by rrf_score desc
    limit ${k}
  `;
  return results;
}

Reciprocal Rank Fusion (RRF) with k=60 is the standard 2026 fusion algorithm. It outperforms weighted score blending because it does not require normalizing across two very different score distributions.

Retrieve 50 candidates here, not 5. The re-ranker in the next step needs candidate diversity to do its job.

Step 5: Re-rank

A cross-encoder reranker reads (query, chunk) together and outputs a scalar relevance score. This is dramatically more accurate than the bi-encoder embeddings used at retrieval time, because the model attends across query and document jointly.

// rerank.ts
import { CohereClient } from "cohere-ai";
const cohere = new CohereClient({ token: process.env.COHERE_API_KEY! });

export async function rerank(query: string, candidates: any[], topN = 8) {
  const res = await cohere.rerank({
    model: "rerank-v3.5",
    query,
    documents: candidates.map((c) => c.text),
    topN,
  });
  return res.results.map((r) => ({
    ...candidates[r.index],
    rerank_score: r.relevanceScore,
  }));
}

If you cannot send data to Cohere, use Voyage's reranker (rerank-2) or do an LLM rerank with Claude Haiku:

const prompt = `Rate each passage 0-10 for relevance to: "${query}"\n\n` +
  candidates.map((c, i) => `[${i}] ${c.text}`).join("\n\n");
// Parse JSON scores back, sort descending, take top 8

Re-ranking typically lifts answer accuracy by 10–25 points. Skip it only if latency budget is sub-300ms per query.

Step 6: Generate with Claude

This is where 2026 RAG diverges hard from 2023 RAG. Three features change everything: prompt caching, citations, and 1M-token context.

// generate.ts
import Anthropic from "@anthropic-ai/sdk";
const claude = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY! });

export async function answerWithRag(query: string) {
  const candidates = await hybridSearch(query, 50);
  const top = await rerank(query, candidates, 8);

  // Build "documents" content blocks — Claude returns citations against these
  const documents = top.map((c, i) => ({
    type: "document" as const,
    source: {
      type: "text" as const,
      media_type: "text/plain" as const,
      data: c.text,
    },
    title: c.metadata.section ?? `Source ${i + 1}`,
    context: c.metadata.sourceUrl,
    citations: { enabled: true },
  }));

  const response = await claude.messages.create({
    model: "claude-sonnet-4-5",
    max_tokens: 1024,
    system: [
      {
        type: "text",
        text: SYSTEM_PROMPT,
        cache_control: { type: "ephemeral" },  // cache the system prompt
      },
    ],
    messages: [
      {
        role: "user",
        content: [
          ...documents,
          { type: "text", text: `Question: ${query}` },
        ],
      },
    ],
  });

  // response.content includes text blocks AND citation blocks tying spans to documents
  return response;
}

const SYSTEM_PROMPT = `You are a precise retrieval-grounded assistant. Answer the user's question using ONLY the provided documents. If the documents do not contain the answer, say so. Always cite. Keep answers under 6 sentences unless asked for more.`;

What is happening here:

  • cache_control: ephemeral on the system prompt makes repeat queries 90% cheaper on cached tokens (5-minute TTL, or 1 hour with the extended cache header).
  • citations: { enabled: true } tells Claude to return structured citation blocks. Each cited span includes the source document index and the exact text range. You can render these as clickable footnotes in your UI without parsing brackets out of prose.
  • document content blocks are the 2026-native way to feed retrieved context — better than embedding chunks in a text block because the citations engine knows the boundaries.

For corpora under 500K tokens, you can skip retrieval entirely and put the whole knowledge base in the cached system prompt. The first request pays full cost; every subsequent request within the cache TTL pays 10%. For very small, very high-traffic knowledge bases this beats a real RAG pipeline on both cost and latency.

Step 7: Eval

You cannot improve what you do not measure. Build the eval harness on day one.

# eval.py — minimal Ragas setup
from ragas import evaluate
from ragas.metrics import (
    context_precision,
    context_recall,
    faithfulness,
    answer_relevancy,
)
from datasets import Dataset

# 1. Freeze a 100-question test set with ground-truth answers + relevant doc IDs
test_set = Dataset.from_list([
    {
        "question": "What is the refund window for annual plans?",
        "ground_truth": "30 days from purchase date.",
        "answer": rag_pipeline("What is the refund window for annual plans?"),
        "contexts": retrieved_chunks_for_question,
    },
    # ... 99 more
])

# 2. Run all four metrics
results = evaluate(
    test_set,
    metrics=[context_precision, context_recall, faithfulness, answer_relevancy],
)
print(results)

Four numbers to watch every time you change the pipeline:

  • Context precision: of the chunks you retrieved, how many were actually relevant. Low score → improve retrieval or re-ranking.
  • Context recall: of the truly relevant chunks, how many did you retrieve. Low score → improve chunking, embeddings, or top-k.
  • Faithfulness: does the generated answer match the retrieved context. Low score → tighten the system prompt or switch model.
  • Answer relevancy: does the answer actually address the question. Low score → query rewriting or prompt issue.

For production observability, pair Ragas with Langfuse or Braintrust to track these metrics on a sample of real traffic, not just a frozen test set. Drift detection on context precision is how you catch a stale index before users complain.

Common pitfalls

These are the issues we have seen on every RAG audit. Fix them once and you save weeks.

  • Fixed-size chunking across structure. Splitting mid-table, mid-code-block, or mid-list destroys retrieval. Always chunk on semantic boundaries.
  • Same embedding for query and document. Voyage, Cohere, and OpenAI all expose asymmetric modes. Use them.
  • No re-ranking. Bi-encoder retrieval alone caps quality around 65% precision@5 for hard questions. Re-rank to 80%+.
  • No hybrid search. Pure vector search will miss product codes and proper nouns. Add BM25.
  • Stuffing the prompt with 50 chunks. More context is not better — irrelevant chunks confuse the model. Re-rank to 5–10.
  • No citations. If users cannot click through to source, they will not trust the system. Use Claude's native citations.
  • No eval harness. You will regress retrieval quality the first time you change the chunker. A frozen test set tells you immediately.
  • Caching everything in DB row format. Embeddings are not human-readable — store them as vector(N), not JSON arrays of floats.

Cost optimization

A naive RAG pipeline on Claude Sonnet 4.5 costs around $0.012 per query. With the optimizations below, the same pipeline lands at $0.0015 — an 8x reduction.

  • Prompt cache your system prompt. Cached tokens cost 10% of base. If your system prompt is 2K tokens, you save $0.005 per query immediately.
  • Cache the document block too when serving common queries. The same top-8 retrieved chunks repeat across similar questions; mark them ephemeral.
  • Use Haiku for re-ranking and query rewriting. Reserve Sonnet/Opus for the final generation call. Haiku 4.5 handles rerank prompts at ~1/10th the cost.
  • Cap max_tokens aggressively. Set 512 for support answers, 1024 for explanations, 2048+ only when you actually need long output. Most teams overpay here.
  • Sample, do not log every request. Send 5% of production traffic to your eval pipeline, not 100%. Storage and re-eval costs add up fast.
  • Re-embed only changed documents. Use content-hash deduplication on chunk text — a typo fix in a 500-page doc should not trigger 5,000 re-embeddings.
  • Pick voyage-3-lite for retrieval, voyage-3-large for re-ranking. Or use Cohere rerank and skip the second embed call entirely.

For most production RAG systems, prompt caching alone is the difference between "this scales" and "this is bankrupting us." Set it up before you ship.


Building a RAG system end-to-end is genuinely hard — not the prototype, but the part where it stays accurate as your corpus grows, your team ships features, and your traffic shifts. At AY Automate we ship production RAG systems on Claude every week, from internal knowledge bases to customer-facing copilots, with full eval harnesses, observability, and cost guardrails wired in from day one. If you want help designing the architecture, picking the stack, or auditing an existing pipeline, book a free consultation — we will walk through your use case and tell you exactly what to build (and what to skip).

FAQ

Do I need RAG if Claude has a 1M-token context window?

Sometimes no. If your knowledge base fits comfortably under ~500K tokens and your traffic is bursty, you can put the whole corpus in a cached system prompt and skip retrieval. For larger corpora, frequently changing content, or strict latency budgets, RAG still wins on cost and speed.

Which embedding model should I use in 2026?

voyage-3-large is the strongest general-purpose embedding model right now and is Anthropic's recommended pairing with Claude. Use voyage-multilingual-2 if you serve non-English content. OpenAI's text-embedding-3-large is a solid alternative if you are already in the OpenAI ecosystem.

Pinecone or pgvector?

Default to pgvector. You get vectors next to your business data, one auth model, SQL joins, and no extra vendor. Move to Pinecone or Turbopuffer when you cross ~50M vectors or need multi-region replication you do not want to operate yourself.

How do I get citations from Claude?

Pass retrieved chunks as document content blocks and set citations: { enabled: true } on each one. Claude returns structured citation blocks alongside the answer text, with exact character spans into your source documents. Render them as clickable footnotes in your UI.

How much does a production RAG system cost to run?

With prompt caching, hybrid search, and aggressive max_tokens caps, a typical support-style RAG query lands at $0.001–$0.003 on Claude Sonnet 4.5. Without those optimizations, the same query costs $0.01–$0.03. The optimizations are not optional at scale.

What is the difference between RAG and an agent?

RAG is single-shot: retrieve → generate → return. An agent loops: it can rewrite the query, retrieve multiple times, call tools, and self-correct. For most knowledge-base questions RAG is enough. For multi-step research or transactional workflows you want an agent — see our breakdown in Claude API vs Claude Code.

Which framework should I use to orchestrate this?

For straightforward RAG, a thin TypeScript or Python wrapper over the Anthropic SDK is enough — frameworks add more weight than value. If you need agentic retrieval or multi-step workflows, LangGraph or the Claude Agent SDK are the strongest 2026 choices. Our best RAG frameworks post compares them in depth.

How do I evaluate retrieval quality?

Freeze a 100-question test set with ground-truth answers and relevant document IDs. Run Ragas (or Braintrust / Langfuse) on every pipeline change. Track context precision, context recall, faithfulness, and answer relevancy. Alert when any metric drops more than 5 points from baseline.

Book a Free Strategy Call

Building this in production?

Walid runs a 30-min call to map your AI engineering team. Free, no slides.

Or send us a brief →
Share this article
About the Author
Boulanouar Walid
Boulanouar Walid
Founder & CEO

Walid founded AY Automate to help businesses ship AI workflows that actually move revenue. He leads strategy and oversees every client engagement end-to-end.

Full Bio →