40 AI Engineer Interview Questions

Book a Free Strategy Call

Skip the read: talk to Walid in 30 min.

Free strategy call. We map your AI engineering team, you keep the notes.

AI engineer interview questions in 2026 test 5 areas: LLM fundamentals (tokens, sampling, context windows), prompt engineering, RAG and vector databases, multi-agent systems, and production ops. The signal you want is specific: does this candidate know what a token costs at scale, what an eval harness actually looks like, why their RAG retrieval is failing, and how to debug a multi-agent loop that silently drifted off the plan?

The role itself has changed. Hiring an AI engineer in 2026 is not the same role it was in 2023. The "ML engineer who fine-tunes models" archetype has fragmented into 4 sub-disciplines: prompt and eval specialists, RAG infrastructure builders, multi-agent orchestrators, and LLMOps engineers who manage real production cost budgets. The candidate pool has expanded, the title has been diluted, and most resumes now claim "LLM experience" because someone called an OpenAI API once.

The hard part is filtering people who can ship reliable, cost-bounded AI systems from people who built a demo, posted it on X, and called it production. This guide gives you 40 questions across 5 sections covering fundamentals, prompt engineering, RAG, multi-agent systems, and production ops, each with strong and weak sample answers, plus a take-home rubric and scoring matrix. Use it to run a screen that surfaces real builders. If you'd rather skip hiring entirely and rent a pre-vetted team, AY Automate's AI agent development service is one phone call away.

The interview structure that works

A 4-stage funnel keeps the loop tight without burning senior eng time:

30-min recruiter screen. Resume gut-check. Has the candidate shipped one LLM feature past 100 real users? If no, end here.
60-min technical screen. 6-8 questions from Sections 1-3 below. Live, voice-based. No coding. You are testing reasoning, not LeetCode.
Take-home (2-4 hours, paid). A small RAG-plus-agent task with an explicit eval requirement. See Bonus section.
Final loop (90 min total). Take-home walkthrough, system design, behavioral. 2 interviewers max.

Total clock time per candidate: under 4 hours. If you are spending 8+ hours per candidate and still hiring wrong, the bottleneck is your question quality, not your funnel length. See our guide to hiring AI engineers for the full sourcing playbook.

Section 1: Fundamentals

These are the table-stakes questions. A senior candidate should clear all 8 in under 12 minutes.

Q1. What is the difference between a transformer's attention mechanism and a recurrent network's hidden state?

Strong: Attention computes a weighted sum over all positions in parallel using query/key/value projections; recurrence carries information one step at a time through a hidden state. Attention scales with sequence length squared but parallelizes; recurrence is linear-time but sequential. That parallelism is why transformers won.

Weak: "Transformers are newer and better." No mention of parallelism, no mention of the quadratic cost.

Q2. What is a token, and why does it matter for cost?

Strong: A token is a sub-word unit produced by a tokenizer like BPE or tiktoken. A 1,000-word English document is roughly 1,300 tokens. Cost matters because frontier models are billed per input and output token, and a sloppy prompt can 5x your bill before you notice.

Weak: "A token is a word." Wrong. Production candidates know the difference.

Q3. Explain temperature, top-p, and top-k in one breath.

Strong: Temperature scales the logit distribution before softmax: higher means flatter, more random. Top-p (nucleus) samples from the smallest set whose cumulative probability exceeds p. Top-k samples from the k highest-probability tokens. You usually pick one: temperature for chat, top-p for structured generation.

Weak: Confuses temperature with top-p, or says "they all control randomness" without distinguishing them.

Q4. What is a context window, and what breaks when you hit it?

Strong: The maximum number of tokens the model can process per request. Hit it and earlier tokens get truncated, which silently corrupts answers in long agent loops. Mitigations: summarization, sliding window, RAG, or models with 1M+ context like Gemini 2.5 or Claude.

Weak: "The model just errors out." Sometimes, but the dangerous failure mode is silent truncation.

Q5. When would you fine-tune vs. when would you use RAG?

Strong: Fine-tune for style, format adherence, and narrow task specialization. RAG for knowledge that changes, is large, or is proprietary. They are not substitutes; production systems use both.

Weak: "Fine-tuning is always better." False since 2024.

Q6. What is hallucination and what are the 3 primary causes?

Strong: Output that is fluent but factually wrong. Causes: (1) the model is filling in gaps from training data, (2) the retrieved context is missing or wrong, (3) the prompt forces a confident answer when the model should refuse.

Weak: "It happens randomly." No, it has identifiable causes.

Q7. Walk me through what happens between user input and model output.

Strong: Tokenize → embed → forward pass through transformer layers (attention + FFN) → final logits → sampling strategy → detokenize → stream. Production wraps this with system prompt injection, guardrails, and cost logging.

Weak: "The model thinks and gives an answer." Hand-wave.

Q8. What is the difference between an embedding and a completion?

Strong: An embedding is a fixed-size vector representation of input text, used for similarity search and clustering. A completion is generated output text. Different endpoints, different cost profiles, different use cases.

Weak: Conflates them or thinks embeddings are generated by the same model as completions.

Free weekly brief

Steal our production automations

The exact n8n flows, Claude Code setups, and prompts we ship for clients, broken down step by step. No spam, unsubscribe anytime.

Section 2: LLM & Prompt Engineering

This section separates people who copied prompts from Twitter from people who have iterated on prompts under production load.

Q9. Walk me through how you'd design a prompt for a customer support classifier with 12 intent classes.

Strong: Start with a system prompt that defines the role, lists the 12 classes with one-line definitions, gives 2-3 few-shot examples covering edge cases, requires structured JSON output, and includes a fallback "other" class. Iterate against a labeled eval set of 200+ examples.

Weak: "Just ask the model to classify it." No structure, no eval, no fallback.

Q10. What is chain-of-thought prompting and when does it help vs. hurt?

Strong: Asking the model to reason step-by-step before answering. Helps on multi-step reasoning, math, and ambiguous classification. Hurts on simple retrieval tasks (adds latency and cost for no accuracy gain) and on tasks where you need structured output (CoT pollutes the JSON).

Weak: "It always makes the model smarter." Not on simple tasks.

Q11. How do you handle prompt injection attacks?

Strong: Defense in depth: system prompt isolation, input sanitization, structured tool calling instead of free-form output, output validation, and treating LLM output as untrusted user input downstream. No single silver bullet.

Weak: "Tell the model to ignore malicious instructions." This does not work and indicates the candidate has not read recent attack literature.

Q12. Few-shot vs zero-shot: when do you pick which?

Strong: Zero-shot for simple, unambiguous tasks where adding examples just adds cost. Few-shot when the task has a specific format, edge cases, or stylistic requirements the model needs to see. Test both; sometimes one-shot beats few-shot.

Weak: "Few-shot is always better." More tokens, more cost, not always better.

Q13. What is constrained decoding and when do you need it?

Strong: Forcing the model to output only tokens that match a grammar: JSON schema, regex, or context-free grammar. Needed when downstream code parses the output and a single malformed character breaks the pipeline. Tools: OpenAI structured outputs, Outlines, JSON mode.

Weak: Has never heard of it. In 2026, this is a red flag for anyone claiming production LLM experience.

Q14. Explain prompt caching and when it matters.

Strong: Providers cache the prefix of prompts to skip recomputation on repeated calls. For long system prompts or RAG-heavy applications, this can cut input cost by 50-90%. You structure prompts so the static part comes first.

Weak: Does not know prompt caching exists. Indicates no production cost optimization experience.

Q15. How would you reduce a $5,000/month LLM bill by 50% without degrading quality?

Strong: Routing: send simple queries to a cheap model (Haiku, GPT-4o-mini, Gemini Flash) and only escalate complex ones to a frontier model. Add prompt caching, compress system prompts, trim retrieved context to top-k=3 instead of top-k=10, and add an eval harness so you can verify quality stayed flat.

Weak: "Use a cheaper model." Sure, but without an eval harness, you do not know if quality dropped.

Q16. What is the difference between a chat completion and a tool call?

Strong: Chat completion returns text. A tool call returns a structured object, {name, arguments}, that your code executes, then you feed the result back into the model. Tool calling is how agents act on the world.

Weak: Treats them as the same thing.

Section 3: RAG & Vector Databases

RAG is where most candidates fall apart. The demos are easy. Production RAG is hard.

Q17. Walk me through a production RAG pipeline end to end.

Strong: Ingest → chunk (with overlap and metadata) → embed → store in vector DB with metadata index → query embedding → hybrid search (vector + BM25) → rerank top-k → assemble context → LLM call → cite sources → eval the retrieval and generation separately.

Weak: "Embed the docs and query them." Missing chunking strategy, reranking, hybrid search, and evals: the parts that make production RAG actually work.

Q18. What is your chunking strategy and why?

Strong: Depends on the corpus. For technical docs, semantic chunking by heading with 200-token overlap. For chat logs, by conversation turn. For long PDFs, recursive chunking with a 512-1024 token target. Always include parent-doc metadata.

Weak: "Fixed 1000-character chunks." Works for demos, breaks on real documents.

Q19. Vector DB choice: pgvector vs. Pinecone vs. Qdrant vs. Weaviate. When do you pick which?

Strong: pgvector if you already run Postgres and your corpus is under 10M vectors. Qdrant for self-hosted high-throughput with filtering. Pinecone for managed scale with low ops budget. Weaviate for hybrid search and built-in modules. The choice is mostly about ops cost and existing infra, not raw recall.

Weak: "Pinecone is the best." Marketing answer.

Q20. What is a reranker and when should you add one?

Strong: A cross-encoder model that re-scores the top-N retrieved chunks against the query. Add it when retrieval recall is good but precision is bad, meaning the right answer is in the top 50 but not the top 5. Cohere Rerank, BGE Reranker, and Voyage rerankers are the common picks.

Weak: "Same as a vector search." No, it is a second stage.

Q21. How do you evaluate a RAG system?

Strong: 2 layers. Retrieval: recall@k, MRR, hit rate, on a labeled query→doc set. Generation: faithfulness (answer grounded in context), answer relevance, context relevance. Tools: Ragas, TruLens, custom LLM-as-judge with calibration.

Weak: "Eyeball some outputs." Not eval, that is vibes.

Q22. What is hybrid search and why does it help?

Strong: Combining dense vector search (semantic) with sparse keyword search (BM25). Vector search misses exact-match cases like product codes, names, and acronyms. BM25 catches those. Weighted fusion of the two improves recall by 10-30% in most production corpora.

Weak: Has never heard of BM25.

Q23. How do you handle stale data in a RAG pipeline?

Strong: Incremental ingestion with change-data-capture from the source, TTL on chunks, scheduled re-embedding on schema changes, and metadata filters that exclude documents older than X for time-sensitive queries.

Weak: "Re-embed the whole corpus weekly." Works at small scale, falls over at production scale.

Q24. What is the "lost in the middle" problem and how do you mitigate it?

Strong: LLMs attend more to the start and end of long contexts than the middle. Mitigations: keep retrieved context short (5-10 chunks max), put the most relevant chunk last, use a reranker, and prefer models with stronger long-context attention like Claude 3.5+ or Gemini 2.5.

Weak: Has not encountered it. Means they have not run a real long-context RAG system.

Section 4: Multi-Agent Systems & Orchestration

This is the 2026 frontier. Most candidates have built single-call demos. Few have shipped multi-step agents that survive contact with production.

Q25. Walk me through how an agent is different from a chatbot.

Strong: A chatbot generates one response per turn. An agent loops: observe, plan, act with a tool, observe the result, decide whether to continue. State persists across turns. Failure modes are different: infinite loops, drift from the plan, tool errors that need recovery.

Weak: "An agent uses tools." Surface-level.

Q26. What is the ReAct pattern and what are its weaknesses?

Strong: Reason-Act-Observe loop where the model reasons in natural language, calls a tool, observes the result, and reasons again. Weaknesses: prompt grows quadratically, the model can hallucinate tool calls, and it has no global plan, so it drifts on multi-step tasks.

Weak: Knows the name but cannot articulate failure modes.

Q27. How would you design a multi-agent system to research and write a market analysis report?

Strong: Orchestrator agent that plans subtasks, delegates to specialist agents (search, summarize, fact-check, draft), and a critic agent that reviews before final output. State stored in a shared scratchpad. Hard stop on token budget and step count. LangGraph or Claude Agent SDK for the topology.

Weak: "One agent that does everything." Will blow context and drift.

Q28. What is the difference between LangGraph, CrewAI, and the Claude Agent SDK?

Strong: LangGraph is a stateful graph framework: explicit nodes and edges, good for complex topologies. CrewAI is role-based with simpler abstractions: fast to prototype, harder to debug at scale. Claude Agent SDK is closer to the metal, with first-class tool use and built-in subagent patterns. Pick based on team familiarity and observability needs.

Weak: "They are all the same." They are not.

Q29. How do you prevent an agent loop from running forever?

Strong: Hard limits on step count, token budget, and wall-clock time. Cycle detection on repeated tool calls with the same arguments. A judge agent that checks "are we making progress" every N steps. Manual interrupt hooks.

Weak: "Set max_iterations." Necessary but not sufficient.

Q30. What is human-in-the-loop and where do you add it?

Strong: Pausing the agent at high-stakes steps (sending an email, spending money, modifying production data) and surfacing the proposed action to a human for approval. Add it wherever the cost of a wrong action exceeds the cost of waiting for a human.

Weak: "At the end." Wrong; by then the damage is done.

Q31. How do you debug an agent that silently produces wrong results?

Strong: Trace every step: tool inputs, tool outputs, reasoning text, token counts. Replay the trace in a dev environment. Add assertions on intermediate state. Tools: LangSmith, Langfuse, Arize, OpenTelemetry with GenAI semantic conventions.

Weak: "Look at the final output." Not a debugging strategy.

Q32. What is agent memory and what are the patterns?

Strong: Short-term: the running context. Long-term: vector store of past interactions, summarized periodically. Episodic: structured logs of past task runs. Procedural: learned skills stored as reusable prompts. Pick the minimum that solves your use case.

Weak: "Just stuff everything in the context window." Will blow cost and degrade quality.

Section 5: Production / Ops / Evaluation

This section separates the engineers from the prototypers. Senior candidates should breeze through this.

Q33. What does your observability stack look like for an LLM application?

Strong: Structured logs with trace IDs, token counts per call, latency p50/p95/p99, tool call traces, prompt version, model version, eval scores. Stack: OpenTelemetry + Langfuse or LangSmith + Datadog or Grafana. Alert on cost spikes and quality regressions.

Weak: "We log requests and responses." Necessary, nowhere near sufficient.

Q34. How do you run a canary deployment for a prompt change?

Strong: Route 5% of traffic to the new prompt, log eval scores and user feedback for 24-48 hours, auto-rollback if quality drops by more than X%, gradually ramp to 100%. Treat prompts like code with versioning, PR review, and CI evals.

Weak: "Push it and watch." Cowboy.

Q35. What is an eval harness and what does yours include?

Strong: A set of test inputs with expected outputs (or reference answers), graded by deterministic checks where possible and LLM-as-judge where not. Runs on every prompt change. Includes regression set, edge cases, and adversarial cases. Calibrated against human labels on a sample.

Weak: "We have some test prompts." Not an eval harness.

Q36. How do you handle PII in an LLM pipeline?

Strong: Detection (Presidio, AWS Comprehend) at ingest, redaction or tokenization before sending to provider, opt-out of provider training, regional routing for data residency, audit logs of every PII handling event. Read the BAA if healthcare.

Weak: "We trust the provider." Not a strategy.

Q37. Walk me through your cost monitoring.

Strong: Per-tenant token tracking, per-feature cost attribution, daily cost alerts, monthly forecasts, automatic throttling above budget thresholds. Dashboard showing cost per user, cost per request, and cost per outcome (e.g., cost per resolved ticket).

Weak: "We check the OpenAI dashboard." Reactive, not proactive.

Q38. What is LLM-as-judge and what are its known biases?

Strong: Using a model to grade another model's outputs. Biases: position bias (prefers the first answer), length bias (prefers longer), self-preference (a model prefers its own outputs). Mitigations: randomize position, control for length, use a different model family as judge, calibrate against human labels.

Weak: "It works fine." It does not without calibration.

Q39. How do you handle a model deprecation announcement?

Strong: Inventory every call site, run the eval harness against the replacement model, identify regressions, retune prompts where needed, canary deploy, migrate by the deadline. Have a vendor-abstraction layer so you can swap providers without rewriting application code.

Weak: "Switch when the email comes." You will have outages.

Q40. What is your incident playbook when an LLM provider has an outage?

Strong: Multi-provider fallback (router pattern), graceful degradation (cached answers, template responses), status page integration, on-call runbook with the exact CLI commands to flip traffic. Tested quarterly.

Weak: "Wait for it to come back." Acceptable if you are a hobby project.

Bonus: Behavioral & Take-home ideas

Behavioral questions worth asking:

"Tell me about a prompt change you made that regressed production. How did you catch it, how did you fix it?"
"Describe an LLM bill you reduced. What was the before and after, and what specifically changed?"
"Walk me through an agent failure mode you debugged. What did the traces show?"
"What is one thing the AI ecosystem got wrong in the last 12 months, and what did you do about it?"

Take-home prompts (2-4 hours, paid at market rate):

RAG over a small corpus. Given a folder of 50 markdown docs, build a Q&A system. Deliverables: working code, an eval set of 20 questions with expected answers, a measured retrieval recall@5 and answer faithfulness score, and a one-page README on tradeoffs.
Agent with one tool. Build an agent that uses a weather API to answer multi-step travel questions ("Should I bring an umbrella for my Tokyo trip next Tuesday?"). Deliverables: working code, trace logs of 3 runs, and a paragraph on failure modes.
Eval harness from scratch. Given a prompt and 30 sample inputs/outputs, build an automated eval that scores faithfulness and relevance. Deliverables: the harness, calibration against the provided human labels, and a write-up on observed LLM-as-judge biases.

Score take-homes on correctness, code quality, eval rigor, and the README. The README is the most diagnostic part: it shows how the candidate thinks about tradeoffs.

Scoring rubric

Dimension	1: Reject	3: Junior pass	5: Senior hire
Fundamentals	Confuses tokens and words, no temperature understanding	Knows the basics, struggles on context windows	Explains tokenization, sampling, and context tradeoffs unprompted
Prompt engineering	Copies prompts from Twitter	Iterates with eval, knows few-shot tradeoffs	Has shipped prompt caching, routing, and structured outputs in production
RAG	Embed-and-query demo only	Chunks correctly, knows about reranking	Has built hybrid search, evals retrieval separately, debugs precision vs recall
Multi-agent	Single LLM call only	Has used LangChain or LangGraph for simple flows	Has shipped multi-agent systems with HITL, observability, and budget guards
Production / ops	"We log responses"	Has dashboards, basic alerts	Full eval harness, canary deploys, cost forecasting, incident runbooks
Communication	Hand-wavy, jargon-heavy	Clear but shallow	Specific, names tradeoffs, admits limitations

Hire at average 4.0+. Reject below average 3.0. The 3.0-4.0 zone is where you discuss in the debrief and bias toward "no" if you have other candidates.

Hire or rent the team

Running a 4-stage AI engineer interview loop takes 40+ hours of senior eng time per hire, and the 2026 market is competitive: base salaries for production-ready AI engineers in the US are now $220k-$340k before equity (we break this down in our AI engineer salary guide for 2026). If you need shipping speed instead of headcount, AY Automate runs a pre-vetted AI agent development team that integrates as an embedded squad (Claude Code, Claude Agent SDK, LangGraph, and production RAG infrastructure) on a flat monthly fee. We have shipped multi-agent systems for series A through Fortune 500 clients and we are happy to debate the build-vs-rent calculus on a 30-minute consultation call.

FAQ

What is an AI engineer in 2026?

An engineer who designs, builds, and operates LLM-powered features in production. The role spans prompt engineering, RAG, multi-agent orchestration, evals, and LLMOps. It is distinct from "ML engineer" (model training) and "data scientist" (analysis). Most teams need one AI engineer per 2-4 product engineers.

How is an AI engineer different from a prompt engineer?

A prompt engineer optimizes prompts. An AI engineer owns the entire system around the prompt: retrieval, orchestration, evals, observability, cost. Prompt engineering is one skill inside the AI engineer role. Hiring a "prompt engineer" in 2026 usually means you actually need an AI engineer.

Should AI engineers know how to fine-tune models?

Nice to have, not required. Most production AI work in 2026 is prompt engineering plus RAG plus orchestration. Fine-tuning matters for narrow style or format adherence and for cost optimization at scale. If your role description says "must have fine-tuning experience" but you have not measured whether fine-tuning is needed, you are over-specifying and shrinking your candidate pool.

How long should the take-home be?

2 to 4 hours, always paid at market hourly rate. Anything longer is disrespectful and filters out senior candidates with options. Anything shorter does not give you enough signal on code quality and eval rigor.

Are leetcode questions useful for AI engineer roles?

Mostly no. The job is system design, prompt iteration, and debugging traces, not implementing a B-tree from scratch. Use one easy data-structures question if you must, then move on to the AI-specific sections.

What red flags should I watch for in 2026?

Candidates who only name one provider (OpenAI lock-in is a sign of shallow exposure), candidates who cannot explain a single eval they have run, candidates who treat agents as magic instead of as deterministic loops with failure modes, and candidates who have never optimized an LLM bill.

Should we hire a full-time AI engineer or use an agency?

Hire full-time if AI is a core product surface you will keep iterating on for 2+ years. Use an AI agent development agency if you need to ship a specific system in 90 days, if you cannot compete on comp with FAANG, or if you want production observability and evals built in from day one without spending 6 months learning the stack. Many teams use both: agency to ship v1 fast, full-time hire to own it long-term.

What if a candidate is strong on RAG but weak on agents (or vice versa)?

That is normal in 2026. The role has fragmented faster than people can cross-train. Hire for the dimension you need most right now and either pair them with a complementary engineer or budget 3-6 months of ramp time on the weaker axis. To generate a complete job description for the role you are hiring (including the interview questions above), use the AI Job Description Generator.

Book a Free Strategy Call

Building this in production?

Walid runs a 30-min call to map your AI engineering team. Free, no slides.

Or send us a brief →

Free weekly brief

Steal our production automations

The exact n8n flows, Claude Code setups, and prompts we ship for clients, broken down step by step. No spam, unsubscribe anytime.

Share this article

About the Author

Taha

AI Engineer

Taha builds and ships custom AI agents and workflow automations for AY Automate clients across SaaS, finance, and professional services.