Book a Free Strategy Call
Skip the read — talk to Walid in 30 min.
Free strategy call. We map your AI engineering team, you keep the notes.
Onboarding an AI engineer in 2026 is not the same as onboarding a backend hire from 2022. The first week is no longer "clone the repo, read the wiki, ship a small bug fix." It is "get authenticated to three model providers, pull down the eval harness, get read-only access to a production trace store, and reproduce a regression on a frozen dataset." If your onboarding still looks like a generic engineering ramp, your new hire will spend their first three weeks waiting on access tickets instead of shipping agents.
The job has changed. AI engineers do not just write code — they write prompts, build eval suites, tune retrievers, monitor token spend, debug non-deterministic failures, and ship features that depend on third-party model availability. Their daily loop is prompt → eval → trace → fix → re-eval, not code → test → deploy. Onboarding has to mirror that loop. That means model access on day one, an eval harness they can actually run locally, sandboxed access to anonymized production traces, and a senior engineer who can pair on prompt debugging — not just code review.
This is the playbook we use at AY Automate when we onboard AI engineers into client teams and into our own AI agent development practice. It covers the pre-day-1 checklist, a day-by-day Week 1 and Week 2 plan, clean 30/60/90-day milestones, a mentor + buddy structure that actually scales, and the KPIs that predict whether the hire will be a top performer or a 6-month miss. If you have not yet hired, pair this with our how to hire AI engineers guide and our AI engineer interview questions — onboarding starts during the loop, not on day one.
Pre-day-1 checklist
Everything on this list needs to be done before the new hire opens their laptop on Monday morning. If any of it slips into week one, you have already lost three days.
Hardware and machine setup
- 16-inch MacBook Pro M3 Pro or M4 (32GB RAM minimum). AI engineers run local models, vector DBs, and eval harnesses concurrently — 16GB is not enough.
- External monitor (27" minimum), keyboard, mouse, headset shipped to home address one week before start.
- Pre-installed: Cursor or VS Code, Claude Code CLI, Docker Desktop, Ollama, Python 3.12 + uv, Node 22 + pnpm, gh CLI, direnv, 1Password CLI.
- FileVault enabled, MDM enrolled, OS up to date, admin account provisioned.
Accounts and identity
- Google Workspace / Microsoft 365 mailbox, calendar invites sent for week-one meetings.
- SSO into GitHub, Slack, Linear / Jira, Notion / Confluence, PagerDuty / Opsgenie, Sentry, Datadog or Grafana.
- 1Password or Bitwarden vault provisioned with team-shared items.
- VPN or Tailscale enrollment, with access to non-prod environments only.
Model and infra access
- Anthropic Console seat (Claude Sonnet 4.5, Opus 4.5, Haiku 4.5).
- OpenAI org membership with GPT-5 and o3 access at minimum.
- Google AI Studio / Vertex seat for Gemini 2.5 Pro and 3.0 Flash.
- A scoped API key per provider, stored in 1Password, rotated quarterly.
- Vector DB credentials (Pinecone, Weaviate, Qdrant, or pgvector) — read-write on dev, read-only on prod.
- LangSmith, Langfuse, Helicone, or Braintrust seat for trace and eval review.
- OpenRouter or AI Gateway account if the team multiplexes providers.
Code and documentation
- Repo read access on day 1, write access after first PR is reviewed.
- Eval harness repo cloned and runnable locally with
make evalorpnpm eval. - A "first-week reading list" of 8–12 docs: architecture overview, prompt style guide, eval methodology, incident runbook, on-call rotation, data handling policy.
- A frozen sample of anonymized production traces (PII-scrubbed) the new hire can replay.
Calendar pre-loaded
- Day 1: HR + manager 1:1, buddy intro, lunch with the team.
- Day 2–5: 30-min shadow sessions with each team member.
- Week 2: first PR review, first eval run, first incident review observation.
If you cannot tick every item above, push the start date by a week. A delayed start is cheaper than three wasted weeks.
Week 1: Codebase + product orientation
Week 1 is not for shipping. It is for building a mental model — of the product, the architecture, the eval methodology, and the team. Resist the urge to assign a "starter ticket" on day 2.
Day 1 — Identity, environment, first run
- Morning: HR, equipment unboxing, password reset, 1Password import, SSO checks.
- Manager 1:1 (60 min): role expectations, 30/60/90 plan walkthrough, success metrics, team norms.
- Buddy intro (30 min): the buddy handles all "stupid questions" for the first month — not the manager.
- Afternoon: clone the main monorepo, run
make bootstrap, get a "hello world" prompt running against Claude Sonnet 4.5 from the local dev environment. - End-of-day check-in: did they get a model response back? If no, fix it before tomorrow.
Day 2 — Product walkthrough
- 90-min product demo from a senior engineer or PM: every user-facing surface, every agent, every workflow.
- Self-paced: walk through the production app as a real user. Take notes on every confusing UX moment.
- 1:1 with the head of product: roadmap, why this quarter's bets, what is explicitly not being built.
Day 3 — Architecture deep-dive
- 2-hour whiteboard session with a staff engineer: data flow, prompt routing layer, retrieval layer, eval pipeline, observability stack.
- Read the top 5 ADRs (architecture decision records).
- Trace one real user request end-to-end through the logs.
Day 4 — Eval methodology
- Walkthrough of the eval harness: what is graded, by whom, on what cadence.
- Run the full eval suite locally. Note runtime, cost, and pass rates.
- Read the last 3 eval regression reports.
Day 5 — Shadow + retro
- Shadow a live customer support escalation or an on-call incident.
- 30-min retro with the manager: what is confusing, what is missing, what to fix next week.
- File the first PR — a documentation fix or a typo in a prompt template. Goal is to exercise the PR pipeline, not to ship value.
By end of Week 1, the new hire should be able to draw the system architecture on a whiteboard, explain how the eval harness scores a response, and name every teammate.
Week 2: First production fix or eval
Week 2 is the first real shipping week. Scope is small, guardrails are heavy, and the goal is to learn the production pipeline by moving through it — not by reading about it.
Pick the right first ticket
The first ticket should be:
- Touchable in 2–3 days, not 2 weeks.
- Owned by the new hire's eventual area (retrieval, agents, evals, infra).
- Visible enough that shipping it matters, small enough that failure is recoverable.
- Reversible — feature-flagged or behind a low-traffic path.
Good examples: add a new eval case for a known failure mode, fix a prompt regression flagged in last week's trace review, add a retry with exponential backoff to a flaky tool call, add observability tags to an under-instrumented agent step.
Bad examples: refactor the prompt router, design a new agent, swap the vector DB.
Guardrails for week 2
- All prompt changes go through the eval harness before merge — no exceptions.
- All retriever changes require a documented A/B on a frozen eval set.
- All production model calls stay behind feature flags for at least 48 hours.
- No direct database writes against prod. No API key access to prod model endpoints. No write access to the vector DB prod cluster.
Pair, do not solo
The buddy pairs for at least 90 minutes per day during week 2. Most failures in AI engineering are not code failures — they are prompt subtlety, retriever drift, or eval-grader bias. Pairing surfaces these in real time.
End of week 2 deliverable
- One PR merged to main with an eval delta documented (e.g. "added 12 cases to the support-agent eval suite, pass rate dropped from 91% to 84% as expected — the new cases cover a known gap").
- One short loom or written walkthrough explaining the change.
- A retro doc: what was surprising, what was harder than expected, what to fix in the onboarding.
30-day milestone: first agent or feature in prod
By day 30, the hire should have shipped one meaningful feature or one new agent capability to production behind a feature flag, with eval coverage, observability, and a documented rollback plan.
What "meaningful" looks like in practice:
- A new tool added to an existing agent, with 10+ eval cases and a measured pass rate.
- A new retrieval strategy A/B-tested on a frozen dataset and rolled out to 10% of traffic.
- A latency or cost optimization with a documented before/after.
- A new eval grader or judge for a previously un-measured behavior.
The 30-day review
A 60-minute structured review with the manager and the buddy:
- Walk through every PR merged.
- Walk through the eval deltas — improvements and regressions.
- Walk through observability dashboards the hire built or owns.
- Identify one thing the hire is now the team expert on.
- Identify one gap to close in the next 30 days.
If by day 30 the hire has not merged a single PR, the onboarding has failed. The problem is almost always access, scope, or pairing — not the hire.
60-day milestone: own a system + on-call rotation
By day 60, the hire should own a system end-to-end and be in the on-call rotation.
System ownership means
- They are the named owner in the service catalog.
- They review every PR that touches the system.
- They maintain the eval suite for that system.
- They write the runbook for incidents in that system.
- They present the system's metrics in the weekly review.
On-call readiness
Before joining the on-call rotation:
- Shadow at least 3 real incidents.
- Run a "GameDay" exercise: a senior engineer triggers a simulated outage (rate limit from a provider, vector DB timeout, prompt regression in production) and the new hire works the runbook.
- Read every post-mortem from the last 6 months.
- Pair on-call for one full week before holding the pager solo.
60-day expectations
- Two to three production features shipped, with eval coverage and observability.
- One incident handled solo or as primary, with a written post-mortem.
- One documentation contribution that another engineer has referenced unprompted.
- One eval methodology improvement (new grader, new dataset, new replay tooling).
90-day milestone: lead an initiative
By day 90, the hire is no longer "new." They lead one initiative end-to-end: scoping, eval design, architecture, build, ship, and post-launch tuning.
What "lead an initiative" looks like
- A net-new agent or feature with clear product impact.
- A multi-week refactor that improves eval pass rates by a measurable amount.
- A platform improvement — eval harness v2, trace replay tooling, prompt versioning system.
- A migration — provider swap, model upgrade, retrieval layer rewrite.
The initiative is theirs to scope, defend, build, and own. The manager reviews progress weekly. The buddy is now a peer, not a teacher.
The 90-day review
A 90-minute review with the manager, the buddy, and one cross-functional partner (product, design, or infra). Walks through:
- The initiative — outcome, eval impact, learnings.
- The system they own — health, gaps, plans.
- The on-call performance — incidents handled, runbooks improved.
- Career conversation — what they want to learn in the next quarter, what to ship next.
If the 90-day review goes well, the new hire is now a senior member of the team in everything but tenure. If it does not, the problem is almost always scope or environment, not skill.
Mentor + buddy structure
AI engineering onboarding fails when there is one mentor doing five jobs. Split the role.
The manager — owns the 30/60/90 plan, the calibration, the career conversation, and the "is this working" call. Does not pair on code or prompts.
The buddy — a peer engineer, ideally 1–2 years into the team. Owns the daily "stupid questions" channel, pairs at least 90 minutes per day in week 1–2, and tapers down to 2–3 hours per week by day 30. Buddy rotation is 90 days — long enough to build a relationship, short enough not to burn the buddy out.
The technical mentor — a staff engineer in the new hire's eventual area. Owns architecture review, prompt review, and the "is the design sound" call. Meets weekly for the first 60 days.
The product partner — a PM or designer who pairs the new hire on understanding the user. Often skipped, always a mistake.
Four people, four jobs. None of them is the manager doing everything.
Onboarding KPIs that actually matter
The wrong KPIs: "ramp time," "days to first commit," "lines of code." All vanity, all gameable.
The right KPIs:
- Eval delta per PR — does the hire move the eval pass rate, and in which direction? Track every merged PR.
- PR throughput by week — week-over-week trend matters more than absolute count. Week 4 should be 3–5x week 1.
- Documentation contributions — net-new docs, runbook updates, ADRs authored or reviewed. AI engineers who do not write are AI engineers who do not scale.
- Time-to-first-eval-improvement — days between start date and the first PR that measurably improves an eval suite.
- Incident response score — for on-call hires, average time-to-mitigation and post-mortem quality.
- Peer review score — at day 60 and day 90, ask 3 peers a 5-point rubric on collaboration, quality, and judgment.
- Manager confidence trend — manager rates "would I want to ship a critical feature with this person leading?" weekly. The trend matters more than the absolute value.
Track these in a simple Notion or Linear dashboard. Review monthly. Adjust scope if any of them stalls.
What to give them on day 1
A single page in Notion or Confluence, titled "Your First Day," that contains:
- Laptop already imaged and shipped.
- 1Password vault link with every account credential.
- SSO logins verified (GitHub, Slack, Linear, Notion, PagerDuty, Sentry, Datadog).
- Anthropic, OpenAI, and Google AI Studio API keys scoped to dev, stored in 1Password.
- Vector DB credentials (Pinecone or pgvector) — read-write on dev, read-only on prod.
- LangSmith, Langfuse, Helicone, or Braintrust seat with trace read access.
- Eval harness repo cloned,
make evalworking locally before lunch. - Frozen sample of 500 anonymized production traces, downloadable from a private S3 bucket.
- A scoped GitHub PAT for CI workflows, rotated every 90 days.
- A printed (or PDF) one-pager with: the system architecture diagram, the eval methodology, the on-call rotation, the data handling policy, and the names + roles of every teammate.
- Two calendar holds: 1:1 with the manager at 10am, lunch with the buddy at noon.
If any of this is missing on day 1, escalate. The first 8 hours set the tone for the next 90 days.
Ready to onboard your next AI engineer well?
If you are scaling an AI team in 2026, the bottleneck is rarely hiring — it is ramping. We have onboarded AI engineers into AY Automate and into client teams across SaaS, fintech, and ops automation. We build the eval harness, the trace replay tooling, the prompt versioning system, and the runbooks that make this playbook executable. If you want help designing the onboarding for your team, or if you want us to embed an AI engineer who already runs this playbook from day one, see our AI agent development service, our how to hire AI engineers guide, and our AI engineer interview questions. Or book a consultation and we will walk through your current onboarding flow in 30 minutes.
FAQ
How long should AI engineer onboarding take in 2026?
Plan for a full 90 days to "fully ramped," with clear 30 and 60 day milestones along the way. Anyone who promises 2 weeks is selling. Senior hires can ship in week 2, but ownership and on-call readiness take 60 days minimum.
Is AI engineer onboarding really different from backend engineer onboarding?
Yes. The daily loop is prompt → eval → trace → fix → re-eval, not code → test → deploy. That means model access, eval harnesses, trace replay tooling, and anonymized production data have to be wired up before day one, not requested in week three.
Do I need a dedicated buddy or can the manager handle it?
Split the roles. The manager owns calibration and career; the buddy owns daily pairing and "stupid questions." When the same person does both, the new hire stops asking questions by week three.
What if the new hire has not used Claude Code or the Claude Agent SDK before?
That is fine for most senior hires — strong engineering judgment transfers. Add a week of guided exercises: build a small agent end-to-end with the Claude Agent SDK, run it through your eval harness, ship it behind a flag. Reference our AI engineer interview questions — if they passed those, they will pick up the SDK in days.
Should the new hire be on-call in their first 30 days?
No. Shadow at least three incidents, run a GameDay simulation, and pair on-call for a full week before holding the pager solo. Most teams ramp into on-call between day 45 and day 60.
How do I know if onboarding is failing?
Three signals: no PR merged by day 14, no eval delta by day 30, no clear system ownership by day 60. If you see two of the three, the problem is almost always access, scope, or pairing — not the hire. Fix the environment before evaluating the person.
What is the single biggest mistake teams make onboarding AI engineers?
Treating it like backend onboarding. They underspec model access, skip the eval harness walkthrough, give too-vague first tickets, and use vanity KPIs like "days to first commit." Use eval delta per PR and PR throughput trend instead.
Should I onboard contractors and full-time hires differently?
Yes. Contractors get a narrower scope, a fixed initiative, and read-only production access for the duration. Full-time hires get the full 30/60/90 with system ownership and on-call. If you need help designing the contractor version, our AI agent development team can embed senior engineers who already run this playbook — book a consultation to scope it.
Book a Free Strategy Call
Building this in production?
Walid runs a 30-min call to map your AI engineering team. Free, no slides.

Taha builds and ships custom AI agents and workflow automations for AY Automate clients across SaaS, finance, and professional services.
