Day 10

Founder OS — gstack: How YC’s CEO Engineers with Claude Code

Day 10 / 30 Days of Claude Code

gstack: How YC's CEO Engineers with Claude Code

Garry Tan's 8 cognitive modes for Claude Code. A full toolkit that treats AI like an engineering org with distinct roles, not a single chatbot doing everything at once.

Skill Modes

50+

Browser Commands

10x

Parallel (Conductor)

“Planning is not review. Review is not shipping. Founder taste is not engineering rigor. Each requires a fundamentally different cognitive stance.”

Garry Tan, gstack README

The 8 modes: the full engineering org

Each mode is a separate skill file living in .claude/skills/gstack/. One job per file. One persona per session. No blending. No confusion.

/plan-ceo-review

Founder Mode, 10-star product thinking

/plan-eng-review

Engineering Manager, lock architecture

/review

Paranoid Staff Engineer, hunt bugs

/ship

Release Engineer, deterministic deploy

/browse

QA Engineer, persistent browser daemon

/qa

QA Lead, diff-aware test matrix

/setup-browser-cookies

Session Manager, auth state persistence

/retro

Engineering Manager, metrics + retrospective

Who this is for

You already use Claude Code heavily and want consistent, high-rigor workflows instead of one mushy generic mode. You have felt the pain of the AI trying to plan and code simultaneously, or reviewing its own work while writing it. gstack solves that by giving each cognitive job its own isolated persona, like hiring specialists instead of asking one generalist to do everything.

Why this matters

Most people dump everything into one prompt. Garry treats each mode as a separate cognitive stance, like switching between CEO, staff engineer, release engineer, and QA lead. The AI performs dramatically better because it knows exactly what job it has right now. Not “help me with my project.” Instead: “you are a paranoid staff engineer hunting for production incidents before they happen.”

Modes 1 + 2

Plan: CEO + Engineering Review

Two distinct planning modes. The founder rethinks the problem. The engineer locks the architecture. Never at the same time.

Mode 1: /plan-ceo-review

Founder Mode

The 10-star product test

CEO review does not build what was asked. It rethinks the problem from first principles and finds the dream version, the “10-star product,” then works backwards to what is shippable.

The core question: “If we could give the user a magical experience with zero constraints, what would that look like?” Then: “What is the 8-star version we can actually build?”

“Instead of building a literal photo upload feature, CEO mode identifies the REAL feature: helping sellers create listings that sell, product ID, web enrichment, title/description drafting, hero image suggestion. The user asked for upload. They needed a listing creation engine.”

gstack /plan-ceo-review example

Three scope stances

Every CEO review ends with one of three explicit stances:

EXPANSION: Dream ambitious. The feature request is actually too small. Expand scope because the opportunity is bigger than what was asked.

HOLD SCOPE: Bulletproof the current plan. The scope is right, now make it airtight. Fill gaps, tighten edges, add safety nets.

REDUCTION: Surgical minimum. Cut everything that is not essential to the core value proposition. Ship the smallest thing that proves the idea.

The 9 Prime Directives

Every plan produced by CEO review must satisfy all nine. These are non-negotiable architectural constraints:

1. Zero silent failures — every error path handled, every edge case addressed
2. Named errors — no generic “something went wrong.” Every failure has identity
3. Data shadow paths — what happens when data is missing, partial, stale, or corrupted
4. Interaction edge cases — double-click, back button, tab switching, network drop
5. Observability first-class — logging, metrics, traces built in from day one
6. Mandatory diagrams — no plan is complete without a visual architecture
7. Written deferrals — if something is out of scope, write it down explicitly
8. Six-month optimization — will this still make sense in 6 months?
9. Permission to restart — if the plan does not work, kill it and start over

/plan-ceo-review

$ claude /plan-ceo-review

# Reading: git log, TODO.md, CLAUDE.md, .gstack/context

SCOPE STANCE: EXPANSION
User asked for: photo upload for product listings
10-star version: full listing creation engine

Architecture diagram: (mermaid output)
Prime directive check: 9/9 satisfied
Estimated complexity: 3 sprints
Deferred items: 2 (written to .gstack/deferrals.md)

Awaiting founder approval before engineering review.

Mode 2: /plan-eng-review

Engineering Manager Mode

Lock the architecture before writing code

Engineering review takes the CEO-approved plan and turns it into a buildable architecture. The key insight: “LLMs get way more complete when you force them to draw the system.” Diagramming is not optional, it is the mechanism that prevents architectural drift.

This mode outputs architecture diagrams, sequence diagrams, state machines, and test matrices. It forces the model to think about system boundaries, data flow, failure modes, and trust boundaries, all before a single line of implementation.

Outputs

Architecture diagram: component boundaries, data stores, external services

Sequence diagrams: every user flow, every API call chain

State machines: for anything with lifecycle (orders, auth, sessions)

Test matrix: what to test, at which layer, with which strategy

Focus areas

System boundaries: where does our code end and third-party begin?

Data flow: how does data move, transform, and persist?

Failure modes: what happens when each dependency fails?

Trust boundaries: where does user input cross into trusted execution?

/plan-eng-review

$ claude /plan-eng-review

# Consuming CEO-approved plan

Generating architecture diagram...
Generating sequence diagram: upload flow
Generating sequence diagram: enrichment pipeline
Generating state machine: listing lifecycle

Trust boundary identified: user upload -> image processing
Failure mode: enrichment API timeout (added retry + fallback)
Data flow gap: no cache invalidation on listing update

Test matrix: 14 unit / 8 integration / 3 e2e
Architecture locked. Ready for implementation.

Why two planning modes instead of one

CEO mode asks “are we building the right thing?” Engineering mode asks “are we building it the right way?” When you combine these into one prompt, the AI either skips the strategic rethink (jumping straight to architecture) or never locks down the technical details (staying at a high level). Splitting them produces dramatically better output for both.

Mode 3

Review: Paranoid Staff Engineer

The reviewer who has been burned before. Hunting for the production incident before it happens.

Cognitive stance: Paranoid

“I want the model imagining the production incident before it happens. Not suggesting better variable names, finding the bug that pages you at 3am.”

Garry Tan on /review design

What it hunts for

N+1 queries
Stale reads across replicas
Race conditions in concurrent paths
Concurrency bugs in shared state
Bad trust boundaries (user input reaching eval/exec)
SQL injection vectors
Missing database indexes on hot paths
Broken invariants across transactions
Retry logic that amplifies failures
Silent catch blocks that swallow errors
LLM output passed to dangerous functions
Auth bypasses in middleware chains
Unbounded loops or recursion
Missing rate limiting on public endpoints

Pass 1: Critical

Blocks the merge

SQL injection and query safety
LLM trust boundary violations
Authentication and authorization bypasses
Silent failure paths (empty catch blocks)
Data corruption risks (race conditions, partial writes)
Security vulnerabilities (XSS, CSRF, SSRF)

If any Pass 1 issue exists, the review fails. No exceptions.

Pass 2: Informational

Improves the code

Naming and readability suggestions
Performance optimization opportunities
Missing indexes on non-critical paths
Code duplication that could be extracted
Style inconsistencies
Documentation gaps

Pass 2 issues are suggestions. They never block a merge.

Greptile Integration

Auto-triage PR comments with learning

gstack integrates with Greptile for automated PR comment triage. Every comment is classified as either a valid issue or a false positive. The system learns over time:

Auto-triage: Greptile scans the PR, generates comments, then gstack's review mode distinguishes valid issues from noise.

False positive learning: When a comment is marked as a false positive, it gets stored in .gstack/greptile-history.md. Future reviews auto-skip similar patterns. The system gets less noisy over time instead of more.

/review

$ claude /review

# Pass 1: Critical (blocks merge)

CRITICAL [SQL-INJECT] Unsanitized user input in query builder
-> src/db/listings.ts:47 — req.query.sort passed directly to ORDER BY
-> Fix: allowlist sort columns, reject unknown values

CRITICAL [LLM-TRUST] LLM output used in dynamic import()
-> src/services/enrichment.ts:112 — model response drives module loading
-> Fix: validate against known module allowlist before import

CRITICAL [SILENT-FAIL] Empty catch block swallows payment error
-> src/payments/process.ts:203 — catch(e) {} with no logging or rethrow
-> Fix: log error, emit metric, rethrow or return explicit failure

# Pass 2: Informational (suggestions only)

INFO [NAMING] `data` at line 89 -> consider `userProfile`
INFO [INDEX] listings.created_at could use an index (used in ORDER BY)

# Greptile auto-triage
Greptile comments: 7 total
Valid issues: 3 (matched Pass 1 findings)
Skipped: 2 (known false positives from history)
Noise: 2 (style-only, moved to Pass 2)

RESULT: 3 critical issues — merge blocked

Paranoid review vs polite review

Most AI code reviews are polite. They say “you might want to consider...” gstack's review mode is adversarial by design. It assumes every input is hostile, every external call will fail, every user will do the unexpected thing. The separation matters because when the AI is also trying to be helpful (writing code, suggesting features), it pulls its punches on criticism. Isolating the review stance produces significantly harder, more useful feedback.

Mode 4

Ship: Release Engineer

Fully automated, non-interactive deployment. Same input, same output, every time. No “are you sure?” prompts.

Cognitive stance: Deterministic

The 9-step release process

The key rule: “User's /ship command means proceed automatically without confirmation.” Every step runs sequentially. If any step fails, the entire process halts with a clear error.

Step 1 Preflight checks — clean working tree, all tests pass, no uncommitted changes
Step 2 Merge main — pull latest main, resolve conflicts if any
Step 3 Run tests — full test suite, fail fast on first error
Step 4 Eval suites — run evaluation benchmarks if defined
Step 5 Pre-landing review — automated /review pass on final diff
Step 6 Greptile triage — auto-triage any PR comments, skip known false positives
Step 7 Version bump — automatic based on change size (see rules below)
Step 8 Changelog — auto-generate from commit history
Step 9 Commit + Push + PR — bisectable commits, push, open PR

Auto version bumping

Size-based, not ceremony-based

MICRO: Less than 50 lines changed. Bump patch automatically. No questions asked.

PATCH: 50+ lines changed. Bump patch automatically.

MINOR / MAJOR: Only when explicitly requested by the user. The model never decides a minor or major bump on its own.

This removes the “what version should this be?” ceremony from every release.

Bisectable commits

Every commit is a logical unit

Ship mode does not create one giant squash commit. It splits changes into bisectable units by category:

Infrastructure: config, dependencies, tooling
Models / Schema: database changes, type definitions
Controllers / Routes: business logic, API changes
VERSION + CHANGELOG: always the last commit

This means git bisect actually works when something breaks.

/ship

$ claude /ship

[1/9] Preflight: clean tree, no uncommitted changes
[2/9] Merging main: already up to date
[3/9] Tests: 47 passed, 0 failed (12.3s)
[4/9] Evals: baseline maintained (no regression)
[5/9] Pre-landing review: 0 critical issues
[6/9] Greptile: 2 comments, both known false positives (skipped)
[7/9] Version: 1.4.2 -> 1.4.3 (PATCH, 83 lines changed)
[8/9] Changelog: 3 entries added to CHANGELOG.md

[9/9] Splitting commits:
abc1234 infra: update playwright to 1.42
def5678 feat: listing enrichment pipeline
ghi9012 fix: race condition in upload handler
jkl3456 chore: bump VERSION to 1.4.3, update CHANGELOG

Pushed to origin/feature/listing-enrichment
PR #142 opened: “Listing enrichment pipeline”
Ship complete. Zero manual steps.

Shipping is a script, not a ceremony

When deployment requires human decisions, “what version?”, “should I squash?”, “did you update the changelog?”, it becomes slow and error-prone. gstack makes shipping fully deterministic. The commit history contains all the information needed. The model reads the diff, decides the version bump, splits the commits, generates the changelog, and pushes. Same input, same output, every time.

Modes 5 + 6 + 7

Browse + QA + Cookies

A compiled browser daemon, diff-aware testing, and persistent session management. The largest and most technically ambitious part of gstack.

Mode 5: /browse, QA Engineer

Architecture

Compiled Bun binary + persistent Chromium daemon

The browser tool is a compiled binary (~58MB) built on Playwright and Bun. It runs a persistent Chromium daemon, not a headless screenshot tool that launches and dies on every command.

100-200ms response time on hot commands vs 3-5 seconds cold start on every action with typical browser MCP tools.

Session state carries across commands: cookies, localStorage, sessionStorage, and authentication all persist. Navigate to /dashboard, click a button, check the result, all without re-authenticating.

The Ref System: why @e1, @e2 instead of CSS selectors

gstack does not use CSS selectors or XPath to target elements. Instead, it assigns stable refs like @e1, @e2 to interactive elements on the page. Three reasons CSS selectors fail in practice:

CSP blocks DOM injection: many production apps have Content Security Policies that prevent injecting attributes or scripts into the page. Refs are assigned server-side by the daemon, not injected into the DOM.

Framework hydration strips attributes: React, Next.js, and similar frameworks hydrate the DOM on load, stripping any custom attributes added before hydration completes. Refs are stable because they live in the daemon's element map, not the DOM.

Shadow DOM is unreachable: CSS selectors cannot pierce shadow DOM boundaries. The ref system can, because it uses Playwright's internal element handles which work regardless of DOM encapsulation.

50+ commands organized by category

Navigate
goto, back, forward, reload, wait_for

Read
get_text, get_html, get_attribute, get_value, get_title, get_url

Snapshot
screenshot, full_page_screenshot, element_screenshot, pdf

Interact
click, dblclick, fill, select, check, uncheck, press, type, hover, focus, drag

Inspect
get_elements, get_refs, count, is_visible, is_enabled, is_checked, bounding_box

Visual
compare_screenshot, visual_diff, highlight, annotate

Compare
diff_text, diff_html, diff_screenshot, assert_text, assert_visible

Tabs
new_tab, switch_tab, close_tab, list_tabs

Security model

Localhost-only: the daemon only accepts connections from 127.0.0.1

Bearer token auth: every request requires a token generated at daemon startup

No plaintext cookies: session cookies are stored encrypted, never written to disk in plaintext

Zero token overhead

gstack's browser uses plain text stdout. The result of “get page text” is just the text.

Compare with MCP browser tools: every call includes full JSON schemas, tool definitions, and structured responses. A simple “get page text” costs 10x more tokens via MCP.

Over a 20-command QA session, the difference is 30-40K tokens saved.

/browse

$ claude /browse

Daemon running (pid 4821, port 9222)
Chromium: v124.0.6367.91
Session: authenticated as admin@app.com

> goto http://localhost:3000/dashboard
Loaded in 340ms, 23 interactive elements found

> get_refs
@e1 button “Create Listing”
@e2 input “Search...”
@e3 link “Settings”
@e4 button “Export CSV”
... (19 more)

> click @e1
Clicked “Create Listing” (94ms)
Navigation: /dashboard -> /listings/new
12 new interactive elements

> fill @e5 “Vintage Camera Collection”
Filled input “Title” with “Vintage Camera Collection” (47ms)

> screenshot
Saved: .gstack/screenshots/listings-new-001.png

Mode 6: /qa, QA Lead

4 testing modes

Diff-aware (2-5 min): Reads the git diff, maps changed files to affected features, tests only what was touched. The default mode for active development.

Full (5-15 min): Complete test suite across all categories. Run before major releases or after large refactors.

Quick (30 sec): Smoke test. Hit the critical paths only: homepage loads, auth works, main feature functions. Useful for rapid iteration.

Regression (baseline comparison): Compares current state against a stored baseline. Detects visual regressions, broken links, new console errors, and performance degradation that were not present in the baseline.

Health score: 8 weighted categories

Every QA run produces a single health score (0-10) computed from weighted category scores:

Functional 20% — Do features work as expected?
Console 15% — Any JS errors, warnings, or failed network requests?
Accessibility 15% — ARIA labels, keyboard nav, screen reader compat
UX 15% — Loading states, error messages, empty states, responsiveness
Links 10% — Dead links, broken anchors, 404s
Visual 10% — Layout shifts, overflow, z-index issues, visual regressions
Performance 10% — Load time, bundle size, render blocking resources
Content 5% — Placeholder text, lorem ipsum, missing copy, typos

“Repro is everything. Every issue needs at least one screenshot. No exceptions.”

gstack /qa mode directive

/qa diff-aware

$ claude /qa

Mode: diff-aware (3 files changed)
Mapping: listings/create.tsx -> /listings/new
Mapping: api/enrichment.ts -> enrichment pipeline
Mapping: components/ImageUpload.tsx -> upload widget

# Testing /listings/new
Functional: form submits correctly
Functional: validation fires on empty title
Console: TypeError at ImageUpload.tsx:34 (file > 10MB)
UX: loading spinner shows during enrichment
Visual: image preview overflows container at 320px width
Accessibility: form labels present, tab order correct

# Screenshots captured: 4
# Report saved: .gstack/qa-reports/2026-03-16-diff.md

Health score: 7.4 / 10.0
1 critical (console error on large upload)
1 warning (responsive overflow)

Mode 7: /setup-browser-cookies

Session Manager

Persistent authentication for browser sessions

Before you can /browse an authenticated app, you need session state. The /setup-browser-cookies mode handles this: it logs in through the real login flow, captures the resulting cookies and localStorage, and stores them encrypted for reuse across future /browse and /qa sessions.

This means you run /setup-browser-cookies once per environment, and every subsequent /browse session starts already authenticated. No re-entering credentials. No token expiry surprises mid-QA.

Mode 8

Retro: Engineering Manager

Real metrics, pattern detection, per-contributor analysis, and historical tracking. The retrospective that feeds the next planning cycle.

Cognitive stance: Analytical

What it computes

Total commits this period
Lines of code added / removed
Test-to-code ratio
Average PR size
Fix ratio (fixes vs features)
Coding session patterns (when do you ship most?)
Hotspot files (most frequently changed)
Shipping streaks (consecutive days with commits)
Per-contributor analysis
Focus score (how scattered vs concentrated is the work?)

Output structure

Your Week: Deep-dive into the current period. What shipped, what slipped, what patterns emerged. Commits, LOC, test ratio, and fix ratio for the period.

Per-teammate breakdown: For each contributor: what they shipped, one specific praise (tied to a real commit), and one growth opportunity (tied to a real pattern). Not generic “great job”, specific “the retry logic in commit abc1234 was excellent because it handles partial failures.”

Team wins: Aggregate wins for the team. Shipping streaks, test coverage improvements, zero-downtime deployments.

Top 3 habits: The three patterns that had the most positive impact this period. These become priorities to reinforce in the next cycle.

Historical tracking

Every retro saves a JSON snapshot to .context/retros/. These accumulate over time, enabling trend analysis.

Week-over-week delta: are you shipping more or less? Is your fix ratio improving or degrading? Is test coverage trending up?

The retro mode reads previous snapshots and highlights changes, not just current state.

Compare mode

Period-over-period delta analysis. Compare this week vs last week, this sprint vs last sprint, or this month vs last month.

Positive deltas get called out as wins. Negative deltas get flagged with specific recommendations.

The goal: the retro output becomes the input for the next /plan-ceo-review. Recurring failures become priorities. Shipping streaks become habits to protect.

/retro

$ claude /retro

Period: Mar 10 - Mar 16, 2026
Analyzing: 34 commits, 3 contributors

# YOUR WEEK
Commits: 34 (+8 vs last week)
LOC: +1,247 / -389
Test ratio: 0.31 (target: 0.25) above target
Fix ratio: 0.18 (3 fixes / 14 features) healthy
Shipping streak: 5 days
Focus score: 0.72 (concentrated on listing feature)

# HOTSPOTS
src/listings/create.tsx — 12 changes (refactor candidate)
src/api/enrichment.ts — 8 changes (stabilizing)

# PER-CONTRIBUTOR
@garry — Praise: retry logic in abc1234 handles partial failures cleanly
@garry — Growth: 3 commits lacked test coverage (upload handler)

# TOP 3 HABITS TO REINFORCE
1. Bisectable commits (every commit builds and passes tests)
2. Test-first on API routes (0 regressions this week)
3. Written deferrals (4 items deferred with clear rationale)

# DELTA vs LAST WEEK
Commits: +8 (31%)
Test ratio: +0.04
Fix ratio: +0.06 (slightly more fixes, monitor)

Snapshot saved: .context/retros/2026-w11.json

Every retro feeds the next plan

Most teams do retros as standalone meetings that produce action items nobody tracks. In gstack, the retro produces structured data that the plan mode consumes. Hotspot files become refactoring priorities. Low test ratios become coverage targets. Shipping streaks become habits to protect. The loop closes automatically: retro output is plan input.

Deep Dive

How gstack is built

The technical decisions behind gstack: why Bun, why a daemon, why plain text over MCP, and how it all fits together.

Why Bun runtime

gstack's browser tool is a compiled Bun binary (~58MB). Four reasons Bun was chosen over Node.js:

Compiled binaries: bun build --compile produces a single executable. No node_modules, no runtime dependency, no version conflicts. Copy the binary and it works.

Native SQLite: Bun includes SQLite as a first-class built-in. The browser daemon uses it for session state, cookie storage, and ref persistence. No external database process needed.

Native TypeScript: No transpilation step. Write TypeScript, run TypeScript. Faster iteration during development, simpler build pipeline.

Lightweight HTTP: Bun's HTTP server is significantly faster than Express or Fastify for the simple request/response pattern the daemon uses.

Daemon model

Persistent state + lifecycle management

The browser daemon starts when the first /browse command runs and stays alive across commands. This is fundamentally different from tools that launch a browser per command.

Persistent state: Cookies, localStorage, sessionStorage, DOM state, scroll position, all survive between commands. You navigate to a page, fill a form, click submit, and verify the result page, all as separate commands with shared state.

Lifecycle management: The daemon tracks its own PID. If Claude Code exits, the daemon cleans up. If the daemon crashes, the CLI detects it and restarts on next command.

Multi-workspace safety: Each workspace gets its own daemon instance on a unique port. Running 10 Claude Code sessions via Conductor means 10 isolated Chromium instances with zero cross-contamination.

Plain text CLI over MCP

Why gstack rejected the MCP approach

Most browser automation tools for Claude Code use MCP (Model Context Protocol). gstack deliberately chose plain text CLI instead. The reason is token efficiency.

Token cost comparison: “get page text” command

Chrome MCP ~3,200 tokens (JSON schema + tool def + structured response)
Playwright MCP ~2,800 tokens (similar overhead, slightly leaner schema)
gstack CLI ~280 tokens (plain text command + plain text response)

Over a 20-command QA session:

Chrome MCP ~64,000 tokens
Playwright MCP ~56,000 tokens
gstack CLI ~5,600 tokens

gstack saves ~90% of tokens on browser operations.

The insight: every MCP call includes full JSON schemas, tool definitions, and structured responses. When you just need the text content of a page, all that structure is pure overhead. gstack's approach is simpler: plain text in, plain text out, piped through Bash tool calls.

Crash recovery

Chromium crashes are inevitable. gstack handles them gracefully:

Server-side: If Chromium crashes, the daemon server exits cleanly (exit code 1). It does not hang or zombie.

Client-side: The CLI detects the dead daemon on the next command and automatically restarts it. The user sees a “daemon restarted” message and the command retries.

State recovery: Session cookies are persisted to SQLite, so re-authentication is automatic after a crash. The user loses in-page state (form inputs, scroll position) but not session state (auth, preferences).

File structure

The complete gstack layout inside a Claude Code project:

.claude/skills/gstack/
|-- plan-ceo-review.md # Founder mode — 10-star product thinking
|-- plan-eng-review.md # Engineering manager — architecture lock
|-- review.md # Paranoid staff engineer — code review
|-- ship.md # Release engineer — deterministic deploy
|-- browse.md # QA engineer — browser daemon commands
|-- qa.md # QA lead — diff-aware test matrix
|-- setup-browser-cookies.md # Session manager — auth persistence
|-- retro.md # Engineering manager — retrospective

.claude/skills/gstack/bin/
|-- gstack-browser-darwin-arm64 # Compiled binary (macOS ARM)
|-- gstack-browser-darwin-x64 # Compiled binary (macOS Intel)
|-- gstack-browser-linux-x64 # Compiled binary (Linux)

.gstack/ # Runtime state (gitignored)
|-- greptile-history.md # False positive learning
|-- deferrals.md # Written deferrals from planning
|-- qa-reports/ # QA run reports with screenshots
|-- screenshots/ # Browser screenshots
|-- sessions.db # SQLite — encrypted cookie storage

.context/retros/ # Historical retro snapshots
|-- 2026-w10.json
|-- 2026-w11.json

Conductor integration

10 simultaneous Claude Code sessions

gstack is designed to work with Conductor (Claude Code's parallel execution mode). You can run 10 simultaneous Claude Code sessions, each with its own gstack mode active:

Session 1: /plan-ceo-review on feature A
Session 2: /review on PR #140
Session 3: /browse on staging environment
Session 4: /qa on feature B
...

Each session gets its own isolated Chromium instance (unique port, unique PID, unique session state). There is zero cross-contamination between sessions. The daemon model makes this possible, each workspace tracks its own browser lifecycle independently.

Takeaways

How to use this in your own setup

Six principles from gstack you can apply today, plus installation and a real workflow example.

Installation

Clone the repo and the skills are ready to use immediately:

# Clone into your project's .claude/skills/
git clone https://github.com/garrytan/gstack .claude/skills/gstack

# The skill files are now available as slash commands:
claude /plan-ceo-review
claude /plan-eng-review
claude /review
claude /ship
claude /browse
claude /qa
claude /setup-browser-cookies
claude /retro

# For browser commands, the compiled binary auto-detects your platform
# No additional dependencies needed

Mode separation is everything

One skill file = one cognitive mode. Do not mix planning and coding. Do not mix review and shipping. Each mode has a different mental posture: strategic, paranoid, systematic, deterministic, analytical. Keeping them separate makes AI output dramatically better because the model knows exactly what job it has right now.

The 10-star product test

Before building, ask: "What would the 10-star version of this look like?" Then work backwards to what is shippable. This prevents the most common failure mode in AI-assisted development: building exactly what was asked instead of what was needed. The user asked for photo upload. They needed a listing creation engine.

Paranoid review beats polite review

Most AI code reviews are polite. They suggest improvements. gstack's review mode is adversarial. It assumes every input is hostile, every external call will fail, every user will do the unexpected thing. The two-pass system (Critical vs Informational) means real security issues are never buried under style suggestions. Polite reviews miss the bugs that page you at 3am.

Browser as first-class citizen

Persistent state via daemon architecture. 100-200ms hot starts instead of 3-5 second cold boots. The ref system (@e1, @e2) instead of fragile CSS selectors. Plain text CLI instead of token-heavy MCP. If your AI cannot interact with your app the way a user does, your QA has a blind spot. gstack's browser tool eliminates that blind spot while using 90% fewer tokens.

Deterministic shipping eliminates ceremony

Version bumps, commit splitting, changelog generation, all derived from commit metadata and diff size. No human decisions at ship time. The 9-step process runs automatically: preflight, merge, test, eval, review, triage, version, changelog, push. Same input, same output, every time. If you need a human to click "deploy," you have not automated enough.

Close the loop: retro feeds plan

The output of /retro becomes the input of the next /plan-ceo-review. Hotspot files become refactoring priorities. Low test ratios become coverage targets. Shipping streaks become habits to protect. Written deferrals become backlog items. Historical JSON snapshots enable trend analysis. The system improves itself every cycle.

Real workflow: one feature through all modes

Here is how a single feature flows through gstack from idea to shipped:

1. /plan-ceo-review
User asks for “photo upload.” CEO mode identifies the real feature:
a listing creation engine. Scope stance: EXPANSION.
9 prime directives checked. Founder approves.

2. /plan-eng-review
Architecture locked: 3 sequence diagrams, 1 state machine,
trust boundaries mapped, test matrix defined.
14 unit / 8 integration / 3 e2e tests planned.

3. [Implementation happens here: regular Claude Code]

4. /review
Paranoid staff engineer finds 2 critical issues:
unsanitized input in query builder, silent catch in payment flow.
Both fixed before proceeding.

5. /browse + /qa
Browser daemon tests the real UI. QA finds a console error
on large file uploads and a responsive overflow. Both fixed.
Health score: 9.2 / 10.0

6. /ship
9-step process: preflight passes, tests pass, Greptile clean,
version bump 1.4.2 -> 1.4.3, 4 bisectable commits, PR opened.

7. /retro
Retro notes: listing creation module is a hotspot (12 changes).
Recommend: extract enrichment pipeline into its own module next cycle.
This becomes a priority in the next /plan-ceo-review.

Get gstack on GitHub

github.com/garrytan/gstack

View Repository

Day 9: Secure Your Claude Code Environment in 10 Minutes

Day 11: Claude Certified Architect Study Guide