Karpathy's Autoresearch, Ported to Marketing, Security, and Product

May 13, 2026/RuleSell Team

Andrej Karpathy's autoresearch repo has 80,700 stars and 630 lines of code. It was built to optimize ML training runs. The pattern — define a metric, define a constraint, let the model iterate overnight — generalizes to almost any task with a measurable outcome. We ported it to landing-page copy, STRIDE security audits, pricing-page A/B variants, and three other domains.

On March 7, 2026, Andrej Karpathy posted a tweet that did 8.6M views and kicked off a small movement:

"I packaged up the 'autoresearch' project into a new self-contained minimal repo."

The repo is 630 lines. It has 80,700 stars at time of writing. The pattern — give a model a metric to improve, a constraint not to break, and an iteration budget; let it run until the metric stops improving or the budget exhausts — is one of the most reusable ideas in the agentic-engineering toolkit. The four mature Claude Code ports (uditgoenka 4.4k⭐, wanshuiyin ARIS 9k⭐, drivelineresearch, Maleick) all riff on the same loop.

Karpathy's domain was ML — optimize a training run, find an architecture that learns faster, prune a model without losing accuracy. The community has not fully absorbed how generally the pattern applies. Any task with a mechanical metric and a stable constraint is an autoresearch task. This post takes the loop, names six domains where we've gotten useful results, and shows what does and doesn't transfer.

The pattern, in 4 sentences

The autoresearch loop:

Modify — generate a candidate change.
Verify — run the metric against the candidate.
Keep or discard — if the metric improved without breaking the constraint, keep; else revert.
Repeat — until the metric plateaus or the budget runs out.

That is the entire idea. Karpathy's repo wraps it in CLI affordances (resume, budget caps, logging), but the loop fits on a napkin. Whether it works in your domain comes down to whether you can name a metric and a constraint that hold up.

What autoresearch needs

Two preconditions. Get these wrong and the loop spins on noise.

A mechanical metric. Something a script can compute. Validation loss, test pass rate, page-load time, conversion rate, false-positive count on a security scanner. Subjective "is this good" scoring by another LLM works but slows the loop down and adds judgment variance — only use LLM-as-judge metrics when you have nothing better, and stabilize the judge prompt before running (see /topic/llm-as-judge). A stable constraint. The candidate must not break X. For Karpathy: don't break training stability, don't crash on the validation set. For marketing: don't change the legal claims. For security: don't false-positive on the existing benign-traffic corpus. Without a constraint the loop will find the metric maximum by violating something you care about — Goodhart's law.

If you can't name both, your task isn't autoresearch-shaped. That's fine — most tasks aren't. Stop here and use a different tool.

Domain 1 — Landing page conversion copy

Metric: simulated click-through rate from a deterministic LLM judge (consistent prompt, same model version, judge sees only headline+subhead+CTA, not the rest of the page). Constraint: all factual claims must match the source-of-truth product brief; no superlatives outside the brief's approved list; brand voice score ≥ 8/10 from a separate judge prompt.

We ran this on a B2B SaaS landing hero across 200 iterations on claude-opus-4-7 with budget cap $12. The starting headline scored 4.1/10 simulated CTR. The 200th iteration scored 7.8/10. We then A/B tested the top 3 candidates against the original on real traffic; the winner lifted CTR 23% over the human-written baseline.

What worked: the constraint blocked the model from drifting into "AI-powered comprehensive solution"-tier copy because we explicitly listed the words to avoid. What did not work: without the brand voice judge, the model converged on technically-effective but tonally-off copy ("HACK YOUR PIPELINE NOW") that the marketing team would have killed in review.

Watchouts: the LLM judge is the weak link. Stabilize the judge prompt before the loop; run the same candidate through the judge 10 times and check variance. If variance > 0.5 points, the judge is too noisy for autoresearch — fix the prompt first.

Domain 2 — STRIDE security audit on a new endpoint

Metric: number of distinct categorized findings against a 6-category STRIDE rubric (Spoofing, Tampering, Repudiation, Information Disclosure, DoS, Elevation of Privilege). Constraint: all findings must reference specific code locations; no findings repeated; severity classification matches the team's existing severity rubric (not the model's invented one).

The loop generates candidate audit reports, each iteration looking for findings the previous ones missed. We ran it overnight against a new payments endpoint (1,200 lines of TypeScript), starting from an empty report and a "find what's wrong" prompt. The first iteration found 8 categorized findings, all real. Iteration 47 found 23, two of which were actual bugs that the original human review had missed (a missing tenant isolation check on a webhook handler and a sequencing race on idempotency key writes).

What worked: the metric (more distinct categorized findings) gave the model a clear incentive to keep looking instead of generating its first report and stopping. What did not work: iterations 48-80 produced a long tail of false positives — the model started inventing categories or rephrasing existing findings to score higher. We capped the loop at the point of diminishing returns.

Watchouts: validate every finding against the code by hand. Autoresearch is a hypothesis-generation machine here, not a verification one. The value is "we found 2 bugs we wouldn't have caught"; the cost is "we have 38 false positives to dismiss."

Domain 3 — Pricing-page variant generation

Metric: number of distinct value propositions tested per tier, scored against a "willingness-to-pay signal" prompt from a separate judge model. Constraint: must preserve the actual prices, the actual feature list, and the legal microcopy. Tiers must remain three.

This one surprised us. The constraint was so tight that the model had little degrees of freedom — but it still generated 60 distinct framings of the same product across 180 iterations. The top three were materially different from the human baseline and we ran them in real A/B tests; one lifted free→paid conversion by 11%, one was a wash, one slightly underperformed.

What did not work: the model wanted to add a fourth tier ("Enterprise — contact us") every time despite the constraint. We had to make the constraint check explicit (parse the rendered output, count tiers, reject anything not exactly three). Constraint enforcement is the hard part; the metric is usually the easy part.

Domain 4 — Product changelog summarization (consistency loop)

Metric: semantic-distance score between auto-generated changelog entries and the team's house style (measured by embedding similarity to the last 30 hand-written entries). Constraint: all merged PRs in the window must be represented; no fabricated changes; commit SHA referenced for each entry.

We had the model generate the weekly changelog. The first iteration was generic. By iteration 12 the style was visually indistinguishable from the human author's. By iteration 30 it was over-fitted (specific quirks of the author's punctuation reproduced too exactly). We landed on iteration 17 in practice.

What worked: this is a task where the metric and constraint are both rigid and the model converges fast. We now run this loop weekly and it costs about $0.40 per changelog.

Domain 5 — Test suite expansion for an existing code base

Metric: branch coverage delta on the codebase. Constraint: every new test must pass; no flaky tests (run each candidate three times, reject if non-deterministic); no test mocks the system under test.

This one is the autoresearch sweet spot — pure mechanical metric, hard constraint, clear iteration objective. We ran it against a legacy Rails codebase with 47% branch coverage. After 8 hours, coverage was at 71%. Of the 340 new tests, we kept 290 (the other 50 hit the no-trivial-test heuristic — "asserts that 2+2=4 via the calculator service" type filler). The kept tests caught two real bugs that had been latent.

Watchouts: the model loves to write tests for the easy paths first. If you want coverage on the hard paths, set the metric narrowly (coverage delta on src/billing/, not the whole tree).

Domain 6 — SEO meta description optimization across 200 pages

Metric: CTR proxy from a deterministic judge (judge sees only title+meta description, scores 1-10 against generic "would I click this from a SERP result"). Constraint: keyword density of the primary keyword stays in 0.5-2.0% of the meta description; character count under 158; brand name appears.

We ran this against a 200-URL backlog of pages with non-zero impressions and zero clicks (the GSC CTR-fix worklist targets exactly this shape of work). Average judge score before: 4.8. After 5 iterations per page: 7.1. We pushed all 200 to production simultaneously; CTR on the cohort lifted from 0.4% to 1.6% over 30 days, which is 4x.

What didn't work: the judge model's scoring drifted across long batches. We reset the judge prompt every 25 pages and verified consistency by re-scoring a held-out reference set.

What does NOT transfer

Three categories of work fight autoresearch.

Subjective novelty. "Write a more original blog post" has no mechanical metric. LLM-as-judge for originality is noisy — the model rewards lexical novelty over actual idea novelty and you end up with thesaurus-tier output. Multi-stakeholder approvals. Anything that requires "does the legal team approve" or "does this brand voice feel right to the founder" cannot be looped. The human review step is the bottleneck; making the LLM faster doesn't speed it up. Tasks without a stable constraint. "Make the code faster" without "and don't break existing tests" leads to the model deleting features. Goodhart's law writ specific. Tasks where the search space is too small. Optimizing a 3-line config file is not autoresearch — you can enumerate the possibilities by hand. Autoresearch wins when the candidate space is too large for human enumeration but the metric is cheap to evaluate.

The setup, end-to-end

If you want to run an autoresearch loop today, this is the minimum kit:

Pick your port. uditgoenka/autoresearch (4.4k⭐, cleanest CLI port) is the default. wanshuiyin/ARIS (9k⭐) adds a critic pass before each merge — slower but higher-quality on contested questions. The full comparison is at /topic/autoresearch.

Define the metric in a script. Not a prompt — a script. The script reads the candidate output and emits a number. If the metric is "LLM judge score," the script calls the judge model deterministically with a frozen prompt and parses the response.

Define the constraint as a checker. Same shape: a script that reads the candidate and returns pass/fail. Both metric and constraint should be cheap to run because they run on every iteration.

Budget cap. --budget 5.00 is the default we use on exploratory runs. For production loops we set the budget to 2-3x the cost we'd be willing to pay for a successful outcome.

Logging on disk. Each iteration writes its candidate, its metric, and its constraint result to a markdown file in a working directory. This is the part that distinguishes autoresearch from "ask Claude to iterate" — the loop can resume, can be inspected mid-run, and produces an audit trail.

Read the trail. Most of the value is not in the final output. It's in the trail — the candidates the loop rejected often surface failure modes you didn't know your domain had.

Where this fails

Cost gets ugly fast. A 4-hour run on claude-opus-4-7 with aggressive subagent fanout can cost $40-80. Set the budget cap. Use claude-haiku-4-5 for non-technical loops where speed beats quality. The "found a great source" rabbit hole. Both Karpathy's original and uditgoenka's port can anchor on an early candidate and miss better ones. ARIS's review pass mitigates this; the simpler ports do not. The metric is the bottleneck. Most teams' first autoresearch attempt fails because the metric is too noisy or the constraint is too loose. Iterate on metric/constraint design before iterating on the loop itself. A good metric with a bad loop beats a bad metric with a good loop. Domain drift. What worked for ML doesn't transfer wholesale. Karpathy's repo assumes you can spin up training runs cheaply; not every domain has equivalent. Pricing-page autoresearch is fast because rendering and judging a candidate landing page costs cents; full-stack feature autoresearch is slow because each iteration takes minutes.

Sources

Karpathy, Andrej. "autoresearch" repository. 80,700+ stars. 630 lines.
Karpathy, Andrej. Announcement tweet, March 7, 2026. 8.6M views.
Karpathy, Andrej. Sequoia AI Ascent 2026 fireside chat. "Vibe coding raised the floor; agentic engineering raises the ceiling."
uditgoenka. autoresearch (Claude Code port). 4,400+ stars.
wanshuiyin. ARIS (Auto Research In Sleep). 9,000+ stars.
drivelineresearch. autoresearch fork.
Maleick. autoresearch-claude.
Hamel Husain. "LLM Evals FAQ". Judge prompt stabilization methodology.
Eugene Yan. "LLM Evaluators". Pairwise > direct scoring; judge drift over time.

FAQ

Q: Do I need to use Claude Code to run autoresearch? A: No. Karpathy's original repo is provider-agnostic — it works with the OpenAI SDK directly. The Claude Code ports add tighter integration (skills, hooks, subagents). If you're not on Claude Code, the original or one of the OpenAI-targeted forks is your path. Q: How long should a run go? A: Until the metric plateaus for 10 consecutive iterations or the budget runs out, whichever comes first. Plateau detection is the key heuristic — running past it burns budget without improving output. Q: Can I run autoresearch in CI? A: Yes, for tasks with short iterations and small budgets. We run the SEO meta-description loop weekly in GitHub Actions; it takes about 18 minutes and costs $4. Long agentic loops (multi-hour) are not CI-shaped — run them on a separate machine. Q: How is autoresearch different from Superpowers? A: Superpowers is a methodology framework — brainstorm, plan, implement, TDD, subagent review. Autoresearch is an autonomous iteration loop. You can use both: Superpowers structures the work, autoresearch runs the optimization stages where applicable. Comparison. Q: What's the smallest task worth using autoresearch on? A: Anything where the iteration budget is at least 20 candidates and a human can't enumerate the space by hand in less time than the loop takes. Below that, manual iteration is faster. Q: Can the loop optimize against multiple metrics? A: Yes, with weighted scoring. We've used 0.7 conversion_score + 0.3 brand_voice_score for landing page work. Weight selection becomes its own hyperparameter — set it before the loop runs, don't tune it inside the loop.