Auto Research: Karpathy’s 5-Minute Agent Loop

Auto research is the agent loop where software changes one thing, runs a short experiment, checks a real metric, keeps the winner, and repeats. Karpathy’s AutoResearch made that loop concrete for AI training: 5-minute experiments, roughly 12 runs per hour, and 100+ trials overnight.

We saw a new Search Console signal for this topic this week: “auto research” appeared with 10 impressions, 0 clicks, and an average position of 86.1. Our existing Karpathy AutoResearch page also had 14 impressions at position 75.4 over the last 14 days, so this is an update—not a new post. The search intent is still early, but the pattern matters for anyone building agents that improve business outcomes instead of just answering questions.

What AutoResearch does

Karpathy’s AutoResearch repo gives an AI agent a small but real LLM training setup and asks it to improve the result. The agent edits train.py, runs a fixed 5-minute training experiment, checks validation bits per byte, keeps the change if the metric improves, and reverts it if it does not.

The important part is not that the task is machine learning. The important part is the closed loop: propose → test → measure → keep or discard. Once that loop exists, the agent can run far more trials than a human team would manually attempt.

Fixed time budget: every run gets 5 minutes, so experiments stay comparable.
Single-file scope: the agent mostly edits one training file, which keeps diffs reviewable.
Binary decision: the metric improved or it did not.
Git as memory: each experiment becomes history the agent can inspect.

That design explains why the repo spread so quickly. GitHub currently shows tens of thousands of stars and forks because the idea is simple enough to understand: stop asking agents for suggestions and let them run the experiment loop.

Why the 5-minute loop matters

The 5-minute constraint is the product insight. Without a hard budget, the agent can accidentally “improve” results by making the model slower, bigger, or harder to compare. With a fixed budget, every experiment competes under the same conditions.

Karpathy’s README describes the practical result: about 12 experiments per hour and around 100 experiments while you sleep. That turns experimentation from a calendar problem into a throughput problem. Humans still design the system and judge direction, but the agent handles the boring middle: try, measure, log, repeat.

Philipp Schmid’s writeup reported larger follow-on runs as well: roughly 700 experiments, around 20 real improvements, and an 11% faster time-to-GPT-2 result in one nanochat setup. He also highlighted Shopify CEO Tobi Lütke adapting the loop for a query-expansion model: 37 experiments in 8 hours, producing a smaller 0.8B model that scored 19% higher than a previous 1.6B model.

Those numbers are early and domain-specific, but they show the same lesson: when evaluation is clear, agents can compound small improvements faster than a human workflow.

Auto research vs. a normal AI agent

Most AI agents stop at advice. They read data, summarize it, and suggest an action. Auto research goes one step further: it creates an experiment, runs it, and keeps evidence about whether the change worked.

Pattern	What the agent does	What the human gets
Chat assistant	Answers a question	A response to review
Analytics agent	Checks connected data and explains what changed	A diagnosis with numbers
Auto research loop	Runs repeated tests against a metric	A log of attempts, winners, and failures

This distinction matters for non-technical site owners. You do not need an agent that says, “maybe update your title tag.” You need an agent that spots a 0% CTR query, updates the right page without cannibalizing another post, records the assumption, and checks whether clicks improve on the next run.

The business version: experiments on GA4, Search Console, and Shopify data

AutoResearch is built for ML training, but the same loop applies to business data if the metric is clear. The experiment does not need to be “change model architecture.” It can be “change a title tag,” “rewrite a product FAQ,” “adjust a Slack alert threshold,” or “test a new weekly report format.”

Here are practical auto research loops for site owners:

SEO CTR loop: find pages with high impressions and 0 clicks, update title/meta, wait 7–14 days, compare CTR and position.
AEO citation loop: add answer-first blocks and FAQs, monitor long-tail question impressions, then refine the extraction block.
Shopify operations loop: test a lower inventory-alert threshold, track missed stockouts and false alarms, then keep the threshold that reduces noise.
GA4 diagnosis loop: detect a device or channel drop, ask why it happened, save the likely cause, and verify whether the segment recovers.

This is where DataVessel fits. A non-technical owner can ask, “why did iOS traffic drop?” or “which Search Console pages have 0% CTR?” and get an answer with real numbers from GA4, Search Console, Shopify, WordPress, and Slack—not another dashboard to learn.

For content work, this is already close to the SEO Growth Autopilot workflow: recall past decisions, pull Search Console data, avoid cannibalization, update or publish content, save the assumption, and report the result back to Slack.

The hard part is evaluation, not generation

Auto research only works when the metric is hard to game. In ML, a held-out validation set prevents the agent from optimizing a result that looks good but fails in production. In business analytics, the equivalent is a clean baseline and a metric tied to the actual goal.

For SEO, that means you do not reward a page just because impressions rose. You check whether the right query moved, whether CTR changed, whether position held, and whether the page avoided competing with another URL. For Shopify, you do not reward “more alerts.” You reward fewer missed issues and fewer noisy messages.

A good agent needs three guardrails:

A stable metric: CTR, conversion rate, order import failures, refund spike rate, or another number that maps to a real outcome.
A comparison window: last 7 days vs. previous 7 days, or current 14 days vs. the prior period.
Memory: what changed, why it changed, and what result was expected.

Without those guardrails, you get random automation. With them, you get compounding improvement.

Where AutoResearch points next

The bigger idea is not that every business needs to train a model overnight. Most do not. The bigger idea is that agent workflows should become measurable loops, not one-off prompts.

That is the difference between “AI wrote a blog post” and “AI found a rising query, updated the right page, added internal links, checked Search Console two weeks later, and learned the title test failed.” One is content generation. The other is operating discipline.

If you want to build that discipline into your own analytics workflow, start smaller than Karpathy did. Pick one metric. Pick one weekly cadence. Use scheduled agents to run the check. Send the results to Slack. Then keep a memory of what you changed and whether it worked.

As frontier models become better at long-horizon work, the loop becomes more practical. Claude Fable 5’s long-running AI agent use cases show why stronger models still need the same ingredients: a metric, a harness, a progress log, tests, and a human review gate before risky changes ship.

Frequently Asked Questions

What is auto research in AI?

Auto research is an autonomous experiment loop where an AI agent proposes a change, runs a test, measures a metric, and keeps or discards the change. Karpathy’s AutoResearch applies this to LLM training, but the same pattern can apply to SEO, analytics, and ecommerce operations.

What is Karpathy’s AutoResearch?

Karpathy’s AutoResearch is an open-source project where an AI agent edits a small LLM training script, runs fixed 5-minute experiments, and keeps changes that improve validation bits per byte. The repo is designed around a simple, repeatable loop rather than a large framework.

Can auto research work without a GPU?

The original Karpathy repo targets single-GPU model training, but the auto research pattern does not require a GPU. Business use cases can run against Search Console, GA4, Shopify, WordPress, or Slack data as long as the agent has a clear metric and a safe action boundary.

How is auto research different from prompt optimization?

Prompt optimization improves instructions for a frozen model. Auto research can change the system being tested, such as code, thresholds, content, or workflows, then measure whether the change improved the target metric.

What is the safest first auto research workflow for a small site?

The safest first workflow is a read-heavy SEO CTR loop: identify high-impression, 0-click Search Console queries, update one existing page, save the assumption, and recheck the same query after 7–14 days. It has a clear metric and low operational risk.