TL;DR: Andrej Karpathy released autoresearch — a system where AI agents autonomously modify training code, run 5-minute experiments, and keep changes that improve performance. One overnight session: 126 experiments, measurable improvement, zero human intervention. The bigger idea: what if thousands of agents collaborate on research like a distributed community?
There’s a loop every ML practitioner knows well. Change the code. Run the experiment. Check the metrics. Adjust. Repeat.
Andrej Karpathy — co-founder of OpenAI, former Tesla AI lead — just automated that loop. And the implications go far beyond ML training.
What Autoresearch Does
Autoresearch is deceptively simple. You give an AI agent access to a training codebase and a set of instructions. The agent then:
- Reads the current code — A single file (
train.py) containing the model architecture, optimizer, and training loop - Makes a change — Modifies hyperparameters, architecture, or optimizer settings
- Runs a 5-minute training experiment — Fixed time budget makes results comparable
- Checks if validation loss improved — If yes, keeps the change and commits to Git. If no, reverts.
- Repeats
That’s it. The agent loops through this cycle autonomously. At ~12 experiments per hour, an overnight run produces 100+ experiments.
What Happened Overnight
Karpathy shared the results from one overnight session — 126 experiments run by Claude on an H100 GPU over roughly 10.5 hours.
The results:
- Starting validation score: 0.9979 bits per byte
- Final validation score: 0.9697 bits per byte
- Best discovery: Applying weight decay to embeddings — a finding the agent arrived at through systematic exploration
- 14 hyperparameters optimized including depth, batch size, learning rates, and window patterns
The agent also discovered what doesn’t work — weight tying catastrophically failed, parallel attention-MLP layers underperformed, multi-query attention degraded results. These negative findings are just as valuable. Knowing what not to try saves research time.
Why 5-Minute Experiments Matter
The fixed 5-minute budget is a key design decision. It means:
- Fast iteration — 12 experiments per hour instead of one per day
- Comparable results — Every experiment uses the same time budget, so improvements are meaningful
- Low cost per experiment — Failed experiments waste 5 minutes of compute, not hours
- Overnight viability — 100+ experiments while you sleep
This is the opposite of how most ML research works. The traditional approach involves careful hypothesis formation, long training runs, and manual analysis. Autoresearch replaces that with high-volume, low-cost exploration — letting the agent try things a human researcher might consider too risky or too speculative to spend time on.
The Bigger Idea: Distributed Agent Research
Here’s where it gets really interesting. Karpathy’s next vision for autoresearch isn’t about one agent running experiments. It’s about thousands of agents collaborating.
In his words: “The goal is not to emulate a single PhD student, it’s to emulate a research community of them.”
The current system grows a single thread of commits — one direction at a time. But Karpathy envisions something like SETI@home for ML research:
- The original repo is a seed from which many research directions branch
- Different agents explore different architectural ideas, optimizer strategies, or hardware platforms simultaneously
- Each agent writes a summary of its findings — like a mini “paper” — as a GitHub Discussion or PR
- Other agents read these findings before starting their own runs, building on what was discovered
The bottleneck in research has always been human attention and tenacity. When agents handle the exploration, those bottlenecks disappear. As Karpathy puts it: “Agents can in principle easily juggle and collaborate on thousands of commits across arbitrary branch structures.”
Git Wasn’t Built for This
There’s a practical problem though. Git assumes one “master” branch with temporary forks that merge back. That model breaks when you have hundreds of agents producing thousands of experimental branches that you’d never want to merge — you’d want to “adopt” and accumulate branches of commits.
Karpathy notes that existing abstractions will “accumulate stress as intelligence, attention and tenacity cease to be bottlenecks.” The tooling hasn’t caught up with what’s now possible.
This is a pattern we’ll see more of. The infrastructure built for human collaboration — Git, GitHub, CI/CD — was designed around human limitations. When agents do the work, those limitations change, and the tools need to evolve.
What This Means Beyond ML
Autoresearch is about ML training today. But the pattern is universal:
- Change code → run test → check results → keep or revert
That loop applies to any domain where you can define a measurable improvement metric. SEO experiments. Ad copy variations. Pricing optimization. Infrastructure tuning.
The question isn’t whether autonomous experimentation becomes the default. It’s how quickly the tooling catches up to make it accessible beyond ML researchers with H100s.
We’re watching that closely at datavessel. The idea of agents that autonomously test, measure, and improve — then report findings to your team — is exactly the direction we’re building toward.
Try It Yourself
Autoresearch is open source. If you have a GPU and want to see what an AI agent discovers about your training pipeline overnight, the repo is here.
If you don’t have an H100 but want AI agents working on your business data — monitoring analytics, generating reports, finding insights — that’s what datavessel does.


Leave a Reply