Claude Runs Science Experiments Autonomously for Days

Key Takeaways

Claude Opus 4.6 built a cosmological physics solver from scratch over a few days, a task that groups with domain expertise typically complete over months to years of researcher time
CLAUDE.md and CHANGELOG.md files give Claude persistent goals and long-term memory across multi-day sessions, including a log of failed approaches so dead ends are never repeated
The Ralph loop orchestration pattern prevents Claude from stopping prematurely, iterating up to 20 times until success criteria are confirmed
Anthropic’s C compiler project previously demonstrated this model at scale, with Claude working across roughly 2,000 sessions to build a compiler capable of compiling the Linux kernel

Anthropic published research on March 23, 2026 showing Claude running a scientific computing project autonomously for multiple days, reaching sub-percent accuracy on a physics calculation that groups with domain expertise typically complete over months to years. This is not a prototype. It used a real HPC cluster, a real cosmological solver, and Claude Opus 4.6 working without continuous human supervision. For developers, researchers, and technical teams in the US and India already exploring AI automation, the architecture Anthropic describes is replicable today with Claude Code.

What Anthropic Actually Built and Why It Matters

The experiment centered on building a differentiable cosmological Boltzmann solver in JAX, numerical code that predicts the statistical properties of the Cosmic Microwave Background by evolving coupled equations for photons, baryons, neutrinos, and dark matter through the early universe. Groups who do have domain expertise have built differentiable solvers in JAX with a subset of CLASS features, and these efforts typically represent months to years of researcher time. Claude Opus 4.6 worked on the project from scratch over a few days, reaching sub-percent agreement with the reference CLASS implementation across its various outputs.

The researcher who ran the experiment, Siddharth Mishra-Sharma from Anthropic’s Discovery team, does not specialize in cosmology. He described the task as not being in his core scientific domain and noted he would not be able to complete it himself in any reasonable time frame. A side effect he reported: following Claude’s git commit history turned out to be an effective way to learn the underlying physics, with the commit log reading like lab notes from a fast, hyper-literal postdoc.

The Four-Part Architecture That Makes Multi-Day Runs Work

Getting Claude to operate reliably over days requires structure that most single-session users never set up. Anthropic’s approach uses four interlocking components:

CLAUDE.md: A root-level instruction file Claude keeps in context throughout the project, encoding goals, design decisions, and rules. Claude can edit this file as it works, updating instructions for future sessions
CHANGELOG.md: The agent’s portable long-term memory across sessions, functioning as lab notes. It tracks current status, completed tasks, failed approaches with explanations, accuracy tables at checkpoints, and known limitations. Without logging failed attempts, successive sessions will re-attempt the same dead ends
Test oracle: A reference implementation or quantifiable objective Claude runs continuously to measure progress. In the Boltzmann solver project, this was the CLASS C source code, which Claude used to construct and run unit tests before every commit
Git coordination: Claude commits and pushes after every meaningful unit of work, creating a recoverable history, making progress visible without active monitoring, and preventing work from being lost if compute allocation runs out mid-session

These four elements address the core failure mode of long-running agents: context loss between sessions and undefined success criteria. The CHANGELOG.md pattern is particularly critical because an agent that does not record why an approach failed will waste entire sessions re-attempting strategies it already abandoned.

The Type of Scientific Task That Fits This Model

Not every research problem suits a long-running agent. Anthropic identified three traits that define compatible tasks: the work is well-scoped, the success criteria are clear, and human oversight can be occasional rather than continuous. Concrete examples include reimplementing a numerical solver, converting legacy scientific software from an old Fortran dialect to a modern language, and debugging a large codebase against a reference implementation.

The Boltzmann solver is structurally different from tasks that can be parallelized across many agents. Because it is a deeply coupled pipeline, a small numerical error or poor approximation in one stage can shift everything downstream. Anthropic chose a single sequential agent that spawns subagents as needed and uses the reference implementation to bisect discrepancies, rather than distributing the work. For Indian and US research teams choosing between agent architectures, this distinction matters: parallelization helps when components are independent; it hurts when they are not.

The Ralph Loop: Preventing Premature Completion

One underreported problem with long-horizon agents is agentic laziness: the model claims completion before the task is fully done. Left unchecked, this silently ends sessions short of the stated goal.

The Ralph loop addresses this directly. It is an orchestration pattern that re-prompts Claude when it signals completion, asking whether the work actually meets the defined success criteria. A typical invocation in Claude Code looks like this:

/ralph-loop:ralph-loop "Please keep working on the task until the success criterion of 0.1% accuracy across the entire parameter range is achieved." --max-iterations 20 --completion-promise "DONE"

Claude iterates up to 20 times, checking its own output against the target before confirming completion. Similar patterns include GSD (Get Shit Done) and the native /loop command built into Claude Code. For research teams in India or the US running Claude on overnight compute jobs, this pattern is the practical difference between waking up to a finished task and waking up to an agent that stopped at 60% three hours in.

Running Claude on an HPC Cluster: The Exact Setup

Anthropic’s workflow uses SLURM, the job scheduler standard in academic HPC environments including institutions across the US and India. The job script requests a compute node, launches Claude Code inside a tmux session, and supports runtimes of up to 48 hours:

bash

#!/bin/bash
#SBATCH --job-name=claude-agent
#SBATCH --partition=GPU-shared
#SBATCH --gres=gpu:h100-32:1
#SBATCH --time=48:00:00
#SBATCH --output=agent_%j.log
cd $PROJECT/my-solver
source .venv/bin/activate
export TERM=xterm-256color
tmux new-session -d -s claude "claude; exec bash"
tmux wait-for claude

The tmux wrapper lets you detach from the session, close your laptop, and check progress remotely on your phone via GitHub. Steering the agent mid-run is as simple as SSH-ing into the cluster and re-prompting, or asking a local Claude Code instance to SSH in and run commands on your behalf.

For researchers at IITs, IISc, or US universities where HPC access is standard, this is a production-ready workflow requiring a configured Claude Code environment and a CLAUDE.md file to start.

How Sequential vs. Parallel Agent Architecture Compares

Dimension	Single Sequential Agent	Parallel Agent Teams
Best for	Deeply coupled pipelines (e.g., solvers)	Independent, parallelizable subtasks
Example task	Cosmological Boltzmann solver	C compiler (2,000 sessions)
Error propagation risk	High if parallelized	Low for independent modules
Subagent use	Spawned as needed within one session	Distributed across many sessions
Memory continuity	CHANGELOG.md across sessions	Requires coordination mechanism

Anthropic’s C compiler project used roughly 2,000 parallel sessions and successfully compiled the Linux kernel. The Boltzmann solver required the opposite approach because numerical errors in early stages cascade through the entire physics pipeline. Choosing the right architecture upfront is the single most consequential decision before launching a multi-day agent run.

Limitations and Honest Trade-offs

The Boltzmann solver Claude produced is not production-grade. It does not match the CLASS reference implementation to acceptable accuracy in every parameter regime. The agent also had clear gaps in its test coverage for a significant portion of the development cycle, testing the code at only a single parameter point rather than across the full parameter space. Claude made elementary domain errors as well, such as tripping over gauge conventions or spending hours chasing bugs that a domain expert would spot immediately. Long-running autonomous work today depends on the agent having a reliable test oracle; without one, there is no objective way for the agent to know whether it is making real progress.

What This Means for Developers and Researchers Today

The opportunity cost framing Anthropic used is direct: every night you do not have agents working on well-defined problems is potential progress left unrealized. For researchers at Indian institutions managing compute budgets, or US lab teams with overnight GPU access, the architecture Anthropic describes converts idle compute time into autonomous scientific output.

The prerequisite is discipline in problem definition. Tasks suited to this model share three traits: clear deliverables, quantifiable success criteria, and a reference implementation or test suite to check against. Without all three, a long-running agent has no reliable way to measure whether it is advancing toward the goal.

Frequently Asked Questions (FAQs)

What is long-running Claude for scientific computing?

Long-running Claude refers to Claude Code sessions that operate autonomously for hours or days on scientific computing tasks with minimal human supervision. Anthropic demonstrated this in March 2026 using Claude Opus 4.6 to build a cosmological Boltzmann solver on an HPC cluster, reaching sub-percent accuracy against the CLASS reference implementation over a multi-day run.

What is CLAUDE.md and why does it matter for long-running tasks?

CLAUDE.md is a root-level instruction file that Claude Code loads automatically and keeps in context throughout a project. It stores the project’s goals, design decisions, and operating rules. Claude can also update it during the run, which means future sessions inherit decisions made in earlier ones rather than restarting blind.

How does Claude avoid losing progress between sessions?

Anthropic recommends a CHANGELOG.md progress file that functions as the agent’s portable long-term memory. It logs completed steps, failed approaches with specific reasons, accuracy milestones at key checkpoints, and known limitations. Without this record, Claude will repeat failed strategies in subsequent sessions.

What is the Ralph loop in Claude Code?

The Ralph loop is an orchestration pattern that re-prompts Claude when it signals task completion, asking it to verify whether the success criteria are actually met before accepting the result. It iterates up to a configured maximum, commonly 20 cycles, before accepting the “DONE” confirmation, preventing premature task termination on long-horizon work.

What kinds of scientific tasks work best for long-running Claude agents?

Tasks that are well-scoped, have clear and measurable success criteria, and include a reference implementation or existing test suite work best. Specific examples from Anthropic’s research include reimplementing numerical solvers, converting legacy scientific code to modern languages, and debugging large codebases against reference outputs. Open-ended discovery tasks with no verifiable output are not well-suited to this model at present.

Why did Anthropic choose a single sequential agent instead of parallel agents for the Boltzmann solver?

The Boltzmann solver is a deeply coupled pipeline where a small numerical error or poor approximation in an early stage can subtly shift everything downstream. This makes it better suited to a single agent working sequentially and drawing from the full context of prior decisions, rather than distributing work across parallel agents that cannot share that causal chain.

Can this workflow run on university HPC clusters in India or the US?

Yes. Anthropic published a working SLURM job script that launches Claude Code inside a tmux session on a GPU node with up to a 48-hour runtime. SLURM is the standard scheduler at most academic HPC facilities in the US and India, including IIT, IISc, and major US research universities, making this workflow directly applicable to academic lab settings.

Is the output from a long-running Claude agent production-ready?

Not always. Anthropic explicitly noted the Boltzmann solver Claude built does not match the reference CLASS implementation to acceptable accuracy in every parameter regime, and the agent had test coverage gaps during significant portions of development. Long-running agents compress research timelines significantly, but human review and domain expertise remain essential before deploying outputs in production scientific contexts.

Search for an article

Claude Can Now Run Scientific Research for Days Without You Touching a File