From AI Coding to Semi-Automated Science

Source asciidoc: `docs/article/from-ai-coding-to-semi-automated-science.adoc` Artificial intelligence is now moving beyond the familiar role of a coding assistant. The more consequential shift is that AI systems are starting to participate in the research loop itself: proposing modifications, editing code, running experiments, evaluating outcomes, and preserving only the changes that improve results. That is not the same thing as autonomous science in the strong sense, and it is certainly not proof of runaway self-improvement. But it is a meaningful change in the mechanics of progress.

This distinction matters. For the last two years, public discussion has focused on productivity at the level of the individual developer: faster coding, faster debugging, faster prototyping. Those gains are real, but they are not the deepest story. The deeper story is that parts of scientific and engineering iteration are becoming machine-executable. Once that happens, the bottleneck begins to move. The scarce resource is no longer only implementation time. It becomes the quality of the search space, the quality of the evaluator, the cost of experimentation, and the discipline with which successful steps are retained.

A useful concrete example comes from Andrej Karpathy’s autoresearch project. Its design is intentionally modest. The system gives an agent access to a small but real language-model training setup based on nanochat, lets it modify the code, runs short training experiments, checks whether the metric improved, and keeps or discards the change accordingly. In other words, the system does not claim to have solved science. It operationalizes a narrower but highly important idea: some research work can be turned into an iterative search loop with machine participation. The visual pattern of progress is telling. It is not a single heroic breakthrough. It is a staircase of incremental improvements accumulated across many trials.

That staircase is the real signal. In many technical fields, progress does not arrive as one grand insight. It arrives as repeated hypothesis generation, implementation, testing, comparison, and selection. Human researchers have always known this. A large share of serious work is not the moment of inspiration; it is the disciplined elimination of bad ideas and the careful preservation of better ones. If an AI system can absorb even part of that loop, the throughput of research changes. The machine is not replacing scientific judgment in full, but it is increasing the number of informed shots a team can take per unit of time.

This is why it is more accurate to describe the moment as the rise of semi-automated research rather than “AI creating AI” in some mystical sense. The novelty is not that optimization exists. Science and engineering have always relied on optimization, search, and feedback. The novelty is the combination of a general-purpose language model, a mutable codebase, a controlled execution environment, and an evaluator that can decide whether a modification deserves to survive. That stack creates a practical research primitive: propose, modify, run, score, retain.

Other recent systems show the same broader pattern at larger scope. Sakana AI’s “The AI Scientist” pushed the loop much further, describing a pipeline that can generate ideas, search literature, implement experiments, analyze results, draft a manuscript, and even perform automated reviewing. In 2026, a Nature paper on the system described this as progress toward end-to-end automation of AI research, while also explicitly noting limitations and risks, including misleading experimental comparisons and the possibility of adding noise to scientific literature. The important point is not that the problem is solved. It is that the full research lifecycle is increasingly being decomposed into stages that machines can participate in.

Google DeepMind’s AlphaEvolve points in a similar direction from another angle. Rather than presenting a general science pipeline, it frames the problem as iterative code evolution guided by evaluators. In DeepMind’s description and white paper, AlphaEvolve uses language models to propose code changes and then relies on feedback loops to improve algorithms. The reported outcomes are notable not because they suggest general machine intelligence, but because they show that a well-designed search-and-evaluation loop can produce useful results in hard domains, including optimization of computational infrastructure and selected mathematical problems. The headline should not be “the machine became a scientist.” The headline should be “algorithmic improvement itself is becoming more automatable.”

Microsoft Research’s RD-Agent makes the same trend even more explicit in industrial terms. Its architecture separates “Research” from “Development”: one component proposes or refines ideas based on performance feedback, while another implements and fixes code using execution feedback. That division is revealing. It treats R&D not as a romantic act of inspiration but as an operational system with roles, traces, checkpoints, and measurable outputs. This is exactly how mature organizations scale valuable work: not by waiting for genius, but by structuring iteration.

Taken together, these projects suggest a directional change in AI progress. Models are no longer only the object of research. They are becoming instruments for accelerating research on models and adjacent computational problems. That shift has strategic consequences. First, it increases the value of well-designed evaluation. A weak evaluator can industrialize error just as efficiently as a strong evaluator can industrialize progress. If the metric is noisy, narrow, or gameable, automation simply moves faster in the wrong direction. Second, it increases the value of environment design. The more clearly a task can be executed, measured, and compared, the easier it is to place inside an automated loop. Third, it changes the human role. Researchers do not disappear; they move upward. They become architects of search spaces, curators of constraints, designers of evaluators, and auditors of results.

This is also where restraint matters. It would be a mistake to interpret these developments as evidence that science is about to become fully autonomous across domains. Machine learning is a particularly favorable target because many experiments are already digital, reproducible, and metric-driven. The same recipe does not transfer cleanly to every field. Wet-lab science, long-horizon causal inference, and research areas with ambiguous or delayed feedback remain far harder to automate. Even within AI itself, local benchmark gains can be misleading if they do not generalize. Semi-automated research can amplify overfitting, benchmark gaming, and shallow novelty unless it is governed by rigorous evaluation and human review.

Still, the trend is real. The most important consequence is not that AI writes more code. It is that AI begins to increase the rate at which hypotheses can be generated, tested, and filtered. That is a deeper lever. It changes the production function of progress. In practical terms, the frontier may increasingly belong not only to those with the strongest models, but also to those with the best experimental loops: the cleanest evaluators, the cheapest test environments, the best retention logic, and the strongest discipline around reproducibility.

So the right framing is neither panic nor complacency. We are not watching autonomous science arrive in finished form. We are watching the early industrialization of a research pattern. The shift is from manual iteration toward machine-assisted iteration, and in some narrow but important contexts, toward machine-driven iteration under human-designed constraints. That is already enough to matter.

The decisive question for the next phase of AI is therefore not whether machines can generate impressive text or write acceptable code. It is whether we can build reliable systems in which models help improve models without collapsing quality, truthfulness, or scientific discipline. If that answer continues to improve, then the next leap in AI will not come only from better base models. It will come from better research loops.