Claude Sonnet 4.5 Can Code Autonomously for 30 Hours 🤯
Claude Sonnet 4.5: 30-Hour Autonomous Coding and the Expanding Autonomy Frontier
Overview
This episode of AI Daily Brief covers the release of Anthropic’s Claude Sonnet 4.5, focusing on its coding capabilities and, most notably, its reported ability to code autonomously for up to 30 hours on a single task. The host contextualises this milestone within the rapidly evolving landscape of agentic AI models, comparing performance against OpenAI’s GPT-5 Codex and discussing what extended autonomous operation means for the future of software development. The episode also covers headlines including an alleged OpenAI Sora 2 launch and companion TikTok-style app, Lufthansa AI-driven layoffs, and California’s AI Safety Bill SB 53.
Source video: No URL provided (AI Daily Brief, published ~2025-09-30)
Prerequisites
- Familiarity with large language models (LLMs) and their use in code generation
- Basic understanding of agentic AI systems (models that plan, execute multi-step tasks, and iterate autonomously)
- Awareness of competing AI coding tools: Claude Code, OpenAI Codex, GPT-5
- Understanding of standard AI benchmarks (SWE-bench, TerminalBench, OSWorld)
- General knowledge of the Claude model family versioning (3.5, 3.6, 3.7, Sonnet 4, Opus 4.1, etc.)
Main Points
Claude Sonnet 4.5: Positioning and Benchmarks
- Anthropic describes Sonnet 4.5 as “the best coding model in the world,” targeting use cases in complex agentic tasks, computer use, reasoning, and math.
- Key benchmark results published by Anthropic:
- SWE-bench Verified: 77.2% raw (vs. GPT-5 Codex’s 74.5%); 82% with parallel test-time compute
- TerminalBench (agentic terminal coding): 50% vs. GPT-5’s 43.8%
- OSWorld (computer use): 61.4% vs. Opus 4.1’s 44.4%
- Financial analysis benchmark: 55.3% vs. GPT-5’s 46.9%
- Performance is broadly described as placing Sonnet 4.5 alongside or above Opus 4.1 and GPT-5 class models.
First Impressions: Mixed but Mostly Positive
- Some users (e.g., Gosu Coder, Jeremy Mack) reported no immediately perceptible difference from Sonnet 4.0 and questioned whether coding progress had plateaued.
- Others (e.g., Dan Shipper/Every, Simon Willison, Bindu Reddy, Ethan Mollick) noted genuine improvements, particularly in:
- Speed (~50% faster than previous Claude versions per Every)
- Instruction-following and parallel tool calling
- Finance, statistics, and data analysis tasks
- Dan Shipper noted Sonnet 4.5 is 5x cheaper than Opus 4.1 at the same price point as Sonnet 4, effectively making Opus redundant for most API use cases.
- Victor Talon and Eric Provencher characterised the GPT-5 vs. Sonnet 4.5 distinction as “deep reasoning” (GPT-5) vs. “light reasoning” (Sonnet 4.5) — Sonnet excels at efficient context use and speed; GPT-5 excels at exhaustive multi-minute reasoning on hard problems.
Agentic Use Cases and Enterprise Adoption
- Cognition (Devin): Rebuilt Devin around Sonnet 4.5, citing the biggest leap since Sonnet 3.6:
- Planning performance up 18%
- End-to-end eval scores up 12%
- Multi-hour sessions faster and more reliable
- Notable behaviour: Sonnet 4.5 is reportedly the first model aware of its own context window, proactively summarising progress as limits approach — termed “context anxiety” by Cognition, which can cause premature task shortcuts.
- Factory (enterprise agentic coding): Highlighted more reliable file editing, higher environmental awareness, and avoidance of overthinking simple tasks.
The 30-Hour Autonomous Coding Claim
- Anthropic reported Sonnet 4.5 coded a Slack/Teams-style chat application autonomously for 30 hours, producing approximately 11,000 lines of code, stopping only upon task completion.
- This significantly surpasses prior reported milestones:
- Replit Agent 3: ~200 minutes of autonomous operation
- GPT-5 Codex (announced ~2 weeks prior): 7+ hours on complex tasks
- Analyst Carlos Perez reverse-engineered the approach from leaked system prompts, identifying key mechanisms:
- Forcing code >20 lines into durable artifacts (persistent append-only surfaces), one per response
- Enforcing runtime constraints and governing tool loops
- Supporting long-horizon autonomy via planning and feedback loops
- Anthropic illustrated progress by having every prior Claude version attempt to clone Claude.ai; a usable, functional clone only became possible with Sonnet 4 and is now built in ~5 hours autonomously.
- SWE-bench scores moved from ~33% to 82% in approximately one year.
Imagine with Claude: Generative UI Research Preview
- Anthropic launched a “bonus research preview” called Imagine with Claude, where the model generates entire software UIs and backend functionality on the fly in real time — no pre-written code or predetermined functionality.
- Described as pioneering the “model as backend” concept, similar to the WebSIM paradigm.
- Example use cases: interactive choose-your-own-adventure narratives, side-by-side PM productivity dashboards.
- Limitations noted: dense UIs (e.g., simulated email clients) often malfunction or are too slow; still a generation away from seamless deployment.
- Commentators such as Swix (Latent Space/Cognition) and Josh Bickett view it as a potential new human-computer interaction paradigm — a “generative computer” accessed via natural language.
Claude Code Platform Upgrades
- Claude Agent SDK: Exposes tools, context management systems, and permissions frameworks embedded in Claude Code for developer use.
- Updated terminal interface and a new VS Code extension allowing Claude Code use directly in an IDE.
- Checkpoints feature: Allows instant rollback of Claude’s latest changes — described as highly practical for agentic coding workflows.
Headlines: OpenAI Sora 2 and AI Video App
- Sources (Wall Street Journal, Wired) report OpenAI is imminently launching Sora 2 alongside a TikTok-style short-form video app featuring AI-generated vertical video content.
- App features: swipe-feed, recommendation algorithm, like/comment/remix options; no user-uploaded content.
- Users can verify their likeness for inclusion in generated clips; public figures require explicit opt-in, fictional characters require opt-out.
- OpenAI sees a strategic opportunity given uncertainty around TikTok’s U.S. future.
- Debate continues over the societal implications of AI-generated short-form content.
Lufthansa AI-Driven Workforce Reduction
- Lufthansa announced elimination of 4,000 full-time equivalent roles by 2030 (~4% of workforce), primarily from 10,000 administrative positions.
- Explicitly attributed to digitisation and increased AI adoption improving efficiency.
- Company is financially healthy (targeting 10% operating margins by 2028, up from 4.4% in 2024); framed as proactive restructuring, not distress-driven.
California AI Safety Bill SB 53 Signed
- Governor Newsom signed SB 53, a moderated successor to the vetoed SB 1047.
- Requires leading AI companies to report safety protocols and disclose highest-degree risks (focus: catastrophic risks such as bioweapons assistance or mass casualty facilitation).
- Strengthens whistleblower protections for AI lab employees.
- Anthropic supported the bill; Google and OpenAI opposed it; Meta was neutral.
- Critics (e.g., a16z) argue it regulates model development rather than deployment, potentially entrenching large incumbents and disadvantaging startups.
Key Concepts
- SWE-bench Verified: A standardised benchmark measuring an AI model’s ability to resolve real-world software engineering tasks from GitHub issues.
- TerminalBench: A benchmark evaluating agentic coding performance in terminal/command-line environments.
- OSWorld: A benchmark measuring an AI model’s ability to operate a computer OS autonomously (computer use).
- Agentic coding: AI systems that autonomously plan, execute, iterate, and debug multi-step software development tasks without continuous human prompting.
- Parallel tool calling: The ability of a model to invoke multiple tools simultaneously (e.g., running multiple searches or reading multiple files at once) rather than sequentially.
- Context anxiety: Term coined by Cognition describing Sonnet 4.5’s observed behaviour of taking shortcuts or leaving tasks incomplete when it (sometimes incorrectly) believes it is approaching its context window limit.
- Durable artifacts: Persistent, append-only output surfaces used to store code modules across a long agentic session, preventing truncation.
- Model as backend: A paradigm in which an LLM generates not only the UI but also all backend logic and functionality on the fly, with no pre-written code.
- Imagine with Claude: Anthropic’s research preview demonstrating real-time, fully generative software UI and functionality creation using Sonnet 4.5.
- Light vs. deep reasoning: A distinction drawn by community analysts — Sonnet 4.5 characterised as “light reasoning” (fast, efficient, context-aware) versus GPT-5’s “deep reasoning” (slow, exhaustive, multi-minute deliberation).
- Claude Agent SDK: A developer toolkit exposing Claude Code’s internal tools, context management, and permissions frameworks for building custom agentic applications.
- SB 53: California AI Safety Bill requiring safety protocol reporting and risk disclosure from leading AI developers, with strengthened whistleblower protections.
Summary
The central finding of this episode is that Claude Sonnet 4.5 represents a significant step forward in autonomous agentic coding, most dramatically illustrated by Anthropic’s claim that the model coded a Slack-equivalent application for 30 consecutive hours without human intervention — more than quadrupling the previously reported record of 7 hours set by GPT-5 Codex just weeks earlier. While first-impression reviews from developers are mixed — with some finding the quality leap over Sonnet 4.0 modest and others observing clear gains in speed, reliability, and non-coding domains like finance and statistics — enterprise agentic platforms such as Cognition’s Devin reported meaningful capability jumps and rebuilt their product around the model. The host argues that agentic coding improvements are not merely a niche technical milestone but a bellwether for broader model capability progress, and that the trajectory of benchmark improvement (SWE-bench rising from ~33% to 82% in roughly one year) signals a relentless pace of change that is beginning to reshape how software is built, potentially even including the models themselves.