GPT-5 is 58% AGI

ai-daily-brief-podcast

Overview

This episode of AI Daily Brief covers two main segments: a headlines roundup of notable AI industry news, followed by a deep-dive discussion on a new academic framework for defining and measuring progress toward Artificial General Intelligence (AGI). The central thesis of the main segment is that a newly published paper from researchers affiliated with the Center for AI Safety proposes a quantifiable, multi-dimensional definition of AGI and applies it to score GPT-4 at 27% and GPT-5 at 58% of the way to AGI. The host argues that while AGI definitions are largely irrelevant to day-to-day AI application, they are increasingly consequential for financial markets and investment decisions.

Source video: URL not provided.


Prerequisites

  • Familiarity with large language models (LLMs) and their general capabilities and limitations
  • Basic understanding of the AI industry landscape (OpenAI, Anthropic, Google DeepMind, Meta, etc.)
  • Awareness of common AI terminology: context windows, hallucinations, fine-tuning, multimodal models, agents
  • General knowledge of the AGI concept and why it is a subject of debate
  • Some familiarity with financial/market dynamics around AI stocks and enterprise SaaS metrics (ARR, valuation)

Main Points

Headlines: Claude Code Comes to the Web and Mobile

  • Anthropic has expanded Claude Code beyond terminals and IDEs to a web app and iOS app.
  • The key new capability is spinning up background agents to run multiple tasks in parallel across different repositories.
  • Features include automatic PR creation and change summaries, supporting asynchronous agentic coding workflows.
  • Product manager Kat Wu indicated the CLI will remain the most intelligent and customizable interface, but the goal is to “put Claude Code everywhere.”

Headlines: Replit Projects $1 Billion in Revenue by End of 2026

  • Replit has reached $240 million ARR, up from $16 million at end of 2024 — over 10x growth in one year.
  • The company has 150,000 paying customers and 40 million free users; enterprise margins are close to 80%.
  • The consumer segment operates at a loss (loss-leader model), while enterprise adoption — including companies like Duolingo and Zillow — drives revenue.
  • Replit is positioned as a replacement for no-code/low-code tools that historically underdelivered.

Headlines: Meta’s Standalone AI App Gaining Traction

  • Meta’s AI app now sees over 300,000 downloads per day, up from ~100,000 in mid-September.
  • Daily active users reached 2.7 million, up from 775,000 the prior month.
  • The growth correlates with the September 25 launch of the Vibes feed (AI-generated image and video content), though the feature was widely criticized as “AI slop.”
  • OpenAI’s Sora app requiring an invite code may also be driving users toward Meta’s freely available platform.

Headlines: Open Evidence Raises $200M at $6B Valuation

  • Open Evidence, an AI assistant for doctors, raised $200M (previously raised $210M in July at a $3.5B valuation).
  • The platform now supports 15 million clinical consultations per month, up from 8.5 million in July.
  • The product is free for registered medical professionals and monetized through advertising; it has expanded into 10,000 medical centers.
  • A key competitive moat: the model is fine-tuned on 100 million real-world clinical consultations, data no competitor possesses.
  • The host notes this illustrates a broader thesis — “data exhaust” from real-world usage may become extremely valuable even as the “bitter lesson” (scale beats specialization in pre-training) has held so far.

Headlines: Suno Fundraising and Music Industry Détente

  • Music-gen startup Suno is in talks to raise $100M at a $2B valuation, quadrupling its previous valuation; it is generating $100M ARR.
  • Suno and competitor Udio face copyright lawsuits from Universal and Warner Music; talks to settle and establish a licensing framework are underway, with labels potentially taking equity stakes.
  • Spotify and Universal Music Group are signaling a more pro-AI, partnership-oriented stance — consistent with the music industry’s historical pattern of eventually monetizing disruptive technologies.

Headlines: Starbucks “All In” on AI

  • Starbucks CEO Brian Nichols highlighted a scaled in-store deployment called the “green dot” — a knowledge assistant for store leaders covering equipment troubleshooting and drink recipes.
  • Pilots exist for inventory, supply chain forecasting, and scheduling, but none are at scale.
  • Nichols cited measurable ROI in software/code development speed but was cautious about broader claims.
  • Robot baristas were explicitly ruled out for the near term.

Main Segment: Why AGI Definitions Matter (Even If Practically Irrelevant)

  • The host reiterates a recurring position: AGI as a concept has little practical relevance for day-to-day AI application in business.
  • However, AGI definitions are becoming financially material because progress toward AGI is now a factor in how markets value AI stocks, which are central to the broader economy.
  • The week’s context: OpenAI co-founder Andrej Karpathy publicly stated he believes AGI is still a decade away, contrasting with estimates of one to two years, reigniting debate.
  • Karpathy’s own definition of AGI (reflecting original OpenAI intent): a system that can perform any economically valuable task at human performance or better — including physical work, not just knowledge work.

The Landscape of Existing AGI Definitions

  • OpenAI (2023 framework): “AI systems that are generally smarter than humans.”
  • Sam Altman (2025): A system that can “tackle increasingly complex problems at human level in many fields.”
  • OpenAI Five Levels of AI:
    • Level 1 — Chatbots (conversational language)
    • Level 2 — Reasoners (human-level problem solving)
    • Level 3 — Agents (systems that can take actions)
    • Level 4 — Innovators (AI that aids in invention)
    • Level 5 — Organizations (AI that can do the work of an entire organization)
    • Current models are estimated to be in the Level 3–4 range.
  • Gartner: “Intelligence of a machine that can accomplish any intellectual task a human can perform.”
  • Google: Hypothetical intelligence able to “understand or learn any intellectual task a human can.”
  • Amazon: Software “able to perform tasks it is not necessarily trained or developed for.”
  • ARC AGI Prize: AGI is “a system that can efficiently acquire new skills outside of its training data” — emphasizing generalization power over task-specific skill.
  • OpenAI/Microsoft contract definition: AGI achieved when OpenAI produces software generating $100 billion in profits.
  • Elon Musk: Capable of doing anything a human with a computer can do, but not smarter than all humans and computers combined; estimates 3–5 years away; gives Grok 5 a 10% chance.

The New Framework: AGI Definition Paper (agidefinition.ai)

  • Produced by researchers working with the Center for AI Safety, the paper introduces a quantifiable framework grounded in Cattell-Horn-Carroll (CHC) theory — a well-established model of human cognition.
  • AGI is defined as: matching the cognitive versatility and proficiency of a well-educated adult.
  • AI performance is split into 10 equally weighted categories, each scored out of 10:
    1. Reading and Writing
    2. Math
    3. Reasoning
    4. Working Memory
    5. Memory Storage
    6. Memory Retrieval
    7. Visual
    8. Auditory
    9. Speech
    10. Knowledge
  • Each category contains multiple subcategories assessed individually.

Benchmark Results: GPT-4 vs. GPT-5

  • GPT-4 scored 27%; GPT-5 scored 58% under this framework.
  • GPT-5 made significant gains in reading/writing and math relative to GPT-4.
  • GPT-5 scored in several categories where GPT-4 was entirely deficient: reasoning, working memory, memory retrieval, visual, and auditory.
  • Those newly-scored areas remain nascent compared to math performance.
  • Dan Hendricks (Director, Center for AI Safety): “AGI won’t arrive in a year, but it could easily arrive this decade.”

The Biggest Gap: Memory

  • The paper identifies memory — specifically long-term memory storage and reliable retrieval — as “perhaps the most significant bottleneck.”
  • Current models “fake memory” by using large context windows and external retrieval tools, masking real deficits.
  • Both GPT-4 and GPT-5 fail to form lasting memories across sessions and still hallucinate when retrieving facts.
  • This is a common critique raised by AGI skeptics and a genuine limitation for personalization and dependable learning over time.
  • Anthropic’s recent “skills” feature represents early-stage progress in this area, but no model approaches human-level memory capability.

Strengths and Limitations of the Framework

  • Strength: Converts AGI from a vague buzzword into a measurable, trackable numeric score, enabling more objective discussion of progress with each model release.
  • Strength: Highlights what is missing rather than only frontier capabilities — e.g., despite gold-medal performance at the International Mathematical Olympiad, further math gains contribute little to the AGI score compared to closing gaps in audio, visual, and memory.
  • Limitation: The framework’s scope is cognitive ability, not motor control or economic output. A high score does not guarantee business value.
  • The host argues that both cognitive/functional and economic definitions have merit and serve complementary purposes.

Key Concepts

  • AGI (Artificial General Intelligence): A contested term broadly referring to AI systems that can match or exceed human cognitive performance across a wide range of tasks; definitions vary significantly by organization and researcher.
  • Cattell-Horn-Carroll (CHC) Theory: A well-validated psychometric model of human intelligence used in this paper as the foundation for categorizing AI cognitive capabilities.
  • ARC AGI Prize: A benchmark and competition designed to test AI generalization ability — specifically the capacity to acquire new skills outside of training data — as a proxy for intelligence rather than task-specific skill.
  • Five Levels of AI (OpenAI framework): A staged model of AI progress from chatbots (Level 1) through autonomous organizations (Level 5), adapted from a Google DeepMind paper.
  • Context window: The amount of text/data a model can process in a single session; models “faking” memory by using large context windows do not retain information across sessions.
  • Data exhaust: Real-world usage data generated as a byproduct of deploying AI tools; argued to be a significant future competitive moat for vertical AI companies.
  • Loss-leader model: A business strategy where a product or segment is offered at a loss to drive adoption, with monetization occurring elsewhere (e.g., enterprise customers).
  • Claude Code: Anthropic’s agentic coding tool, now available via web and iOS in addition to CLI/IDE environments.
  • Vibes feed: Meta’s AI-generated image and video content feed, launched September 25, 2025, correlated with a spike in Meta AI app downloads.
  • Open Evidence: An AI clinical decision support tool for doctors, trained on medical journals and fine-tuned on 100 million real-world consultations.

Summary

The episode’s central argument is that while AGI definitions are largely academic in terms of practical AI deployment, they are becoming financially consequential as markets increasingly price AI stocks based on perceived progress toward AGI. Against this backdrop, a new paper from researchers affiliated with the Center for AI Safety — grounded in Cattell-Horn-Carroll cognitive theory — proposes a rigorous, quantifiable ten-category framework for measuring AGI progress and applies it to score GPT-4 at 27% and GPT-5 at 58%. The framework’s key contribution is transforming AGI from an endlessly debated buzzword into a trackable scorecard, revealing that while frontier models have achieved extraordinary performance in math and coding, they remain critically deficient in memory, auditory understanding, and visual reasoning. The host concludes that this kind of structured framework is a valuable addition to the field — not because it will resolve all debate, but because it provides a shared heuristic for assessing genuine progress with each successive model generation, even as economic-output-based definitions remain relevant for contractual and market purposes.