Open Questions
The big debates about AI — tracked from when they were first seriously discussed to how they're resolving. Inspired by Dwarkesh Patel's format.
📂 Data lives in questions.json — disagree? Open an issue.
Still Open
Genuine uncertainty remains. These are the hardest questions.
🔴 Why haven't LLMs made genuine scientific discoveries despite knowing everything?
Still OpenEvidence
LLMs have essentially every known fact memorized but haven't made novel connections that count as discoveries. Humans with far less knowledge routinely find cross-domain insights. Possible explanations: combinatorial explosion (need both concepts in attention simultaneously), low learning efficiency, or creativity requires something pre-training doesn't provide.
Analysis
This might be the deepest open question in AI. If knowledge is cheap to store (Wikipedia is <5MB), why can't models that have it all memorized make the leaps that moderately intelligent humans make? The optimistic read: once models get true agency + learning efficiency, their knowledge advantage will make them "fucking dominate" (Dwarkesh's framing). The pessimistic read: there's something about generalization and creativity that current architectures fundamentally lack.
🔴 Why are all AI labs converging on the same architecture?
Still OpenEvidence
Every major lab is making "thinking" models with chain-of-thought reasoning. Is this because it's the only thing that works? Or are labs just copying each other while equally promising tangential directions go unexplored?
Analysis
The convergence is suspicious. Either the transformer + CoT + RLHF recipe is genuinely the only viable path (which would be remarkable and somewhat concerning), or there's a massive blind spot where alternative approaches could yield breakthroughs. The success of DeepSeek R1 with a different training recipe hints that diversity might be underexplored.
🔴 Can AI achieve superhuman science without human-level learning efficiency?
Still OpenEvidence
Models take orders of magnitude more data than humans to learn equivalent skills, even ones they perform at 99th percentile. Einstein generalized from a few thought experiments and murky observations — that's extreme learning efficiency. Creativity and learning efficiency might be the same thing.
Analysis
If creativity IS learning efficiency, then the path to AI scientists isn't more data or bigger models — it's fundamentally better learning algorithms. This reframes the entire scaling debate: maybe we're scaling the wrong thing. The missing middle between pre-training (skim everything) and in-context learning (short-term memory) might be the key unlock.
🔴 Why does test-time compute keep helping even at absurd scale?
Still OpenEvidence
o3-high wrote 43 million words per task on ARC-AGI and scores kept improving. What did it figure out with word 42 million? RL may just upweight ~10 tokens of MCTS-like scaffolding ("wait", "let's backtrack"), which explains why reasoning distills easily.
Analysis
If reasoning is really just 10 tokens of scaffolding that can be distilled trivially, the current test-time compute scaling might be a temporary inefficiency — future models will do by instinct what current ones need millions of words for. But if it's NOT just scaffolding, and longer thinking genuinely produces better results at arbitrary scale, that has profound implications for what intelligence even is.
🔴 Will AI cause explosive economic growth?
Still OpenEvidence
If AI can automate cognitive labor at 1000x lower cost, historical economic models predict dramatic GDP acceleration. But physical world constraints (energy, manufacturing, regulation, human adoption speed) create friction. Software-only tasks are accelerating; physical world integration is slower.
Analysis
The "software-only singularity" scenario is increasingly plausible — dramatic acceleration in digital domains while physical world constraints throttle overall growth. The question is whether the digital acceleration is big enough to move GDP numbers at national/global scale, or whether it stays concentrated in tech sector productivity.
🔴 How much of a parallelization penalty do multi-agent systems pay?
Still OpenEvidence
Breaking problems across multiple agents loses whole-context visibility. But parallel agents can explore more solution space. Current multi-agent systems show coordination overhead but also emergent capabilities. The optimal split between single deep-thinker and many parallel workers is unknown.
Analysis
This matters enormously for the economics of AI. If parallelization penalty is low, you can throw 1000 cheap agents at a problem instead of one expensive one. If it's high, single powerful models win. Early evidence suggests task-dependent: decomposable tasks parallelize well, tightly coupled reasoning doesn't.
Partially Resolved
Evidence is mounting but the jury's still out.
🟡 Will AI models converge in capability?
Partially ResolvedEvidence
Frontier models score within a few points of each other on standard evals (MMLU, HumanEval, SWE-bench). But practitioners consistently prefer different models for different tasks — Claude for complex coding, Gemini for multimodal and research, GPT for general instruction following. The rapid leapfrogging cycle (monthly lead changes) confirms convergence at the benchmark level while user preferences suggest meaningful differentiation in practice.
Analysis
Benchmark convergence among frontier models is nearly absolute — GPT-5, Claude 4, Gemini 3 Pro, and Grok 4 score within a few points of each other on standard evals. But experiential convergence remains partial. Divergent post-training paradigms (OpenAI's RL reasoning focus, Anthropic's Constitutional AI, Google's multimodal integration) create distinct behavioral profiles. Users consistently report different 'vibes' and specialized strengths despite identical paper scores. This suggests benchmarks are measuring the wrong things — or that the gap between 'can do X on a benchmark' and 'reliably does X in production' is where the real differentiation lives.
🟡 Will open-source AI keep up with frontier labs?
Partially ResolvedEvidence
DeepSeek R1 proved open-source can compete on reasoning at dramatically lower cost. Llama 3 closed the gap to ~1 generation behind frontier. But frontier models still lead on the hardest tasks. Kimi and other Chinese labs are still emerging wildcards.
Analysis
The gap is persistent but not growing — which might be enough. Open-source doesn't need to match frontier; it needs to be 'good enough' for most use cases. The real question is whether the economics of training will concentrate or distribute. DeepSeek suggests the latter.
🟡 Is RAG the right architecture for AI memory?
Partially ResolvedEvidence
Long context windows (200K-1M standard by 2026) handle most document Q&A without retrieval pipelines. Agents managing their own memory through filesystem access are more flexible. But enterprise RAG pipelines remain a multi-billion dollar market — dynamic data, cost constraints, and latency requirements ensure RAG's continued relevance at scale.
Analysis
Long-context windows (1M+ tokens) have superseded RAG for ad-hoc document Q&A and short-term working memory. But RAG remains essential for enterprise-scale search, latency-sensitive applications, cost optimization, and querying dynamic databases where full-context ingestion is financially or computationally impractical. RAG isn't dead — it's been demoted from 'default architecture' to 'specialized tool.' The 2023 pattern of 'chunk everything, embed, retrieve, generate' as the ONLY way to give models knowledge is obsolete. RAG as enterprise search infrastructure is alive and growing.
🟡 Is the AI lab funding bubble sustainable?
Partially ResolvedEvidence
Labs keep raising at 10x valuations year-over-year. OpenAI at $300B, Anthropic at $60B, xAI at $50B+. But Nvidia pulling back from a rumored $100B OpenAI investment is the first real crack. Hyperscalers (Microsoft, Google, Amazon) continue massive capex.
Analysis
The question isn't whether there's a bubble — there almost certainly is in some valuations. The question is whether the underlying technology justifies it long-term. Unlike crypto or social media, AI is generating measurable productivity gains now. The funding may be ahead of revenue, but it's not ahead of utility.
🟡 What's the real bottleneck for AI scaling?
Partially ResolvedEvidence
Per Elon Musk (Collision/Dwarkesh interview): energy is the binding constraint in the next year, chips in the 3-4 year timeframe. Hyperscalers appear to be continuing or increasing capital investments into chip manufacturing and data center power.
Analysis
This shifted from 'data' (2023) to 'compute' (2024) to 'energy' (2025-26). The progression tells a story — AI scaling has been so aggressive that it's consuming each resource bottleneck in sequence. Energy is a harder constraint than compute because you can't just throw money at power plants the way you can at chip fabs.
🟡 Will synthetic data / RLVR replace pre-training?
Partially ResolvedEvidence
Grok 4 is the landmark: first major model where more than half of training cost was NOT pre-training. RLVR (reinforcement learning from verifiable rewards) and synthetic data generation crossed a critical threshold. Pre-training is becoming the foundation layer, not the main event.
Analysis
This is a paradigm shift in how we think about training. If most of the intelligence comes from post-training (RLVR, synthetic data, reasoning fine-tuning), then the pre-training data moat matters less. It also means smaller labs with clever post-training recipes could compete with pre-training giants.
🟡 Are hallucinations a fundamental problem?
Partially ResolvedEvidence
Hallucination rates have dropped dramatically through better training, RLHF, chain-of-thought, and tool use. Not eliminated — and leading researchers (including Yann LeCun) argue they cannot be fully eliminated within current autoregressive architectures. The practical resolution is mitigation through verification layers, not architectural elimination.
Analysis
Hallucinations are an inherent feature of probabilistic generative architectures, not a solvable engineering bug. However, grounding, tool use, and RLHF have reduced hallucination rates enough that they're no longer a commercial blocker for human-in-the-loop workflows. The 2023 framing ('AI can't be trusted') was wrong — but so is declaring the problem 'solved.' Hallucinations remain a fundamental barrier to fully autonomous, high-stakes agentic systems. The computer reliability analogy holds: computers still have bugs, but they're reliable enough to run the world. LLMs still hallucinate, but they're reliable enough to be massively useful.
🟡 How far will context windows go?
Partially ResolvedEvidence
Went from 8K (early 2023) → 32K → 128K → 200K → 1M standard in under 3 years. Gemini showed 1M+ is technically feasible. 200K-1M is becoming the standard range for frontier models.
Analysis
The trend line suggests context windows will keep growing, but the question is whether infinite context or agent-managed memory wins. At some point, stuffing everything into context becomes less efficient than letting the agent decide what to retrieve. The answer might be 'both' — huge context as a capability, with agents managing what fills it.
🟡 Will AI replace human programmers?
Partially ResolvedEvidence
Coding was AI's killer app. Most new production code is increasingly AI-written or AI-assisted. Autonomous coding agents can handle multi-file changes, write tests, fix bugs. But human judgment on architecture, product decisions, and edge cases remains essential.
Analysis
The question was always wrong. It's not 'replace' — it's 'transform.' The role shifts from writing code to directing AI that writes code. The analogy isn't automation replacing factory workers; it's power tools replacing hand tools. You still need the carpenter, but one carpenter does what ten did before. The trajectory is clear, the timeline is the only debate.
🟡 Can agents run autonomously for extended periods?
Partially ResolvedEvidence
Agent runtimes have gone from seconds → minutes → hours. Still worse than humans on reliability. But 1000x cheaper, so the calculus is different — you can run them continuously and accept lower per-task success rates.
Analysis
The framing matters here. If you compare a single agent run to a single human session, agents lose on quality. But that's the wrong comparison. The right comparison is: what happens when you can run 1000 agents for the cost of one human? Reliability per run matters less when you can just keep running them. The question isn't 'are agents as good as humans?' — it's 'is the cost-adjusted output positive?' And it increasingly is.
🟡 Why is reliable agency harder than reasoning?
Partially ResolvedEvidence
Agency has improved (seconds to minutes to hours of autonomous operation) but remains unreliable for complex multi-step tasks. The causes appear to be error compounding and planning horizon limits rather than fundamental architectural constraints. Unitree robots and coding agents suggest agency improves with better environments and RL, not architectural breakthroughs.
Analysis
The 'Agency Gap': reliable, stateful digital agency (planning, error-recovery, multi-step execution) turns out to be orders of magnitude harder than single-turn intelligence. This echoes a pattern from physical robotics, but the causes in software are different: error compounding over long horizons, lack of ground-truth verification for intermediate steps, context degradation in long sequences, and the combinatorial explosion of real-world state spaces. The bottleneck may be building enough realistic RL environments with smooth reward landscapes, not the architecture itself.
🟡 When will RL become the dominant training workload over pre-training?
Partially ResolvedEvidence
Grok 4 crossed the threshold: >50% training cost was NOT pre-training. Dario Amodei noted labs are only spending ~M on RL despite hundreds of millions on base models. The bottleneck appears to be RL environment construction — building complex, realistic, hard-to-reward-hack challenges.
Analysis
If RL is this powerful and this cheap, the question is why labs aren't spending 100x more on it. The answer seems to be infrastructure: we need more environments with smooth reward landscapes. As horizon lengths increase, each RL sample requires hours of agentic compute before you can evaluate it. This could slow progress or shift advantage to smaller labs doing clever RL on cheaper base models.
Resolved
Questions that seemed open months ago and are now largely settled.
🟢 Will frontier AI costs keep falling?
Resolved — YesEvidence
Costs have dropped roughly 10x every 12-18 months since GPT-4's launch. What cost $60/M tokens in 2023 costs under $1/M in 2026. This trend shows no signs of stopping — each new generation brings better performance at lower cost.
Analysis
This is perhaps the most consequential resolved question. Exponential cost deflation means every AI application that's marginally viable today becomes obviously viable in 18 months. It's the engine driving adoption — not capability improvements, but cost improvements making existing capabilities accessible.
🟢 Will AI coding become a leisure activity?
Emerging YesEvidence
People are replacing video game time and bar time with 'vibe coding' — building things with AI for fun. The barrier to creation has dropped so low that programming is becoming entertainment. Already happening for early adopters; spreading rapidly.
Analysis
This one resolved faster than anyone expected. When building a working app takes 30 minutes instead of 30 days, the act of creation becomes accessible as leisure. It's the same shift that happened with music production (GarageBand), video (TikTok), and writing (blogs). AI did it for software. The cultural implications are enormous — an entire generation learning to build things for fun.
This page is open source
Think a question is missing? Disagree with a resolution status? Have new evidence? All debates happen on GitHub. What gets merged becomes the record.