Open source — debates & discussions happen on GitHubContribute →

Open Questions

The big debates about AI — tracked from when they were first seriously discussed to how they're resolving. Inspired by Dwarkesh Patel's format.

📂 Data lives in questions.json — disagree? Open an issue.

6 Open
12 Partial
2 Resolved

Still Open

Genuine uncertainty remains. These are the hardest questions.

🔴 Why haven't LLMs made genuine scientific discoveries despite knowing everything?

Still Open
debated 2024updated 2026-03

Evidence

LLMs have essentially every known fact memorized but haven't made novel connections that count as discoveries. Humans with far less knowledge routinely find cross-domain insights. Possible explanations: combinatorial explosion (need both concepts in attention simultaneously), low learning efficiency, or creativity requires something pre-training doesn't provide.

Analysis

This might be the deepest open question in AI. If knowledge is cheap to store (Wikipedia is <5MB), why can't models that have it all memorized make the leaps that moderately intelligent humans make? The optimistic read: once models get true agency + learning efficiency, their knowledge advantage will make them "fucking dominate" (Dwarkesh's framing). The pessimistic read: there's something about generalization and creativity that current architectures fundamentally lack.

🔴 Why are all AI labs converging on the same architecture?

Still Open
debated 2025updated 2026-03

Evidence

Every major lab is making "thinking" models with chain-of-thought reasoning. Is this because it's the only thing that works? Or are labs just copying each other while equally promising tangential directions go unexplored?

Analysis

The convergence is suspicious. Either the transformer + CoT + RLHF recipe is genuinely the only viable path (which would be remarkable and somewhat concerning), or there's a massive blind spot where alternative approaches could yield breakthroughs. The success of DeepSeek R1 with a different training recipe hints that diversity might be underexplored.

🔴 Can AI achieve superhuman science without human-level learning efficiency?

Still Open
debated 2024updated 2026-03

Evidence

Models take orders of magnitude more data than humans to learn equivalent skills, even ones they perform at 99th percentile. Einstein generalized from a few thought experiments and murky observations — that's extreme learning efficiency. Creativity and learning efficiency might be the same thing.

Analysis

If creativity IS learning efficiency, then the path to AI scientists isn't more data or bigger models — it's fundamentally better learning algorithms. This reframes the entire scaling debate: maybe we're scaling the wrong thing. The missing middle between pre-training (skim everything) and in-context learning (short-term memory) might be the key unlock.

🔴 Why does test-time compute keep helping even at absurd scale?

Still Open
debated 2025updated 2026-03

Evidence

o3-high wrote 43 million words per task on ARC-AGI and scores kept improving. What did it figure out with word 42 million? RL may just upweight ~10 tokens of MCTS-like scaffolding ("wait", "let's backtrack"), which explains why reasoning distills easily.

Analysis

If reasoning is really just 10 tokens of scaffolding that can be distilled trivially, the current test-time compute scaling might be a temporary inefficiency — future models will do by instinct what current ones need millions of words for. But if it's NOT just scaffolding, and longer thinking genuinely produces better results at arbitrary scale, that has profound implications for what intelligence even is.

🔴 Will AI cause explosive economic growth?

Still Open
debated 2024updated 2026-03

Evidence

If AI can automate cognitive labor at 1000x lower cost, historical economic models predict dramatic GDP acceleration. But physical world constraints (energy, manufacturing, regulation, human adoption speed) create friction. Software-only tasks are accelerating; physical world integration is slower.

Analysis

The "software-only singularity" scenario is increasingly plausible — dramatic acceleration in digital domains while physical world constraints throttle overall growth. The question is whether the digital acceleration is big enough to move GDP numbers at national/global scale, or whether it stays concentrated in tech sector productivity.

🔴 How much of a parallelization penalty do multi-agent systems pay?

Still Open
debated 2025updated 2026-03

Evidence

Breaking problems across multiple agents loses whole-context visibility. But parallel agents can explore more solution space. Current multi-agent systems show coordination overhead but also emergent capabilities. The optimal split between single deep-thinker and many parallel workers is unknown.

Analysis

This matters enormously for the economics of AI. If parallelization penalty is low, you can throw 1000 cheap agents at a problem instead of one expensive one. If it's high, single powerful models win. Early evidence suggests task-dependent: decomposable tasks parallelize well, tightly coupled reasoning doesn't.

Partially Resolved

Evidence is mounting but the jury's still out.

🟡 Will AI models converge in capability?

Partially Resolved
debated 2023updated 2026-03

Evidence

Frontier models score within a few points of each other on standard evals (MMLU, HumanEval, SWE-bench). But practitioners consistently prefer different models for different tasks — Claude for complex coding, Gemini for multimodal and research, GPT for general instruction following. The rapid leapfrogging cycle (monthly lead changes) confirms convergence at the benchmark level while user preferences suggest meaningful differentiation in practice.

Analysis

Benchmark convergence among frontier models is nearly absolute — GPT-5, Claude 4, Gemini 3 Pro, and Grok 4 score within a few points of each other on standard evals. But experiential convergence remains partial. Divergent post-training paradigms (OpenAI's RL reasoning focus, Anthropic's Constitutional AI, Google's multimodal integration) create distinct behavioral profiles. Users consistently report different 'vibes' and specialized strengths despite identical paper scores. This suggests benchmarks are measuring the wrong things — or that the gap between 'can do X on a benchmark' and 'reliably does X in production' is where the real differentiation lives.

🟡 Will open-source AI keep up with frontier labs?

Partially Resolved
debated 2023updated 2026-03

Evidence

DeepSeek R1 proved open-source can compete on reasoning at dramatically lower cost. Llama 3 closed the gap to ~1 generation behind frontier. But frontier models still lead on the hardest tasks. Kimi and other Chinese labs are still emerging wildcards.

Analysis

The gap is persistent but not growing — which might be enough. Open-source doesn't need to match frontier; it needs to be 'good enough' for most use cases. The real question is whether the economics of training will concentrate or distribute. DeepSeek suggests the latter.

🟡 Is RAG the right architecture for AI memory?

Partially Resolved
debated 2023updated 2026-03

Evidence

Long context windows (200K-1M standard by 2026) handle most document Q&A without retrieval pipelines. Agents managing their own memory through filesystem access are more flexible. But enterprise RAG pipelines remain a multi-billion dollar market — dynamic data, cost constraints, and latency requirements ensure RAG's continued relevance at scale.

Analysis

Long-context windows (1M+ tokens) have superseded RAG for ad-hoc document Q&A and short-term working memory. But RAG remains essential for enterprise-scale search, latency-sensitive applications, cost optimization, and querying dynamic databases where full-context ingestion is financially or computationally impractical. RAG isn't dead — it's been demoted from 'default architecture' to 'specialized tool.' The 2023 pattern of 'chunk everything, embed, retrieve, generate' as the ONLY way to give models knowledge is obsolete. RAG as enterprise search infrastructure is alive and growing.

🟡 Is the AI lab funding bubble sustainable?

Partially Resolved
debated 2023updated 2026-03

Evidence

Labs keep raising at 10x valuations year-over-year. OpenAI at $300B, Anthropic at $60B, xAI at $50B+. But Nvidia pulling back from a rumored $100B OpenAI investment is the first real crack. Hyperscalers (Microsoft, Google, Amazon) continue massive capex.

Analysis

The question isn't whether there's a bubble — there almost certainly is in some valuations. The question is whether the underlying technology justifies it long-term. Unlike crypto or social media, AI is generating measurable productivity gains now. The funding may be ahead of revenue, but it's not ahead of utility.

🟡 What's the real bottleneck for AI scaling?

Partially Resolved
debated 2024updated 2026-03

Evidence

Per Elon Musk (Collision/Dwarkesh interview): energy is the binding constraint in the next year, chips in the 3-4 year timeframe. Hyperscalers appear to be continuing or increasing capital investments into chip manufacturing and data center power.

Analysis

This shifted from 'data' (2023) to 'compute' (2024) to 'energy' (2025-26). The progression tells a story — AI scaling has been so aggressive that it's consuming each resource bottleneck in sequence. Energy is a harder constraint than compute because you can't just throw money at power plants the way you can at chip fabs.

🟡 Will synthetic data / RLVR replace pre-training?

Partially Resolved
debated 2024updated 2026-01

Evidence

Grok 4 is the landmark: first major model where more than half of training cost was NOT pre-training. RLVR (reinforcement learning from verifiable rewards) and synthetic data generation crossed a critical threshold. Pre-training is becoming the foundation layer, not the main event.

Analysis

This is a paradigm shift in how we think about training. If most of the intelligence comes from post-training (RLVR, synthetic data, reasoning fine-tuning), then the pre-training data moat matters less. It also means smaller labs with clever post-training recipes could compete with pre-training giants.

🟡 Are hallucinations a fundamental problem?

Partially Resolved
debated 2023updated 2026-03

Evidence

Hallucination rates have dropped dramatically through better training, RLHF, chain-of-thought, and tool use. Not eliminated — and leading researchers (including Yann LeCun) argue they cannot be fully eliminated within current autoregressive architectures. The practical resolution is mitigation through verification layers, not architectural elimination.

Analysis

Hallucinations are an inherent feature of probabilistic generative architectures, not a solvable engineering bug. However, grounding, tool use, and RLHF have reduced hallucination rates enough that they're no longer a commercial blocker for human-in-the-loop workflows. The 2023 framing ('AI can't be trusted') was wrong — but so is declaring the problem 'solved.' Hallucinations remain a fundamental barrier to fully autonomous, high-stakes agentic systems. The computer reliability analogy holds: computers still have bugs, but they're reliable enough to run the world. LLMs still hallucinate, but they're reliable enough to be massively useful.

🟡 How far will context windows go?

Partially Resolved
debated 2023updated 2026-03

Evidence

Went from 8K (early 2023) → 32K → 128K → 200K → 1M standard in under 3 years. Gemini showed 1M+ is technically feasible. 200K-1M is becoming the standard range for frontier models.

Analysis

The trend line suggests context windows will keep growing, but the question is whether infinite context or agent-managed memory wins. At some point, stuffing everything into context becomes less efficient than letting the agent decide what to retrieve. The answer might be 'both' — huge context as a capability, with agents managing what fills it.

🟡 Will AI replace human programmers?

Partially Resolved
debated 2023updated 2026-03

Evidence

Coding was AI's killer app. Most new production code is increasingly AI-written or AI-assisted. Autonomous coding agents can handle multi-file changes, write tests, fix bugs. But human judgment on architecture, product decisions, and edge cases remains essential.

Analysis

The question was always wrong. It's not 'replace' — it's 'transform.' The role shifts from writing code to directing AI that writes code. The analogy isn't automation replacing factory workers; it's power tools replacing hand tools. You still need the carpenter, but one carpenter does what ten did before. The trajectory is clear, the timeline is the only debate.

🟡 Can agents run autonomously for extended periods?

Partially Resolved
debated 2024updated 2026-03

Evidence

Agent runtimes have gone from seconds → minutes → hours. Still worse than humans on reliability. But 1000x cheaper, so the calculus is different — you can run them continuously and accept lower per-task success rates.

Analysis

The framing matters here. If you compare a single agent run to a single human session, agents lose on quality. But that's the wrong comparison. The right comparison is: what happens when you can run 1000 agents for the cost of one human? Reliability per run matters less when you can just keep running them. The question isn't 'are agents as good as humans?' — it's 'is the cost-adjusted output positive?' And it increasingly is.

🟡 Why is reliable agency harder than reasoning?

Partially Resolved
debated 2024updated 2026-03

Evidence

Agency has improved (seconds to minutes to hours of autonomous operation) but remains unreliable for complex multi-step tasks. The causes appear to be error compounding and planning horizon limits rather than fundamental architectural constraints. Unitree robots and coding agents suggest agency improves with better environments and RL, not architectural breakthroughs.

Analysis

The 'Agency Gap': reliable, stateful digital agency (planning, error-recovery, multi-step execution) turns out to be orders of magnitude harder than single-turn intelligence. This echoes a pattern from physical robotics, but the causes in software are different: error compounding over long horizons, lack of ground-truth verification for intermediate steps, context degradation in long sequences, and the combinatorial explosion of real-world state spaces. The bottleneck may be building enough realistic RL environments with smooth reward landscapes, not the architecture itself.

🟡 When will RL become the dominant training workload over pre-training?

Partially Resolved
debated 2024updated 2026-03

Evidence

Grok 4 crossed the threshold: >50% training cost was NOT pre-training. Dario Amodei noted labs are only spending ~M on RL despite hundreds of millions on base models. The bottleneck appears to be RL environment construction — building complex, realistic, hard-to-reward-hack challenges.

Analysis

If RL is this powerful and this cheap, the question is why labs aren't spending 100x more on it. The answer seems to be infrastructure: we need more environments with smooth reward landscapes. As horizon lengths increase, each RL sample requires hours of agentic compute before you can evaluate it. This could slow progress or shift advantage to smaller labs doing clever RL on cheaper base models.

Resolved

Questions that seemed open months ago and are now largely settled.

🟢 Will frontier AI costs keep falling?

Resolved — Yes
debated 2023updated 2026-03

Evidence

Costs have dropped roughly 10x every 12-18 months since GPT-4's launch. What cost $60/M tokens in 2023 costs under $1/M in 2026. This trend shows no signs of stopping — each new generation brings better performance at lower cost.

Analysis

This is perhaps the most consequential resolved question. Exponential cost deflation means every AI application that's marginally viable today becomes obviously viable in 18 months. It's the engine driving adoption — not capability improvements, but cost improvements making existing capabilities accessible.

🟢 Will AI coding become a leisure activity?

Emerging Yes
debated 2025updated 2026-03

Evidence

People are replacing video game time and bar time with 'vibe coding' — building things with AI for fun. The barrier to creation has dropped so low that programming is becoming entertainment. Already happening for early adopters; spreading rapidly.

Analysis

This one resolved faster than anyone expected. When building a working app takes 30 minutes instead of 30 days, the act of creation becomes accessible as leisure. It's the same shift that happened with music production (GarageBand), video (TikTok), and writing (blogs). AI did it for software. The cultural implications are enormous — an entire generation learning to build things for fun.

🔓

This page is open source

Think a question is missing? Disagree with a resolution status? Have new evidence? All debates happen on GitHub. What gets merged becomes the record.