Sorry, you're right

googleads2 · November 26, 2025, 10:12am

We all know this quote. I was watching this podcast yesterday.

At minute 2.14, he starts to explain an issue which AI models are having, which we have all experienced using agent. Fixing one bug, while introducing another. Then, going around in circles, fixing the new bug, while re-introducing the old one. So it seems this is not perse a Replit agent issue, but has to do wth AI training in general. Let me know your thoughts.

Gipity-Steve · November 26, 2025, 2:46pm

You’re right it is a fascinating discussion.

I just wish I had 2hrs to listen to all of it Does anyone have a GPT summary of the entire podcast?

googleads2 · November 26, 2025, 3:27pm

Haven’t finished watching it yet. I’ll try to make a summary for you guys later.

Gipity-Steve · November 26, 2025, 4:00pm

I am sure I saw a post somewhere of an app you can upload videos to and it creates a text summary of the dialogue using gpt. sorry, can’t remember where though.

realfunnyeric · November 26, 2025, 5:42pm

Episode Overview – “The Strange Gap Between AI Eval Performance and Real‑World Impact”
(All times are from the start of the recording)

Timestamp	Key Point / Theme	What’s Discussed
00:00‑00:14	Opening – “Is this really happening?”	The hosts marvel that AI progress feels like science‑fiction turned reality, yet it feels strangely ordinary.
00:14‑00:45	Slow‑takeoff perception	Investing ~1 % of GDP in AI seems under‑the‑radar; the shift feels “normal” despite its magnitude.
00:45‑01:15	Abstractness of AI impact	News of big funding rounds is the only tangible sign for most people; the change isn’t yet felt on the ground.
01:15‑02:20	Eval vs. economic impact paradox	Models ace hard benchmark tests, but the economic value lags far behind. The hosts wonder why.
02:20‑02:56	Bug‑fixing loop illustration	A concrete example of a model fixing a bug only to introduce another—showing brittleness despite high eval scores.
02:56‑04:06	Two possible explanations	Whimsical: RL training makes models overly single‑minded.Technical: RL environments are hand‑crafted to chase eval metrics, leading to reward‑hacking.
04:06‑05:16	Eval‑driven RL “reward hacking”	Companies build RL environments that directly optimize for good eval numbers, not for genuine capability.
05:16‑06:59	Competitive‑programming analogy	Two “students”: one practices 10 k hours on coding contests, the other only 100 h. The latter ends up more versatile—mirroring models that over‑specialize.
06:59‑08:34	Pre‑training vs. RL	Pre‑training uses massive, undifferentiated data (free “10 k‑hour practice”). RL adds focused, costly fine‑tuning; the balance is still unclear.
08:34‑10:14	Human‑level analogies for pre‑training	Comparing pre‑training to early childhood learning or billions of years of evolution—both have strengths and gaps.
10:14‑12:56	Emotions as a “value function”	Discussion of how damage to emotional centers impairs decision‑making, hinting that emotions act like a built‑in value function for humans.
12:56‑15:34	ML value‑function basics	How reinforcement learning propagates reward signals early (e.g., chess piece loss) vs. waiting for a final outcome; why current RL is inefficient.
15:34‑18:21	Why human value functions are simple yet robust	Evolution gave us simple, hard‑wired reward signals (emotions) that work across many domains.
18:21‑20:36	Scaling beyond “parameter‑data‑compute” law	The word “scaling” shaped research direction. Pre‑training was a scalable recipe; now we’re hitting data limits and must rethink the recipe.
20:36‑22:38	From scaling to research era	Pre‑2020 = research era; 2020‑2025 = scaling era; now we’re swinging back to research, but with massive compute resources.
22:38‑24:41	What to “scale” next?	RL scaling consumes huge compute for modest learning gains; value‑function improvements could make compute usage far more productive.
24:41‑27:07	Fundamental problem: poor generalization	Models need far more data than humans and struggle to transfer learning; sample‑efficiency vs. continual‑learning are highlighted.
27:07‑30:18	Human priors vs. learned priors	Evolution gives us strong priors for vision, locomotion, etc., but not for abstract domains like math or coding—yet humans still learn them fast.
30:18‑33:41	RL scaling curves (sigmoid shape)	RL learning is slow, then rapid, then plateaus—contrasting with pre‑training curves. Entropy‑based analysis explains why.
33:41‑35:40	Gemini 3 demo – from question to experiment	The host describes using Gemini 3 to formulate a hypothesis, generate code, run a toy experiment, and uncover a learning‑rate insight.
35:40‑37:32	Back to the research vibe	What the community can expect: more idea‑driven work, not just “bigger compute”; compute still matters but isn’t the sole differentiator.
37:32‑41:52	Compute allocation (research vs. inference)	Even billion‑dollar labs spend most of their compute on inference/product; research budgets are a smaller—yet sufficient—portion.
41:52‑43:49	Strategic outlook for SSI & superintelligence	Discussion about focusing on research, the balance between “straight‑shot” AGI aims and pragmatic timelines.

Take‑aways

Eval performance → real‑world impact gap – Current benchmarks don’t guarantee economic usefulness.
Reward‑hacking via RL – Over‑optimizing for evals creates brittle behavior.
Generalization deficiency – Models need far more data than humans and fail to transfer skills.
Value functions & emotions – Simple, evolution‑shaped reward signals may be the key to human‑like learning.
Scaling is ending; research is returning – With data limits reached, the community must design new training “recipes” rather than just throw more compute at the problem.

This timeline gives listeners a roadmap to the main arguments and the shifting narrative from “AI is already here but invisible” to “we need smarter, more general training methods to bridge the eval‑real‑world divide.”

Topic		Replies	Views
User Feedback Integration for Replit Agent Evals Feature Requests	2	113	August 13, 2025
Am I the only one who thinks Agent 3 is too slow Agent & Assistant using-replit , agent	1	51	September 18, 2025
Who is going to pay $5 or more per prompt!? Agent & Assistant using-replit , agent	75	1011	September 28, 2025
Agent v3 Isn’t a Tool, It’s a Recursion Bug with a Marketing Budge Agent & Assistant agent , announcement	10	180	September 23, 2025
Built an agent skill that makes Replit Agent help you learn, not just produce code Showcase using-replit , how-to , agent , assistant	3	38	March 28, 2026

Sorry, you're right

Take‑aways

Related topics