Fei-Fei Li's Spatial Intelligence and World Models explained | AI with Kyle

Highlights

Not much happening in the world of AI. For once! So I took the opportunity to run a deep dive into “world models” and spatial intelligence.

🧠 What Are World Models? The Next Evolution Beyond Language

Discussed at [00:00:19]

Dr. Fei-Fei Li (Stanford, ImageNet creator) published a manifesto: AI needs "spatial intelligence" - the ability to understand and interact with 3D space, physics, and reality itself.

— (@)

Kyle's take: Think of it this way: ChatGPT is brilliant at language but has never seen a sunset, caught a ball, or spilled coffee. It's a "wordsmith in the dark."

World models are AI systems that understand reality - gravity, object permanence, cause and effect. Not just describing physics but EXPERIENCING it in simulation. Babies learn this way - they touch, break, throw things BEFORE they speak. They build an internal model of how the world works. That's what we're trying to give AI.

Intelligence evolved from movement and interaction with the physical world, not language. Single cells sensing light and moving toward it - that's the seed of intelligence.

Perception leads to action, action changes environment, changed environment requires new perception. This loop, over millions of years, built brains. Language came last, built on top of spatial understanding.

We’ve taught AI backwards - language first, reality never.

Dr. Li quotes Wittgenstein: "limits of my language mean limits of my world." But for AI, it's reversed - limits of world understanding mean limits of intelligence. Without spatial grounding, LLMs hit a ceiling. They can describe physics brilliantly but can't stack basic kid’s building blocks. With world models, AI gets what drove evolution: embodied experience in space that remembers its past and predicts its future.

Source: Dr. Fei-Fei Li's X thread and Substack article

🔑 Three Things Every World Model Needs (The Technical Bit)

Discussed at [00:11:52]

Dr. Li outlines three essential capabilities: Generative (create worlds), Multimodal (all senses), Interactive (persistence).

Kyle's take: Let me break this down simply.

Generative: Not one world like a video game, but infinite possible worlds - each physically accurate. Imagine AI that can spawn endless training environments.
Multimodal: Process everything at once - vision, sound, touch, depth, movement. Humans don't process senses separately; neither should AI.
Interactive: The world remembers what happened and evolves. Drop a glass, it stays broken. This is the killer difference from LLMs. ChatGPT forgets everything between chats. Sora videos vanish after 10 seconds. World models persist - like having a Minecraft server that never resets. The computational requirements are staggering but this is how intelligence actually works - through continuous interaction with persistent reality.

📈 The Three-Stage Path to Spatial Intelligence

Discussed at [00:04:54]

Stage 1 (Now): Creative tools. Stage 2 (Soon): Robot training. Stage 3 (Future): Scientific discovery. We're currently in Stage 1.

Kyle's take: Stage 1 is the toy stage. Closer to video games and movie making tools. Sora in a way, just much more advanced. Allowing us to create and stage worlds using AI.

Stage 2 revolutionises robotics - instead of training self-driving cars on real streets (dangerous, slow), train them millions of times in simulation.

Stage 3 is the endgame: simulate actual physics. Like…all of it. Run a million drug trials in the time one real trial takes. Test fusion reactors without building them. Fold proteins at light speed. This isn't improvement - it's transformation.

Dr. Li (smartly!) avoids saying "AGI" or “singularity” but clearly this is a path from narrow AI to general intelligence.

🎯 Why "Spatial Intelligence" Changes Everything

Discussed at [00:18:18]

Spatial intelligence = perception + action + memory in 3D space. It's what separates truly intelligent beings from language processors.

Kyle's take: Here's what clicked for me: Source: Philosophical framework from Dr. Li's paper

Member Questions:

"Is this the path to AGI?" Asked at [00:30:42]

Kyle's response: Dr. Li never says "AGI" - smart, avoids that endless debate! But implicitly yes. LLMs alone won't get there is what Dr. Li is hinting at. They're missing the foundation all intelligence is built on - spatial understanding.

"Why does OpenAI waste money on Sora videos?" Asked during discussion

Kyle's response: They're not wasting it! Every frame teaches object permanence, physics, causality. While everyone mocks the cat videos, OpenAI's building foundations for reality simulation. It just looks silly for now as we are in the toy stage.

"How is this different from video games?" Implicit in discussion

Kyle's response: Games have one handcrafted world with programmed physics. World models generate infinite worlds, each physically accurate, learning physics from data not programming. It's the difference between Skyrim and infinite possible Skyrims that understand reality.

Want the full unfiltered discussion? Join me tomorrow for the daily AI news live stream where we dig into the stories and you can ask questions directly.

Streaming on YouTube (with screen share) and TikTok (follow and turn on notifications for Live Notification).

Audio Podcast on iTunes and Spotify.