News

Thinking Machines Just Built AI That Listens While It Talks

Mira Murati's startup unveiled a full duplex interaction model that processes audio, video, and text simultaneously in 200 millisecond chunks. Current AI feels like a walkie talkie. This feels like a phone call.

May 20, 2026 5 min read
Key takeaways
  • Every AI assistant you use today works like a walkie talkie. You speak. It listens. It processes. Then it replies. You wait during every step. This creates a 1 to 2 second delay that makes natural conversation impossible.
  • On May 12, Thinking Machines Lab, the startup founded by former OpenAI CTO Mira Murati, unveiled something fundamentally different: an interaction model that can listen, see, and talk at the same time. The technical term is full duplex, and it could change how we interact with AI entirely.

What Full Duplex Actually Means

Current AI voice systems are half duplex. Like a walkie talkie, only one side communicates at a time. You finish speaking, the model detects you stopped, it processes your input, then it responds. Even the best systems like GPT Realtime and Gemini Flash Live have a noticeable gap.

Thinking Machines' model, called TML Interaction Small, is full duplex. It processes incoming audio, video, and text in continuous 200 millisecond chunks called micro turns. The AI ingests input and generates output simultaneously, so it can nod, interject, say I see, or remain silent at precisely the right moment.

In demos, the model was asked to count pushups while watching through a camera. It counted accurately in real time, responding to visual cues without waiting for audio prompts. It was asked to spot bugs in code while a developer was writing. It flagged issues mid typing. These capabilities are impossible with current models that only detect when you stop speaking.

The response latency is 0.40 seconds, roughly the speed of natural human conversation. For comparison, GPT Realtime 2.0 clocks in at 1.18 seconds and Gemini 3.1 Flash Live at 0.57 seconds.

AI Voice Response Latency
TML Interaction Small0.40 seconds | Full duplex
Gemini 3.1 Flash Live0.57 seconds | Half duplex
GPT Realtime 2.01.18 seconds | Half duplex

The Architecture Behind It

The system uses a dual model architecture. A 276 billion parameter Mixture of Experts model with 12 billion active parameters handles the real time conversation. This is the interaction model that stays in constant exchange with the user, processing dialog, maintaining presence, and generating immediate responses within the 200 millisecond window.

A separate background model handles deeper tasks like reasoning, web searches, and tool use. This runs asynchronously while the interaction model keeps the conversation flowing. So while the AI is explaining something to you, the background model can quietly look up information, run calculations, or call external APIs without creating dead air.

The technical approach is called encoder free early fusion. Instead of using separate pre trained encoders for audio and video (like Whisper for speech), the system takes in raw audio and image patches through lightweight embedding layers and trains everything jointly from scratch. This lets the model learn how to coordinate audio, video, and text within a single training process.

Source: Verified from Thinking Machines Lab blog, TechCrunch, VentureBeat, and SiliconANGLE, May 2026.

Get AI pricing updates biweekly
Verified pricing changes, new model launches, and cost-saving tips.

Who Is Behind This

Thinking Machines Lab was founded by Mira Murati, who served as CTO of OpenAI until her departure in 2024. The company raised a $20 billion seed round that valued it at $120 billion, making it one of the most well funded AI startups in history.

The team includes co founder Lilian Weng, former VP of Applied Research at OpenAI, and CTO Soumith Chintala, the creator of PyTorch who previously spent years at Meta. The company has grown to approximately 130 employees.

Thinking Machines also announced an NVIDIA partnership to deploy at least one gigawatt of next generation Vera Rubin systems, and expanded its Google Cloud relationship to use AI Hypercomputer infrastructure with NVIDIA GB300 systems.

The pedigree explains the ambition. This team has built some of the most capable AI systems in the world. Their bet is that making AI genuinely interactive, not just intelligent, is the next frontier.

When You Can Use It

You cannot use it yet. TML Interaction Small is currently in a limited research preview available only to select partners. A broader public release is planned for later in 2026, but no specific date has been announced.

The practical implications are significant for anyone building voice AI products. Customer service bots that can acknowledge frustration while looking up account details. Fitness apps where the AI counts reps by watching your camera. Real time translation that feels like a conversation rather than a series of recordings. Code review assistants that spot bugs as you type.

For consumers, the most likely way you will experience full duplex AI is through products built on top of Thinking Machines' models, not through the models directly. Enterprise voice automation companies like Avoca AI and RingCentral AIR Pro are already exploring similar capabilities.

Meanwhile, OpenAI and Google are almost certainly working on their own full duplex solutions. Amazon has been developing Sonic models for the same purpose. The race is on, and Thinking Machines has set the benchmark.

Source: Availability details from Thinking Machines Lab blog and TechCrunch, May 2026.

thinking-machinesmira-muratifull-duplexvoice-aiai-models
AT
AI Tools Mentor
We verify pricing for 300+ AI tools against official docs. No estimates — just the actual numbers. Updated weekly.
Share this article
Related Articles

AiToolsMentor.com · Verified AI tool pricing