💬
Language
"Wave to a friend"
"Practice martial arts"
FSQ Tokenizer
🎵
Music
Audio waveforms
Rhythm patterns
FSQ Tokenizer
📍
Trajectory
Path waypoints
Navigation goals
FSQ Tokenizer
🏃
Motion
Human pose data
Movement sequences
FSQ Tokenizer

Shared Token Pool

Common Code Embedding Space

Language
Music
Trajectory
Motion

Multimodal LLM

Unified Understanding

Input
Output

Causal Decoder

Frame-wise DoF Decoding

Causal Convolution (k=3)
Pad
Input
Output
Real-time Streaming →
Motion Tracker
Humanoid Robot Control
Joint DoF Values (29 dims)
Robot Motion Output