UniAct: Unified Motion Generation and Action Streaming for Humanoid Robots

Nan Jiang1,2,3,9,11* Zimo He4,2,9,11* Wanhe Yu1,3,5,9,11 Lexi Pang1,3,5,9,11 Yunhao Li6,3,9,11 Hongjie Li7,3,9,11
Jieming Cui1,2,3,9,11 Yuhan Li8,2 Yizhou Wang4,9,10 Yixin Zhu3,1,9,11,12 Siyuan Huang2,9
1 Institute for AI, Peking University 2 Beijing Institute for General Artificial Intelligence (BIGAI) 3 School of Psychological and Cognitive Sciences, Peking University
4 School of Computer Science, Peking University 5 Yuanpei College, Peking University 6 School of Foreign Languages, Peking University 7 School of EECS, Peking University
8 Huazhong University of Science and Technology 9 State Key Lab of General AI 10 Nat'l Eng. Research Center of Visual Technology
11 Beijing Key Laboratory of Behavior and Mental Health, Peking University 12 Embodied Intelligence Lab, PKU-Wuhan Institute for Artificial Intelligence
* Equal contribution ✉ yixin.zhu@pku.edu.cn, syhuang@bigai.ai
Abstract

We propose UniAct, a two-stage framework integrating a fine-tuned MLLM with a causal streaming pipeline, enabling humanoid robots to execute multimodal instructions with sub-500ms latency. By unifying inputs through a shared discrete codebook via FSQ, UniAct ensures cross-modal alignment while constraining motions to a physically grounded manifold. This approach yields a 19% improvement in the success rate of zero-shot tracking of imperfect reference motions. We validate UniAct on UA-Net, our 20-hour humanoid motion benchmark, demonstrating robust generalization across diverse real-world scenarios. Our results mark a critical step toward responsive, general-purpose humanoid assistants capable of seamless interaction through unified perception and control.

Architecture

Method Framework

UniAct begins by accepting a diverse range of multimodal inputs, including text, music, trajectories, and specific motion references. To process this variety, a finite scalar quantizer encodes the inputs into discrete tokens. These tokens exist within a shared embedding space, allowing for seamless fusion across all different modalities. Once encoded, a trained multimodal large language model processes these tokens. It then generates specific motion tokens that represent the intended robot movements. To finish the process, a causal decoder converts those motion tokens into continuous degree-of-freedom values. These are streamed to a motion tracker to drive the robot in real time.