UniAct: Unified Motion Generation and Action Streaming for Humanoid Robots

Nan Jiang^1,2,3,9,11* Zimo He^4,2,9,11* Wanhe Yu^1,3,5,9,11 Lexi Pang^1,3,5,9,11 Yunhao Li^6,3,9,11 Hongjie Li^7,3,9,11

Jieming Cui^1,2,3,9,11 Yuhan Li^8,2 Yizhou Wang^4,9,10 Yixin Zhu^3,1,9,11,12✉ Siyuan Huang^2,9✉

¹ Institute for AI, Peking University ² Beijing Institute for General Artificial Intelligence (BIGAI) ³ School of Psychological and Cognitive Sciences, Peking University

⁴ School of Computer Science, Peking University ⁵ Yuanpei College, Peking University ⁶ School of Foreign Languages, Peking University ⁷ School of EECS, Peking University

⁸ Huazhong University of Science and Technology ⁹ State Key Lab of General AI ¹⁰ Nat'l Eng. Research Center of Visual Technology

¹¹ Beijing Key Laboratory of Behavior and Mental Health, Peking University ¹² Embodied Intelligence Lab, PKU-Wuhan Institute for Artificial Intelligence

* Equal contribution ✉ yixin.zhu@pku.edu.cn, syhuang@bigai.ai

Paper Code Dataset Video

Abstract

We propose UniAct, a two-stage framework integrating a fine-tuned MLLM with a causal streaming pipeline, enabling humanoid robots to execute multimodal instructions with sub-500ms latency. By unifying inputs through a shared discrete codebook via FSQ, UniAct ensures cross-modal alignment while constraining motions to a physically grounded manifold. This approach yields a 19% improvement in the success rate of zero-shot tracking of imperfect reference motions. We validate UniAct on UA-Net, our 20-hour humanoid motion benchmark, demonstrating robust generalization across diverse real-world scenarios. Our results mark a critical step toward responsive, general-purpose humanoid assistants capable of seamless interaction through unified perception and control.

Architecture

Method Framework

UniAct begins by accepting a diverse range of multimodal inputs, including text, music, trajectories, and specific motion references. To process this variety, a finite scalar quantizer encodes the inputs into discrete tokens. These tokens exist within a shared embedding space, allowing for seamless fusion across all different modalities. Once encoded, a trained multimodal large language model processes these tokens. It then generates specific motion tokens that represent the intended robot movements. To finish the process, a causal decoder converts those motion tokens into continuous degree-of-freedom values. These are streamed to a motion tracker to drive the robot in real time.

Experiments