We propose UniAct, a two-stage framework integrating a fine-tuned MLLM with a causal streaming pipeline, enabling humanoid robots to execute multimodal instructions with sub-500ms latency. By unifying inputs through a shared discrete codebook via FSQ, UniAct ensures cross-modal alignment while constraining motions to a physically grounded manifold. This approach yields a 19% improvement in the success rate of zero-shot tracking of imperfect reference motions. We validate UniAct on UA-Net, our 20-hour humanoid motion benchmark, demonstrating robust generalization across diverse real-world scenarios. Our results mark a critical step toward responsive, general-purpose humanoid assistants capable of seamless interaction through unified perception and control.
UniAct begins by accepting a diverse range of multimodal inputs, including text, music, trajectories, and specific motion references. To process this variety, a finite scalar quantizer encodes the inputs into discrete tokens. These tokens exist within a shared embedding space, allowing for seamless fusion across all different modalities. Once encoded, a trained multimodal large language model processes these tokens. It then generates specific motion tokens that represent the intended robot movements. To finish the process, a causal decoder converts those motion tokens into continuous degree-of-freedom values. These are streamed to a motion tracker to drive the robot in real time.
Generating complex humanoid actions directly from natural language prompts with high fidelity and semantic understanding.
Synchronizing movements with musical beats and rhythm patterns.
Precise path following and spatial navigation.
Translating inputs across different modalities for robust action synthesis.
GVHMR integration for real-time motion retargeting and imitating.