MotionMaster: Generalizable Text-Driven Motion Generation and Editing

Nan Jiang1,2,3,7,8,9*, Yunhao Li3,6,7,8*, Lexi Pang1,3,5,7,8*, Zimo He4,2,7,8,9, Siyuan Huang2,7✉️, Yixin Zhu3,1,7,8,9✉️
1Institute for AI, Peking University    2Beijing Institute for General Artificial Intelligence (BIGAI)    3School of Psychological and Cognitive Sciences, Peking University
4School of Computer Science, Peking University    5Yuanpei College, Peking University    6School of Foreign Languages, Peking University
7State Key Lab of General AI    8Beijing Key Laboratory of Behavior and Mental Health, Peking University
9Embodied Intelligence Lab, PKU-Wuhan Institute for Artificial Intelligence
*Indicates Equal Contribution    ✉️Indicates Corresponding Author
CVPR 2026
Teaser

MotionMaster unlocks a high level of text-generalizable and long-horizon motion generation and editing unseen in previous methods. By natively grounding motion into the shared embedding space of a pretrained MLLM, MotionMaster inherits its profound action semantics and long-horizon reasoning capabilities. This leap is fueled by three core contributions: MotionGB, a massive 10,000-hour richly annotated dataset; a novel FSQ-based tokenizer that balances local joint accuracy with global trajectory coherence; and a unified finetuning paradigm with temporal-augmentation and semantic-balancing strategies to facilitate learning.

Multi-Level Text-to-Motion Generation

Long-Horizon Generation (2 Actions)

Long-Horizon Generation (3 Actions)

Long-Horizon Generation (5 Actions)

Multi-Level Text-Guided Motion Editing

Abstract

Synthesizing realistic human motion from natural language holds transformative potential for animation, robotics, and virtual reality. Recent methods handle single-action sequences and simple textual instructions, yet multi-action compositions and precise editing remain elusive due to limited data diversity, inadequate representations, and fragmented pipelines. Critically, most existing methods train motion generation models from scratch, failing to exploit the rich action semantics and long-horizon reasoning already encoded in pretrained MLLMs. Here we show that finetuning a pretrained MLLM with large-scale motion data yields strong zero-shot generalization across diverse text-guided motion generation and editing tasks. We present MotionMaster, a unified framework built on three components: MotionGB, a 10,000-hour dataset expanded from 400 hours of verified motion capture via spatial-temporal augmentation; an FSQ-based tokenizer that preserves both local joint accuracy and global trajectory coherence; and a finetuned MLLM with motion and language tokens in a shared embedding space. MotionMaster outperforms prior methods by 41.6% in multi-action semantic consistency and 20.8% in body-part composition. These results demonstrate that pretraining knowledge from MLLMs transfers effectively to motion understanding, opening a viable path toward general-purpose motion intelligence.

Method

Overview of MotionMaster Framework

Overview of MotionMaster. (a) The FSQ-based motion tokenizer encodes joint positions into localized features, quantizes them into discrete tokens, and supervises reconstruction via a loss computed in global coordinates. (b) For text-to-motion generation, the finetuned MLLM autoregressively decodes motion tokens conditioned on a text prompt. (c) For text-guided editing, the original motion is provided as additional context, and the MLLM selectively modifies the relevant tokens while preserving the remainder of the sequence.

BibTeX

@inproceedings{jiang2026motionmaster,
  title={MotionMaster: Generalizable Text-Driven Motion Generation and Editing},
  author={Jiang, Nan and Li, Yunhao and Pang, Lexi and He, Zimo and Huang, Siyuan and Zhu, Yixin},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2026}
}