SMP: Reusable Score-Matching Motion Priors

Yuxuan Mu^1*
Ziyu Zhang^1*
Yi Shi^1*
Dun Yang²
Minami Matsumoto²
Kotaro Imamura²
Guy Tevet³
Chuan Guo⁴
Michael Taylor²
Chang Shu⁵
Pengcheng Xi⁵
Xue Bin Peng^1,6

¹Simon Fraser University
²Sony Interactive Entertainment
³Stanford University
⁴Snap Inc.
⁵National Research Council Canada
⁶NVIDIA

^*Joint first authors

01

Reusable

A single frozen diffusion model serves as the reward model across many tasks and styles, from locomotion and steering, to dodgeball and zombie-walk.

02

Modular

A motion prior is trained once, independently of any task or policy, and serves as a reward model without ever accessing the original motion data.

03

Composable

A general pretrained motion prior can be shaped or composed to synthesize new styles not present in the original dataset without further training.

Abstract

Data-driven motion priors that can guide agents toward producing naturalistic behaviors play a pivotal role in creating life-like virtual characters. Adversarial imitation learning has been a highly effective method for learning motion priors from reference motion data. However, adversarial priors, with few exceptions, need to be retrained for each new controller, thereby limiting their reusability and necessitating the retention of the reference motion data when applied to downstream tasks.

In this work, we present Score-Matching Motion Priors (SMP), which leverages pre-trained motion diffusion models and score distillation sampling (SDS) to create reusable task-agnostic motion priors. SMPs can be pre-trained on a motion dataset, independent of any control policy or task. Once trained, they can be frozen and reused as general-purpose reward functions to train policies to produce naturalistic behaviors for downstream tasks.

We show that a general motion prior trained on large-scale datasets can be repurposed into a variety of style-specific priors. Furthermore, SMP can compose different styles to synthesize new styles not present in the original dataset. Our method creates reusable and modular motion priors that produce high-quality motions comparable to state-of-the-art adversarial imitation learning methods. We demonstrate the effectiveness of SMP across a diverse suite of control tasks with physically simulated humanoid characters.

How it works

A motion diffusion model is trained on a reference motion dataset. During policy training, the pretrained diffusion model is repurposed as a reward model: it compares the noise added to the agent’s motion against the noise it predicts, and the residual evaluates how far the behavior is from the reference data manifold.

Schematic overview. A pretrained motion diffusion model serves as a reusable reward model for motion naturalness via score distillation sampling. The model can be style-conditioned, enabling the policy to learn specific skills or styles without retraining the original motion data.

Ensemble Score-Matching

Instead of evaluating SDS at a single random diffusion timestep, SMP aggregates SDS losses over a fixed set of timesteps. This substantially reduces variance without introducing significant bias, therebystabilizing PPO training.

Adaptive Normalization

SDS magnitudes differ sharply across noise levels and can vary across diffusion model checkpoints. We normalize each timestep’s SDS error by its running mean, substantially reducing the burden of manual hyperparameter tuning.

Generative State Initialization

The same diffusion model is sampled to generate initial states for training. This matches reference-state-initialization’s exploration benefits without retaining the original motion dataset.

§1 · Modularity

One prior, 100+N styles

We train a single style-conditioned diffusion model on the 20-hour 100STYLE dataset. Using classifier-free guidance, the same frozen model is reshaped into distinct style-specific priors — no style-specific datasets, no discriminator retraining. Diffusion model predictions conditioned on different styles can be composed to synthesize styles that don’t exist in the original dataset, such as Elated + FlickLegs and GracefulArms + Spin.

Style composition. A pretrained 100-style prior is adapted via classifier-free guidance and per-body-part mixing in the diffusion model’s ε-space to produce new stylistic behaviors — all while the agent performs a downstream target-location task.

§2 · Reusability

One prior, many tasks

A single motion prior trained on the LaFAN1 locomotion dataset is reused to train policies for different tasks. The character automatically discovers gait transitions — backward jogs, side-steps, slow walks near targets — and, for dodgeball, develops agile jumping and dodging skills that do not explicitly appear in the reference data.

LaFAN1 locomotion prior reused across multiple downstream tasks.

Steering

Following a target heading and target velocity. Side-steps and cross-stepping gaits arise when the two directions disagree.

Target Location

Reaching 2D target positions. The character speeds up when far from the target and transitions smoothly into walking as it approaches the target.

Dodgeball

Balls are launched at up to 25m/s from up to 10m away — less than 0.5s to react. Agile dodging skills resembling human-like strategies emerge purely from locomotion data.

§3 · Human–Object Interaction

Beyond locomotion: object and scene interaction

SMP extends naturally to joint human–object priors by modelling character and object motions together. This enables complex tasks such as picking up a box and climbing stairs, with coordinated whole-body manipulation and stable foot contacts.

Object Carry

Walk to a box, pick it up, then carry it to an arbitrary target location. The prior jointly models character and object motions.

Stair Traversal

Using a human–scene interaction prior, the character ascends and descends a staircase with natural foot placement.

§5 · Data efficiency

Skill emergence from 3 seconds of data

SMP works even when reference data is scarce. With only three seconds of walking, jogging, and running clips, the character learns to continuously modulate its gait across a wide speed range — developing transitions such as walk→jog→sprint that were never in the dataset.

Target speed trained from a 3-second motion prior. The policy follows continuous target speeds sampled from [1.2, 6.8] m/s, automatically adjusting stride frequency and gait.

§6 · Real World

From simulation to a real humanoid

An SMP-trained policy is deployed directly on a Unitree G1 humanoid. With asymmetric actor-critic training, domain randomization, and proprioception-only observations, the robot exhibits natural locomotion, robust recovery from external perturbations, and agile motion skills.

Real-world Unitree G1. SMP transfers from simulation to hardware, showing agile motion skills without runtime motion planners or hand-crafted heuristics.

Full video

A complete walkthrough of the paper — background, motivation, method, and additional qualitative results across all task settings.

Watch the video walkthrough

Read the paper (PDF) →

Citation

@article{mu2026smp,
  title   = {SMP: Reusable Score-Matching Motion Priors for Physics-Based Character Control},
  author  = {Mu, Yuxuan and Zhang, Ziyu and Shi, Yi and Yang, Dun and
             Matsumoto, Minami and Imamura, Kotaro and Tevet, Guy and
             Guo, Chuan and Taylor, Michael and Shu, Chang and
             Xi, Pengcheng and Peng, Xue Bin},
  journal = {ACM Transactions on Graphics (Proceedings of SIGGRAPH 2026)},
  year    = {2026},
}