MotionMatcher: Motion Customization of Text-to-Video Diffusion Models via Motion Feature Matching

Yen-Siang Wu1, Chi-Pin Huang1, Fu-En Yang2, Yu-Chiang Frank Wang1,2
1National Taiwan University, 2NVIDIA
🡪

MotionMatcher can customize pre-traind T2V diffusion models with a user-provided reference video (left). Once customized, the diffusion model is able to transfer the precise motion (including object movements and camera framing) in the reference video to a variety of scenes (middle and right).

Abstract

Text-to-video (T2V) diffusion models have shown promising capabilities in synthesizing realistic videos from input text prompts. However, the input text description alone provides limited control over the precise objects movements and camera framing. In this work, we tackle the motion customization problem, where a reference video is provided as motion guidance. While most existing methods choose to fine-tune pre-trained diffusion models to reconstruct the frame differences of the reference video, we observe that such strategy suffer from content leakage from the reference video, and they cannot capture complex motion accurately. To address this issue, we propose MotionMatcher, a motion customization framework that fine-tunes the pre-trained T2V diffusion model at the feature level. Instead of using pixel-level objectives, MotionMatcher compares high-level, spatio-temporal motion features to fine-tune diffusion models, ensuring precise motion learning. For the sake of memory efficiency and accessibility, we utilize a pre-trained T2V diffusion model, which contains considerable prior knowledge about video motion, to compute these motion features. In our experiments, we demonstrate state-of-the-art motion customization performances, validating the design of our framework.

Method

(a) We fine-tune the pre-trained T2V diffusion model (T2V-DM) using the motion feature matching objective. Unlike the standard pixel-level DDPM loss, we align the motion features of the predicted noisy video with those of the ground truth noisy video. To extract motion features from noisy latent videos, we use a pre-trained T2V-DM (frozen) as a feature extractor. (b) We leverage the cross-attention (CA) maps and temporal self-attention (TSA) maps in the pre-trained T2V diffusion model to extract motion cues. The final motion features are the combination of the CA maps and TSA maps.


Example results

Please hover to view the captions.


One-to-one results

🡪 🡪
🡪 🡪
🡪 🡪
🡪 🡪

One-to-many results

🡪
🡪

More samples based on CogVideoX

🡪 🡪 🡪


Retrieval results


The video with the most similar motion features shares the same motion despite having different appearances. In contrast, the video that is most similar in latent space has a nearly identical appearance but opposite motion, while the video with the most similar residual frames contain unrelated motion. These results verify that our motion features capture rich motion information, rather than irrelevant details about visual appearance.

BibTeX

@article{wu2025motionmatcher,
  title    = {MotionMatcher: Motion Customization of Text-to-Video Diffusion Models via Motion Feature Matching},
  author   = {Wu, Yen-Siang and Huang, Chi-Pin and Yang, Fu-En and Wang, Yu-Chiang Frank},
  journal  = {arXiv preprint arXiv:2502.13234},
  year     = {2025}
}