RoMo teaser: large-scale, taxonomy-organized human motion dataset

Abstract

Success in generative modeling across language, image, and video demonstrates that large, well-curated datasets are the key driver for building capable models. 3D human motion, however, has lagged behind, constrained by an unsatisfying choice between small, high-fidelity motion capture datasets and large-scale in-the-wild collections dominated by static or low-quality sequences.

We introduce RoMo, a rich, large-scale, carefully curated dataset of in-the-wild human motions that resolves these tradeoffs. To ensure quality, we introduce a taxonomy-aware filtering pipeline that aggressively removes static and artifact-prone sequences. Every sequence is annotated with detailed captions and organized by a novel three-level semantic taxonomy. This hierarchical structure enables fine-grained, per-category evaluation that reveals model strengths and weaknesses obscured by global metrics.

We demonstrate that models trained on RoMo achieve state-of-the-art fidelity and diversity while gaining a superior understanding of complex, subtle text prompts. Finally, we release the Motion Toolbox to standardize metrics, data conversion, and visualization, establishing a foundation for reproducible and interpretable motion generation research.

0
Motion sequences
0
Hours of motion
0
Categories
0
Subcategories
0
Atomic actions

Semantic Taxonomy

Two displayed levels — 54 categories → 2,065 subcategories. Click a node to expand; toggle the sunburst sized by sequence count.

Taxonomy-aware Filtering

An uncompromising pipeline distills ~14 years of raw web video down to 1.3K hours of high-quality, well-annotated motion — about 1%. Scroll to walk through each filter.

Raw web video
125K hours remaining
100% of raw footage
00

Raw web video

Human-motion videos crawled from YouTube, Kinetics-700, HOIGen, VideoUFO and more — unfiltered, noisy, fully in-the-wild.

125K hours ~14 years
01

Meta Filter

An LLM screens each video's metadata: it must depict a real human action, a single person, with the full body visible, and not be AI-generated. Clips below 24 FPS are dropped.

65K hours −48%
02

Scene Detection

PySceneDetect splits transitions and discontinuities into clean shots; near-static scenes are removed using inter-frame differences.

39K hours −41%
03

Human Detection

YOLOv8 keeps clips dominated by a single prominent person; ViTPose 2D-pose checks reject heavy truncation and tiny subjects.

11K hours −72%
04

Motion Estimation

GVHMR lifts each clip into a 3D SMPL sequence (24 joints, resampled to 30 FPS); low-quality reconstructions are filtered out.

3.8K hours −65%
05

Motion Filter

An adaptive, per-category dynamic-score threshold removes static and low-activity clips — the final quality gate before the dataset.

1.3K hours −66%
RoMo — 813,938 sequences · 1,238 h · only ~1% of the raw footage survives.

Dataset Statistics

Interactive Plotly figures — hover, zoom, and toggle. Generated from the RoMo dataset.

Scale vs. prior datasets

Core clip counts and hours from Table 1 of the paper; hover for caption diversity.

Where the motions come from

Motion sequence counts and percentages from the source-distribution figure.

Sequences per category

Category counts from the RoMo data; total = 813,938 sequences.

Semantic diversity (t-SNE)

t-SNE of Sentence-T5 caption embeddings. RoMo spans a broader semantic space than prior datasets.

Motion Toolbox

A single library to standardize how motion datasets are measured, converted, and viewed.

Standard metrics

Unified FID, diversity, multimodality and per-category scores so results are comparable across papers.

Data conversion

Convert between SMPL, R15, and common motion formats with consistent root-motion handling.

Visualization

One-line GLB rendering and side-by-side comparison — the same pipeline powering this gallery.

BibTeX

@inproceedings{Zhang2026RoMo,
  author    = {Zhang, Jiahao and Liu, Joseph and Lee, Young-Yoon and Moon, Seonghyeon and Zordan, Victor and Tevet, Guy and Liu, Karen and Gould, Stephen and Jacob, Oren and Jiang, Haomiao and Kapadia, Mubbasir and Ben-Shabat, Yizhak},
  title     = {RoMo: A Large-Scale, Richly Organized Dataset and Semantic Taxonomy for Human Motion Generation},
  booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026},
}