PRISM: Learning a Shared Primitive Space for Transferable Skeleton Action Representation
(ICCV'2023)

Di Yang1     Quan Kong3     Lorenzo Garattoni2
Gianpiero Francesca2     François Brémond1

1Inria,  Université Côte d'Azur    2Toyota Motor Europe    3Woven by Toyota

Corresponding author

Abstract

Real-world human action understanding remains challenging due to long-tailed label distributions, compositional motion patterns, and viewpoint variations. Existing skeleton-based methods often lack a structured and transferable representation of motion, and task-specific models for generation, classification, and detection are usually trained independently, resulting in fragmented pipelines and limited cross-task generalization. We present PRISM, a \textbf{PRI}mitive-centric \textbf{S}keleton \textbf{M}odeling framework that learns a shared motion representation from a motion generation objective and transfers it to perception tasks. PRISM represents each action sequence as a trajectory in a primitive coefficient space, which captures how a set of learned atomic motion primitives contribute to the observed motion. A structured decomposition module learns this representation in a physically grounded and view-invariant manner via motion generation. Instead of enforcing joint or unified training across tasks, PRISM provides a single primitive-centric representation that can be sequentially transferred to downstream classification and frame-wise detection through lightweight task heads. This representation introduces structure, compositionality, and improved generalization across distinct supervisions. PRISM consistently improves performance on long-tailed and multi-label datasets and enables interpretable reasoning over compositional and rare actions. Extensive experimental results show that the structured primitive space serves as a transferable and robust foundation for diverse action understanding tasks in real-world datasets.

[Paper]      [Code]      [Bibtex]