[NeurIPS submitted] DynaFLIP: Rethinking Robotics Perception via Tri-Modal-Dynamics Guided Representation

Abstract: Robot manipulation succeeds only when perception preserves the aspects of a scene that matter for action. Yet most robot learning pipelines still rely on visual encoders pre-trained for static recognition or vision-language alignment, leaving motion understanding to downstream policies. We introduce DynaFLIP, a dynamics-aware multimodal pre-training framework that pushes motion understanding upstream into perception. DynaFLIP uses image transitions, language, and 3D flow as training-time supervision to shape an image-only encoder, requiring no extra modalities at test time. Our key idea is to align these modalities jointly by minimizing the generalized simplex volume spanned by their embeddings, capturing higher-order structure beyond anchor-based pairwise objectives. To avoid geometrically ambiguous low-volume solutions and trivial collapse, we augment the objective with a cosine regularizer between selected modality pairs and embed the resulting cosine-augmented energy in an contrastive framework. We construct image-language-3D flow triplets from heterogeneous human and robot videos and evaluate the resulting representations across simulation and real-world manipulation. DynaFLIP consistently outperforms strong baselines on LIBERO and in a real-world VLA setup, with especially strong gains under visual, spatial, and semantic distribution shifts. Additional analyses confirm that the encoder attends to manipulation-relevant regions and preserve control-relevant information. Our results suggest that robot generalization improves when visual representations are trained to encode not just what is present, but how the world changes under action.