EgoAVFlow: Robot Policy Learning with Active Vision from Human Egocentric Videos via 3D Flow

Daesol Cho1, Youngseok Jang2, Danfei Xu1, and Sehoon Ha1
1Georgia Institute of Technology    2KAIST

Abstract

Egocentric human videos provide a scalable source of manipulation demonstrations; however, deploying them on robots requires active viewpoint control to maintain task-critical visibility, which human viewpoint imitation often fails to provide due to human-specific priors. We propose EgoAVFlow, which learns manipulation and active vision from egocentric videos through a shared 3D flow representation that supports geometric visibility reasoning and transfers without robot demonstrations. EgoAVFlow uses diffusion models to predict robot actions, future 3D flow, and camera trajectories, and refines viewpoints at test time with reward-maximizing denoising under a visibility-aware reward computed from predicted motion and scene geometry. Real-world experiments under actively changing viewpoints show that EgoAVFlow consistently outperforms prior human-demo-based baselines, demonstrating effective visibility maintenance and robust manipulation without robot demonstrations.

Video

Overview

Thumbnail image

EgoAVFlow learns manipulation and active viewpoint control from egocentric human videos by predicting future 3D flow and optimizing camera viewpoints for visibility, yielding viewpoint-robust robot execution without robot demonstrations.

Overview of the robot experiments. For object manipulation and viewpoint adjustment, both robots works simultaneously. The black robot do object manipulation, while gray-colored robot with RGBD camera actively adjust the camera viewpoint to maintain the visibility of the manipulated object.

Method overview diagram

Method overview. EgoAVFlow consists of three diffusion models. The robot policy $\pi_r$ produces future robot action sequences. The flow generation model $f$ predicts future 3D flows from the outputs of $\pi_r$. The view policy $\pi_v$ produces future camera viewpoints from the outputs of $\pi_r$, $f$, and reconstructed mesh surfaces through a visibility-aware reward-maximizing denoising process. Viewpoints (A) represent that most query points are invisible (Red LOS) due to the table's mesh surface or out of FoV, whereas in viewpoints (B) these points are visible (Green LOS), yielding a higher visibility reward.

Dataset

Good visibility

Bad visibility

Task description: There are 4 different tasks: spray, doll, towel, toilet paper. Each task requires appropriate viewpoint adjustments. Otherwise, the object is occluded by the robot or elements in the environment, such as a table or drawer.

Finding 1: Continuous viewpoint adjustment is necessary for reliable visibility

Visibility comparison

The visibility is computed from each different fixed viewpoint. No single viewpoint can maintain full visibility throughout the execution, indicating that the viewpoint must be continuously adjusted online to maximize visibility.

Finding 2: Visibility-aware viewpoint planning outperforms human viewpoint imitation

Due to the visibility-maximizing viewpoint adjustments, EgoAVFlow maintains visibility of the query points and their predicted future flows, whereas human viewpoint imitation (HVI) fails to keep them in view, causing the query points to move out of the FoV.

Visibility reward

Visibility reward: For all tasks, EgoAVFlow achieves higher average visibility rewards $R_{vis}$ than human viewpoint imitation (HVI), demonstrating our method's visibility maintenance capability.

Finding 3: 3D flow policy outperforms under actively varying viewpoints

Success rates robot policy baseline

Success rates of EgoAVFlow, HVI (Human Viewpoint Imitation), and robot policy learning baselines

All baselines are trained on the same dataset as our method, and no real robot data is used. As AMPLIFY and Phantom require an image input, we use diffusion-based inpainting to remove the human in the image and overlays a robot onto the resulting frames, enabling policy training on synthesized robot observations. To isolate the effect of the policy representation, we use the same viewpoint adjustment module for all methods: our view policy $\pi_v$ with visibility-maximizing denoising. As shown in the experimental results, EgoAVFlow shows superior capability compared to baselines. As the viewpoint keeps changing during the evaluation by our proposed view policy, the results demonstrate that the proposed 3D flow-based policy is view-invariant and benefits from its inherent 3D representation, yielding a viewpoint-robust manipulation capability.

Failure Analysis

Failure analysis

Failure analysis: Since AMPLIFY and Phantom are not inherently 3D-aware, they suffer from distribution shifts when evaluated under the unseen viewpoints. EgoZero uses 3D points, but it cannot address the non-static scene, such as when the object moves due to the gripper's contact. As a result, these robot policy baselines account for most manipulation-related failures, such as grasping/object pose failure. In the case of out-of-viewpoint, human viewpoint imitation (HVI) accounts for most failures because it does not use our proposed view policy for visibility maintenance. EgoAVFlow accounts for the second-largest share of the failures. However, this does not indicate worse performance; rather, many robot policy baseline rollouts are already counted as early grasping failures and thus do not reach the later stages. If they progressed further, their proportions would be more comparable to our method.