Diffusion-based Synthetic Dataset Generation for Egocentric 3D Human Pose Estimation
Published in ECCV 2024 Workshop SyntheticData4CV, 2024
Authors
Kyohei Hayakawa, Dong-Hyun Hwang, Chen-Chieh Liao, and Hideki Koike
NAVER Cloud and Tokyo Institute of Technology
Abstract
Egocentric 3D human pose estimation, which estimates an individual’s 3D pose from a camera attached to a part of their body, operates within a specialized camera domain to capture the entire body. This specialization makes it challenging to collect real-world training data due to the difficulty in acquiring diverse and accurately labeled data from the egocentric perspective. Consequently, most existing methods rely on synthetic data for training, which increases the mean joint error when applied to real-world images due to the domain gap. Some works have addressed this issue by generating pseudo-labels from synchronized real egocentric and exocentric images and using them for training. However, this approach is costly in terms of data collection, making it difficult to scale and apply to other camera setups.
In this work, we propose a novel method that employs a diffusion model with ControlNet to generate real-world-like images, thereby reducing the domain gap. The proposed method relies on synthetic data, which is easy to acquire, and a small amount of text-captioned real-world data. This method easily applies to estimating egocentric 3D human poses across various camera setups. Experiments with two different camera setups demonstrated that models trained with images generated by the proposed method improve accuracy with real-world data. Specifically, the PA-MPJPE of the Mo2Cap2 model improved by 8.9% in the SceneEgo test set and by 3.4% in the GlobalEgoMocap test set.