Recent advancements in 3D Human Pose Estimation using fully-supervised learning approach have shown impressive results; however, these methods heavily rely on large amounts of annotated 3D data, which are challenging to obtain outside controlled laboratory environments. Therefore, in this study, we propose a new self-supervised training method designed to train a 3D human pose estimation network using unlabeled multi-view images. The model trains relative depths between joints without any 3D annotation by satisfying multi-view consistency constraints from unlabeled multi-view videos without camera calibration, while simultaneously learning representations of multiple plausible pose hypotheses. For this reason, we call our proposed network a Multi-Hypothesis Canonical Lifting Network (MHCanonNet). By enriching the diversity of extracted features and keeping various possibilities open, our network accurately estimates the final 3D pose. The key idea lies in the design of a novel and unbiased reconstruction objective function that combines multiple hypotheses from different viewpoints. The proposed approach demonstrates state-of-the-art results not only on two popular benchmark datasets, Human3.6M and MPI-INF-3DHP but also on an in-the-wild dataset, Ski-Pose, surpassing existing self-supervised training methods.
Bibliographical noteFunding Information:
This work was partially supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No. 2019-0-00079 , Artificial Intelligence Graduate School Program(Korea University) , No. 2022-0-00984 , Development of Artificial Intelligence Technology for Personalized Plug-and-Play Explanation and Verification of Explanation ).
© 2023 Elsevier Ltd
- 3D human pose
- Multi-view geometry
- Self-supervised learning
ASJC Scopus subject areas
- Signal Processing
- Computer Vision and Pattern Recognition
- Artificial Intelligence