MHCanonNet: Multi-Hypothesis Canonical lifting Network for self-supervised 3D human pose estimation in the wild video

Hyun Woo Kim, Gun Hee Lee, Woo Jeoung Nam, Kyung Min Jin, Tae Kyung Kang, Geon Jun Yang, Seong Whan Lee

Research output: Contribution to journalArticlepeer-review

2 Citations (Scopus)


Recent advancements in 3D Human Pose Estimation using fully-supervised learning approach have shown impressive results; however, these methods heavily rely on large amounts of annotated 3D data, which are challenging to obtain outside controlled laboratory environments. Therefore, in this study, we propose a new self-supervised training method designed to train a 3D human pose estimation network using unlabeled multi-view images. The model trains relative depths between joints without any 3D annotation by satisfying multi-view consistency constraints from unlabeled multi-view videos without camera calibration, while simultaneously learning representations of multiple plausible pose hypotheses. For this reason, we call our proposed network a Multi-Hypothesis Canonical Lifting Network (MHCanonNet). By enriching the diversity of extracted features and keeping various possibilities open, our network accurately estimates the final 3D pose. The key idea lies in the design of a novel and unbiased reconstruction objective function that combines multiple hypotheses from different viewpoints. The proposed approach demonstrates state-of-the-art results not only on two popular benchmark datasets, Human3.6M and MPI-INF-3DHP but also on an in-the-wild dataset, Ski-Pose, surpassing existing self-supervised training methods.

Original languageEnglish
Article number109908
JournalPattern Recognition
Publication statusPublished - 2024 Jan

Bibliographical note

Publisher Copyright:
© 2023 Elsevier Ltd


  • 3D human pose
  • Multi-view geometry
  • Self-supervised learning

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Computer Vision and Pattern Recognition
  • Artificial Intelligence


Dive into the research topics of 'MHCanonNet: Multi-Hypothesis Canonical lifting Network for self-supervised 3D human pose estimation in the wild video'. Together they form a unique fingerprint.

Cite this