Two-dimensional human pose estimation is a challenging task, where the goal is to localize key anatomical landmarks (e.g. elbows, knees, shoulders), given an image of a person in some pose. Current state-of-the-art pose estimation makes use of thousands of labeled figures to finetune transformers or train deep convolutional neural networks. These methods are manual-labor-intensive, requiring tools like Amazon Mechanical Turk to crowdsource pose labels on individual frames. Self-supervised methods, on the other hand, re-frame the pose estimation task as a re-construction problem (i.e. given one part of the input data, reconstruct another part of the input data), effectively doing away with the need for ground truth labels. This enables them to leverage the vast amount of visual content that has yet to be labeled, though at the present cost of yielding lower accuracies than their supervised counterparts. In this paper, we explore how to improve unsupervised pose estimation systems. We (1) conduct deep dive analysis into the relationship between reconstruction loss and pose estimate accuracy, (2) propose an efficient model architecture that quickly learns how to localize joints, and (3) offer a new metric of consistency that can be used to measure how consistent a model’s pose estimates are with regard to body proportions. Importantly, we arrive at a model that outperforms the original model (Schmidtke et al. [1]) that inspired our work, and we find that a combination of carefully engineered reconstruction losses and inductive bias coding can help coordinate pose learning alongside reconstruction in a self-supervised paradigm.
- Tags
-