CHAIRS: Towards Full-Body Articulated Human-Object Interaction


1Peking University2Beijing Institute for General Artificial Intelligence3Tsinghua University
*Equal contributors+Work done during internship at BIGAI
Drag to move your view around

Abstract

Fine-grained capturing of 3D HOI boosts human activity understanding and facilitates downstream visual tasks, including action recognition, holistic scene reconstruction, and human motion synthesis. Despite its significance, existing works mostly assume that humans interact with rigid objects using only a few body parts, limiting their scope. In this paper, we address the challenging problem of f-AHOI, wherein the whole human bodies interact with articulated objects, whose parts are connected by movable joints. We present CHAIRS, a large-scale motion-captured f-AHOI dataset, consisting of 16.2 hours of versatile interactions between 46 participants and 74 articulated and rigid sittable objects. CHAIRS provides 3D meshes of both humans and articulated objects during the entire interactive process, as well as realistic and physically plausible full-body interactions. We show the value of CHAIRS with object pose estimation. By learning the geometrical relationships in HOI, we devise the very first model that leverage human pose estimation to tackle the estimation of articulated object poses and shapes during whole-body interactions. Given an image and an estimated human pose, our model first reconstructs the pose and shape of the object, then optimizes the reconstruction according to a learned interaction prior. Under both evaluation settings (e.g., with or without the knowledge of objects' geometries/structures), our model significantly outperforms baselines. We hope CHAIRS will promote the community towards finer-grained interaction understanding. We will make the data/code publicly available.






Examples of the proposed A-HOI dataset—CHAIRS contains fine-grained interactions between 46 participants and 74 sittable objects with drastically different kinematic structures, providing multi-view RGB-D sequence inputs and ground-truth 3D mesh of humans and articulated objects for over 16.2 hours of recordings.

>

Method Overview



The overall architecture of our model - The reconstruction model uses the predicted voxelized human to guide the pose estimation of the interacting object. We further regress the root 6D pose of the object using the image feature and the SMPL-X parameters. We utilize both predictions and an interaction prior to optimize the final estimated pose.



Dataset Clips


   

   






Result Examples



The optimization results of our method on CHAIRS dataset. The first line is the original rgb input. The second and third line is the result of our full model with mesh reconstruction and part-level 6D pose estimation. The fourth line is the result of mesh recostruction without knowledge of the object.





The optimization results of our method on in-the-wild images with a person sitting in a sitable furniture.





The optimization results of our method on BEHAVE dataset.