EgoCap: Egocentric Marker-less Motion Capture with Two Fisheye Cameras (Extended Abstract)

Helge Rhodin, Christian Richardt, Dan Casas, Eldar Insafutdinov, Mohammad Shafiei, Hans-Peter Seidel, Bernt Schiele, Christian Theobalt

Introduction

Traditional optical skeletal motion-capture methods – both marker-based and marker-less – use several cameras typically placed around a scene in an outside-in arrangement, with camera views approximately converging in the center of a confined recording volume. This greatly constrains the spatial extent of motions that can be recorded; simply enlarging the recording volume by using more cameras, for instance to capture an athlete, is not scalable. In other cases, a scene may be cluttered with objects or furniture, or other dynamic scene elements, such as people in close interaction, may obstruct a motion-captured person in the scene or create unwanted dynamics in the background. In such cases, even state-of-the-art outside-in marker-less optical methods that succeed with just a few cameras, and are designed for outdoor scenes , quickly fail. This problem can partly be bypassed with motion-capture methods that use body-worn sensors. Shiratori et al. propose to wear 16 cameras placed on body parts facing inside-out , and capture the skeletal motion through structure-from-motion relative to the environment. This clever solution requires instrumentation, calibration and a static background, but allows free roaming and was inspirational for our egocentric approach.

We propose EgoCap: an egocentric motion-capture approach that estimates full-body pose from a pair of optical cameras carried by lightweight headgear (see Figure 1). The body-worn cameras are oriented such that their field of view covers the user’s body entirely, forming an arrangement that is independent of external sensors – an optical inside-in method. It reduces the setup effort, enables free roaming, and minimizes body instrumentation. EgoCap decouples the estimation of local body pose with respect to the headgear cameras and global headgear position, which we infer by structure-from-motion on the scene.

Our first contribution is a new egocentric inside-in sensor rig with only two head-mounted, downward-facing commodity video cameras with fisheye lenses (see Figure 1 left). The rig can be attached to a helmet or a head-mounted VR display, and, hence, requires less instrumentation and calibration than other body-worn systems. The stereo fisheye optics keep the whole body in view in all poses, despite the cameras’ proximity to the body.

Our second contribution is a new marker-less motion-capture algorithm tailored to the strongly distorted egocentric fisheye views. It combines a generative model-based skeletal pose estimation approach with evidence from a trained ConvNet-based body-part detector, and is designed to work with unsegmented frames and general backgrounds (Section 2).

Our third contribution is a new approach for automatically creating body-part detection training datasets. We record test subjects in front of green screen with an existing outside-in marker-less motion-capture system to get ground-truth skeletal poses, which are reprojected into the simultaneously recorded head-mounted fisheye views to get 2D body-part annotations. We augment the training frames by replacing the green screen with random background images, and vary the appearance in terms of color and shading by intrinsic recoloring . With this technique, we annotate 100,000 images of egocentric videos of eight people in different clothing. We provide the dataset for research purposes .

Egocentric Inside-In Motion Capture

Our egocentric setup separates human motion capture into two subproblems: (1) local skeleton pose estimation with respect to the camera rig, and (2) global rig pose estimation relative to the environment. Global pose is estimated with existing structure-from-motion techniques . We formulate skeletal pose estimation as an analysis-by-synthesis-style optimization problem in the pose parameters pt\mathbf{p}^{t}, that maximizes the alignment of a projected 3D human body model in the left Ileftt\mathcal{I}_{\text{left}}^{t} and the right Irightt\mathcal{I}_{\text{right}}^{t} stereo fisheye views, at each video time step tt. We use a hybrid alignment energy combining evidence from a generative image-formation model, as well as from a discriminative detection approach:

EcolorE_{\text{color}} is an extension of a generative ray-casting model to the strongly distorted fisheye views, which provides differentiable visibility through a volumetric representation. EdetectionE_{\text{detection}} constrains pt\mathbf{p}^{t} to 2D joint detections obtained from an exiting ConvNet , which was fine-tuned on the previously introduced dataset. EposeE_{\text{pose}} penalizes violations of anatomical joint-angle limits as well as poses deviating strongly from the rest pose, and EsmoothE_{\text{smooth}} regularizes temporal changes.

Evaluation and Applications

We first evaluate the learned body-part detectors using the percentage of correct keypoints (PCK) metric on a validation set consisting of 1000 images of two subjects that are not part of the training set. Background augmentation during training brings a clear improvement of 67 PCK points. Cloth recoloring additionally improves significantly by 3 PCK points.

D Body Pose Accuracy.

We quantitatively evaluate the 3D body pose accuracy of our approach on ground-truth data obtained with the Captury Studio. The average Euclidean 3D distance over all 18 joints, for which detection labels are available, is 7±\pm1 cm for a challenging 250-frame walking sequence with occlusions, and 7±\pm1 cm on a long sequence of 750 frames of gesturing and interaction. It meets the accuracy of outside-in approaches using 2–3 cameras .

Large-scale Motion Capture.

We successfully tested on a basketball sequence outdoors, which shows quick motion and close interaction, on an outdoor walking sequence, and on a large-scale biking sequence (Figure 1, third column).

Constrained/Crowded Spaces.

We also tested EgoCap for motion capture in a crowded scene, where many spectators are interacting and occluding the tracked user from the outside (Figure 1, fourth column). In such a setting, as well as in settings with many obstacles and narrow sections, outside-in motion capture, even with a dense camera system, would be difficult.

Immersive VR.

The EgoCap head-gear (Figure 1, first column) is designed to be used in virtual reality (VR) applications (Figure 1, last column). Current HMD-based systems only track the pose of the display; our approach adds motion capture of the wearer’s full body, which enables a much higher level of immersion.

Conclusion

We presented EgoCap, the first approach for marker-less egocentric full-body motion capture with a head-mounted fisheye stereo rig. EgoCap enables motion capture of dense and crowded scenes, and reconstruction of large-scale activities that would not fit into the constrained recording volumes of outside-in motion-capture methods. It is particularly suited for HMD-based VR applications; two cameras attached to an HMD enable full-body pose reconstruction of your own virtual body to pave the way for immersive VR experiences and interactions.

Acknowledgements

This research was funded by the ERC Starting Grant project CapReal (335545).

References