Talks and presentations

Mobile Human Pose Estimation (Microsoft Research Asia)

May 01, 2020

Talk, Virtual, Virtual

Microsoft Research Asia Invited Talk

We present MoVNect, a lightweight deep neural network to capture 3D human pose using a single RGB camera. To improve the overall performance of the model, we apply the teacher-student learning method based knowledge distillation to 3D human pose estimation. Real-time post-processing makes the CNN output yield temporally stable 3D skeletal information, which can be used in applications directly. We implement a 3D avatar application running on mobile in real-time to demonstrate that our network achieves both high accuracy and fast inference time. Extensive evaluations show the advantages of our lightweight model with the proposed training method over previous 3D pose estimation methods on the Human3.6M dataset and mobile devices.

Synthesizing Pseudo-2.5D Content from Monocular Videos for Mixed Reality (NAVER)

February 01, 2019

Talk, Green Factory, South Korea, Korea

Free-viewpoint video (FVV) is a kind of advanced media that provides a more immersive user experience than traditional media. It allows users to interact with content because users can view media at the desired viewpoint and is becoming a next-generation media. In creating FVV content, existing systems require complex and specialized capturing equipment and has low end-user usability because it needs a lot of expertise to use the system. This becomes an inconvenience for individuals or small organizations who want to create content and limits the end user’s ability to create FVV-based user-generated content (UGC) and inhibits the creation and sharing of various created content. To tackle these problems, ParaPara is proposed in this work. ParaPara is an end-to-end system that uses a simple yet effective method to generate pseudo-2.5D FVV content from monocular videos, unlike the previously proposed systems. First, the system detects persons from the monocular video through a deep neural network, calculates the real-world homography matrix based on the minimal user interaction, and estimates the pseudo-3D positions of the detected persons. Then, person textures are extracted using general image processing algorithms and placed at the estimated real-world positions. Finally, the pseudo-2.5D content is synthesized from these elements. The content, which is synthesized by the proposed system, is implemented on Microsoft HoloLens; the user can freely place the generated content on the real world and watch it on a free viewpoint.