Carsten Stoll

Carsten Stoll

I am a Research Scientist at Meta, developing next-generation digital humans.

I earned my PhD in Computer Graphics and Computer Vision from the Max-Planck Institute for Informatics with Prof. Christian Theobalt. I also led a group on Optical Performance Capture at the Max-Planck Center for Visual Computing and Communication and spent a year as visiting researcher at Weta Digital.

I cofounded The Captury, a company focused on real-time markerless motion capture, and then joined Facebook Reality Labs to work on full-body virtual humans. In 2020 I joined Epic Games to work on next-generation MetaHumans and real-time machine learning methods. In 2025 I rejoined Meta to help push virtual humans to the next level.

My research spans computer graphics, geometric modeling, animation, and computer vision. My goal is to create and control fully photorealistic, believable digital human avatars.

Carsten Stoll

Projects

Metahuman Bodies

Metahuman Bodies

At Epic Games I led the research and technical development of a parametric body model for MetaHumans, released with Unreal Engine 5.6. While prior versions of MetaHumans were based on discrete body types, the parametric model allows fine-grained direct and measurement-based editing of body shapes.

ML Deformer

ML Deformer

I developed the ML model driving Unreal Engine's ML Deformer plugin, which creates high-fidelity characters with deformations driven by full muscle, flesh, and cloth simulation running in real time. ML Deformer is now widely used in game development, including the Witcher 4 tech demo shown at Unreal Fest 2025.

Momentum

Momentum

At Facebook/Oculus Research I developed the Momentum library for human kinematics and numerical optimization, used to prototype markerless motion capture algorithms and for early versions of the Oculus VR headset upper-body tracking. We also published MHR, the Momentum Human Rig, an open-source parametric body model.

Publications

2026

Large-scale Codec Avatars

Junxuan Li, Rawal Khirodkar, Chengan He, Zhongshi Jiang, Giljoo Nam, Lingchen Yang, Jihyun Lee, Egor Zakharov, Zhaoen Su, Rinat Abdrashitov, Yuan Dong, Julieta Martinez, Kai Li, Qingyang Tan, Takaaki Shiratori, Matthew Hu, Peihong Guo, Xuhua Huang, Ariyan Zarei, Marco Pesavento, Yichen Xu, He Wen, Teng Deng, Wyatt Borsos, Anjali Thakrar, Jean-Charles Bazin, Carsten Stoll, Ginés Hidalgo, James Booth, Lucy Wang, Xiaowen Ma, Yu Rong, Sairanjith Thalanki, Chen Cao, Christian Häne, Abhishek Kar, Sofien Bouaziz, Jason Saragih, Yaser Sheikh, and Shunsuke Saito

IEEE/CVF CVPR, 2026

LCA is a high-fidelity, full-body 3D avatar model that generalizes to world-scale populations in a feedforward manner. Inspired by the success of large language models, we introduce a pre/post-training paradigm for 3D avatar modeling: pretraining on diverse multi-view studio data and post-training on in-the-wild images. Given a handful of images, LCA produces identity-preserving 3D avatars driven by facial expressions, full-body motion, and fine-grained hand poses.

2025

MHR: Momentum Human Rig

Aaron Ferguson, Ahmed A. A. Osman, Berta Bescos, Carsten Stoll, Chris Twigg, Christoph Lassner, David Otte, Eric Vignola, Fabian Prada, Federica Bogo, Igor Santesteban, Javier Romero, Jenna Zarate, Jeongseok Lee, Jinhyung Park, Jinlong Yang, John Doublestein, Kishore Venkateshan, Kris Kitani, Ladislav Kavan, Marco Dal Farra, Matthew Hu, Matthew Cioffi, Michael Fabris, Michael Ranieri, Mohammad Modarres, Petr Kadlecek, Rawal Khirodkar, Rinat Abdrashitov, Romain Prévost, Roman Rajbhandari, Ronald Mallet, Russell Pearsall, Sandy Kao, Sanjeev Kumar, Scott Parrish, Shoou-I Yu, Shunsuke Saito, Takaaki Shiratori, Te-Li Wang, Tony Tung, Yichen Xu, Yuan Dong, Yuhua Chen, Yuanlu Xu, Yuting Ye, and Zhongshi Jiang

arxiv, 2025

MHR is a parametric human body model that combines the decoupled skeleton/shape paradigm of ATLAS with a flexible, modern rig and pose corrective system inspired by the Momentum library. It enables expressive, anatomically plausible human animation with non-linear pose correctives, and is designed for robust integration in AR/VR and graphics pipelines. MHR is available as open source.

2024

HUMOS: Human Motion Model Conditioned on Body Shape

Shashank Tripathi, Omid Taheri, Christoph Lassner, Michael J. Black, Daniel Holden, and Carsten Stoll

ECCV, 2024

We introduce a generative motion model conditioned on body shape. Most existing models ignore how body proportions influence the way people move, relying on a standardized average body. Using cycle consistency, intuitive physics, and stability constraints on unpaired data, our approach generates diverse, physically plausible motions that reflect individual body characteristics.

EPOCH: Jointly Estimating the 3D Pose of Cameras and Humans

Nicola Garau, Giulia Martinelli, Niccolo Bisagno, Denis Tome, and Carsten Stoll

ECCV T-CAP Workshop, 2024

EPOCH is a framework for monocular 3D human pose estimation that uses the full perspective camera model instead of common weak-perspective approximations. It jointly estimates camera parameters and 3D pose through a lifter network and a regressor network, establishing an unambiguous 2D-to-3D relationship. The approach achieves state-of-the-art results on Human3.6M and MPI-INF-3DHP with improved generalization to unseen data.

fNeRF: High Quality Radiance Fields from Practical Cameras

Yi Hua, Christoph Lassner, Carsten Stoll, and Iain Matthews

arxiv, 2024

Standard Neural Radiance Fields use a pinhole camera model, which bakes defocus blur into the reconstruction. We propose a modified ray casting approach that leverages the optics of real lenses with finite aperture, modeling partial occlusions more faithfully than existing approximations. This yields sharper reconstructions with up to 3 dB PSNR improvement on synthetic and real datasets.

PhisaNet: Phonetically Informed Speech Animation Network

Salvador Medina, Sarah L. Taylor, Carsten Stoll, Gareth Edwards, Alex Hauptmann, Shinji Watanabe, and Iain Matthews

ICASSP, 2024

PhISANet leverages neural audio representations pre-trained on large speech corpora to map audio signals into animation parameters for the lower face and tongue of realistic 3D models. A CTC-based phonetic loss provides additional supervision, improving the phonetic accuracy of generated lip and tongue animation.

2023

Personalized 3D Human Pose and Shape Refinement

Tom Wehrbein, Bodo Rosenhahn, Iain Matthews, and Carsten Stoll

IEEE/CVF ICCVW, 2023

We propose a test-time optimization approach that refines 3D human pose and shape estimates for specific individuals, improving on generic feed-forward predictions. By exploiting temporal consistency and person-specific body shape priors during inference, the method adapts to individual body proportions and motion patterns without retraining.

2022

Speech Driven Tongue Animation

Salvador Medina, Denis Tomè, Carsten Stoll, Mark Tiede, Kevin Munhall, Alex Hauptmann, and Iain Matthews

IEEE/CVF CVPR, 2022

Most speech-driven facial animation focuses on lip motion and neglects the inner mouth. We introduce a large-scale speech and motion capture dataset focused on tongue, jaw, and lip movement, and propose a deep learning method using self-supervised audio feature encoders. The approach generalizes well to unseen speakers and content, enabling realistic tongue animation from audio alone.

2021

ANR: Articulated Neural Rendering for Virtual Avatars

Amit Raj, Julian Tanke, James Hays, Minh Vo, Carsten Stoll, and Christoph Lassner

IEEE CVPR, 2021

We extend Deferred Neural Rendering to articulated human bodies, addressing challenges like mesh deformation inaccuracies and pose-dependent dynamics. ANR uses a neural shading step that explicitly accounts for geometric misalignment between a coarse mesh and the true surface. User studies show a clear preference for our approach over existing avatar creation and animation methods.

2020

TexMesh: Reconstructing Detailed Human Texture and Geometry from RGB-D Video

Tiancheng Zhi, Christoph Lassner, Tony Tung, Carsten Stoll, Srinivasa G. Narasimhan, and Minh Vo

ECCV, 2020

TexMesh reconstructs detailed human meshes with high-resolution full-body albedo texture from RGB-D video. By leveraging the captured environment illumination, we estimate local surface geometry and albedo, then use photometric constraints for self-supervised adaptation to real-world sequences. After a brief self-adaptation step, the method runs at interactive frame rates.

PatchNets: Patch-Based Generalizable Deep Implicit 3D Shape Representations

Edgar Tretschk, Ayush Tewari, Vladislav Golyanik, Michael Zollhöfer, Carsten Stoll, and Christian Theobalt

ECCV, 2020

We introduce a mid-level patch-based implicit surface representation that generalizes across object categories. By learning patch-level signed distance functions in a canonical space, a model trained on a single ShapeNet category can represent detailed shapes from any other category using far less training data. The representation also supports shape interpolation, point cloud completion, and non-rigid deformation.

2015

Outdoor Human Motion Capture by Simultaneous Optimization of Pose and Camera Parameters

Ahmed Elhayek, Carsten Stoll, Kwang In Kim, and Christian Theobalt

Eurographics, 2015

We present a markerless motion capture method for outdoor settings with hand-held or moving cameras. The approach simultaneously optimizes skeletal pose and camera parameters, extending performance capture beyond controlled studio environments where camera positions are pre-calibrated.

2013

Capturing Relightable Human Performances under General Uncontrolled Illumination

Guannan Li, Chenglei Wu, Carsten Stoll, Yebin Liu, Kiran Varanasi, Qionghai Dai, and Christian Theobalt

Eurographics, 2013

We capture human performances that can be relit under novel illumination from multi-view video recorded under general, uncontrolled lighting. The method jointly estimates dynamic geometry, reflectance properties, and illumination, enabling realistic relighting of captured performances in post-production.

Markerless Motion Capture of Multiple Characters Using Multiview Image Segmentation

Yebin Liu, Juergen Gall, Carsten Stoll, Qionghai Dai, Hans-Peter Seidel, and Christian Theobalt

IEEE PAMI, 2013

We present a markerless motion capture approach for simultaneously tracking multiple characters from multi-view video. Multi-view image segmentation separates overlapping subjects and the method jointly estimates skeletal poses of all characters, enabling capture in scenes with close interactions where traditional methods fail due to occlusions.

On-set performance capture of multiple actors with a stereo camera

Chenglei Wu, Carsten Stoll, Levi Valgaerts, and Christian Theobalt

SIGGRAPH Asia, 2013

We demonstrate on-set performance capture of multiple actors using only a single stereo camera pair, enabling markerless motion capture directly during film production. The method recovers skeletal motion and surface deformation of several actors simultaneously, making it practical for use on real film sets with minimal equipment.

2012

Coherent Spatiotemporal Filtering, Upsampling and Rendering of RGBZ Videos

Christian Richardt, Carsten Stoll, Neil A. Dodgson, Hans-Peter Seidel, and Christian Theobalt

Eurographics, 2012

We present a framework for coherent spatiotemporal processing of combined color and depth (RGBZ) video. The method jointly filters, upsamples, and renders noisy or low-resolution depth data alongside color imagery, producing temporally stable results suitable for free-viewpoint video and 3D display applications.

Spatio-temporal motion tracking with unsynchronized cameras

Ahmed Elhayek, Carsten Stoll, Nils Hasler, Kwang In Kim, Hans-Peter Seidel, and Christian Theobalt

IEEE CVPR, 2012

We propose markerless human motion tracking from multiple unsynchronized cameras, removing a key practical limitation of multi-view capture. The approach jointly optimizes spatio-temporal alignment and skeletal pose, enabling motion capture with commodity cameras that lack hardware synchronization.

Feature-Based Multi-video Synchronization with Subframe Accuracy

Ahmed Elhayek, Carsten Stoll, Kwang In Kim, Hans-Peter Seidel, and Christian Theobalt

DAGM, 2012

We present a feature-based approach to temporally synchronize multiple video streams with subframe accuracy. Image features establish correspondences across views and precise temporal offsets are estimated, enabling the use of unsynchronized consumer cameras for multi-view reconstruction and motion capture.

2011

Video-based characters: creating new human performances from a multi-view video database

Feng Xu, Yebin Liu, Carsten Stoll, James Tompkin, Gaurav Bharaj, Qionghai Dai, Hans-Peter Seidel, Jan Kautz, and Christian Theobalt

SIGGRAPH Asia, 2011

We synthesize novel human performances by recombining motion segments from a multi-view video database. The approach creates character animations that were never actually performed, stitching captured clips together while maintaining visual coherence and realistic transitions between segments.

Markerless motion capture of interacting characters using multi-view image segmentation

Yebin Liu, Carsten Stoll, Juergen Gall, Hans-Peter Seidel, and Christian Theobalt

CVPR, 2011

We propose a markerless capture method that handles closely interacting characters by using multi-view image segmentation to resolve inter-person occlusions. The approach jointly segments and tracks multiple people, enabling robust motion capture in scenarios where subjects are in close physical contact.

Fast articulated motion tracking using a sums of Gaussians body model

Carsten Stoll, Nils Hasler, Juergen Gall, Hans-Peter Seidel, and Christian Theobalt

ICCV, 2011

We represent the human body as a sum of Gaussian density functions for fast articulated tracking. This smooth, differentiable representation enables efficient gradient-based optimization for pose estimation from multi-view silhouettes, achieving real-time performance for full-body markerless motion capture.

2010

Video-based reconstruction of animatable human characters

Carsten Stoll, Juergen Gall, Edilson Aguiar, Sebastian Thrun, and Christian Theobalt

SIGGRAPH Asia, 2010

We present a complete pipeline for reconstructing fully animatable human characters from multi-view video. The method captures both skeletal motion and detailed surface deformations, producing rigged characters that can be reanimated with new motions while preserving realistic soft-tissue dynamics from the original footage.

Joint Estimation of Motion, Structure and Geometry from Stereo Sequences

Levi Valgaerts, Andrés Bruhn, Henning Zimmer, Joachim Weickert, Carsten Stoll, and Christian Theobalt

ECCV, 2010

We propose a variational framework for jointly estimating dense optical flow, scene depth, and 3D surface geometry from stereo image sequences. Coupling these problems into a single energy functional exploits their mutual dependencies, yielding more accurate and consistent results than solving them independently.

2009

Template based shape processing

Carsten Stoll

Saarland University, 2009 (PhD Thesis)

My PhD dissertation presents a unified framework for using template meshes to process and reconstruct 3D shapes. It covers template deformation for point cloud fitting, performance capture from multi-view video, and volumetric shape editing, establishing foundational techniques for subsequent work in human body capture.

Estimating body shape of dressed humans

Nils Hasler, Carsten Stoll, Bodo Rosenhahn, Thorsten Thormählen, and Hans-Peter Seidel

Shape Modeling International, 2009

We address the problem of estimating the underlying body shape of a person wearing clothing. A statistical body model is fitted to 3D scan data of dressed subjects, using shape priors to infer the body surface beneath garments. This enables body shape estimation from practical input without requiring the subject to undress.

A Statistical Model of Human Pose and Body Shape

Nils Hasler, Carsten Stoll, Martin Sunkel, Bodo Rosenhahn, and Hans-Peter Seidel

Eurographics, 2009

We present a statistical model that jointly captures the variability of human body shape and pose from a large corpus of 3D body scans. The model disentangles shape identity from pose-dependent deformations, enabling synthesis of novel body shapes in arbitrary poses and providing a strong prior for body reconstruction.

Motion capture using joint skeleton tracking and surface estimation

Juergen Gall, Carsten Stoll, Edilson Aguiar, Christian Theobalt, Bodo Rosenhahn, and Hans-Peter Seidel

IEEE CVPR, 2009

We propose a markerless motion capture method that jointly tracks the skeleton and estimates the body surface from multi-view video. Coupling skeletal pose estimation with non-rigid surface deformation in a unified optimization captures both the articulated motion and detailed surface geometry of the performer.

2008

Performance capture from sparse multi-view video

Edilson Aguiar, Carsten Stoll, Christian Theobalt, Naveed Ahmed, Hans-Peter Seidel, and Sebastian Thrun

ACM SIGGRAPH, 2008

We demonstrate high-quality performance capture from only a sparse set of cameras, significantly reducing the number required compared to prior dense capture setups. A template mesh is deformed to match multi-view silhouettes and sparse feature correspondences, recovering detailed non-rigid surface motion of the performer.

2007

Marker-less Deformable Mesh Tracking for Human Shape and Motion Capture

Edilson Aguiar, Christian Theobalt, Carsten Stoll, and Hans-Peter Seidel

IEEE CVPR, 2007

One of the early methods for markerless human shape and motion capture using deformable mesh tracking from multi-view video. We track a template mesh through a sequence by optimizing a deformation field that matches observed silhouettes and image features across all views.

A Volumetric Approach to Interactive Shape Editing

Carsten Stoll, Edilson Aguiar, Christian Theobalt, and Hans-Peter Seidel

Technical Report, 2007

We propose a volumetric approach for interactive 3D shape editing, where deformations are defined in the embedding volume rather than directly on the surface. This enables intuitive, physically-inspired modifications that preserve surface detail and topology during large-scale edits.

Rapid Animation of Laser-scanned Humans

Edilson Aguiar, Christian Theobalt, Carsten Stoll, and Hans-Peter Seidel

IEEE VR, 2007

We present a method for quickly rigging and animating detailed 3D human models from laser scans. A skeleton and skinning weights are automatically fitted to the scanned geometry, enabling rapid creation of animatable characters from static scan data without manual rigging.

Geodesics guided constrained texture deformation

Carsten Stoll, Zachi Karni, and Hans-Peter Seidel

Pacific Graphics, 2007

We introduce a method for deforming texture coordinates on 3D meshes guided by geodesic distances, enabling constrained texture mapping that follows the intrinsic surface geometry. Geodesic paths guide how textures stretch and flow across the shape, preserving coherence during surface deformation.

2006

BSP Shapes

Carsten Stoll, Hans-Peter Seidel, and Marc Alexa

Shape Modeling and Applications, 2006

We present a shape representation based on binary space partitioning (BSP) trees that supports efficient Boolean operations and constructive solid geometry. The approach provides a compact, hierarchical representation of complex 3D shapes enabling fast inside/outside queries and set operations.

Incremental Raycasting of Piecewise Quadratic Surfaces on the GPU

Carsten Stoll, Stefan Gumhold, and Hans-Peter Seidel

Symposium on Interactive Raytracing, 2006

We introduce a GPU-based incremental raycasting algorithm for rendering piecewise quadratic surfaces at interactive rates. The method exploits the mathematical structure of quadratic patches for efficient ray-surface intersection, enabling high-quality rendering of smooth curved surfaces in real time.

Template Deformation for Point Cloud Fitting

Carsten Stoll, Zachi Karni, Christian Rössl, Hitoshi Yamauchi, and Hans-Peter Seidel

Point Based Graphics @ SIGGRAPH, 2006

We present a method for fitting a template mesh to unstructured point cloud data through non-rigid deformation. The template is deformed to match the target point set while preserving mesh quality and surface detail, providing a robust way to obtain clean, connected meshes from noisy scan data.

2005

Visualization with stylized line primitives

Carsten Stoll, Stefan Gumhold, and Hans-Peter Seidel

IEEE Vis, 2005

We propose stylized line primitives for scientific visualization, enabling expressive non-photorealistic rendering of flow fields and vector data. The method renders line-based visual elements with controllable attributes like thickness, opacity, and curvature to effectively convey complex data patterns.

2003

Direction Fields over Point-Sampled Geometry

Marc Alexa, Tobias Klug, and Carsten Stoll

WSCG, 2003

We define and compute smooth direction fields directly on point-sampled surfaces without requiring an explicit mesh. This enables downstream operations like texture synthesis and non-photorealistic rendering on point-based geometry by establishing a consistent tangent frame across the point set.