Carsten Stoll
I am a Research Scientist at Meta, developing next-generation digital humans.
I earned my PhD in Computer Graphics and Computer Vision from the Max-Planck Institute for Informatics with Prof. Christian Theobalt. I also led a group on Optical Performance Capture at the Max-Planck Center for Visual Computing and Communication and spent a year as visiting researcher at Weta Digital.
I cofounded The Captury, a company focused on real-time markerless motion capture, and then joined Facebook Reality Labs to work on full-body virtual humans. In 2020 I joined Epic Games to work on next-generation MetaHumans and real-time machine learning methods. In 2025 I rejoined Meta to help push virtual humans to the next level.
My research spans computer graphics, geometric modeling, animation, and computer vision. My goal is to create and control fully photorealistic, believable digital human avatars.
Projects
Metahuman Bodies
At Epic Games I led the research and technical development of a parametric body model for MetaHumans, released with Unreal Engine 5.6. While prior versions of MetaHumans were based on discrete body types, the parametric model allows fine-grained direct and measurement-based editing of body shapes.
ML Deformer
I developed the ML model driving Unreal Engine's ML Deformer plugin, which creates high-fidelity characters with deformations driven by full muscle, flesh, and cloth simulation running in real time. ML Deformer is now widely used in game development, including the Witcher 4 tech demo shown at Unreal Fest 2025.
Momentum
At Facebook/Oculus Research I developed the Momentum library for human kinematics and numerical optimization, used to prototype markerless motion capture algorithms and for early versions of the Oculus VR headset upper-body tracking. We also published MHR, the Momentum Human Rig, an open-source parametric body model.
Publications
2026
Large-scale Codec Avatars
IEEE/CVF CVPR, 2026
LCA is a high-fidelity, full-body 3D avatar model that generalizes to world-scale populations in a feedforward manner. Inspired by the success of large language models, we introduce a pre/post-training paradigm for 3D avatar modeling: pretraining on diverse multi-view studio data and post-training on in-the-wild images. Given a handful of images, LCA produces identity-preserving 3D avatars driven by facial expressions, full-body motion, and fine-grained hand poses.
2025
MHR: Momentum Human Rig
arxiv, 2025
MHR is a parametric human body model that combines the decoupled skeleton/shape paradigm of ATLAS with a flexible, modern rig and pose corrective system inspired by the Momentum library. It enables expressive, anatomically plausible human animation with non-linear pose correctives, and is designed for robust integration in AR/VR and graphics pipelines. MHR is available as open source.
2024
HUMOS: Human Motion Model Conditioned on Body Shape
ECCV, 2024
We introduce a generative motion model conditioned on body shape. Most existing models ignore how body proportions influence the way people move, relying on a standardized average body. Using cycle consistency, intuitive physics, and stability constraints on unpaired data, our approach generates diverse, physically plausible motions that reflect individual body characteristics.
EPOCH: Jointly Estimating the 3D Pose of Cameras and Humans
ECCV T-CAP Workshop, 2024
EPOCH is a framework for monocular 3D human pose estimation that uses the full perspective camera model instead of common weak-perspective approximations. It jointly estimates camera parameters and 3D pose through a lifter network and a regressor network, establishing an unambiguous 2D-to-3D relationship. The approach achieves state-of-the-art results on Human3.6M and MPI-INF-3DHP with improved generalization to unseen data.
fNeRF: High Quality Radiance Fields from Practical Cameras
arxiv, 2024
Standard Neural Radiance Fields use a pinhole camera model, which bakes defocus blur into the reconstruction. We propose a modified ray casting approach that leverages the optics of real lenses with finite aperture, modeling partial occlusions more faithfully than existing approximations. This yields sharper reconstructions with up to 3 dB PSNR improvement on synthetic and real datasets.
PhisaNet: Phonetically Informed Speech Animation Network
ICASSP, 2024
PhISANet leverages neural audio representations pre-trained on large speech corpora to map audio signals into animation parameters for the lower face and tongue of realistic 3D models. A CTC-based phonetic loss provides additional supervision, improving the phonetic accuracy of generated lip and tongue animation.
2023
Personalized 3D Human Pose and Shape Refinement
IEEE/CVF ICCVW, 2023
We propose a test-time optimization approach that refines 3D human pose and shape estimates for specific individuals, improving on generic feed-forward predictions. By exploiting temporal consistency and person-specific body shape priors during inference, the method adapts to individual body proportions and motion patterns without retraining.
2022
Speech Driven Tongue Animation
IEEE/CVF CVPR, 2022
Most speech-driven facial animation focuses on lip motion and neglects the inner mouth. We introduce a large-scale speech and motion capture dataset focused on tongue, jaw, and lip movement, and propose a deep learning method using self-supervised audio feature encoders. The approach generalizes well to unseen speakers and content, enabling realistic tongue animation from audio alone.
2021
ANR: Articulated Neural Rendering for Virtual Avatars
IEEE CVPR, 2021
We extend Deferred Neural Rendering to articulated human bodies, addressing challenges like mesh deformation inaccuracies and pose-dependent dynamics. ANR uses a neural shading step that explicitly accounts for geometric misalignment between a coarse mesh and the true surface. User studies show a clear preference for our approach over existing avatar creation and animation methods.
2020
TexMesh: Reconstructing Detailed Human Texture and Geometry from RGB-D Video
ECCV, 2020
TexMesh reconstructs detailed human meshes with high-resolution full-body albedo texture from RGB-D video. By leveraging the captured environment illumination, we estimate local surface geometry and albedo, then use photometric constraints for self-supervised adaptation to real-world sequences. After a brief self-adaptation step, the method runs at interactive frame rates.
PatchNets: Patch-Based Generalizable Deep Implicit 3D Shape Representations
ECCV, 2020
We introduce a mid-level patch-based implicit surface representation that generalizes across object categories. By learning patch-level signed distance functions in a canonical space, a model trained on a single ShapeNet category can represent detailed shapes from any other category using far less training data. The representation also supports shape interpolation, point cloud completion, and non-rigid deformation.
2015
Outdoor Human Motion Capture by Simultaneous Optimization of Pose and Camera Parameters
Eurographics, 2015
We present a markerless motion capture method for outdoor settings with hand-held or moving cameras. The approach simultaneously optimizes skeletal pose and camera parameters, extending performance capture beyond controlled studio environments where camera positions are pre-calibrated.
2013
Capturing Relightable Human Performances under General Uncontrolled Illumination
Eurographics, 2013
We capture human performances that can be relit under novel illumination from multi-view video recorded under general, uncontrolled lighting. The method jointly estimates dynamic geometry, reflectance properties, and illumination, enabling realistic relighting of captured performances in post-production.
Markerless Motion Capture of Multiple Characters Using Multiview Image Segmentation
IEEE PAMI, 2013
We present a markerless motion capture approach for simultaneously tracking multiple characters from multi-view video. Multi-view image segmentation separates overlapping subjects and the method jointly estimates skeletal poses of all characters, enabling capture in scenes with close interactions where traditional methods fail due to occlusions.
On-set performance capture of multiple actors with a stereo camera
SIGGRAPH Asia, 2013
We demonstrate on-set performance capture of multiple actors using only a single stereo camera pair, enabling markerless motion capture directly during film production. The method recovers skeletal motion and surface deformation of several actors simultaneously, making it practical for use on real film sets with minimal equipment.
2012
Coherent Spatiotemporal Filtering, Upsampling and Rendering of RGBZ Videos
Eurographics, 2012
We present a framework for coherent spatiotemporal processing of combined color and depth (RGBZ) video. The method jointly filters, upsamples, and renders noisy or low-resolution depth data alongside color imagery, producing temporally stable results suitable for free-viewpoint video and 3D display applications.
Spatio-temporal motion tracking with unsynchronized cameras
IEEE CVPR, 2012
We propose markerless human motion tracking from multiple unsynchronized cameras, removing a key practical limitation of multi-view capture. The approach jointly optimizes spatio-temporal alignment and skeletal pose, enabling motion capture with commodity cameras that lack hardware synchronization.
Feature-Based Multi-video Synchronization with Subframe Accuracy
DAGM, 2012
We present a feature-based approach to temporally synchronize multiple video streams with subframe accuracy. Image features establish correspondences across views and precise temporal offsets are estimated, enabling the use of unsynchronized consumer cameras for multi-view reconstruction and motion capture.
2011
Video-based characters: creating new human performances from a multi-view video database
SIGGRAPH Asia, 2011
We synthesize novel human performances by recombining motion segments from a multi-view video database. The approach creates character animations that were never actually performed, stitching captured clips together while maintaining visual coherence and realistic transitions between segments.
Markerless motion capture of interacting characters using multi-view image segmentation
CVPR, 2011
We propose a markerless capture method that handles closely interacting characters by using multi-view image segmentation to resolve inter-person occlusions. The approach jointly segments and tracks multiple people, enabling robust motion capture in scenarios where subjects are in close physical contact.
Fast articulated motion tracking using a sums of Gaussians body model
ICCV, 2011
We represent the human body as a sum of Gaussian density functions for fast articulated tracking. This smooth, differentiable representation enables efficient gradient-based optimization for pose estimation from multi-view silhouettes, achieving real-time performance for full-body markerless motion capture.
2010
Video-based reconstruction of animatable human characters
SIGGRAPH Asia, 2010
We present a complete pipeline for reconstructing fully animatable human characters from multi-view video. The method captures both skeletal motion and detailed surface deformations, producing rigged characters that can be reanimated with new motions while preserving realistic soft-tissue dynamics from the original footage.
Joint Estimation of Motion, Structure and Geometry from Stereo Sequences
ECCV, 2010
We propose a variational framework for jointly estimating dense optical flow, scene depth, and 3D surface geometry from stereo image sequences. Coupling these problems into a single energy functional exploits their mutual dependencies, yielding more accurate and consistent results than solving them independently.
2009
Template based shape processing
Saarland University, 2009 (PhD Thesis)
My PhD dissertation presents a unified framework for using template meshes to process and reconstruct 3D shapes. It covers template deformation for point cloud fitting, performance capture from multi-view video, and volumetric shape editing, establishing foundational techniques for subsequent work in human body capture.
Estimating body shape of dressed humans
Shape Modeling International, 2009
We address the problem of estimating the underlying body shape of a person wearing clothing. A statistical body model is fitted to 3D scan data of dressed subjects, using shape priors to infer the body surface beneath garments. This enables body shape estimation from practical input without requiring the subject to undress.
A Statistical Model of Human Pose and Body Shape
Eurographics, 2009
We present a statistical model that jointly captures the variability of human body shape and pose from a large corpus of 3D body scans. The model disentangles shape identity from pose-dependent deformations, enabling synthesis of novel body shapes in arbitrary poses and providing a strong prior for body reconstruction.
Motion capture using joint skeleton tracking and surface estimation
IEEE CVPR, 2009
We propose a markerless motion capture method that jointly tracks the skeleton and estimates the body surface from multi-view video. Coupling skeletal pose estimation with non-rigid surface deformation in a unified optimization captures both the articulated motion and detailed surface geometry of the performer.
2008
Performance capture from sparse multi-view video
ACM SIGGRAPH, 2008
We demonstrate high-quality performance capture from only a sparse set of cameras, significantly reducing the number required compared to prior dense capture setups. A template mesh is deformed to match multi-view silhouettes and sparse feature correspondences, recovering detailed non-rigid surface motion of the performer.
2007
Marker-less Deformable Mesh Tracking for Human Shape and Motion Capture
IEEE CVPR, 2007
One of the early methods for markerless human shape and motion capture using deformable mesh tracking from multi-view video. We track a template mesh through a sequence by optimizing a deformation field that matches observed silhouettes and image features across all views.
A Volumetric Approach to Interactive Shape Editing
Technical Report, 2007
We propose a volumetric approach for interactive 3D shape editing, where deformations are defined in the embedding volume rather than directly on the surface. This enables intuitive, physically-inspired modifications that preserve surface detail and topology during large-scale edits.
Rapid Animation of Laser-scanned Humans
IEEE VR, 2007
We present a method for quickly rigging and animating detailed 3D human models from laser scans. A skeleton and skinning weights are automatically fitted to the scanned geometry, enabling rapid creation of animatable characters from static scan data without manual rigging.
Geodesics guided constrained texture deformation
Pacific Graphics, 2007
We introduce a method for deforming texture coordinates on 3D meshes guided by geodesic distances, enabling constrained texture mapping that follows the intrinsic surface geometry. Geodesic paths guide how textures stretch and flow across the shape, preserving coherence during surface deformation.
2006
BSP Shapes
Shape Modeling and Applications, 2006
We present a shape representation based on binary space partitioning (BSP) trees that supports efficient Boolean operations and constructive solid geometry. The approach provides a compact, hierarchical representation of complex 3D shapes enabling fast inside/outside queries and set operations.
Incremental Raycasting of Piecewise Quadratic Surfaces on the GPU
Symposium on Interactive Raytracing, 2006
We introduce a GPU-based incremental raycasting algorithm for rendering piecewise quadratic surfaces at interactive rates. The method exploits the mathematical structure of quadratic patches for efficient ray-surface intersection, enabling high-quality rendering of smooth curved surfaces in real time.
Template Deformation for Point Cloud Fitting
Point Based Graphics @ SIGGRAPH, 2006
We present a method for fitting a template mesh to unstructured point cloud data through non-rigid deformation. The template is deformed to match the target point set while preserving mesh quality and surface detail, providing a robust way to obtain clean, connected meshes from noisy scan data.
2005
Visualization with stylized line primitives
IEEE Vis, 2005
We propose stylized line primitives for scientific visualization, enabling expressive non-photorealistic rendering of flow fields and vector data. The method renders line-based visual elements with controllable attributes like thickness, opacity, and curvature to effectively convey complex data patterns.
2003
Direction Fields over Point-Sampled Geometry
WSCG, 2003
We define and compute smooth direction fields directly on point-sampled surfaces without requiring an explicit mesh. This enables downstream operations like texture synthesis and non-photorealistic rendering on point-based geometry by establishing a consistent tangent frame across the point set.