Part II: Human Data to Robot Policy

Chapter 4: Can Human Data Alone Suffice? — Teleop-Free Approaches

Written: 2026-04-07 Last updated: 2026-04-15

Summary

Starting in 2025, learning robot manipulation policies from human data alone (zero robot data) became possible. X-Sim achieves this with a single RGBD video, Human2Sim2Robot with a single demonstration, and EgoZero with smart glasses only. However, these approaches exhibit systematic limitations on contact-rich tasks, with validated task ranges limited to 5–13. These limitations strengthen the case for Data A + Data B co-training (Chapter 5) and tactile information (Chapter 3).

4.1 Introduction

"Can Data B alone enable robot control?" corresponds to TacTeleOp [#26]'s Hypothesis H1. Until 2024, the answer was largely negative — the embodiment gap (kinematic, visual, dynamic differences between human and robot) was considered too large for direct transfer. In 2025, a series of studies overturned this assumption.

4.2 Real-to-Sim-to-Real: Routing Through Simulation

X-Sim (Cornell, CoRL 2025 Oral)

X-Sim ^[1] presents a 3-stage pipeline starting from a single human RGBD video: (1) photorealistic simulation reconstruction + object trajectory extraction, (2) RL training in simulation using object trajectory as an embodiment-agnostic reward, (3) synthetic rollout generation across diverse viewpoints/lighting followed by distillation into a diffusion policy.

Metric	Value
Task progress improvement	+30% (vs baselines)
Data collection time	10× reduction vs BC
Input	1 human RGBD video
Robot data	0

X-Sim's key insight is using object trajectory as reward. How the human moved the object (object-centric representation) is embodiment-agnostic, functioning as a reward signal regardless of human-robot kinematic differences.

Limitations: (1) RGBD input required (less accessible than RGB-only). (2) Only 5 tasks validated, no contact-rich tasks (screwing, etc.). (3) Object-centric reward may be weak for tool use or deformable objects. (4) Gripper-based, dexterous hand unvalidated.

Human2Sim2Robot (Stanford, CoRL 2025)

Human2Sim2Robot ^[2] learned dexterous hand (Allegro) policies from a single human RGB-D video. It uses object 6D pose trajectory as dense reward and initializes RL exploration with pre-manipulation hand pose.

vs Baseline	Improvement
vs Replay	+67%
vs Object-Aware Replay	+55%
vs BC (data augmentation)	+68%

Evaluated on 7 real-world tasks (KUKA + Allegro Hand) with 10 rollouts each. Ablation confirmed object pose trajectory is more stable than hand tracking reward — hand tracking propagates pose estimation errors.

Author-stated limitations: (1) KUKA + Allegro only. (2) Rigid objects only. (3) Pose estimation ambiguity for symmetric/reflective objects. (4) Single-object, single-task policies. (5) Digital twin reconstruction required per environment.

TacPlay [#27] connection: The authors' acknowledged limitation of "pose estimation noise" is precisely where tactile can complement. Direct tactile measurement at the contact surface bypasses occlusion and noise in vision-based pose estimation.

4.3 Zero Robot Data: Direct Transfer Without Simulation

EgoZero (NYU/Berkeley, 2025)

EgoZero ^[3] collected egocentric human demonstrations from Project Aria smart glasses and learned manipulation policies with zero robot data. The key is an egocentric 3D point-based unified state-action space that abstracts kinematic differences between human and robot into a morphology-agnostic representation.

Metric	Value
Success rate (7-task avg)	70%
Collection time per task	20 min
Robot data	0

EgoZero's 70% shows that human data alone can reach a substantial level, while simultaneously meaning 30% failure exists. The gap to industrial requirements (95%+) of 25%p defines the space for tactile and co-training.

Limitations: Franka Panda gripper only, no dexterous hand, Project Aria (Meta proprietary) dependency, complete absence of tactile/force information.

VidBot (TU Munich/ETH, CVPR 2025)

VidBot ^[4] learns 3D affordances from in-the-wild monocular RGB human videos. It reconstructs metric-scale 3D hand trajectories via depth foundation model + structure-from-motion and generates fine-grained interaction trajectories with a diffusion model. It reported approximately +20% improvement over existing methods across 13 tasks.

The ability to achieve zero-shot transfer from RGB alone is notable, but the complete absence of force information suggests a ceiling on contact-rich tasks.

4.4 Pretraining-Based Approaches

LAPA (ICLR 2025)

LAPA ^[5] learns discrete latent actions via VQ-VAE between frames, uses these for VLA pretraining, then fine-tunes with small amounts of robot data.

Experiment	LAPA	OpenVLA	Difference
Real-world avg	50.1%	43.9%	+6.2%p
Unseen objects	57.8%	46.2%	+11.6%p
Compute cost	272 H100-hr	21,500 A100-hr	~30× efficiency

LAPA's conclusion that "human video pretraining is more efficient than robot data" directly supports TacTeleOp's B pretrain → A fine-tune structure. However, weak cross-environment generalization (Language Table 33.6% vs ActionVLA 64.8%) and limitations on fine-grained motions like grasping were reported.

VideoDex (CMU, CoRL 2023 / IJRR 2024)

VideoDex [Shaw et al., 2023/2024] extracted hand motions from internet videos, retargeted them to LEAP Hand, and learned via pretrained visual embeddings + Neural Dynamical Policies. A pioneer of the "human video → robot dexterous policy" paradigm, but requiring 120–175 demos per task for fine-tuning.

4.5 Comparative Analysis: State of Teleop-Free Approaches

Paper	Method	Input	Robot Data	Key Result	Contact-rich
X-Sim	Sim RL + object reward	1 RGBD	0	+30%	Unvalidated
Human2Sim2Robot	Sim RL + pose reward	1 RGB-D	0	+55–68%	Partial
EgoZero	3D point policy	Aria glasses	0	70%	Unvalidated
VidBot	3D affordance	Monocular RGB	0	+20%	Unvalidated
LAPA	Latent action pretrain	Internet video	Minimal	30× efficiency	Weak
VideoDex	Retarget + pretrain	Internet video	120–175/task	Pretrain effect	Unvalidated
UMI	Handheld gripper	Diffusion Policy	0 robot data	Cup 100%, tossing 87.5%, outdoor 71.7%	No
ACT-1	Skill Capture Glove	Skill Transform+model	0 robot data	33 manipulation types, Airbnb zero-shot	No

Pattern Analysis

2025 as the inflection point: All zero-robot-data studies concentrate in 2025.
Vision-centric rewards dominate: Object trajectory (X-Sim), object pose (Human2Sim2Robot), video similarity (Human2Bot) — all vision-based.
Complete absence of tactile: Including UMI and ACT-1, none of these approaches use tactile information.
Gripper bias: EgoZero, X-Sim, VidBot, and UMI are gripper-based. Only Human2Sim2Robot and VideoDex use dexterous hands. ACT-1 uses a dexterous hand but depends on Skill Transform.
Contact-rich unvalidated: Screw driving, capping, and precision assembly are largely absent.

4.6 Key Discussion: The Contact-Rich Wall

Teleop-free achievements are impressive, but a systematic gap exists for contact-rich tasks. The causes are clear:

Visual reward limitations: Object trajectories and poses do not capture force distributions at contact surfaces. In capping, even if the object is correctly positioned, insufficient torque causes failure — visual rewards cannot distinguish this failure mode.

Amplified sim-to-real gap: Contact dynamics is the least accurate domain in simulation. Friction, deformation, and surface condition errors are maximized in contact-rich tasks.

Pose estimation limitations: As Human2Sim2Robot's authors acknowledged, occlusion at contact surfaces and pose estimation errors for reflective/transparent objects are particularly severe for contact-rich tasks.

UMI ^[8] [#35] achieved high success rates on non-contact-rich tasks such as cup arrangement (100%), but lacking tactile sensors, its extension to contact-rich tasks remains unvalidated. ACT-1 ^[9] [#29]'s espresso extraction demo suggests contact-rich potential, but without systematic benchmarks, quantitative evaluation is impossible. The 10% failure cases behind Skill Transform's 90% success rate also remain undisclosed.

From a deployment perspective, Habilis-β ^[10] [#33] proposed deployment metrics — TPH (tasks per hour) and MTBI (mean time between interventions) — evaluating sustained operational efficiency rather than single-episode success rates. It reported 4.75× productivity improvement over π0.5 in simulation. This reminds us that sustainable operational viability, not just contact-rich success rates, is a critical evaluation axis for real deployment.

Figure 4.1: DexForce extracts force-informed actions by augmenting observed robot positions with contact forces from 6-axis force-torque sensors. Enables high-quality demonstrations for contact-rich tasks like unscrewing nuts, opening AirPods cases, and flipping boxes. Source: Chen et al. (2025), Fig. 1

This analysis yields two paths:

Path 1: Add Data A (small robot data) to Data B to fill the contact-rich gap (Chapter 5, co-training).
Path 2: Replace visual rewards with tactile rewards to directly address contact-rich tasks (Chapter 9, TacPlay).

4.7 Connection to Our Direction

Teleop-free studies provide three implications for TacGlove/TacTeleOp/TacPlay:

Figure 4.2: PP-Tac system overview. Leverages tactile feedback from round-shaped sensors (R-Tac) on a dexterous robotic hand to grasp thin, deformable, paper-like objects using diffusion-based policies inspired by human strategies. Source: Lin et al. (2025), Fig. 1

Data B's baseline value confirmed: X-Sim, EgoZero, and VidBot showed meaningful policies can be learned from human data alone. The premise that TacTeleOp's Data B collection has potential value is confirmed.

70% ceiling exists: EgoZero's 70% suggests a practical upper bound for vision-only Data B. Breaking this ceiling requires tactile (Chapter 3) and co-training (Chapter 5).

Tactile reward opportunity: All approaches including UMI and ACT-1 rely on visual rewards; tactile rewards have not even been attempted. TacPlay's "tactile-target RL" occupies this empty space.

The next chapter analyzes co-training approaches that overcome Data B-only limitations through combination with Data A (Chapter 5).

References

Dan, P., et al. (2025). X-Sim: Cross-Embodiment Simulation for Robot Learning. CoRL 2025 Oral. https://portal-cornell.github.io/X-Sim/ scholar
Lum, T. G. W., et al. (2025). Human2Sim2Robot: Dexterous Manipulation Transfer via Simulation. CoRL 2025. scholar
Liu, V., et al. (2025). EgoZero: Robot Policy Learning from Egocentric Video without Robot Data. arXiv. scholar
Chen, H., et al. (2025). VidBot: Learning Robot Manipulation from Internet Videos. CVPR 2025. scholar
Ye, S., et al. (2025). LAPA: Latent Action Pretraining from Videos. ICLR 2025. scholar
Shaw, K., et al. (2023/2024). VideoDex: Learning Dexterous Manipulation from Internet Videos. CoRL 2023 / IJRR 2024. scholar
Ghunaim, Y., et al. (2025). Human2Bot: Zero-Shot Robot Learning from Human Videos. Autonomous Robots. scholar
Chi, C., et al. (2024). Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots. RSS 2024. #35 scholar
Sunday, E. (2025). ACT-1: Humanoid Hand for Human-Level Manipulation. Physical Intelligence Blog. #29 scholar
Habilis Team (2026). Habilis-β: On-Device VLA for Sustained Autonomous Operation. arXiv. #33 scholar