Part I: Background and Motivation

Chapter 1: Why Human Hand Data? — Limitations of Teleoperation and Alternatives

Written: 2026-04-07 Last updated: 2026-04-07

Summary

The core bottleneck in robot manipulation learning is data collection. Teleoperation suffers from a throughput ceiling of roughly 10 demonstrations per hour and a fundamental inability to handle complex contact-rich tasks. This chapter analyzes these bottlenecks and presents quantitative evidence for why human hand data (Data B) is a compelling alternative in terms of scale, realism, and diversity.

1.1 Introduction: The Data Scarcity Problem

Robot manipulation policy learning demands large-scale data. As demonstrated by RT-2's hundreds of thousands of episodes and pi0 [#2]'s industrial-scale heterogeneous training, a strong positive correlation exists between data scale and policy performance. Yet most research labs possess only hundreds to thousands of robot demonstration episodes — this remains the most fundamental bottleneck in robot learning today.

Figure 1.1: pi0 is a VLA-based model that learns general-purpose robot policies from internet pretraining and cross-embodiment data. Source: Black et al. (2024), Fig. 2

Teleoperation, the traditional approach to data acquisition, has a skilled operator remotely control the robot to produce demonstrations. While it generates robot-executable trajectories (Data A), it carries structurally unscalable limitations.

1.2 Three Bottlenecks of Teleoperation

Throughput Ceiling

Teleoperation data collection typically yields 5–10 demonstrations per hour. DEXOP ^[1] [#10] reports up to 5 per minute (300/hour), but this figure reflects short grasping episodes. For multi-step contact-rich tasks — such as tightening screws or opening caps — each episode takes several minutes, reducing effective throughput to 10–20 per hour.

At this rate, accumulating the 20,854 hours of data used by EgoScale ^[2] would require tens of thousands of hours of skilled operator labor — a practically infeasible scale.

Task Impossibility

DEXOP ^[1] provides the clearest demonstration of teleoperation's fundamental limits. On the drilling task, teleoperation success rate was 0%. Simultaneously controlling individual robot fingers to grasp, press, and rotate a drill via remote interface exceeds the capability of current teleoperation systems. In contrast, DEXOP's passive exoskeleton enabled direct demonstration of this task.

This result implies that teleoperation is not merely "slow" — for certain contact-rich tasks, it is fundamentally impossible. Many industrial processes — capping, precision assembly, flexible part handling — fall into this category.

Cost Structure

AirExo ^[13] directly compared the cost structures of teleoperation versus alternative data collection. A teleoperation platform (robot + control system) costs approximately $60,000, while AirExo's passive exoskeleton costs just $600 — a 100× difference. AoE ^[3] further widened this gap with a $20 smartphone neck mount for large-scale egocentric data collection.

The cost issue extends beyond individual labs. Covering multiple processes, object variations, and shift changes in industrial settings requires parallel data collection, for which teleoperation is poorly suited. UMI ^[14] [#35] further disrupted this cost structure with a $371 handheld gripper that eliminates the robot entirely — collecting demonstration data using only a GoPro fisheye camera and IMU-based SLAM.

Collection Method	Cost	Key Feature
Teleoperation	~$60,000	Robot + control system required
AirExo (passive exoskeleton)	$600	Low-cost, arm-level
UMI (handheld gripper)	$371	No robot needed, GoPro fisheye + IMU SLAM
AoE (smartphone mount)	$20	Lowest cost, egocentric video only

1.3 Human Hand Data as an Alternative

Defining Data A vs Data B

This survey distinguishes demonstration data for robot learning into two categories based on collection method. Data A (teleoperation data) refers to robot-executable trajectory data collected by skilled operators remotely controlling the robot. Data B (human hand data) refers to human demonstration data collected by workers wearing wearable sensors (gloves, glasses) while performing tasks naturally, without a robot. This distinction serves as the foundation for all subsequent chapters.

	Data A (Teleoperation)	Data B (Human Hand Data)
Collection method	Remote robot control	Natural work with glove/glasses
Collector	Skilled robot operator	On-site worker or general population
Cost	High ($60K+ system)	Low ($600 or less)
Throughput	~10 demos/hr	Natural work speed
Robot required	Yes	No
Executability	Directly robot-executable	Requires retargeting
Distribution	Constrained by robot kinematics	Natural human distribution

Data B offers three core advantages.

Scale

EgoDex ^[10] collected 829 hours, 90 million frames, and 194 tasks of dexterous manipulation data using Apple Vision Pro. BuildAI^[11] released 10,000 hours of factory egocentric data from 2,153 workers. Such scales are unattainable through teleoperation.

EgoScale ^[2] pretrained on 20,854 hours of egocentric human video and demonstrated a log-linear scaling law (R² = 0.9983). This means that performance improves predictably as the volume of human data increases.

Realism

Humans work in real environments, with real objects, using natural strategies. This data captures natural adaptations to object shape, material, and weight, free from the "distortions imposed by robot kinematics" inherent in teleoperation data. EgoMimic ^[4] reported that adding 1 hour of human hand data to 2 hours of robot data yields +34–228% improvement over 3 hours of robot data alone — a scaling trend showing that human data reaches higher asymptotic performance per unit time than robot data, quantitatively demonstrating the value of this realism.

Diversity

Teleoperation typically relies on one or two skilled operators, limiting strategy diversity. Human hand data, by contrast, captures multiple workers performing the same task with individual strategies. EgoDex includes data across 194 tasks from multiple participants; Ego4D ^[12] captured daily activities of 931 participants across 9 countries.

1.4 Empirical Evidence for Human Hand Data

Until 2024, "robot control from human data alone" was considered impossible. Starting in 2025, this assumption collapsed.

Paper	Approach	Key Result	Robot Data
X-Sim ^[5]	1 RGBD → sim RL	+30% task progress, 10× collection savings	0
Human2Sim2Robot ^[6]	1 demo → sim RL → Allegro Hand	+55–68% vs baselines	0
EgoZero ^[7]	Smart glasses → 3D points	70% avg across 7 tasks	0
VidBot ^[8]	Internet RGB → affordance	13-task zero-shot	0
LAPA ^[9]	Internet video pretrain → fine-tune	30× efficiency, +6.2%p vs OpenVLA	Minimal
UMI ^[14]	Handheld gripper → diffusion policy	Cup arrangement 100%, dynamic tossing 87.5%, outdoor generalization 71.7%	0
ACT-1 ^[15]	Skill Capture Glove + Skill Transform	90% transfer success, 33 manipulation tasks with 0 robot data + Airbnb zero-shot	0

The consistent implication is that human hand data (Data B) is a viable data source for robot learning. X-Sim (CoRL 2025 Oral) and Human2Sim2Robot demonstrated that a single human demonstration can yield dexterous manipulation policies (Chapter 4).

However, important limitations remain. These studies were validated on only 5–13 lab tasks, with contact-rich tasks (screw driving, capping) largely absent. EgoZero's 70% success rate falls short of industrial requirements (>95%). This gap motivates the need for Data A + Data B co-training (Chapter 5) and tactile information (Chapter 3).

1.5 Defining the Core Questions

From the above analysis, we define three questions that thread through this entire survey:

Can Data B alone enable robot control? — X-Sim, EgoZero, and VidBot provide positive signals, but limitations exist for contact-rich tasks (Chapter 4).

Does combining Data A + Data B improve performance? — EgoMimic +34–228%, EgoScale R²=0.9983, AoE 45%→95% (Close Laptop task) provide strong affirmative evidence. However, co-training that includes tactile data has not been attempted (Chapter 5).

Can teleoperation be entirely eliminated? — This is the ultimate question TacPlay aims to answer. By equipping the robot with the same tactile glove and learning the embodiment gap through autonomous exploration, Data A itself may become unnecessary (Chapter 9).

1.6 Connection to Our Direction

TacGlove/TacTeleOp [#26] addresses Question 2; TacPlay [#27] addresses Question 3. Once we recognize that teleoperation's bottleneck is not merely an efficiency issue but a structural limitation, large-scale collection and utilization of human hand data (Data B) becomes a necessity, not an option. The next chapter analyzes the specific hardware — gloves, exoskeletons, smart glasses — used to collect this Data B (Chapter 2).

References

Fang, H.-S., & Agrawal, P. et al. (2025). DEXOP: Dexterous Manipulation with Passive Exoskeleton. IEEE RA-L. https://arxiv.org/abs/2509.04441 #10 scholar
Zheng, R., et al. (2026). EgoScale: Egocentric Video Pretraining for Scalable Robot Learning. arXiv. https://research.nvidia.com/labs/gear/egoscale/ scholar
Yang, B., et al. (2026). AoE: Always-on Egocentric Data Collection for Robot Learning. arXiv. scholar
Kareer, S., et al. (2024). EgoMimic: Scaling Imitation Learning via Egocentric Video. arXiv. https://arxiv.org/abs/2410.24221 scholar
Dan, P., et al. (2025). X-Sim: Cross-Embodiment Simulation for Robot Learning. CoRL 2025 Oral. https://portal-cornell.github.io/X-Sim/ scholar
Lum, T. G. W., et al. (2025). Human2Sim2Robot: Dexterous Manipulation Transfer via Simulation. CoRL 2025. scholar
Liu, V., et al. (2025). EgoZero: Robot Policy Learning from Egocentric Video without Robot Data. arXiv. scholar
Chen, H., et al. (2025). VidBot: Learning Robot Manipulation from Internet Videos. CVPR 2025. scholar
Ye, S., et al. (2025). LAPA: Latent Action Pretraining from Videos. ICLR 2025. scholar
Hoque, R., et al. (2025). EgoDex: A Large-Scale Egocentric Dexterous Manipulation Dataset. arXiv. scholar
BuildAI (2025). Egocentric-10K: 10,000 Hours of Factory Egocentric Video. Hugging Face. scholar
Grauman, K., et al. (2022). Ego4D: Around the World in 3,000 Hours of Egocentric Video. CVPR 2022. scholar
SJTU (2024). AirExo: Low-Cost Exoskeletons for Learning Whole-Arm Manipulation in the Wild. ICRA 2024. scholar
Chi, C., et al. (2024). Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots. RSS 2024. https://umi-gripper.github.io/ #35 scholar
Sunday Robotics (2025). ACT-1: A Robot Foundation Model. Technical Report. https://www.sundayrobotics.com/act-1 #29 scholar