Chapter 1: Why Human Hand Data? — Limitations of Teleoperation and Alternatives
Summary
The core bottleneck in robot manipulation learning is data collection. Teleoperation suffers from a throughput ceiling of roughly 10 demonstrations per hour and a fundamental inability to handle complex contact-rich tasks. This chapter analyzes these bottlenecks and presents quantitative evidence for why human hand data (Data B) is a compelling alternative in terms of scale, realism, and diversity.
1.1 Introduction: The Data Scarcity Problem
Robot manipulation policy learning demands large-scale data. As demonstrated by RT-2's hundreds of thousands of episodes and pi0 [#2]'s industrial-scale heterogeneous training, a strong positive correlation exists between data scale and policy performance. Yet most research labs possess only hundreds to thousands of robot demonstration episodes — this remains the most fundamental bottleneck in robot learning today.
Teleoperation, the traditional approach to data acquisition, has a skilled operator remotely control the robot to produce demonstrations. While it generates robot-executable trajectories (Data A), it carries structurally unscalable limitations.
1.2 Three Bottlenecks of Teleoperation
Throughput Ceiling
Teleoperation data collection typically yields 5–10 demonstrations per hour. DEXOP [1] [#10] reports up to 5 per minute (300/hour), but this figure reflects short grasping episodes. For multi-step contact-rich tasks — such as tightening screws or opening caps — each episode takes several minutes, reducing effective throughput to 10–20 per hour.
At this rate, accumulating the 20,854 hours of data used by EgoScale [2] would require tens of thousands of hours of skilled operator labor — a practically infeasible scale.
Task Impossibility
DEXOP [1] provides the clearest demonstration of teleoperation's fundamental limits. On the drilling task, teleoperation success rate was 0%. Simultaneously controlling individual robot fingers to grasp, press, and rotate a drill via remote interface exceeds the capability of current teleoperation systems. In contrast, DEXOP's passive exoskeleton enabled direct demonstration of this task.
This result implies that teleoperation is not merely "slow" — for certain contact-rich tasks, it is fundamentally impossible. Many industrial processes — capping, precision assembly, flexible part handling — fall into this category.
Cost Structure
AirExo [13] directly compared the cost structures of teleoperation versus alternative data collection. A teleoperation platform (robot + control system) costs approximately $60,000, while AirExo's passive exoskeleton costs just $600 — a 100× difference. AoE [3] further widened this gap with a $20 smartphone neck mount for large-scale egocentric data collection.
The cost issue extends beyond individual labs. Covering multiple processes, object variations, and shift changes in industrial settings requires parallel data collection, for which teleoperation is poorly suited. UMI [14] [#35] further disrupted this cost structure with a $371 handheld gripper that eliminates the robot entirely — collecting demonstration data using only a GoPro fisheye camera and IMU-based SLAM.
| Collection Method | Cost | Key Feature |
|---|---|---|
| Teleoperation | ~$60,000 | Robot + control system required |
| AirExo (passive exoskeleton) | $600 | Low-cost, arm-level |
| UMI (handheld gripper) | $371 | No robot needed, GoPro fisheye + IMU SLAM |
| AoE (smartphone mount) | $20 | Lowest cost, egocentric video only |
1.3 Human Hand Data as an Alternative
Defining Data A vs Data B
This survey distinguishes demonstration data for robot learning into two categories based on collection method. Data A (teleoperation data) refers to robot-executable trajectory data collected by skilled operators remotely controlling the robot. Data B (human hand data) refers to human demonstration data collected by workers wearing wearable sensors (gloves, glasses) while performing tasks naturally, without a robot. This distinction serves as the foundation for all subsequent chapters.
| Data A (Teleoperation) | Data B (Human Hand Data) | |
|---|---|---|
| Collection method | Remote robot control | Natural work with glove/glasses |
| Collector | Skilled robot operator | On-site worker or general population |
| Cost | High ($60K+ system) | Low ($600 or less) |
| Throughput | ~10 demos/hr | Natural work speed |
| Robot required | Yes | No |
| Executability | Directly robot-executable | Requires retargeting |
| Distribution | Constrained by robot kinematics | Natural human distribution |
Data B offers three core advantages.
Scale
EgoDex [10] collected 829 hours, 90 million frames, and 194 tasks of dexterous manipulation data using Apple Vision Pro. BuildAI[11] released 10,000 hours of factory egocentric data from 2,153 workers. Such scales are unattainable through teleoperation.
EgoScale [2] pretrained on 20,854 hours of egocentric human video and demonstrated a log-linear scaling law (R² = 0.9983). This means that performance improves predictably as the volume of human data increases.
Realism
Humans work in real environments, with real objects, using natural strategies. This data captures natural adaptations to object shape, material, and weight, free from the "distortions imposed by robot kinematics" inherent in teleoperation data. EgoMimic [4] reported that adding 1 hour of human hand data to 2 hours of robot data yields +34–228% improvement over 3 hours of robot data alone — a scaling trend showing that human data reaches higher asymptotic performance per unit time than robot data, quantitatively demonstrating the value of this realism.
Diversity
Teleoperation typically relies on one or two skilled operators, limiting strategy diversity. Human hand data, by contrast, captures multiple workers performing the same task with individual strategies. EgoDex includes data across 194 tasks from multiple participants; Ego4D [12] captured daily activities of 931 participants across 9 countries.
1.4 Empirical Evidence for Human Hand Data
Until 2024, "robot control from human data alone" was considered impossible. Starting in 2025, this assumption collapsed.
| Paper | Approach | Key Result | Robot Data |
|---|---|---|---|
| X-Sim [5] | 1 RGBD → sim RL | +30% task progress, 10× collection savings | 0 |
| Human2Sim2Robot [6] | 1 demo → sim RL → Allegro Hand | +55–68% vs baselines | 0 |
| EgoZero [7] | Smart glasses → 3D points | 70% avg across 7 tasks | 0 |
| VidBot [8] | Internet RGB → affordance | 13-task zero-shot | 0 |
| LAPA [9] | Internet video pretrain → fine-tune | 30× efficiency, +6.2%p vs OpenVLA | Minimal |
| UMI [14] | Handheld gripper → diffusion policy | Cup arrangement 100%, dynamic tossing 87.5%, outdoor generalization 71.7% | 0 |
| ACT-1 [15] | Skill Capture Glove + Skill Transform | 90% transfer success, 33 manipulation tasks with 0 robot data + Airbnb zero-shot | 0 |
The consistent implication is that human hand data (Data B) is a viable data source for robot learning. X-Sim (CoRL 2025 Oral) and Human2Sim2Robot demonstrated that a single human demonstration can yield dexterous manipulation policies (Chapter 4).
However, important limitations remain. These studies were validated on only 5–13 lab tasks, with contact-rich tasks (screw driving, capping) largely absent. EgoZero's 70% success rate falls short of industrial requirements (>95%). This gap motivates the need for Data A + Data B co-training (Chapter 5) and tactile information (Chapter 3).
1.5 Defining the Core Questions
From the above analysis, we define three questions that thread through this entire survey:
- Can Data B alone enable robot control? — X-Sim, EgoZero, and VidBot provide positive signals, but limitations exist for contact-rich tasks (Chapter 4).
- Does combining Data A + Data B improve performance? — EgoMimic +34–228%, EgoScale R²=0.9983, AoE 45%→95% (Close Laptop task) provide strong affirmative evidence. However, co-training that includes tactile data has not been attempted (Chapter 5).
- Can teleoperation be entirely eliminated? — This is the ultimate question TacPlay aims to answer. By equipping the robot with the same tactile glove and learning the embodiment gap through autonomous exploration, Data A itself may become unnecessary (Chapter 9).
1.6 Connection to Our Direction
TacGlove/TacTeleOp [#26] addresses Question 2; TacPlay [#27] addresses Question 3. Once we recognize that teleoperation's bottleneck is not merely an efficiency issue but a structural limitation, large-scale collection and utilization of human hand data (Data B) becomes a necessity, not an option. The next chapter analyzes the specific hardware — gloves, exoskeletons, smart glasses — used to collect this Data B (Chapter 2).
References
- Fang, H.-S., & Agrawal, P. et al. (2025). DEXOP: Dexterous Manipulation with Passive Exoskeleton. IEEE RA-L. https://arxiv.org/abs/2509.04441 #10 scholar
- Zheng, R., et al. (2026). EgoScale: Egocentric Video Pretraining for Scalable Robot Learning. arXiv. https://research.nvidia.com/labs/gear/egoscale/ scholar
- Yang, B., et al. (2026). AoE: Always-on Egocentric Data Collection for Robot Learning. arXiv. scholar
- Kareer, S., et al. (2024). EgoMimic: Scaling Imitation Learning via Egocentric Video. arXiv. https://arxiv.org/abs/2410.24221 scholar
- Dan, P., et al. (2025). X-Sim: Cross-Embodiment Simulation for Robot Learning. CoRL 2025 Oral. https://portal-cornell.github.io/X-Sim/ scholar
- Lum, T. G. W., et al. (2025). Human2Sim2Robot: Dexterous Manipulation Transfer via Simulation. CoRL 2025. scholar
- Liu, V., et al. (2025). EgoZero: Robot Policy Learning from Egocentric Video without Robot Data. arXiv. scholar
- Chen, H., et al. (2025). VidBot: Learning Robot Manipulation from Internet Videos. CVPR 2025. scholar
- Ye, S., et al. (2025). LAPA: Latent Action Pretraining from Videos. ICLR 2025. scholar
- Hoque, R., et al. (2025). EgoDex: A Large-Scale Egocentric Dexterous Manipulation Dataset. arXiv. scholar
- BuildAI (2025). Egocentric-10K: 10,000 Hours of Factory Egocentric Video. Hugging Face. scholar
- Grauman, K., et al. (2022). Ego4D: Around the World in 3,000 Hours of Egocentric Video. CVPR 2022. scholar
- SJTU (2024). AirExo: Low-Cost Exoskeletons for Learning Whole-Arm Manipulation in the Wild. ICRA 2024. scholar
- Chi, C., et al. (2024). Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots. RSS 2024. https://umi-gripper.github.io/ #35 scholar
- Sunday Robotics (2025). ACT-1: A Robot Foundation Model. Technical Report. https://www.sundayrobotics.com/act-1 #29 scholar