Part II: Human Data to Robot Policy

Chapter 5: Synergy of Teleoperation and Human Data — Co-training

Written: 2026-04-07 Last updated: 2026-04-07

Summary

Co-training human data (Data B) with robot data (Data A) consistently outperforms either alone: EgoMimic +34–228%, EgoScale R²=0.9983 log-linear scaling, AoE 45%→95% (Close Laptop task), DEXOP 0.513>0.425. Furthermore, pi0 discovered emergent alignment of human-robot representations at scale. Most co-training studies use no tactile data; UMI-FT showed the first signal with wrist-level F/T sensing, but distributed tactile co-training remains entirely unexplored.

5.1 Introduction

Chapter 4 confirmed that Data B alone can reach ~70%. This chapter asks: "How much further can Data A push it?" This corresponds to TacTeleOp [#26] Hypothesis H2, the area with the richest existing evidence.

The core idea of co-training is simple: Data B teaches "what to do," Data A teaches "how this robot does it." Data B provides scale and realism; Data A provides executability and alignment.

5.2 Quantitative Evidence for Co-training

EgoMimic: "1 Hour of Hand > 2 Hours of Robot" (Georgia Tech, 2024)

EgoMimic ^[1] pioneered the co-training paradigm. It collected egocentric human data via Project Aria glasses and co-trained on a low-cost bimanual manipulator designed to minimize human-robot kinematic gap.

Comparison	Result
2 hr robot + 1 hr hand vs 3 hr robot (ACT)	+34–228% (per task)
Key finding	Human data shows higher asymptotic performance per unit time than robot data

The key implication is that, per unit of time invested, human hand data reaches higher asymptotic performance as a scaling trend compared to teleoperation data. This is the most direct justification for TacTeleOp's "800-hour worker data collection" strategy.

Limitations: only 3 tasks validated, custom robot (general-purpose robots unvalidated), gripper/bimanual (dexterous hand unvalidated), no tactile.

EgoScale: Log-Linear Scaling Law (NVIDIA, 2026)

EgoScale ^[2] established the theoretical foundation for co-training. Pretraining a flow-based VLA on 20,854 hours of egocentric human video, followed by lightweight aligned human-robot mid-training:

Metric	Value
Scaling law	Log-linear, R² = 0.9983
vs no pretraining	+54%
Cross-robot transfer	Effective even for lower-DoF robots

R² = 0.9983 means the relationship between human data volume and performance has near-perfect predictability. This allows pre-calculation of data investment ROI and provides a basis for estimating the expected effect of TacTeleOp's "800-hour investment."

Key question: Does tactile data follow the same log-linear scaling? EgoScale is vision-only. If tactile information density differs from visual, the scaling curve may differ — converging faster (less data needed) or slower (more data needed). This is one of TacTeleOp's core research questions (Chapter 10).

AoE: Dramatic Effect of Minimal Teleop + Abundant Human Data (2026)

AoE ^[3] most directly demonstrates TacTeleOp's scenario. Using a $20 smartphone neck mount for always-on egocentric collection and FLARE framework for co-training:

Task	Teleop only (50)	+ AoE (200 demos)
Pick and Place	45.0%	75.0%
Close Laptop	45.0%	95.0%
Fold Scarf	-	10.0% (HW latency)
Push Bowl & Pour Seed	0.0%	20.0%

The ablation result of 10 teleop + 200 AoE = 0%→55% (Close Laptop) shows that even minimal robot data can unlock the dramatic effect of abundant human data.

TacTeleOp connection: AoE's 50 teleop + 200 human → 45%→95% (Close Laptop task) directly validates TacTeleOp's "small Data A (50–100 teleop) + large Data B (50,000+ demos)" scenario. Pour Seed's 20% reveals vision-only co-training's contact-rich ceiling, the opportunity space for tactile addition.

DEXOP: Exoskeleton Data + Minimal Teleop (MIT, 2025)

DEXOP ^[4] [#10] mixed passive exoskeleton data with teleop data:

Figure 5.2: (a) DEXOP is a passive exoskeleton linking human hand movements to robotic hand movements through mechanical linkages. (b) Enables demonstration collection for diverse dexterous tasks. (c) Collected data can train transferable robot policies. Source: Fang et al. (2025), Fig. 1

Comparison	Success Rate
160 exo + 40 teleop	0.513
200 teleop only	0.425
Collection time difference	Half the time

The pattern "non-robot data source + small robot data > pure robot data" aligns with EgoMimic and AoE.

pi0: Emergent Alignment at Scale (Physical Intelligence, 2025)

pi0 ^[5] [#2] pretrained a VLA on large-scale heterogeneous data (internet vision-language + robot + human video) and discovered emergent alignment — human-robot representations automatically converging without explicit alignment loss during co-finetuning.

Metric	Value
Co-finetuning effect	2× improvement across 4 scenarios
Emergent alignment	Human hand–robot gripper representations auto-converge

This finding raises a fundamental question: With sufficient data scale, does the human-robot gap resolve automatically without explicit retargeting or alignment loss? If so, TacTeleOp's explicit cross-embodiment retargeting (Stage 2) may be unnecessary.

Three key constraints exist: (1) pi0 requires industrial-scale compute, infeasible for academic labs. (2) Observed only for grippers, unvalidated for dexterous hands. (3) Whether emergent alignment occurs for the tactile modality is unexplored. The third constraint is especially important — whether tactile emergent alignment occurs is a novel research question TacTeleOp can explore.

DexWM: World Model Approach (Meta FAIR, 2025)

DexWM ^[6] learned a world model from 900+ hours of human/robot video and performed dexterous manipulation via MPC. It achieved 83% zero-shot grasping success and +50% over Diffusion Policy.

The world model approach is another form of co-training: learning world dynamics from human data and applying them to robot action planning. Incorporating tactile state into TacPlay [#27]'s autonomous exploration world model could enable more accurate contact prediction.

UMI-FT: First Signal of Force/Torque-Aware Co-training (Columbia/Stanford, 2026)

UMI-FT ^[10] [#36] equipped the UMI (Chapter 4) handheld gripper with CoinFT — a 2 g, 20 mm diameter, 3 mm thick capacitive 6-axis force/torque sensor — on each finger, making it the first study to embed contact force information into in-the-wild demonstration data.

Task	ACP (tactile+compliance)	DP+Force	DP only
Whiteboard wiping	92%	28%	16%
Lightbulb insertion	95%	60%	0%
Pumpkin skewering (wild)	100%	—	20%

The key finding is the Adaptive Compliance Policy (ACP): it simultaneously predicts a reference pose (9D), virtual target pose (9D), stiffness scalar, gripper width, and grasp force target, switching between compliant control (low stiffness) during approach and rigid control (high stiffness) during insertion. This demonstrates that force information is decisive for contact-rich tasks where pure position control fails (DP only: lightbulb insertion 0%).

TacTeleOp connection: UMI-FT demonstrated that gripper-finger-level (2 sensors) force/torque sensing dramatically improves contact-rich co-training performance. However, UMI-FT's sensors amount to just two 6-axis F/T sensors at the wrist level, qualitatively different in information density from TacGlove's 5-finger, 24-channel distributed tactile. "If 2 sensors yield 0%→92%, what happens with 24-channel distributed tactile?" is the new question TacTeleOp will answer.

5.3 Formalizing the Co-training Effect

Synthesizing the above results:


Data B = scale + realism ("what to do, with what force")
Data A = executability + alignment ("how this robot does it")
Data A + Data B > Data A alone (EgoMimic, DEXOP)
Data A + Data B > Data B alone (X-Sim, EgoZero's 70% ceiling)
Small A + Large B ≈ optimal (AoE, DEXOP)

EgoScale's log-linear scaling means this effect increases predictably with Data B volume.

5.4 Comparative Analysis

Study	Human Data	Robot Data	Method	Key Result	Tactile
EgoMimic	Aria (visual)	teleop	Co-train	+34–228%	No
EgoScale	20,854 hr ego	mid-train	Pretrain+finetune	+54%, R²=0.9983	No
AoE	200 demos	50 teleop	Co-train (FLARE)	45→95% (Close Laptop)	No
DEXOP	160 exo	40 teleop	Mixed training	0.513>0.425	No
pi0	internet+human	robot	Co-finetuning	2×, emergent	No
DexWM	829 hr human	100 hr robot	World model	83% grasping	No
UMI-FT	In-the-wild demos (with force)	0	Diffusion+ACP	92% wiping, 95% bulb	Wrist F/T

Apart from UMI-FT, all studies use no tactile data, and even UMI-FT is limited to one 6-axis F/T sensor per gripper finger (2 total). Distributed tactile co-training is completely unexplored. This is TacTeleOp's core opportunity.

5.5 Key Discussion: The Unexplored Domain of Tactile Co-training

That co-training works is an established fact. UMI-FT opened the first breakthrough with wrist-level force/torque sensing (whiteboard wiping 16%→92%), but the vast majority of current co-training remains vision-only, and how co-training behaves with distributed tactile remains an open question:

Figure 5.1: OSMO data pipeline. Human demonstrations captured with RGB + stereo IR cameras and tactile glove data. IK retargeting generates robot joint targets from human hand positions for policy training. Source: Yin et al. (2025), Fig. 6

Tactile scaling law: Does EgoScale's visual log-linear hold for tactile? If tactile has higher information density, it may converge faster — large effect from less data.

Tactile emergent alignment: Does visual emergent alignment observed in pi0 occur for tactile? Could OSMO's Embodiment Bridge accelerate it?

Optimal ratio: What is the optimal mixing ratio of Data B (with tactile) and Data A? Same as vision co-training?

Contact-rich differentiation: As vision-only co-training plateaued at 20% on AoE Pour Seed, and UMI-FT overcame this with 2 F/T sensors, does 24-channel distributed tactile yield further improvement?

UMI-FT provided the first quantitative evidence that "force information is decisive for contact-rich co-training." However, UMI-FT's sensor configuration (one 6-axis F/T per gripper finger) and TacGlove's distributed tactile (5-finger, 24-channel) are qualitatively different in information density. "How do co-training's scaling law and performance ceiling change when distributed tactile is included?" remains an open question, and answering it constitutes TacTeleOp's core scientific contribution.

5.6 Connection to Our Direction

TacTeleOp extends the co-training effects confirmed in this chapter to the tactile domain:

Stage 1 (B pretrain): Pretrain on 800 hours of tactile+visual worker data
Stage 2 (Retargeting): UniTacHand [#16] UV map-based tactile alignment
Stage 3 (A fine-tune): Fine-tune with 50–100 teleop demos

If EgoMimic showed the effect of vision co-training, TacTeleOp aims to show the effect of tactile co-training. If EgoScale demonstrated the vision scaling law, TacTeleOp explores the tactile scaling law (Chapter 8, Chapter 10).

References

Kareer, S., et al. (2024). EgoMimic: Scaling Imitation Learning via Egocentric Video. arXiv. https://arxiv.org/abs/2410.24221 scholar
Zheng, R., et al. (2026). EgoScale: Egocentric Video Pretraining for Scalable Robot Learning. arXiv. https://research.nvidia.com/labs/gear/egoscale/ scholar
Yang, B., et al. (2026). AoE: Always-on Egocentric Data Collection for Robot Learning. arXiv. scholar
Fang, H.-S., et al. (2025). DEXOP: Dexterous Manipulation with Passive Exoskeleton. IEEE RA-L. #10 scholar
Physical Intelligence (2025). pi0: A General-Purpose Robot Policy. arXiv. #2 scholar
Goswami, R. G., et al. (2025). DexWM: Dexterous World Models from Human and Robot Data. arXiv. scholar
Yang, R., et al. (2025). EgoVLA: Egocentric Vision-Language-Action Model. arXiv. scholar
PhysBrain (2025). Egocentric2Embodiment Pipeline. arXiv. scholar
RoboWheel (2024). HOI-Based Cross-Embodiment Robot Learning. arXiv. scholar
Chi, C., et al. (2026). In-the-Wild Compliant Manipulation with UMI-FT. arXiv. #36 scholar