Part III: Our Proposals

Chapter 9: TacPlay — Teleop-Free Autonomous Learning

Written: 2026-04-07 Last updated: 2026-04-09

Summary

Note: TacPlay is a future research direction for next year, building on the outcomes of TacGlove (Chapter 7) and TacTeleOp (Chapter 8). This year focuses on TacGlove hardware development and TacTeleOp multi-object grasping validation; TacPlay will be pursued after these results are secured.

TacPlay [#27], a successor to TacGlove [#26]/TacTeleOp, completely eliminates teleoperation. By mounting the same tactile glove on the robot, setting human tactile patterns as "targets," and having the robot learn the embodiment gap through autonomous play, it extends OSMO's passive Embodiment Bridge to active learning and DexH2R's kinematic residuals to tactile residuals. "Active cross-embodiment learning in tactile space" is unprecedented, and cross-task generalization of tactile residuals is this work's most ambitious hypothesis.

9.1 Introduction: Is Teleoperation Truly Necessary?

TacTeleOp (Chapter 8) proposes co-training with small Data A (teleop) + large Data B (worker). This reduces but does not eliminate teleoperation. Each new process still requires 50–100 teleop episodes, demanding skilled operators and robot access.

TacPlay's question is more radical: Can Data A itself be made unnecessary?

Chapter 4 confirmed that X-Sim, EgoZero, and VidBot achieved teleop-free using visual data alone. TacPlay aims to achieve this in tactile space rather than visual. Where visual rewards show limitations on contact-rich tasks (Chapter 4), tactile rewards aim to overcome them.

9.2 Phase 1: Human Contact Prior — Tactile Recipe

Tactile recipes are extracted from TacTeleOp's Data B — structured representations of temporal tactile patterns during manipulation. For example, a capping recipe: (1) thumb and index contact cap edge (t=0), (2) grip stabilization at 2–4N normal force (t=0.5s), (3) clockwise shear for rotation (t=1–3s), (4) increase normal force upon torque resistance (t=3–5s), (5) release after target torque (t=5–6s).

These recipes are statistically extracted from thousands of episodes, averaging individual noise and representing per-object/per-worker variation as distributions.

Compared to ExoStart [4] [#9], which extracts binary rewards from 9–15 exo demos, TacPlay's recipes extend this to continuous 3-axis tactile targets — from "was this pose reached?" to "was this tactile pattern reproduced?"

9.3 Phase 2: Robot Autonomous Tactile Play — Core Contribution

The same TacGlove is mounted on the robot hand (Allegro or LEAP). The robot receives Phase 1's tactile recipe as "target" and autonomously interacts with objects, exploring how to reproduce human tactile patterns with its own kinematics.

Reward Function

r_t = -\| \tau_t^{robot} - \tau_t^{target} \|_2 + \alpha \cdot \mathbb{1}[\text{task\_progress}]
  • Term 1 (tactile similarity): L2 distance between robot's current tactile reading and human target pattern. Closer = higher reward.
  • Term 2 (task progress): Sparse indicator for task completion. Tactile matching alone cannot guarantee task completion (same tactile pattern may yield different outcomes).

The key insight enabling this approach: the same physical sensor generates data on both sides (OSMO [#18]'s Embodiment Bridge principle). When the human produced 2.3N normal force and 0.8N shear on sensor 3 while turning a cap, the robot must produce similar values on the same sensor 3. Same sensor = same units, same scale — no domain adaptation needed.

Of course, kinematic differences mean the robot cannot exactly replicate human patterns. A 5-fingered human and 4-fingered robot create different contact distributions. This systematic difference is precisely the tactile residual.

Tactile Residual Learning

Through repeated autonomous play, the robot automatically learns the systematic difference between human and robot tactile patterns:

\Delta_{tactile} = f(\tau^{human}, \tau^{robot}, \text{kinematics})

Two key differences from DexH2R's kinematic residuals: (1) defined in tactile space, directly capturing contact dynamics differences; (2) cross-task generalization hypothesis — since kinematic differences are physical constants independent of object/task, tactile residuals may generalize across tasks.

9.4 Phase 3: Contact-Guided Deployment

The final deployed policy:

\pi_{robot} = \pi_{human} + \Delta_{residual}

When adding new processes: (1) human demonstrates tactile recipe via Data B (no teleop), (2) apply existing tactile residual (if cross-task generalization holds), (3) optional fine-tuning via 2–4 hours of autonomous play per object.

Teleoperation is completely eliminated in this pipeline.

Cost Comparison

Teleop-only TacTeleOp TacTeleOp+TacPlay
Teleop hours 33+ 8 0
Robot autonomous play 0 0 2–4 hr/object
Adding new process More teleop Small teleop Play only
Human operator needed Yes Yes No (overnight operation)

9.5 Core Hypotheses

H1: Tactile targets work as cross-embodiment reward

Just as X-Sim's visual reward (object trajectory) works cross-embodiment, TacPlay's tactile reward (pattern similarity) should work cross-embodiment. TacPlay operates at contact level rather than object level, providing finer reward signals for contact-rich tasks.

H2: Autonomous play can replace teleop

ExoStart's 9–15 demos → RL → >50% success provides precedent. Human2Sim2Robot's 1 demo → sim RL for dexterous manipulation is additional precedent. TacPlay's tactile targets (richer than ExoStart's binary reward) should enable comparable or better convergence.

H3: Tactile residuals generalize cross-task

Most ambitious and riskiest hypothesis. Physical rationale: kinematic differences are systematic biases independent of object/task — the same robot is always kinematically different in the same way. Therefore, tactile residuals learned on capping may be valid for label application.

Fallback: Even if cross-task generalization fails, task-specific tactile residuals superior to DexH2R's visual residuals on contact-rich tasks still constitutes a contribution. In contact-rich settings, vision suffers from contact surface occlusion, while tactile directly observes the contact surface.

9.6 Differentiation from Related Work

vs OSMO Embodiment Bridge

OSMO TacPlay
Data alignment Passive (collect from same sensor) Active (use same sensor as reward)
Gap learning None Automatic tactile residual learning
Teleop Required Not required

vs DexH2R Residual RL

DexH2R TacPlay
Residual space Kinematic (joint/position) Tactile (contact force/pattern)
Reward Visual task reward Tactile pattern similarity
Generalization Task-specific Cross-task hypothesis
Contact observation Indirect (visual) Direct (tactile)

vs X-Sim (Teleop-Free)

X-Sim TacPlay
Reward Object trajectory (visual) Tactile pattern similarity
Strength domain Non-contact Contact-rich
Precision Object-level Contact-level

9.7 Risks and Mitigation

R1: RL convergence failure (Severity: High)

Tactile-target RL has never been attempted. Reward may be sparse or non-smooth.

Mitigation: (1) Sim-first verification (MuJoCo + TACTO). (2) Reference ExoStart's auto-curriculum RL + dynamics filtering. (3) Reward shaping: curriculum starting from partial pattern matching.

R2: Tactile transfer quality (Severity: High)

Kinematic differences may make the tactile reward signal meaningless despite the shared glove.

Mitigation: (1) Pre-measure human/robot tactile similarity on same object/grip. (2) Apply domain adaptation if similarity is too low. (3) Residual learning itself corrects this difference.

R3: Safety (Severity: Medium)

Autonomous real-world play risks object damage, environment damage, or self-damage.

Mitigation: (1) Force limits (within OSMO's 0.3–80N range). (2) Position limits. (3) Initial human supervision. (4) Sim-first, then real transfer.

9.8 Key Discussion: Is This Truly Novel?

Honest assessment of TacPlay's novelty:

Strongest novelty (P1): "Active cross-embodiment learning in tactile space" — 0 papers have attempted this. OSMO is passive, DexH2R is vision-based. Combining both actively in tactile space is novel.

Riskiest claim (P3): Cross-task generalization of tactile residuals. Near-zero existing evidence. High impact if correct, threatens core contribution if wrong. Fallback essential: task-specific tactile residuals superior to DexH2R is still a contribution.

OSMO v2 threat: Meta FAIR may add active learning to OSMO within 6 months. Mitigation: stretchable hardware differentiation + early submission + industrial application.

9.9 Connection to Our Direction

TacPlay is only possible on top of TacGlove and TacTeleOp:

  • TacGlove's tactile glove → TacPlay's Embodiment Bridge
  • TacTeleOp's Data B → TacPlay's tactile recipe source
  • TacTeleOp's co-training results → TacPlay's baseline comparison

TacPlay is a future research direction for next year. This year's priority is TacGlove hardware validation and TacTeleOp multi-object grasping experiments. Once these results are secured, TacPlay will extend the system to fully autonomous learning. When all three are presented together, the full system's originality is secured.

The next chapter presents experimental designs for validating TacGlove, TacTeleOp, and TacPlay (Chapter 10).

References

  1. Yin, J., et al. (2025). OSMO: A Large-Scale Tactile Glove. arXiv. https://arxiv.org/abs/2512.08920 #18 scholar
  2. DexH2R (2024). Task-Oriented Residual RL for Dexterous Transfer. arXiv. scholar
  3. Dan, P., et al. (2025). X-Sim: Cross-Embodiment Simulation. CoRL 2025 Oral. scholar
  4. Si, Z., et al. (2025). ExoStart: Exoskeleton-Aided Dexterous Manipulation. arXiv. #9 scholar
  5. Lum, T. G. W., et al. (2025). Human2Sim2Robot. CoRL 2025. scholar
  6. Park, M., & Park, Y.-L. et al. (2024). Stretchable Glove. Nature Communications. #6 scholar
  7. Physical Intelligence (2025). pi0. arXiv. #2 scholar
  8. Zheng, R., et al. (2026). EgoScale. arXiv. scholar
  9. Li, et al. (2025). ManipTrans. CVPR 2025. scholar