Chapter 8: TacTeleOp — Multi-Object Grasping and Tactile Co-training
Summary
TacTeleOp implements sequential multi-object grasping using the ROBOTIS HX5-D20 5-finger 20-DOF robot hand, picking up lipstick-sized boxes one by one from a tabletop and accumulating them in the palm. TacGlove (Chapter 7) is mounted on both the robot and human operator, enabling simultaneous collection of Data A (teleoperation) and Data B (human demonstration) for co-training, while integrating Montana/Murray's classical grasp theory into a learning-based framework. The goal is to address five fundamental limitations shared by all existing multi-object grasping research [MultiGrasp, SeqMultiGrasp, SeqGrasp] -- no tactile sensing, palm-up only, 4 fingers, simulation dependence, and no human demonstration data. Developed in Prof. Park Jong-Woo's lab.
8.1 Introduction: From Hardware to Data Pipeline
Chapter 7's TacGlove [#26] established the hardware foundation of a stretchable tactile glove. However, hardware alone does not produce manipulation capability. Three elements are now needed: (1) a suitable robot hand for TacGlove mounting, (2) a challenging manipulation task for that hand, and (3) a learning pipeline connecting human and robot data. TacTeleOp concretizes these three elements as ROBOTIS HX5-D20 + sequential multi-object grasping + Data A/B co-training.
While TacPlay [#27] (Chapter 9) eliminates teleoperation entirely through autonomous play, TacTeleOp improves teleoperation. Both share the same HX5-D20 + TacGlove platform but differ in data collection strategy.
8.2 ROBOTIS HX5-D20 Robot Hand
Specifications
The ROBOTIS HX5-D20 is a 5-finger 20-DOF dexterous hand released in late 2025/early 2026.
| Item | Specification |
|---|---|
| Fingers | 5 |
| DOF | 20 |
| Actuators | DYNAMIXEL XM335, direct drive |
| Control frequency | 1 kHz |
| Max fingertip force | 14 N |
| Max payload | 15 kg |
| Weight | 1.36 kg |
| Built-in tactile | Fingertip pressure sensor (1 per finger, basic) |
| Communication | RS-485, ROS 2 + ros2_control |
| Price | ~$10,000 USD |
Comparison with Existing Robot Hands
| Hand | DOF | Fingers | Weight | Cost | Tactile |
|---|---|---|---|---|---|
| ROBOTIS HX5-D20 | 20 | 5 | 1.36 kg | ~$10K | Fingertip pressure |
| Allegro Hand V4 | 16 | 4 | 1.08 kg | ~$16K | None built-in |
| LEAP Hand | 16 | 4 | ~0.5 kg | <$2K | None built-in |
| Shadow Hand | 20+ | 5 | ~4 kg | ~$100K+ | 129 sensors |
| Psyonic Ability | 6 | 5 | 0.49 kg | ~$10K | 30 fingertip sensors |
Rationale for Selecting HX5-D20
Four key reasons drive the selection:
- 5 fingers = human-like kinematics: Allegro/LEAP's 4 fingers (16 DOF) have a large kinematic gap from the human hand (27 DOF). HX5-D20's 5-finger 20 DOF narrows this gap. The 5th finger (pinky) is critical for sequential multi-object grasping -- it stabilizes already-grasped objects while other fingers pick up the next one.
- TacGlove mounting compatibility: The 1.36 kg form factor and 5-finger structure are kinematically compatible with TacGlove's whole-hand tactile sensor layout.
- Reasonable price point: At ~$10K, it is 1/10th the cost of Shadow Hand ($100K+) and cheaper than Allegro ($16K), making multi-platform setups realistic.
- ROS 2 native: ros2_control support enables immediate integration with existing research infrastructure.
Built-in Tactile Sensor Limitations
The HX5-D20's built-in tactile sensors are limited to one basic pressure sensor per fingertip. This suffices for binary contact detection but is inadequate for shear force measurement, force distribution mapping, and slip detection. Adding TacGlove provides 24-channel (8 sensors x 3-axis) whole-hand tactile, compensating for these limitations.
Important fact: The HX5-D20 is recent enough that no learning-based manipulation research has been published using it. This is both a risk and an opportunity -- TacTeleOp could become the first learning-based manipulation study on HX5-D20.
8.3 Multi-Object Grasping Scenario
Target Task
Pick up lipstick-sized small boxes from a tabletop one by one, accumulating them in the palm. After grasping the first object, hold it stably while sequentially adding the second and third objects. This task directly corresponds to multi-container handling in cosmetics manufacturing (Chapter 7 Section 7.5).
The difficulty of this scenario is underscored by [22]'s F-TAC Hand result: when 70% of the palmar surface is covered at 0.1 mm resolution, the multi-object delivery adaptation rate climbs from 53.5% (tactile absent) to nearly 100%. The gap quantifies the claim that "palm contact signals are central to collision avoidance and re-placement decisions" — tactile is not merely observed but directly shapes the decision. TacTeleOp treats this as a reference point: can TacGlove's 24-channel tactile stream reproduce an equivalent loop closure through a low-cost magnetic modality?
Fundamental Limitations of Existing Research
Three 2024-2025 multi-object grasping studies share common limitations:
| Study | Hand | Objects | Success Rate | Setting | Tactile | Orientation |
|---|---|---|---|---|---|---|
| MultiGrasp [PKU, RA-L 2024] | Shadow | 2 | 44% (sim) | Sim only | None | Palm-up |
| SeqMultiGrasp [USC, arXiv Mar 2025] | Allegro | 2 | 56.7% (real) | Sim+Real | None | Palm-up |
| SeqGrasp [KTH, arXiv Mar 2025] | Allegro | 3-4 | 50% (real) | Sim+Real | None | Palm-up |
Five shared limitations:
- No tactile sensing: All three use only vision and joint positions. They cannot detect whether objects are slipping within the hand.
- Palm-up constraint: All rely on objects falling into an upward-facing palm via gravity. This diverges from real-world scenarios of picking objects up from a table (palm-down).
- 4-finger limitation: Allegro and LEAP have 4 fingers, lacking a pinky to stabilize already-grasped objects. Even MultiGrasp's Shadow Hand shows limited finger coordination strategies.
- Simulation dependence: All learn via RL or optimization in simulation before sim-to-real transfer. None leverage human demonstration data.
- Rigid objects only: Generalization to flexible objects or diverse materials is unvalidated.
Single-Object Manipulation Predecessors
UMI [21] [#35] reported high success rates on single-object manipulation (cup arrangement 100%, dynamic tossing 87.5%) using handheld gripper-based in-the-wild data collection. However, it is limited to a 2-DoF gripper and cannot address multi-object sequential grasping or dexterous manipulation, and lacks tactile sensors. TacTeleOp extends this to 5-finger dexterous hand + 24-channel distributed tactile + multi-object sequential grasping.
TacTeleOp's Differentiation
| Dimension | Existing Work | TacTeleOp |
|---|---|---|
| Tactile | None | TacGlove 24ch whole-hand |
| Grasp orientation | Palm-up | Palm-down (tabletop pick) |
| Finger count | 4 (Allegro/LEAP) | 5 (HX5-D20) |
| Data source | Sim-to-real (RL/optimization) | Human demo + teleoperation |
| Theory basis | None or partial | Montana/Murray integration |
8.4 Teleoperation Challenges and TacGlove Solution
Three Reasons Teleoperation Is Difficult
- Kinematic mismatch: The human hand's 27 DOF must be mapped to the robot's 4-24 DOF. Even HX5-D20 (20 DOF) does not perfectly correspond to the human hand's 27 DOF, with different inter-finger coupling patterns.
- Control latency: Vision-based teleoperation incurs 50-200 ms latency. Glove-based systems reduce this to 10-30 ms, but precision tasks like multi-object grasping remain affected.
- No tactile feedback: Most teleoperation systems provide no tactile feedback to the operator, resulting in excessive force (crushing objects) or insufficient force (dropping objects).
TacGlove Solution: Bidirectional Tactile Bridge
TacGlove (Chapter 7) simultaneously mitigates all three problems:
- TacGlove on robot: Mounting TacGlove on HX5-D20 provides 24ch whole-hand tactile. This data serves as (1) an input modality during learning and (2) potential feedback to the operator during teleoperation.
- TacGlove on human: The same TacGlove worn by humans during daily tasks generates Data B. Same sensor = same tactile space (Embodiment Bridge, Chapter 6).
Tactile Feedback's Effect on Data Quality
OSMO [#18] demonstrated that tactile feedback improves not just teleoperation performance but also collected data quality: 72% with tactile vs 56% without on a wiping task (+16%p). DOGlove went further, showing that tactile feedback alone enables manipulation without vision.
These results imply two things for TacTeleOp: (1) tactile feedback improves Data A quality, and (2) higher-quality Data A elevates the entire co-training pipeline's performance.
8.5 Montana/Murray Grasp Theory Connection
Core Concepts of Classical Theory
Classical multi-fingered grasping theory comprises three key frameworks:
- Montana [2]: Contact kinematics -- mathematically models rolling/sliding contact between fingers and object surfaces, surface curvature, and contact state transitions.
- Murray, Li, Sastry [3]: Force closure and wrench space analysis -- analyzes the set of forces/torques that N contact points can exert on an object to determine grasp stability.
- Mason & Salisbury [4]: Defines geometric conditions for force/form closure and grasp quality metrics.
Connection to Learning-Based Approaches
Modern research has begun integrating classical theory into learning pipelines. SeqMultiGrasp [USC, 2025] incorporates differentiable force closure as a loss term, guiding the learner toward physically stable grasps. This approach yields higher sample efficiency than pure RL and reduces physically implausible grasps.
Tactile Sensors = the Signals Classical Theory Requires
Classical grasp theory requires contact location, normal force, shear force, and contact area as inputs. Existing multi-object grasping studies operate without tactile by substituting simulation's accurate contact models for these theoretical requirements. However, in the real world, simulation contact models are inaccurate, making the measured signals from tactile sensors essential.
In TacTeleOp, TacGlove's 8 three-axis sensors are used to assess force closure condition satisfaction in real time. For example, detecting slip of already-grasped objects (shear force changes) and confirming sufficient grasp force on new objects (normal force thresholds) occur simultaneously.
8.6 Data Engine: Data A + Data B
Data B -- Human Demonstrations (Large-Scale)
Five cosmetics factory workers wear TacGlove + smart glasses while performing daily tasks including multi-object handling.
| Item | Value |
|---|---|
| Workers | 5 |
| Collection period | 20 days |
| Daily hours | 8 hours |
| Total hours | 800 |
| Estimated episodes | 50,000+ |
| Modalities | Joint angles + tactile (24ch) + egocentric RGB |
"Data B = scale + realism" -- includes natural distributions of per-object grasp strategies, contact sequences, and force profiles. In sequential multi-object grasping specifically, the human finger coordination strategy (which fingers stabilize held objects while which fingers pick up new ones) is knowledge not easily discovered by RL.
Data A -- Teleoperation (Small-Scale)
Robot data collected by controlling HX5-D20 via TacGlove-based teleoperation.
| Item | Value |
|---|---|
| Per process | 50-100 episodes |
| Total collection time | ~8 hours (1 week) |
| Robot | ROBOTIS HX5-D20 |
| Modalities | Robot joint + tactile (same TacGlove) + third-person RGB |
"Data A = executability" -- provides feasible trajectories and tactile patterns under the robot's kinematic constraints. AoE's "50 teleop + 200 human -> 45%->95% (Close Laptop task)" (Chapter 5) is the direct precedent for this ratio.
8.7 Co-training Pipeline
Stage 1: Data B Pretrain
Pretrain a visual-tactile foundation model on 800 hours of Data B. The model learns to simultaneously predict future hand positions + future tactile patterns. Following EgoScale's [9] flow-based VLA pretrain methodology with tactile modality added to inputs. No robot data is needed at this stage.
Stage 2: Cross-Embodiment Retargeting
Map human joint angles and tactile patterns to HX5-D20 space.
- Joint angles: MANO parameter -> HX5-D20 joint command mapping. The 5-finger correspondence simplifies retargeting compared to Allegro/LEAP.
- Tactile: Same TacGlove (Embodiment Bridge) allows direct tactile transfer. Systematic biases from kinematic differences are corrected in Stage 3.
- Visual: Apply Mirage [17] or H2R [2025] visual gap solutions (Chapter 6).
Stage 3: Data A Fine-tune
Fine-tune with 50-100 teleop demos per process. Following EgoMimic's [10] co-training architecture, the Data B pretrained model receives additional Data A training. This stage secures robot executability.
8.8 Cosmetics Process Application Scenario
Cosmetics manufacturing is the most natural application environment for TacTeleOp's multi-object grasping (Chapter 7 Section 7.5 for details). Picking up multiple small containers (lipstick, mascara, eyeliner) from a conveyor and placing them into packaging boxes is precisely sequential multi-object grasping. This task is currently mostly manual, and dedicated automation equipment has low ROI due to high-mix low-volume characteristics. A general-purpose robot hand + learning-based approach is the appropriate solution.
8.9 Differentiation from OSMO in Data/Scale
TacTeleOp uses the same data engine as TacGlove (Chapter 7), sharing the differentiation from OSMO:
| Dimension | OSMO | UMI | UMI-FT | TacTeleOp | Ratio |
|---|---|---|---|---|---|
| Data scale | ~2 hours, 140 demos | ~hundreds of demos | ~hundreds of demos | 800 hours, 50,000+ demos | ~400x |
| Task | 1 (wiping) | Single object | Single object | Multi-object grasping + industrial | Multiple |
| Robot hand | xArm (2-finger gripper) | Gripper (2-DoF) | Gripper (2-DoF) | HX5-D20 (5-finger, 20 DOF) | - |
| Tactile | Glove (12ch) | Vision only | Wrist F/T (2 sensors) | Distributed tactile (24ch) | - |
| Success rate | 72% (wiping) | 100% (cup) | 92% (wiping) | Target 90%+ | - |
| Glove material | Rigid | - | - | Stretchable | - |
| Co-training | Unvalidated | Unvalidated | Unvalidated | Core contribution | - |
| Grasp theory | Not applied | Not applied | Not applied | Montana/Murray integration | - |
| Gap solution | Physical identity | Physical identity | ACP | TacGlove co-design | - |
Critical note: TacTeleOp can appropriately claim "the first to address multi-object grasping with tactile sensing" -- all three existing studies operated without tactile. However, avoid claiming "the first tactile data engine" (OSMO preempted this).
8.10 Core Hypotheses and Expected Results
H1: Data B alone is conditionally viable
Based on X-Sim [12] and EgoZero [14], Data B with tactile is expected to achieve higher baseline performance than vision-only Data B. Target: 50-60% on 2-object grasping, 70-80% on single-object grasping with Data B only.
H2: A+B combination is superior
Based on EgoMimic [10] +34-228% and AoE [11] 45%->95% (Close Laptop task), Data A + Data B co-training should show significant improvement over either alone. Target: 50%->80%+ improvement on 2-object grasping.
H3: Tactile addition is significant
Based on OSMO +16%p, VTDexManip +20%, and DexUMI [#8] ablation (failure without tactile), tactile-inclusive co-training should show meaningful improvement over vision-only. Multi-object grasping particularly benefits from slip detection and force regulation, so tactile contribution is expected to be larger than for single-object grasping.
8.11 Limitations and Open Questions
- HX5-D20 is unproven: No learning-based manipulation research has been published using it, leaving sim-to-real gap, control stability, and sensor noise as unknowns. Compared to Allegro/LEAP's rich research ecosystem, this is a significant risk.
- High difficulty of multi-object grasping: The current state-of-the-art is only 56.7% (real) for 2 objects. Even with tactile and human data, reaching practical success rates for 3+ objects will require substantial engineering effort.
- Initial teleoperation quality limitations: The HX5-D20 teleoperation interface is not yet mature, so initial Data A quality may be low. This could reduce Stage 3 fine-tuning effectiveness.
- Additional difficulty of palm-down grasping: Transitioning from palm-up to palm-down means gravity works against grasp stability (pulling objects out of the hand), making stable multi-object grasping considerably harder.
- Learning 5-finger coordination strategies: Leveraging the 5th finger is the key differentiator, but learning this automatically requires sufficient 5-finger utilization data (Data B).
8.12 The Shift to Active Palm: An Enabling Factor for Palm-Down Multi-Object Grasping
Existing multi-object grasping studies all operate under a palm-up assumption — gravity pulls objects into the palm. Real industrial settings (picking small containers from a tabletop, assembly-line part handling), however, are dominated by palm-down configurations. Under palm-down, the palm is no longer a passive supporting surface that gravity helps; it becomes a decision site that must actively modulate contact area and force.
This transition is articulated by a recent [23] paper in npj Robotics: "Most prior work has concentrated on fingertips, leaving the functional role of the palm largely overlooked." That work couples a high-resolution visuotactile active palm (the palm itself is actuated) with reconfigurable fingers, showing that palm–finger coupling is central to contact-rich manipulation success.
The design space of palm actuation is mapped in [24]'s comprehensive review "Actuated Palms for Soft Robotic Hands: Review and Perspectives": passive vs active, pneumatic / cable-driven / tendon, rigid / compliant / hybrid. Within this taxonomy, the palm serves four functional roles — (i) force distribution, (ii) workspace extension, (iii) grasp stability, (iv) conformability. F-TAC Hand's high-resolution passive palm and TacPalm SoftHand's ICA-based closed-loop trigger sit at different points along the active direction.
For TacTeleOp two implications follow. First, palm-down scenarios fundamentally require palm–finger coordination, so palm-side sensing (TacGlove's thenar/hypothenar/central triad) is not a "nice-to-have" but a determinant of success. Second, although the HX5-D20 itself has a fixed palm, TacGlove's 24-channel tactile stream exposes the palm-side decision signal to the upper-level policy, letting most functions of an active palm be emulated at the software layer. TacPlay (Chapter 9) explores this signal space autonomously, learning to reproduce the human palm–finger coordination strategies under the robot's kinematics.
References
- ROBOTIS (2025). HX5-D20 Dexterous Robot Hand. https://www.robotis.com/ scholar
- Montana, D. J. (1988). The Kinematics of Contact and Grasp. IJRR, 7(3). scholar
- Murray, R. M., Li, Z., & Sastry, S. S. (1994). A Mathematical Introduction to Robotic Manipulation. CRC Press. scholar
- Mason, M. T., & Salisbury, J. K. (1985). Robot Hands and the Mechanics of Manipulation. MIT Press. scholar
- Li, Y., et al. (2024). MultiGrasp: Multi-Object Grasping with Dexterous Hands. IEEE RA-L. https://arxiv.org/abs/2310.15599 scholar
- Li, H., et al. (2025). SeqMultiGrasp: Sequential Multi-Object Grasping via Diffusion. arXiv. https://arxiv.org/abs/2503.12579 scholar
- Wan, W., et al. (2025). SeqGrasp: Sequential Grasping via Opposition Space. arXiv. https://arxiv.org/abs/2503.11806 scholar
- Yin, J., et al. (2025). OSMO: A Large-Scale Tactile Glove. arXiv. https://arxiv.org/abs/2512.08920 #18 scholar
- Zheng, R., et al. (2026). EgoScale: Egocentric Video Pretraining. arXiv. scholar
- Kareer, S., et al. (2024). EgoMimic: Scaling Imitation Learning via Egocentric Video. arXiv. scholar
- Yang, B., et al. (2026). AoE: Always-on Egocentric Data Collection. arXiv. scholar
- Dan, P., et al. (2025). X-Sim: Cross-Embodiment Simulation. CoRL 2025 Oral. scholar
- Liu, V., et al. (2025). EgoZero: Robot Policy from Egocentric Video. arXiv. scholar
- Liu, Q., et al. (2025). VTDexManip: Visual-Tactile Dataset. ICLR 2025. scholar
- Xu, M., et al. (2025). DexUMI: Universal Manipulation Interface. arXiv. #8 scholar
- Yang, R., et al. (2025). EgoVLA: Egocentric VLA with MANO. arXiv. scholar
- Chen, L. Y., et al. (2024). Mirage: Cross-Painting Transfer. RSS 2024. scholar
- Park, M., & Park, Y.-L. et al. (2024). Stretchable Glove for Hand Motion Estimation. Nature Communications. #6 scholar
- Sunday Robotics (2025). ACT-1: Skill Capture Glove & Skill Transform. #29 scholar
- Chi, C., et al. (2024). UMI: Universal Manipulation Interface. RSS 2024. #35 scholar
- Chi, C., et al. (2024). UMI on Legs: Making Manipulation Policies Mobile with Manipulation-Centric Whole-body Controllers. arXiv. scholar
- Zhao, Z., et al. (2025). Embedding high-resolution touch across robotic hands enables adaptive human-like grasping (F-TAC Hand). Nature Machine Intelligence. https://arxiv.org/abs/2412.14482 #39 scholar
- Zhou, Y., Lee, W. S., Gu, Y., & She, Y. (2026). Tactile-reactive gripper with an active palm for dexterous manipulation. npj Robotics, 4, 13. https://www.nature.com/articles/s44182-026-00079-y scholar
- Pozzi, M., Malvezzi, M., Prattichizzo, D., & Salvietti, G. (2024). Actuated Palms for Soft Robotic Hands: Review and Perspectives. IEEE/ASME Transactions on Mechatronics, 29(2):902–921. scholar
- Zhang, N., Ren, J., Dong, Y., Gu, G., & Zhu, X. (2025). Soft Robotic Hand with Tactile Palm-Finger Coordination (TacPalm SoftHand). Nature Communications 16:2395. https://doi.org/10.1038/s41467-025-57741-6 #40 scholar