Part III: Our Proposals

Chapter 8: TacTeleOp — Multi-Object Grasping and Tactile Co-training

Written: 2026-04-09 Last updated: 2026-04-09

Summary

TacTeleOp implements sequential multi-object grasping using the ROBOTIS HX5-D20 5-finger 20-DOF robot hand, picking up lipstick-sized boxes one by one from a tabletop and accumulating them in the palm. TacGlove (Chapter 7) is mounted on both the robot and human operator, enabling simultaneous collection of Data A (teleoperation) and Data B (human demonstration) for co-training, while integrating Montana/Murray's classical grasp theory into a learning-based framework. The goal is to address five fundamental limitations shared by all existing multi-object grasping research [MultiGrasp, SeqMultiGrasp, SeqGrasp] -- no tactile sensing, palm-up only, 4 fingers, simulation dependence, and no human demonstration data. Developed in Prof. Park Jong-Woo's lab.

8.1 Introduction: From Hardware to Data Pipeline

Chapter 7's TacGlove [#26] established the hardware foundation of a stretchable tactile glove. However, hardware alone does not produce manipulation capability. Three elements are now needed: (1) a suitable robot hand for TacGlove mounting, (2) a challenging manipulation task for that hand, and (3) a learning pipeline connecting human and robot data. TacTeleOp concretizes these three elements as ROBOTIS HX5-D20 + sequential multi-object grasping + Data A/B co-training.

While TacPlay [#27] (Chapter 9) eliminates teleoperation entirely through autonomous play, TacTeleOp improves teleoperation. Both share the same HX5-D20 + TacGlove platform but differ in data collection strategy.

8.2 ROBOTIS HX5-D20 Robot Hand

Specifications

The ROBOTIS HX5-D20 is a 5-finger 20-DOF dexterous hand released in late 2025/early 2026.

Item	Specification
Fingers	5
DOF	20
Actuators	DYNAMIXEL XM335, direct drive
Control frequency	1 kHz
Max fingertip force	14 N
Max payload	15 kg
Weight	1.36 kg
Built-in tactile	Fingertip pressure sensor (1 per finger, basic)
Communication	RS-485, ROS 2 + ros2_control
Price	~$10,000 USD

Comparison with Existing Robot Hands

Hand	DOF	Fingers	Weight	Cost	Tactile
ROBOTIS HX5-D20	20	5	1.36 kg	~$10K	Fingertip pressure
Allegro Hand V4	16	4	1.08 kg	~$16K	None built-in
LEAP Hand	16	4	~0.5 kg	<$2K	None built-in
Shadow Hand	20+	5	~4 kg	~$100K+	129 sensors
Psyonic Ability	6	5	0.49 kg	~$10K	30 fingertip sensors

Rationale for Selecting HX5-D20

Four key reasons drive the selection:

5 fingers = human-like kinematics: Allegro/LEAP's 4 fingers (16 DOF) have a large kinematic gap from the human hand (27 DOF). HX5-D20's 5-finger 20 DOF narrows this gap. The 5th finger (pinky) is critical for sequential multi-object grasping -- it stabilizes already-grasped objects while other fingers pick up the next one.
TacGlove mounting compatibility: The 1.36 kg form factor and 5-finger structure are kinematically compatible with TacGlove's whole-hand tactile sensor layout.
Reasonable price point: At ~$10K, it is 1/10th the cost of Shadow Hand ($100K+) and cheaper than Allegro ($16K), making multi-platform setups realistic.
ROS 2 native: ros2_control support enables immediate integration with existing research infrastructure.

Built-in Tactile Sensor Limitations

The HX5-D20's built-in tactile sensors are limited to one basic pressure sensor per fingertip. This suffices for binary contact detection but is inadequate for shear force measurement, force distribution mapping, and slip detection. Adding TacGlove provides 24-channel (8 sensors x 3-axis) whole-hand tactile, compensating for these limitations.

Important fact: The HX5-D20 is recent enough that no learning-based manipulation research has been published using it. This is both a risk and an opportunity -- TacTeleOp could become the first learning-based manipulation study on HX5-D20.

8.3 Multi-Object Grasping Scenario

Target Task

Pick up lipstick-sized small boxes from a tabletop one by one, accumulating them in the palm. After grasping the first object, hold it stably while sequentially adding the second and third objects. This task directly corresponds to multi-container handling in cosmetics manufacturing (Chapter 7 Section 7.5).

The difficulty of this scenario is underscored by ^[22]'s F-TAC Hand result: when 70% of the palmar surface is covered at 0.1 mm resolution, the multi-object delivery adaptation rate climbs from 53.5% (tactile absent) to nearly 100%. The gap quantifies the claim that "palm contact signals are central to collision avoidance and re-placement decisions" — tactile is not merely observed but directly shapes the decision. TacTeleOp treats this as a reference point: can TacGlove's 24-channel tactile stream reproduce an equivalent loop closure through a low-cost magnetic modality?

Fundamental Limitations of Existing Research

Three 2024-2025 multi-object grasping studies share common limitations:

Study	Hand	Objects	Success Rate	Setting	Tactile	Orientation
MultiGrasp [PKU, RA-L 2024]	Shadow	2	44% (sim)	Sim only	None	Palm-up
SeqMultiGrasp [USC, arXiv Mar 2025]	Allegro	2	56.7% (real)	Sim+Real	None	Palm-up
SeqGrasp [KTH, arXiv Mar 2025]	Allegro	3-4	50% (real)	Sim+Real	None	Palm-up

Five shared limitations:

No tactile sensing: All three use only vision and joint positions. They cannot detect whether objects are slipping within the hand.
Palm-up constraint: All rely on objects falling into an upward-facing palm via gravity. This diverges from real-world scenarios of picking objects up from a table (palm-down).
4-finger limitation: Allegro and LEAP have 4 fingers, lacking a pinky to stabilize already-grasped objects. Even MultiGrasp's Shadow Hand shows limited finger coordination strategies.
Simulation dependence: All learn via RL or optimization in simulation before sim-to-real transfer. None leverage human demonstration data.
Rigid objects only: Generalization to flexible objects or diverse materials is unvalidated.

Single-Object Manipulation Predecessors

UMI ^[21] [#35] reported high success rates on single-object manipulation (cup arrangement 100%, dynamic tossing 87.5%) using handheld gripper-based in-the-wild data collection. However, it is limited to a 2-DoF gripper and cannot address multi-object sequential grasping or dexterous manipulation, and lacks tactile sensors. TacTeleOp extends this to 5-finger dexterous hand + 24-channel distributed tactile + multi-object sequential grasping.

TacTeleOp's Differentiation

Dimension	Existing Work	TacTeleOp
Tactile	None	TacGlove 24ch whole-hand
Grasp orientation	Palm-up	Palm-down (tabletop pick)
Finger count	4 (Allegro/LEAP)	5 (HX5-D20)
Data source	Sim-to-real (RL/optimization)	Human demo + teleoperation
Theory basis	None or partial	Montana/Murray integration

8.4 Teleoperation Challenges and TacGlove Solution

Three Reasons Teleoperation Is Difficult

Kinematic mismatch: The human hand's 27 DOF must be mapped to the robot's 4-24 DOF. Even HX5-D20 (20 DOF) does not perfectly correspond to the human hand's 27 DOF, with different inter-finger coupling patterns.
Control latency: Vision-based teleoperation incurs 50-200 ms latency. Glove-based systems reduce this to 10-30 ms, but precision tasks like multi-object grasping remain affected.
No tactile feedback: Most teleoperation systems provide no tactile feedback to the operator, resulting in excessive force (crushing objects) or insufficient force (dropping objects).

TacGlove Solution: Bidirectional Tactile Bridge

TacGlove (Chapter 7) simultaneously mitigates all three problems:

TacGlove on robot: Mounting TacGlove on HX5-D20 provides 24ch whole-hand tactile. This data serves as (1) an input modality during learning and (2) potential feedback to the operator during teleoperation.
TacGlove on human: The same TacGlove worn by humans during daily tasks generates Data B. Same sensor = same tactile space (Embodiment Bridge, Chapter 6).

Tactile Feedback's Effect on Data Quality

OSMO [#18] demonstrated that tactile feedback improves not just teleoperation performance but also collected data quality: 72% with tactile vs 56% without on a wiping task (+16%p). DOGlove went further, showing that tactile feedback alone enables manipulation without vision.

These results imply two things for TacTeleOp: (1) tactile feedback improves Data A quality, and (2) higher-quality Data A elevates the entire co-training pipeline's performance.

8.5 Montana/Murray Grasp Theory Connection

Core Concepts of Classical Theory

Classical multi-fingered grasping theory comprises three key frameworks:

Montana ^[2]: Contact kinematics -- mathematically models rolling/sliding contact between fingers and object surfaces, surface curvature, and contact state transitions.
Murray, Li, Sastry ^[3]: Force closure and wrench space analysis -- analyzes the set of forces/torques that N contact points can exert on an object to determine grasp stability.
Mason & Salisbury ^[4]: Defines geometric conditions for force/form closure and grasp quality metrics.

Connection to Learning-Based Approaches

Modern research has begun integrating classical theory into learning pipelines. SeqMultiGrasp ^[6] incorporates differentiable force closure as a loss term, guiding the learner toward physically stable grasps. This approach yields higher sample efficiency than pure RL and reduces physically implausible grasps.

Tactile Sensors = the Signals Classical Theory Requires

Classical grasp theory requires contact location, normal force, shear force, and contact area as inputs. Existing multi-object grasping studies operate without tactile by substituting simulation's accurate contact models for these theoretical requirements. However, in the real world, simulation contact models are inaccurate, making the measured signals from tactile sensors essential.

In TacTeleOp, TacGlove's 8 three-axis sensors are used to assess force closure condition satisfaction in real time. For example, detecting slip of already-grasped objects (shear force changes) and confirming sufficient grasp force on new objects (normal force thresholds) occur simultaneously.

8.6 Data Engine: Data A + Data B

Data B -- Human Demonstrations (Large-Scale)

Five cosmetics factory workers wear TacGlove + smart glasses while performing daily tasks including multi-object handling.

Item	Value
Workers	5
Collection period	20 days
Daily hours	8 hours
Total hours	800
Estimated episodes	50,000+
Modalities	Joint angles + tactile (24ch) + egocentric RGB

"Data B = scale + realism" -- includes natural distributions of per-object grasp strategies, contact sequences, and force profiles. In sequential multi-object grasping specifically, the human finger coordination strategy (which fingers stabilize held objects while which fingers pick up new ones) is knowledge not easily discovered by RL.

Data A -- Teleoperation (Small-Scale)

Robot data collected by controlling HX5-D20 via TacGlove-based teleoperation.

Item	Value
Per process	50-100 episodes
Total collection time	~8 hours (1 week)
Robot	ROBOTIS HX5-D20
Modalities	Robot joint + tactile (same TacGlove) + third-person RGB

"Data A = executability" -- provides feasible trajectories and tactile patterns under the robot's kinematic constraints. AoE's "50 teleop + 200 human -> 45%->95% (Close Laptop task)" (Chapter 5) is the direct precedent for this ratio.

8.7 Co-training Pipeline

Stage 1: Data B Pretrain

Pretrain a visual-tactile foundation model on 800 hours of Data B. The model learns to simultaneously predict future hand positions + future tactile patterns. Following EgoScale's ^[9] flow-based VLA pretrain methodology with tactile modality added to inputs. No robot data is needed at this stage.

Stage 2: Cross-Embodiment Retargeting

Map human joint angles and tactile patterns to HX5-D20 space.

Joint angles: MANO parameter -> HX5-D20 joint command mapping. The 5-finger correspondence simplifies retargeting compared to Allegro/LEAP.
Tactile: Same TacGlove (Embodiment Bridge) allows direct tactile transfer. Systematic biases from kinematic differences are corrected in Stage 3.
Visual: Apply Mirage ^[17] or H2R [2025] visual gap solutions (Chapter 6).

Stage 3: Data A Fine-tune

Fine-tune with 50-100 teleop demos per process. Following EgoMimic's ^[10] co-training architecture, the Data B pretrained model receives additional Data A training. This stage secures robot executability.

8.8 Cosmetics Process Application Scenario

Cosmetics manufacturing is the most natural application environment for TacTeleOp's multi-object grasping (Chapter 7 Section 7.5 for details). Picking up multiple small containers (lipstick, mascara, eyeliner) from a conveyor and placing them into packaging boxes is precisely sequential multi-object grasping. This task is currently mostly manual, and dedicated automation equipment has low ROI due to high-mix low-volume characteristics. A general-purpose robot hand + learning-based approach is the appropriate solution.

8.9 Differentiation from OSMO in Data/Scale

TacTeleOp uses the same data engine as TacGlove (Chapter 7), sharing the differentiation from OSMO:

Dimension	OSMO	UMI	UMI-FT	TacTeleOp	Ratio
Data scale	~2 hours, 140 demos	~hundreds of demos	~hundreds of demos	800 hours, 50,000+ demos	~400x
Task	1 (wiping)	Single object	Single object	Multi-object grasping + industrial	Multiple
Robot hand	xArm (2-finger gripper)	Gripper (2-DoF)	Gripper (2-DoF)	HX5-D20 (5-finger, 20 DOF)	-
Tactile	Glove (12ch)	Vision only	Wrist F/T (2 sensors)	Distributed tactile (24ch)	-
Success rate	72% (wiping)	100% (cup)	92% (wiping)	Target 90%+	-
Glove material	Rigid	-	-	Stretchable	-
Co-training	Unvalidated	Unvalidated	Unvalidated	Core contribution	-
Grasp theory	Not applied	Not applied	Not applied	Montana/Murray integration	-
Gap solution	Physical identity	Physical identity	ACP	TacGlove co-design	-

Critical note: TacTeleOp can appropriately claim "the first to address multi-object grasping with tactile sensing" -- all three existing studies operated without tactile. However, avoid claiming "the first tactile data engine" (OSMO preempted this).

8.10 Core Hypotheses and Expected Results

H1: Data B alone is conditionally viable

Based on X-Sim ^[12] and EgoZero ^[14], Data B with tactile is expected to achieve higher baseline performance than vision-only Data B. Target: 50-60% on 2-object grasping, 70-80% on single-object grasping with Data B only.

H2: A+B combination is superior

Based on EgoMimic ^[10] +34-228% and AoE ^[11] 45%->95% (Close Laptop task), Data A + Data B co-training should show significant improvement over either alone. Target: 50%->80%+ improvement on 2-object grasping.

H3: Tactile addition is significant

Based on OSMO +16%p, VTDexManip +20%, and DexUMI [#8] ablation (failure without tactile), tactile-inclusive co-training should show meaningful improvement over vision-only. Multi-object grasping particularly benefits from slip detection and force regulation, so tactile contribution is expected to be larger than for single-object grasping.

8.11 Limitations and Open Questions

HX5-D20 is unproven: No learning-based manipulation research has been published using it, leaving sim-to-real gap, control stability, and sensor noise as unknowns. Compared to Allegro/LEAP's rich research ecosystem, this is a significant risk.
High difficulty of multi-object grasping: The current state-of-the-art is only 56.7% (real) for 2 objects. Even with tactile and human data, reaching practical success rates for 3+ objects will require substantial engineering effort.
Initial teleoperation quality limitations: The HX5-D20 teleoperation interface is not yet mature, so initial Data A quality may be low. This could reduce Stage 3 fine-tuning effectiveness.
Additional difficulty of palm-down grasping: Transitioning from palm-up to palm-down means gravity works against grasp stability (pulling objects out of the hand), making stable multi-object grasping considerably harder.
Learning 5-finger coordination strategies: Leveraging the 5th finger is the key differentiator, but learning this automatically requires sufficient 5-finger utilization data (Data B).

8.12 The Shift to Active Palm: An Enabling Factor for Palm-Down Multi-Object Grasping

Existing multi-object grasping studies all operate under a palm-up assumption — gravity pulls objects into the palm. Real industrial settings (picking small containers from a tabletop, assembly-line part handling), however, are dominated by palm-down configurations. Under palm-down, the palm is no longer a passive supporting surface that gravity helps; it becomes a decision site that must actively modulate contact area and force.

This transition is articulated by a recent ^[23] paper in npj Robotics: "Most prior work has concentrated on fingertips, leaving the functional role of the palm largely overlooked." That work couples a high-resolution visuotactile active palm (the palm itself is actuated) with reconfigurable fingers, showing that palm–finger coupling is central to contact-rich manipulation success.

The design space of palm actuation is mapped in ^[24]'s comprehensive review "Actuated Palms for Soft Robotic Hands: Review and Perspectives": passive vs active, pneumatic / cable-driven / tendon, rigid / compliant / hybrid. Within this taxonomy, the palm serves four functional roles — (i) force distribution, (ii) workspace extension, (iii) grasp stability, (iv) conformability. F-TAC Hand's high-resolution passive palm and TacPalm SoftHand's ICA-based closed-loop trigger sit at different points along the active direction.

For TacTeleOp two implications follow. First, palm-down scenarios fundamentally require palm–finger coordination, so palm-side sensing (TacGlove's thenar/hypothenar/central triad) is not a "nice-to-have" but a determinant of success. Second, although the HX5-D20 itself has a fixed palm, TacGlove's 24-channel tactile stream exposes the palm-side decision signal to the upper-level policy, letting most functions of an active palm be emulated at the software layer. TacPlay (Chapter 9) explores this signal space autonomously, learning to reproduce the human palm–finger coordination strategies under the robot's kinematics.

References

ROBOTIS (2025). HX5-D20 Dexterous Robot Hand. https://www.robotis.com/ scholar
Montana, D. J. (1988). The Kinematics of Contact and Grasp. IJRR, 7(3). scholar
Murray, R. M., Li, Z., & Sastry, S. S. (1994). A Mathematical Introduction to Robotic Manipulation. CRC Press. scholar
Mason, M. T., & Salisbury, J. K. (1985). Robot Hands and the Mechanics of Manipulation. MIT Press. scholar
Li, Y., et al. (2024). MultiGrasp: Multi-Object Grasping with Dexterous Hands. IEEE RA-L. https://arxiv.org/abs/2310.15599 scholar
Li, H., et al. (2025). SeqMultiGrasp: Sequential Multi-Object Grasping via Diffusion. arXiv. https://arxiv.org/abs/2503.12579 scholar
Wan, W., et al. (2025). SeqGrasp: Sequential Grasping via Opposition Space. arXiv. https://arxiv.org/abs/2503.11806 scholar
Yin, J., et al. (2025). OSMO: A Large-Scale Tactile Glove. arXiv. https://arxiv.org/abs/2512.08920 #18 scholar
Zheng, R., et al. (2026). EgoScale: Egocentric Video Pretraining. arXiv. scholar
Kareer, S., et al. (2024). EgoMimic: Scaling Imitation Learning via Egocentric Video. arXiv. scholar
Yang, B., et al. (2026). AoE: Always-on Egocentric Data Collection. arXiv. scholar
Dan, P., et al. (2025). X-Sim: Cross-Embodiment Simulation. CoRL 2025 Oral. scholar
Liu, V., et al. (2025). EgoZero: Robot Policy from Egocentric Video. arXiv. scholar
Liu, Q., et al. (2025). VTDexManip: Visual-Tactile Dataset. ICLR 2025. scholar
Xu, M., et al. (2025). DexUMI: Universal Manipulation Interface. arXiv. #8 scholar
Yang, R., et al. (2025). EgoVLA: Egocentric VLA with MANO. arXiv. scholar
Chen, L. Y., et al. (2024). Mirage: Cross-Painting Transfer. RSS 2024. scholar
Park, M., & Park, Y.-L. et al. (2024). Stretchable Glove for Hand Motion Estimation. Nature Communications. #6 scholar
Sunday Robotics (2025). ACT-1: Skill Capture Glove & Skill Transform. #29 scholar
Chi, C., et al. (2024). UMI: Universal Manipulation Interface. RSS 2024. #35 scholar
Chi, C., et al. (2024). UMI on Legs: Making Manipulation Policies Mobile with Manipulation-Centric Whole-body Controllers. arXiv. scholar
Zhao, Z., et al. (2025). Embedding high-resolution touch across robotic hands enables adaptive human-like grasping (F-TAC Hand). Nature Machine Intelligence. https://arxiv.org/abs/2412.14482 #39 scholar
Zhou, Y., Lee, W. S., Gu, Y., & She, Y. (2026). Tactile-reactive gripper with an active palm for dexterous manipulation. npj Robotics, 4, 13. https://www.nature.com/articles/s44182-026-00079-y scholar
Pozzi, M., Malvezzi, M., Prattichizzo, D., & Salvietti, G. (2024). Actuated Palms for Soft Robotic Hands: Review and Perspectives. IEEE/ASME Transactions on Mechatronics, 29(2):902–921. scholar
Zhang, N., Ren, J., Dong, Y., Gu, G., & Zhu, X. (2025). Soft Robotic Hand with Tactile Palm-Finger Coordination (TacPalm SoftHand). Nature Communications 16:2395. https://doi.org/10.1038/s41467-025-57741-6 #40 scholar