Glossary

Key terms in A-Z order. `(Ch N)` marks the chapter where the term is introduced or discussed in depth.

Definitions are kept in sync with the monorepo master at `glossary/master_en.md`.

A

ACT (Action Chunking with Transformers): Transformer-based action chunking — learns continuous action sequences from demonstrations to stabilize delayed-reward tasks. (Ch1, Ch3, Ch4, Ch5, Ch6, Ch7, Ch8)

C

Closed-loop: Architecture that feeds execution results back to update plans. (Ch3, Ch8)

Compliance: Mechanical yielding to external force — essential for contact-rich manipulation. (Ch3, Ch5, Ch11)

Co-training: Strategy of jointly training on human and robot data for complementary representations. (Ch1, Ch2, Ch3, Ch4, Ch5, Ch6, Ch8, Ch9, Ch10, Ch11)

Cross-embodiment gap: Transfer gap arising from kinematic, visual, and tactile differences across agents (e.g., human vs. robot, or robot vs. robot). (Ch3)

D

DEXOP: Framework for learning manipulation policies from human hand data. (Ch1, Ch5, Ch11)

Dexterous manipulation: Precise object manipulation with multi-fingered hands — in-hand rotation, assembly, etc. (Ch1, Ch2, Ch3, Ch4, Ch5, Ch6, Ch7, Ch8, Ch9, Ch10, Ch11)

DexUMI: Framework for converting human demonstrations into robot policies. (Ch2, Ch3, Ch7, Ch8)

Diffusion Policy: Policy learning via conditional denoising diffusion over action distributions. (Ch1, Ch4, Ch5)

DoF (Degrees of Freedom): Number of independent joint axes. Human hand has ~27 DoF. (Ch2, Ch5, Ch6, Ch7, Ch8, Ch10, Ch11)

E

Embodiment Retargeting: Maps motion from one embodiment (e.g., human hand) to another (e.g., robot hand) joint space. (Ch5, Ch8)

F

ForceVLA: VLA extended with force sensing using MoE routing to branch on contact modes. (Ch3)

Foundation Model: Large-scale pretrained general-purpose model — e.g., Sparsh (tactile), pi0 (VLA). (Ch1, Ch4, Ch6, Ch8, Ch11)

F-TAC Hand: Robot hand platform integrating high-resolution tactile sensing. (Ch2, Ch3, Ch8, Ch11)

G

GelSight: MIT-developed photometric-stereo-based vision-tactile sensor. (Ch3, Ch6, Ch11)

H

Human hand data: Datasets of human hand manipulation captured via gloves, video, or motion capture. (Ch1, Ch2, Ch5, Ch6)

I

In-hand manipulation: Changing the position/pose of a grasped object. (Ch3, Ch7, Ch11)

M

MANO: Statistical human hand model trained on 1,000 3D scans (778 vertices, 16 joints). (Ch6, Ch7, Ch8)

Mechanoreceptor: Biological sensors for contact/pressure/vibration — Merkel, Meissner, Ruffini, Pacinian. (Ch3, Ch11)

MoE (Mixture of Experts): Architecture that dynamically routes inputs to multiple expert networks — ForceVLA is a representative example. (Ch3)

O

OpenVLA: Open-source VLA foundation model (7B params, trained on Open X-Embodiment). (Ch1, Ch4)

OSMO: Learns and transfers robot policy from human hand motion. (Ch2, Ch3, Ch5, Ch6, Ch7, Ch8, Ch9, Ch10, Ch11)

P

pi0 (π₀): Physical Intelligence's flow-based VLA foundation model. (Ch1, Ch3, Ch5, Ch9, Ch11)

R

RL (Reinforcement Learning): Policy learning by maximizing reward. (Ch1, Ch2, Ch3, Ch4, Ch6, Ch8, Ch9, Ch10, Ch11)

RT-2: Google DeepMind's VLA model jointly trained on web VQA and robot manipulation. (Ch1)

S

Shear force: Force parallel to the contact surface — essential for slip detection. (Ch7, Ch8, Ch10)

SIMPLER: Benchmark that aligns simulation evaluation with real-world performance. (Ch3)

Sim-to-Real: Process/strategies for transferring policies trained in simulation to the real world. (Ch3, Ch4, Ch8, Ch11)

Slip detection: Detecting an object slipping from the hand — typically via shear-force monitoring. (Ch3, Ch8)

Synergy: Coordinated motion pattern across joints — often captured via PCA-based dimensionality reduction. (Ch11)

T

TacGlove: High-resolution tactile glove hardware by the Park Y.-L. group — joint angles + tactile sensing simultaneously. (Ch1, Ch2, Ch3, Ch4, Ch5, Ch6, Ch7, Ch8, Ch9, Ch10, Ch11)

TacPlay: Proposal for Play-style policy learning from human tactile data. (Ch1, Ch2, Ch3, Ch4, Ch5, Ch6, Ch7, Ch8, Ch9, Ch10, Ch11)

TacTeleOp: Proposal combining teleoperation with tactile co-training. (Ch1, Ch2, Ch3, Ch4, Ch5, Ch6, Ch7, Ch8, Ch9, Ch10, Ch11)

Tactile foundation model: General-purpose representation pretrained on large-scale tactile data — e.g., Sparsh, UniTouch. (Ch8, Ch11)

Tactile skin: Large-area flexible tactile sensor array covering hands/arms. (Ch2, Ch3, Ch11)

Teleoperation (TeleOp): Humans remotely operating a robot to collect demonstration data. (Ch1, Ch2, Ch3, Ch5, Ch6, Ch7, Ch8, Ch9, Ch11)

Torque control: Directly controls joint torque — essential in contact-rich environments. (Ch3)

U

UV map: 2D planar unwrapping of a 3D surface — e.g., MANO UV map. (Ch5, Ch6)

V

VLA (Vision-Language-Action): Unified model that directly outputs robot actions from vision and language input. (Ch1, Ch3, Ch4, Ch5, Ch7, Ch8, Ch9, Ch11)