Icon

Task-Oriented Grasping with Conditioning on Generative Examples

1IIT(ISM) Dhanbad 2University of Bremen 3IIIT Allahabad

Accepted to AAAI'26 (Oral)

Abstract

Task-Oriented Grasping (TOG) is a challenging problem that requires an understanding of task semantics, object affordances, and the functional aspects of how an object should be held for a purpose. To address these challenges, we introduce GRIM (Grasp Re-alignment via Iterative Matching), a novel training-free framework for task-oriented grasping. Initially, a coarse alignment strategy is developed using a combination of geometric cues and the scoring of PCA-reduced DINO features. Subsequently, the full grasp pose associated with the retrieved memory instance is transferred to the aligned scene object and further refined against a set of task-agnostic, geometrically stable grasps generated for the scene object, prioritizing task compatibility. Compared to previous training-based methods, our approach is able to achieve high generalization with a few conditioning examples.

Video

Approach

Memory Creation Pipeline

The figure shows our memory creation pipeline. The Hand-Object Reconstruction block...

Retrieval, Alignment and Transfer Process

The figure describes our retrieval, alignment and transfer process. The feature scene and memory objects are shown with DINOv2 PCA features as color representation. In the feature guided iterative alignment phase, the red point cloud is retrieved from memory, overlapped with the scene object point cloud.

Veo 2: Generated Synthetic Videos for Grasp Estimation

We use Veo 2 to generate 5-second and 8-second, object-centered videos depicting human hand interactions for grasp pose estimation. Given an object image and name from the TaskGrasp dataset [Murali et al., 2020b], Veo 2 generates a detailed grasp description tailored to a specific task. This is used to prompt the Veo 2 video generation model, which produces consistent, realistic grasping videos. We extract the middle frame—where the grasp is clearest—for downstream use. This approach enables scalable, high-quality data generation for learning task-conditioned grasps.

Lift the strainer and place it over a container

Pick up the mug and place it somewhere else

Pick up the tongs and prepare them for storage

Pick up the pitcher and pour liquid into a container

Use the flashlight to look around an area

Pick up the coat hanger and hang it on a rack or rod

Pick up the spoon and stir contents

Pick up the sponge and clean a surface

Clear the ice scraper from the surface

Handle the bottle to place it carefully

Lift the charger to manage the cable

Crush garlic for cooking

Hand Object Reconstruction

Feature Guided Alignment

BibTeX

@misc{shailesh2025grimtaskorientedgraspingconditioning,
      title={GRIM: Task-Oriented Grasping with Conditioning on Generative Examples}, 
      author={Shailesh and Alok Raj and Nayan Kumar and Priya Shukla and Andrew Melnik and Micheal Beetz and Gora Chand Nandi},
      year={2025},
      eprint={2506.15607},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2506.15607}, 
}