AffordX: Generalizable and Slim Affordance Reasoning for Task-oriented Manipulation

Institute for AI, Peking University
Dept of CSE, Hong Kong University of Science and Technology
Institute for AI Industry Research, Tsinghua University

Indicates Equal Contribution
Corresponding Author
arXiv Code Dataset Video

Abstract

Object affordance reasoning, the ability to infer object functionalities based on physical properties, is fundamental for task-oriented planning and activities in both humans and Artificial Intelligence (AI). This capability, required for planning and executing daily activities in a task-oriented manner, relies on commonsense knowledge of object physics and functionalities, extending beyond simple object recognition. Current computational models for affordance reasoning from perception lack generalizability, limiting their applicability in novel scenarios. Meanwhile, comprehensive Large Language Models (LLMs) with emerging reasoning capabilities are challenging to deploy on local devices for task-oriented manipulations. Here, we introduce LVIS-Aff, a large-scale dataset comprising 1,496 tasks and 119k images, designed to enhance the generalizability of affordance reasoning from perception. Utilizing this dataset, we develop Afford-X, an end-to-end trainable affordance reasoning model that incorporates Verb Attention and Bi-Fusion modules to improve multi-modal understanding. This model achieves up to a 12.1% performance improvement over the best-reported results from non-LLM methods, while also demonstrating a 1.2% enhancement compared to our previous conference paper. Additionally, it maintains a compact 187M parameter size and infers nearly 50 times faster than the GPT-4V API. Our work demonstrates the potential for efficient, generalizable affordance reasoning models that can be deployed on local devices for task-oriented manipulations. We showcase Afford-X's effectiveness in enabling task-oriented manipulations for robots across various tasks and environments, underscoring its efficiency and broad implications for advancing robotics and AI systems in real-world applications.

Article Teaser

Affordance reasoning for task-oriented manipulation

Afford-X provides efficient visual affordance reasoning through: (a) two comprehensive datasets—COCO-Aff (686k images, 1,144 tasks, 80 categories) and LVIS-Aff (897k images, 1,496 tasks, 1,064 categories); (b) real-time processing (2.38 FPS) with a compact 187M-parameter architecture generating bounding boxes and object masks; (c) robust generalization demonstrated through task-specific object selection and multi-object identification at a 0.7 confidence threshold; (d) integration with robotic systems for simulated task-oriented manipulation.

Dataset Statistics & Visualization

COCO-Tasks

Images Icon 40,000 Images

Tasks Icon 14 Tasks

Categories Icon < 80 Categories

COCO-Aff

Images Icon 686,000 Images

Tasks Icon 1,144 Tasks

Categories Icon 80 Categories

LVIS-Aff

Images Icon 897,000 Images

Tasks Icon 1,496 Tasks

Categories Icon 1,064 Categories

Dataset Visualization GIF

Task-oriented Manipulation in Diverse Environments

clean electronic screens with
drink water with
protect items from rain with
spread butter with
steam video with

Task-Oriented Manipulation for Long-Horizon Tasks

Long-horizon Task 1

Long-horizon Task 2

Key Contributions

  • Novel Framework: We introduce Afford-X, a knowledge distillation-based slim and efficient model that achieves real-time inference for task-oriented manipulation.
  • Advanced Modules: Verb Attention and Bi-Fusion jointly enhance action-focus and multimodal integration.
  • Large-Scale Datasets: Our automated pipeline generates extensive affordance reasoning corpora (COCO-Aff and LVIS-Aff) via GPT-4-based processing.
  • Empirical Validation: Experiments in simulated and real-world scenarios show that our approach outperforms conventional detection-based or heavyweight generative models across diverse dynamic tasks.

Framework

Framework

Our Afford-X pipeline processes visual features from images and textual features from prompts through two key components: a Bi-Fusion (BF) module that aligns text-visual representations via cross-attention, and a Verb Attention (VA) module that emphasizes action-related cues. The enhanced features feed into a transformer encoder-decoder, producing parallel outputs for object detection and segmentation, enabling task-specific object identification even without explicit object labels.

Method Pipeline

Our noun-pronoun distillation framework comprises teacher and student encoder-decoders. The teacher learns from explicit object labels (e.g., "couch"), storing noun features in a memory bank, while the student processes pronoun inputs (e.g., "something") guided by these stored prototypes. A soft binary target loss aligns their bounding box predictions, enabling category-agnostic affordance detection while preserving discriminative power.

Dataset Construction

Dataset Construction

Our automated pipeline converts conventional object detection annotations into large-scale affordance datasets with minimal human intervention. The process starts by generating a set of diverse task prompts for each object category (step 1), then matches these prompts to relevant categories using commonsense preference rankings (step 2). Next, a quality inspection step filters out inappropriate or duplicative pairs (step 3). Finally, image sampling is conducted to ensure balanced coverage of different categories and layouts (step 4), producing comprehensive affordance-task pairs.

Embodied Affordance Reasoning

Embodied Affordance Reasoning

To deploy Afford-X in real-world or simulated robotics environments, we adopt a system architecture that begins with RGB image processing for object masks (a). Optionally, the robot may build a geometric map (b) for robust path planning. Based on these perceptual outputs, the robot positions itself to optimize viewpoint and executes the planned manipulation (c). For multiple sub-tasks (e.g., "build up a space for working"), the system recursively applies the affordance-driven selection and action cycle (d).

Key Experimental Results

State-of-the-Art Comparison

Below, we compare Afford-X against existing methods on COCO-Tasks, COCO-Aff (686k images, 1,144 tasks), and LVIS-Aff (897k images, 1,496 tasks). Our model consistently achieves superior performance in both affordance understanding (mAPbox) and instance segmentation (mAPmask).
Bold values indicate the best performance, while underlined values indicate the second-best.

Comparison of Afford-X with state-of-the-art methods.
Index Method COCO-Tasks COCO-Aff LVIS-Aff
mAPbox mAPmask mAPbox mAPmask mAPbox mAPmask
(a) Fast R-CNN + GGNN 32.6 - - - - -
(b) YOLO + GGNN 33.2 - - - - -
(c) MDETR (w/o pretraining) + GGNN 9.6 8.6 - - - -
(d) MDETR + GGNN 36.8 30.3 - - - -
(g) ViTDet (ViT-H) + GGNN 33.8 25.9 31.5 26.1 8.5 7.4
(h) MDETR 41.3 35.2 44.7 41.0 25.1 22.7
(i) MDETR (w/ VA & BF) 43.2 36.9 45.2 41.4 26.8 24.2
(j) TOIST 44.1 39.0 44.9 41.3 26.2 23.4
(k) Afford-X (w/ VA & BF) 45.3 39.2 45.8 42.5 27.7 24.8

† Results from original papers.
Afford-X consistently outperforms baselines across all datasets, demonstrating robust affordance reasoning.


Computational Efficiency

We also compare runtime efficiency, showing frames per second (FPS) and parameter counts for various \ac{llm}-based pipelines versus our approach on an NVIDIA 3090 GPU:

Computational efficiency comparison
Index Method FPS Parameters
(a) Detection + GPT-4 1.18 >369M
(b) Detection + BLIP + GPT-4 0.27 >498M
(c) GPT-4 + OpenSeeD 0.11 >116M
(d) GPT-4V + OpenSeeD 0.04 >116M
(e) SPHINX + OpenSeeD 0.11 1.2B
(f) SPHINX (CoT) 0.49 1.1B
(g) COCO-Aff (Ours) 2.38 187M
(h) LVIS (Ours) 2.38 187M

Precision Comparison

Precision comparison of different methods

Violin plot comparing precision of LLM-based methods vs. Afford. Stars denote statistical significance. Afford-X achieves notably higher precision with fewer parameters and faster runtime.


BibTeX

If you find our work useful, please consider citing our paper:

@misc{zhu2025affordxgeneralizableslimaffordance,
      title={Afford-X: Generalizable and Slim Affordance Reasoning for Task-oriented Manipulation}, 
      author={Xiaomeng Zhu and Yuyang Li and Leiyao Cui and Pengfei Li and Huan-ang Gao and Yixin Zhu and Hao Zhao},
      journal={arXiv preprint arXiv:2503.03556},
      year={2025}
}