Institute for AI, Peking University Dept of CSE, Hong Kong University of Science and Technology Institute for AI Industry Research, Tsinghua University †Indicates Equal Contribution Corresponding Author
Object affordance reasoning, the ability to infer object functionalities based on physical properties, is fundamental for task-oriented planning and activities in both humans and Artificial Intelligence (AI). This capability, required for planning and executing daily activities in a task-oriented manner, relies on commonsense knowledge of object physics and functionalities, extending beyond simple object recognition. Current computational models for affordance reasoning from perception lack generalizability, limiting their applicability in novel scenarios. Meanwhile, comprehensive Large Language Models (LLMs) with emerging reasoning capabilities are challenging to deploy on local devices for task-oriented manipulations. Here, we introduce LVIS-Aff, a large-scale dataset comprising 1,496 tasks and 119k images, designed to enhance the generalizability of affordance reasoning from perception. Utilizing this dataset, we develop Afford-X, an end-to-end trainable affordance reasoning model that incorporates Verb Attention and Bi-Fusion modules to improve multi-modal understanding. This model achieves up to a 12.1% performance improvement over the best-reported results from non-LLM methods, while also demonstrating a 1.2% enhancement compared to our previous conference paper. Additionally, it maintains a compact 187M parameter size and infers nearly 50 times faster than the GPT-4V API. Our work demonstrates the potential for efficient, generalizable affordance reasoning models that can be deployed on local devices for task-oriented manipulations. We showcase Afford-X's effectiveness in enabling task-oriented manipulations for robots across various tasks and environments, underscoring its efficiency and broad implications for advancing robotics and AI systems in real-world applications.
Affordance reasoning for task-oriented manipulation
Afford-X provides efficient visual affordance reasoning through:
(a) two comprehensive datasets—COCO-Aff (686k images, 1,144 tasks, 80 categories)
and LVIS-Aff (897k images, 1,496 tasks, 1,064 categories);
(b) real-time processing (2.38 FPS) with a compact 187M-parameter architecture generating
bounding boxes and object masks;
(c) robust generalization demonstrated through task-specific object selection and
multi-object identification at a 0.7 confidence threshold;
(d) integration with robotic systems for simulated task-oriented manipulation.
Dataset Statistics & Visualization
COCO-Tasks
40,000 Images
14 Tasks
< 80 Categories
COCO-Aff
686,000 Images
1,144 Tasks
80 Categories
LVIS-Aff
897,000 Images
1,496 Tasks
1,064 Categories
Task-oriented Manipulation in Diverse Environments
clean electronic screens with
drink water with
protect items from rain with
spread butter with
steam video with
Task-Oriented Manipulation for Long-Horizon Tasks
Long-horizon Task 1
Long-horizon Task 2
Key Contributions
Novel Framework: We introduce Afford-X, a knowledge distillation-based slim
and efficient model that achieves real-time inference for task-oriented manipulation.
Advanced Modules: Verb Attention and Bi-Fusion jointly enhance action-focus and
multimodal integration.
Large-Scale Datasets: Our automated pipeline generates extensive affordance reasoning
corpora (COCO-Aff and LVIS-Aff) via GPT-4-based processing.
Empirical Validation: Experiments in simulated and real-world scenarios show that our
approach outperforms conventional detection-based or heavyweight generative models across diverse
dynamic tasks.
Framework
Our Afford-X pipeline processes visual features from images and textual features from prompts
through two key components: a Bi-Fusion (BF) module that aligns text-visual
representations via cross-attention, and a Verb Attention (VA) module that
emphasizes action-related cues. The enhanced features feed into a transformer encoder-decoder,
producing parallel outputs for object detection and segmentation, enabling task-specific object
identification even without explicit object labels.
Our noun-pronoun distillation framework comprises teacher and student encoder-decoders. The
teacher learns from explicit object labels (e.g., "couch"), storing noun features in a memory
bank, while the student processes pronoun inputs (e.g., "something") guided by these stored
prototypes. A soft binary target loss aligns their bounding box predictions, enabling
category-agnostic affordance detection while preserving discriminative power.
Dataset Construction
Our automated pipeline converts conventional object detection annotations into large-scale
affordance datasets with minimal human intervention. The process starts by generating a set of
diverse task prompts for each object category (step 1), then matches these prompts to relevant
categories using commonsense preference rankings (step 2). Next, a quality inspection step
filters out inappropriate or duplicative pairs (step 3). Finally, image sampling is conducted
to ensure balanced coverage of different categories and layouts (step 4), producing
comprehensive affordance-task pairs.
Embodied Affordance Reasoning
To deploy Afford-X in real-world or simulated robotics environments, we adopt a system
architecture that begins with RGB image processing for object masks (a). Optionally, the robot
may build a geometric map (b) for robust path planning. Based on these perceptual outputs, the
robot positions itself to optimize viewpoint and executes the planned manipulation (c). For
multiple sub-tasks (e.g., "build up a space for working"), the system recursively applies the
affordance-driven selection and action cycle (d).
Key Experimental Results
State-of-the-Art Comparison
Below, we compare Afford-X against existing methods on
COCO-Tasks,
COCO-Aff (686k images, 1,144 tasks),
and
LVIS-Aff (897k images, 1,496 tasks).
Our model consistently achieves superior performance in both affordance understanding (mAPbox)
and instance segmentation (mAPmask).
Bold values indicate the best performance, while underlined values indicate the second-best.
Comparison of Afford-X with state-of-the-art methods.
Index
Method
COCO-Tasks
COCO-Aff
LVIS-Aff
mAPbox
mAPmask
mAPbox
mAPmask
mAPbox
mAPmask
(a)
Fast R-CNN + GGNN†
32.6
-
-
-
-
-
(b)
YOLO + GGNN†
33.2
-
-
-
-
-
(c)
MDETR (w/o pretraining) + GGNN†
9.6
8.6
-
-
-
-
(d)
MDETR + GGNN†
36.8
30.3
-
-
-
-
(g)
ViTDet (ViT-H) + GGNN
33.8
25.9
31.5
26.1
8.5
7.4
(h)
MDETR†
41.3
35.2
44.7
41.0
25.1
22.7
(i)
MDETR (w/ VA & BF)
43.2
36.9
45.2
41.4
26.8
24.2
(j)
TOIST†
44.1
39.0
44.9
41.3
26.2
23.4
(k)
Afford-X (w/ VA & BF)
45.3
39.2
45.8
42.5
27.7
24.8
† Results from original papers.
Afford-X consistently outperforms baselines across all datasets, demonstrating robust affordance reasoning.
Computational Efficiency
We also compare runtime efficiency, showing frames per second (FPS) and parameter counts
for various \ac{llm}-based pipelines versus our approach on an NVIDIA 3090 GPU:
Computational efficiency comparison
Index
Method
FPS
Parameters
(a)
Detection + GPT-4
1.18
>369M
(b)
Detection + BLIP + GPT-4
0.27
>498M
(c)
GPT-4 + OpenSeeD
0.11
>116M
(d)
GPT-4V + OpenSeeD
0.04
>116M
(e)
SPHINX + OpenSeeD
0.11
1.2B
(f)
SPHINX (CoT)
0.49
1.1B
(g)
COCO-Aff (Ours)
2.38
187M
(h)
LVIS (Ours)
2.38
187M
Precision Comparison
Violin plot comparing precision of LLM-based methods vs. Afford.
Stars denote statistical significance. Afford-X achieves notably higher precision
with fewer parameters and faster runtime.
BibTeX
If you find our work useful, please consider citing our paper:
@misc{zhu2025affordxgeneralizableslimaffordance,
title={Afford-X: Generalizable and Slim Affordance Reasoning for Task-oriented Manipulation},
author={Xiaomeng Zhu and Yuyang Li and Leiyao Cui and Pengfei Li and Huan-ang Gao and Yixin Zhu and Hao Zhao},
journal={arXiv preprint arXiv:2503.03556},
year={2025}
}