AffordX: Academic Project Page

Object affordance reasoning, the ability to infer object functionalities based on physical properties, is fundamental for task-oriented planning and activities in both humans and Artificial Intelligence (AI). This capability, required for planning and executing daily activities in a task-oriented manner, relies on commonsense knowledge of object physics and functionalities, extending beyond simple object recognition. Current computational models for affordance reasoning from perception lack generalizability, limiting their applicability in novel scenarios. Meanwhile, comprehensive Large Language Models (LLMs) with emerging reasoning capabilities are challenging to deploy on local devices for task-oriented manipulations. Here, we introduce LVIS-Aff, a large-scale dataset comprising 1,496 tasks and 119k images, designed to enhance the generalizability of affordance reasoning from perception. Utilizing this dataset, we develop Afford-X, an end-to-end trainable affordance reasoning model that incorporates Verb Attention and Bi-Fusion modules to improve multi-modal understanding. This model achieves up to a 12.1% performance improvement over the best-reported results from non-LLM methods, while also demonstrating a 1.2% enhancement compared to our previous conference paper. Additionally, it maintains a compact 187M parameter size and infers nearly 50 times faster than the GPT-4V API. Our work demonstrates the potential for efficient, generalizable affordance reasoning models that can be deployed on local devices for task-oriented manipulations. We showcase Afford-X's effectiveness in enabling task-oriented manipulations for robots across various tasks and environments, underscoring its efficiency and broad implications for advancing robotics and AI systems in real-world applications.

Affordance reasoning for task-oriented manipulation

Afford-X provides efficient visual affordance reasoning through: (a) two comprehensive datasets—COCO-Aff (686k images, 1,144 tasks, 80 categories) and LVIS-Aff (897k images, 1,496 tasks, 1,064 categories); (b) real-time processing (2.38 FPS) with a compact 187M-parameter architecture generating bounding boxes and object masks; (c) robust generalization demonstrated through task-specific object selection and multi-object identification at a 0.7 confidence threshold; (d) integration with robotic systems for simulated task-oriented manipulation.

COCO-Tasks

40,000 Images

14 Tasks

< 80 Categories

COCO-Aff

686,000 Images

1,144 Tasks

80 Categories

LVIS-Aff

897,000 Images

1,496 Tasks

1,064 Categories

Novel Framework: We introduce Afford-X, a knowledge distillation-based slim and efficient model that achieves real-time inference for task-oriented manipulation.
Advanced Modules: Verb Attention and Bi-Fusion jointly enhance action-focus and multimodal integration.
Large-Scale Datasets: Our automated pipeline generates extensive affordance reasoning corpora (COCO-Aff and LVIS-Aff) via GPT-4-based processing.
Empirical Validation: Experiments in simulated and real-world scenarios show that our approach outperforms conventional detection-based or heavyweight generative models across diverse dynamic tasks.

Our Afford-X pipeline processes visual features from images and textual features from prompts through two key components: a Bi-Fusion (BF) module that aligns text-visual representations via cross-attention, and a Verb Attention (VA) module that emphasizes action-related cues. The enhanced features feed into a transformer encoder-decoder, producing parallel outputs for object detection and segmentation, enabling task-specific object identification even without explicit object labels.

Our noun-pronoun distillation framework comprises teacher and student encoder-decoders. The teacher learns from explicit object labels (e.g., "couch"), storing noun features in a memory bank, while the student processes pronoun inputs (e.g., "something") guided by these stored prototypes. A soft binary target loss aligns their bounding box predictions, enabling category-agnostic affordance detection while preserving discriminative power.

Our automated pipeline converts conventional object detection annotations into large-scale affordance datasets with minimal human intervention. The process starts by generating a set of diverse task prompts for each object category (step 1), then matches these prompts to relevant categories using commonsense preference rankings (step 2). Next, a quality inspection step filters out inappropriate or duplicative pairs (step 3). Finally, image sampling is conducted to ensure balanced coverage of different categories and layouts (step 4), producing comprehensive affordance-task pairs.

To deploy Afford-X in real-world or simulated robotics environments, we adopt a system architecture that begins with RGB image processing for object masks (a). Optionally, the robot may build a geometric map (b) for robust path planning. Based on these perceptual outputs, the robot positions itself to optimize viewpoint and executes the planned manipulation (c). For multiple sub-tasks (e.g., "build up a space for working"), the system recursively applies the affordance-driven selection and action cycle (d).

State-of-the-Art Comparison

Below, we compare Afford-X against existing methods on COCO-Tasks, COCO-Aff (686k images, 1,144 tasks), and LVIS-Aff (897k images, 1,496 tasks). Our model consistently achieves superior performance in both affordance understanding (mAP_box) and instance segmentation (mAP_mask).
Bold values indicate the best performance, while underlined values indicate the second-best.

*Comparison of Afford-X with state-of-the-art methods.*
Index	Method	COCO-Tasks		COCO-Aff		LVIS-Aff
Index	Method	mAP_box	mAP_mask	mAP_box	mAP_mask	mAP_box	mAP_mask
(a)	Fast R-CNN + GGNN^†	32.6	-	-	-	-	-
(b)	YOLO + GGNN^†	33.2	-	-	-	-	-
(c)	MDETR (w/o pretraining) + GGNN^†	9.6	8.6	-	-	-	-
(d)	MDETR + GGNN^†	36.8	30.3	-	-	-	-
(g)	ViTDet (ViT-H) + GGNN	33.8	25.9	31.5	26.1	8.5	7.4
(h)	MDETR^†	41.3	35.2	44.7	41.0	25.1	22.7
(i)	MDETR (w/ VA & BF)	43.2	36.9	45.2	41.4	26.8	24.2
(j)	TOIST^†	44.1	39.0	44.9	41.3	26.2	23.4
(k)	Afford-X (w/ VA & BF)	45.3	39.2	45.8	42.5	27.7	24.8

† Results from original papers.
Afford-X consistently outperforms baselines across all datasets, demonstrating robust affordance reasoning.

Computational Efficiency

We also compare runtime efficiency, showing frames per second (FPS) and parameter counts for various \ac{llm}-based pipelines versus our approach on an NVIDIA 3090 GPU:

*Computational efficiency comparison*
Index	Method	FPS	Parameters
(a)	Detection + GPT-4	1.18	>369M
(b)	Detection + BLIP + GPT-4	0.27	>498M
(c)	GPT-4 + OpenSeeD	0.11	>116M
(d)	GPT-4V + OpenSeeD	0.04	>116M
(e)	SPHINX + OpenSeeD	0.11	1.2B
(f)	SPHINX (CoT)	0.49	1.1B
(g)	COCO-Aff (Ours)	2.38	187M
(h)	LVIS (Ours)	2.38	187M

Precision Comparison

Violin plot comparing precision of LLM-based methods vs. Afford. Stars denote statistical significance. Afford-X achieves notably higher precision with fewer parameters and faster runtime.

BibTeX

If you find our work useful, please consider citing our paper:

@misc{zhu2025affordxgeneralizableslimaffordance,
      title={Afford-X: Generalizable and Slim Affordance Reasoning for Task-oriented Manipulation}, 
      author={Xiaomeng Zhu and Yuyang Li and Leiyao Cui and Pengfei Li and Huan-ang Gao and Yixin Zhu and Hao Zhao},
      journal={arXiv preprint arXiv:2503.03556},
      year={2025}
}

AffordX: Generalizable and Slim Affordance Reasoning for Task-oriented Manipulation

Abstract

Dataset Statistics & Visualization

COCO-Tasks

COCO-Aff

LVIS-Aff

Task-oriented Manipulation in Diverse Environments

Task-Oriented Manipulation for Long-Horizon Tasks

Key Contributions

Framework

Dataset Construction

Embodied Affordance Reasoning

Key Experimental Results

BibTeX