Xiaomeng Zhu / 朱晓萌

About Me

I am currently a Ph.D. student at the Department of Computer Science and Engineering, The Hong Kong University of Science and Technology (HKUST), supervised by Prof. Fangzhen Lin. Prior to my Ph.D. studies, I received my M.S. degree from the Institute of Automation, Chinese Academy of Sciences (CASIA) and the University of Chinese Academy of Sciences (UCAS), and my B.E. degree from the University of Electronic Science and Technology of China (UESTC), ranked 1/138.

My research focuses on multimodal foundation models, embodied intelligence, proactive perception and decision-making, video understanding, scene graph and temporal reasoning, and task-oriented affordance reasoning. I am especially interested in building agents that can understand evolving scenes, anticipate useful actions, and convert raw videos into structured training data for embodied world models.

我目前是香港科技大学计算机科学与工程学院博士研究生，导师为林方真教授。我的研究方向包括多模态大模型、具身智能、主动感知与决策、视频理解与预测、场景图及时序推理，以及任务导向的物体功能推理。

Publications

Main Contributions

ICML
ProAct: A Benchmark and Multimodal Framework for Structure-Aware Proactive Response

Xiaomeng Zhu, F. Zhu, W. Zhou, Y. Tian, Z. Hu, Y. Huang, Y. Guo, X. Wu, Z. Zhang, et al.

International Conference on Machine Learning (ICML), 2026.

ProAct studies proactive embodied agents that continuously monitor video, decide when to intervene, and select actions under explicit task-graph constraints. The benchmark covers 75 tasks, 5,383 videos, and 91,581 step-level annotations.
```
@inproceedings{zhu2026proact,
  title={ProAct: A Benchmark and Multimodal Framework for Structure-Aware Proactive Response},
  author={Zhu, Xiaomeng and Zhu, F. and Zhou, W. and Tian, Y. and others},
  booktitle={International Conference on Machine Learning},
  year={2026}
}
```
IEEE TMM
OOTSM: A Decoupled Linguistic Framework for Effective Scene Graph Anticipation

Xiaomeng Zhu, C. Wang, H. Wang, X. Liu, F. Lin

IEEE Transactions on Multimedia (TMM), Minor Revision.

OOTSM decouples scene graph anticipation into future object inference and object-oriented relation reasoning, using linguistic reasoning and temporal transition constraints to improve long-tail and future-dynamic predictions.
```
@article{zhu2026ootsm,
  title={OOTSM: A Decoupled Linguistic Framework for Effective Scene Graph Anticipation},
  author={Zhu, Xiaomeng and Wang, C. and Wang, H. and Liu, X. and Lin, F.},
  journal={IEEE Transactions on Multimedia},
  year={2026}
}
```
TPAMI
Afford-X: Generalizable and Slim Affordance Reasoning for Task-Oriented Manipulation

Xiaomeng Zhu^*, Y. Li^*, L. Cui, P. Li, H. Gao, Y. Zhu, H. Zhao * Equal contribution.

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), Major Revision.

Afford-X asks which object can fulfill a given task or function rather than simply matching a noun query. It introduces lightweight VLM modules to reduce noun bias and improve generalization to unseen tasks.
```
@article{zhu2026affordx,
  title={Afford-X: Generalizable and Slim Affordance Reasoning for Task-Oriented Manipulation},
  author={Zhu, Xiaomeng and Li, Y. and Cui, L. and Li, P. and Gao, H. and Zhu, Y. and Zhao, H.},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  year={2026}
}
```
Neuro
MGML: Momentum Group Meta-Learning for Few-Shot Image Classification

Xiaomeng Zhu, S. Li

Neurocomputing, Vol. 514, pp. 351-361, 2022.

MGML studies few-shot image classification through momentum group meta-learning, improving representation adaptation when only limited labeled examples are available.
```
@article{zhu2022mgml,
  title={MGML: Momentum Group Meta-Learning for Few-Shot Image Classification},
  author={Zhu, Xiaomeng and Li, S.},
  journal={Neurocomputing},
  volume={514},
  pages={351--361},
  year={2022}
}
```

Others

AAAI

A Computable Game-Theoretic Framework for Multi-Agent Theory of Mind

F. Zhu, Y. Pan, Xiaomeng Zhu, F. Lin

AAAI 2025 Workshop on Multi-Agent AI.
arXiv

Listening with the Eyes: Benchmarking Egocentric Co-Speech Grounding across Space and Time

W. Zhou, X. Xiong, Z. Hu, Xiaomeng Zhu, C. Zhao, H. Dong, Z. Zhang, M. Tang, J. Wang

arXiv:2603.07966, 2026.
ARR

Plan Right, Then Plan Tight: Symbolic RL for Efficient Embodied Reasoning

X. Shi, Xiaomeng Zhu, Y. Huang, Y. Tian, Y. Guo, Z. Sun, L. Yin, Y. Zhou

ACL ARR 2026 May, in submission to EMNLP.
TAES

Aeroengine Performance Prediction Using a Physical-Embedded Data-Driven Method

T. Mo, S. Dai, A. Fu, Xiaomeng Zhu, S. Li

IEEE Transactions on Aerospace and Electronic Systems, 2025.

Research

Proactive embodied agents: multimodal frameworks for recognizing intervention opportunities and selecting proactive actions from continuous video.
Video scene understanding for world models: action-segment extraction, boundary localization, structured sample generation, and quality inspection for embodied intelligence data production.
Affordance reasoning: task-oriented object-function grounding for manipulation and open-world visual reasoning.

Education

HKUST, Ph.D. student in Computer Science and Engineering, 2024.09 - Present.
UCAS / CASIA, M.S. in Pattern Recognition, 2021.09 - 2024.06.
UESTC, B.E. in Automation, Yingcai Honors Class, ranked 1/138, 2017.09 - 2021.06.

Experience

Tencent Robotics X, Research Intern, 2025.06 - Present.
Peking University, School of Artificial Intelligence, Algorithm Intern, 2023.05 - 2025.02.

Awards

National Scholarship, undergraduate, two times; UCAS Graduate Academic Scholarship.
National First Prize, National Undergraduate Intelligent Car Competition.
Sichuan Province Outstanding Graduate; Tang Lixin Scholarship.

About Me

Publications

Main Contributions

ProAct: A Benchmark and Multimodal Framework for Structure-Aware Proactive Response

OOTSM: A Decoupled Linguistic Framework for Effective Scene Graph Anticipation

Afford-X: Generalizable and Slim Affordance Reasoning for Task-Oriented Manipulation

MGML: Momentum Group Meta-Learning for Few-Shot Image Classification

Others

A Computable Game-Theoretic Framework for Multi-Agent Theory of Mind

Listening with the Eyes: Benchmarking Egocentric Co-Speech Grounding across Space and Time

Plan Right, Then Plan Tight: Symbolic RL for Efficient Embodied Reasoning

Aeroengine Performance Prediction Using a Physical-Embedded Data-Driven Method

Research

Education

Experience

Awards