Xiaomeng Zhu

Xiaomeng Zhu / 朱晓萌

Ph.D. Student

Computer Science and Engineering
The Hong Kong University of Science and Technology

About Me

I am currently a Ph.D. student at the Department of Computer Science and Engineering, The Hong Kong University of Science and Technology (HKUST), supervised by Prof. Fangzhen Lin. Prior to my Ph.D. studies, I received my M.S. degree from the Institute of Automation, Chinese Academy of Sciences (CASIA) and the University of Chinese Academy of Sciences (UCAS), and my B.E. degree from the University of Electronic Science and Technology of China (UESTC), ranked 1/138.

My research focuses on multimodal foundation models, embodied intelligence, proactive perception and decision-making, video understanding, scene graph and temporal reasoning, and task-oriented affordance reasoning. I am especially interested in building agents that can understand evolving scenes, anticipate useful actions, and convert raw videos into structured training data for embodied world models.

我目前是香港科技大学计算机科学与工程学院博士研究生,导师为林方真教授。我的研究方向包括多模态大模型、 具身智能、主动感知与决策、视频理解与预测、场景图及时序推理,以及任务导向的物体功能推理。

Publications

Main Contributions

  1. ProAct paper figure: ProAct-75 data and annotation pipeline ICML

    ProAct: A Benchmark and Multimodal Framework for Structure-Aware Proactive Response

    Xiaomeng Zhu, F. Zhu, W. Zhou, Y. Tian, Z. Hu, Y. Huang, Y. Guo, X. Wu, Z. Zhang, et al.

    International Conference on Machine Learning (ICML), 2026.

    ProAct studies proactive embodied agents that continuously monitor video, decide when to intervene, and select actions under explicit task-graph constraints. The benchmark covers 75 tasks, 5,383 videos, and 91,581 step-level annotations.

    @inproceedings{zhu2026proact,
      title={ProAct: A Benchmark and Multimodal Framework for Structure-Aware Proactive Response},
      author={Zhu, Xiaomeng and Zhu, F. and Zhou, W. and Tian, Y. and others},
      booktitle={International Conference on Machine Learning},
      year={2026}
    }
  2. OOTSM paper figure: decoupled scene graph anticipation framework IEEE TMM

    OOTSM: A Decoupled Linguistic Framework for Effective Scene Graph Anticipation

    Xiaomeng Zhu, C. Wang, H. Wang, X. Liu, F. Lin

    IEEE Transactions on Multimedia (TMM), Minor Revision.

    OOTSM decouples scene graph anticipation into future object inference and object-oriented relation reasoning, using linguistic reasoning and temporal transition constraints to improve long-tail and future-dynamic predictions.

    @article{zhu2026ootsm,
      title={OOTSM: A Decoupled Linguistic Framework for Effective Scene Graph Anticipation},
      author={Zhu, Xiaomeng and Wang, C. and Wang, H. and Liu, X. and Lin, F.},
      journal={IEEE Transactions on Multimedia},
      year={2026}
    }
  3. Afford-X paper figure: affordance reasoning examples and datasets TPAMI

    Afford-X: Generalizable and Slim Affordance Reasoning for Task-Oriented Manipulation

    Xiaomeng Zhu*, Y. Li*, L. Cui, P. Li, H. Gao, Y. Zhu, H. Zhao * Equal contribution.

    IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), Major Revision.

    Afford-X asks which object can fulfill a given task or function rather than simply matching a noun query. It introduces lightweight VLM modules to reduce noun bias and improve generalization to unseen tasks.

    @article{zhu2026affordx,
      title={Afford-X: Generalizable and Slim Affordance Reasoning for Task-Oriented Manipulation},
      author={Zhu, Xiaomeng and Li, Y. and Cui, L. and Li, P. and Gao, H. and Zhu, Y. and Zhao, H.},
      journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
      year={2026}
    }
  4. MGML paper figure: momentum group meta-learning framework Neuro

    MGML: Momentum Group Meta-Learning for Few-Shot Image Classification

    Xiaomeng Zhu, S. Li

    Neurocomputing, Vol. 514, pp. 351-361, 2022.

    MGML studies few-shot image classification through momentum group meta-learning, improving representation adaptation when only limited labeled examples are available.

    @article{zhu2022mgml,
      title={MGML: Momentum Group Meta-Learning for Few-Shot Image Classification},
      author={Zhu, Xiaomeng and Li, S.},
      journal={Neurocomputing},
      volume={514},
      pages={351--361},
      year={2022}
    }

Others

  1. Plan Right paper figure: BDDL data construction and symbolic verification pipeline ARR

    Plan Right, Then Plan Tight: Symbolic RL for Efficient Embodied Reasoning

    X. Shi, Xiaomeng Zhu, Y. Huang, Y. Tian, Y. Guo, Z. Sun, L. Yin, Y. Zhou

    ACL ARR 2026 May, in submission to EMNLP.

Research

  • Proactive embodied agents: multimodal frameworks for recognizing intervention opportunities and selecting proactive actions from continuous video.
  • Video scene understanding for world models: action-segment extraction, boundary localization, structured sample generation, and quality inspection for embodied intelligence data production.
  • Affordance reasoning: task-oriented object-function grounding for manipulation and open-world visual reasoning.

Education

  • HKUST, Ph.D. student in Computer Science and Engineering, 2024.09 - Present.
  • UCAS / CASIA, M.S. in Pattern Recognition, 2021.09 - 2024.06.
  • UESTC, B.E. in Automation, Yingcai Honors Class, ranked 1/138, 2017.09 - 2021.06.

Experience

  • Tencent Robotics X, Research Intern, 2025.06 - Present.
  • Peking University, School of Artificial Intelligence, Algorithm Intern, 2023.05 - 2025.02.

Awards

  • National Scholarship, undergraduate, two times; UCAS Graduate Academic Scholarship.
  • National First Prize, National Undergraduate Intelligent Car Competition.
  • Sichuan Province Outstanding Graduate; Tang Lixin Scholarship.