I am a fifth-year PhD candidate at the School of Data Science, Fudan University. My supervisor are Prof. Zhongyu Wei and Prof. Jianqing Fan. Previously, I got my Bachelor’s degree from Fudan University majoring in data science. I am currently visiting the V3ALab at Australian Institute for Machine Learning (AIML), advised by Prof. Qi Wu.

My research interests are to advance embodied intelligence🤖, by utilizing multimodal large language models (MLLMs) that can perceive👀, reason🤔, and act🚗 in rich, interactive environments. Currently, I have explored the following topics:

  • Vision-Language Navigation and GUI Navigation: Curriculum Learning (NeurIPS’21), Unit-grained Hybrid Learning (arXiv’23), Contrastive Learning (COLING’24) for VLN. Chain-of-Action Thought (EMNLP’24), Multi-Screen Understanding (EMNLP’25) for GUI Nav.

  • Large-scale Pretraining and Post-Training for MLLMs: Grounding-oriented Pre-training (arXiv’24), SFT for UI-related Screen Stream Understanding (EMNLP’25), SFT for Visually Grounded Reasoning (NAACL’25) and RL for Adaptive Visual Reasoning (arXiv’25).

  • Benchmarking of Multimodal & Embodied Models: Unified reformulation of multimodal tasks (ACM MM’2024), Agent-based Efficient Benchmarking (arXiv’25) for MLLMs, Fundamental Abilities Evaluation of GUI Agents (EMNLP’25).

I expect to graduate in September 2026 and am currently seeking job and internship opportunities.

Contact Me

  • My email address is jiwenzhang21{at}m{dot}fudan{dot}edu{dot}cn
  • My full CV is here. (Updated July, 2025)