Research

Preprints - Publications - Google Scholar Profile (Up to date)

* indicates equal contributions.

Preprints

[P3] Mixture-of-Visual-Thoughts: Exploring Context-Adaptive Reasoning Mode Selection for General Visual Reasoning arXiv:2509.22746
🔥Under Review.
👤Authors: Zejun Li, Yingxiu Zhao, Jiwen Zhang, Siyuan Wang, Yang Yao, Runzhou Zhao, Jun Song, Bo Zheng, Zhongyu Wei.
🔑Keywords: Chain of thought, visual reasoning, multi-modal large language models.

[P2] “AutoJudger: An Agent-Driven Framework for Efficient Benchmarking of MLLMs” arXiv:2505.21389
🔥Under Review.
👤Authors: Xuanwen Ding, Chengjun Pan, Zejun Li, Jiwen Zhang, Siyuan Wang, Zhongyu Wei.
🔑Keywords: Efficient benchmarking, multi-modal large language models, judging agent.

[P1] “Continuous or discrete, that is the question: A survey on large multi-modal models from the perspective of input-output space extension” PrePrints:202411.0685
🔥Under Review.
👤Authors: Zejun Li, Jiwen Zhang, Dianyi Wang, Ye Wang, Xuanjing Huang, Zhongyu Wei.
🔑Keywords: Survey, large multi-modal models, input-output space extension.

Publications

Conference Proceedings

[C6] “UI-Hawk: Unleashing the Screen Stream Understanding for Mobile GUI Agents” CameraReady
📒Venue: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2025, November)
👤Authors: Jiwen Zhang*, Ya-Qi Yu*, Minghui Liao, Wentao Li, Jihao Wu, Zhongyu Wei.
🔑Keywords: GUI navigation, large multi-modal models, supervised fine-tuning.

[C5] “Vocot: Unleashing visually grounded multi-step reasoning in large multi-modal models” CameraReady
📒Venue: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL) (2025, April)
👤Authors: Zejun Li*, Ruipu Luo*, Jiwen Zhang, Minghui Qiu, Xuan-Jing Huang, Zhongyu Wei.
🔑Keywords: Large multi-modal models, visual reasoning, chain-of-thought.

[C4] “Reform-eval: Evaluating large vision language models via unified re-formulation of task-oriented benchmarks” CameraReady
📒Venue: Proceedings of the 32nd ACM International Conference on Multimedia (ACM MM) (2024, October)
👤Authors: Zejun Li*, Ye Wang*, Mengfei Du*, Qingwen Liu*, Binhao Wu*, Jiwen Zhang,* Chengxing Zhou, Zhihao Fan, Jie Fu, Jingjing Chen, Zhongyu Wei, Xuanjing Huang.
🔑Keywords: Large multi-modal models, evaluation.

[C3] “Android in the zoo: Chain-of-action-thought for gui agents” CameraReady
📒Venue: Findings of the Association for Computational Linguistics: EMNLP 2024 (EMNLP Findings) (2024, November)
👤Authors: Jiwen Zhang, Jihao Wu, Teng Yihua, Minghui Liao, Nuo Xu, Xiao Xiao, Zhongyu Wei, Duyu Tang.
🔑Keywords: GUI navigation, prompt engineering, supervised fine-tuning.

[C2] “DELAN: Dual-level alignment for vision-and-language navigation by cross-modal contrastive learning” CameraReady
📒Venue: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING) (2024, May)
👤Authors: Mengfei Du*, Binhao Wu*, Jiwen Zhang, Zhihao Fan, Zejun Li, Ruipu Luo, Xuan-Jing Huang, Zhongyu Wei.
🔑Keywords: Contrastive learning, vision-language navigation, cross-modal alignment.

[C1] “Curriculum Learning for Vision-and-Language Navigation” CameraReady
📒Venue: Advances in Neural Information Processing Systems 34 (NeurIPS) (2021, December)
👤Authors: Jiwen Zhang, Zhongyu Wei, Jianqing Fan, Jiajie Peng.
🔑Keywords: Curriculum learning, vision-language navigation, embedied agent.

Technical Reports

[T2] “Texthawk2: A large vision-language model excels in bilingual ocr and grounding with 16x fewer tokens” arXiv:2410.05261
👤Authors: Ya-Qi Yu, Minghui Liao, Jiwen Zhang, Jihao Wu.
🔑Keywords: Large multi-modal models, pre-training, text recognition and grounding.

[T1] “Breaking Down the Task: A Unit-Grained Hybrid Training Framework for Vision and Language Decision Making” arXiv:2307.08016
👤Authors: Ruipu Luo, Jiwen Zhang, Zhongyu Wei.
🔑Keywords: Vision language dicison making, hybrid learning.