I am a fourth-year Ph.D. student at School of Mathematical Sciences, Shanghai Jiao Tong University (SJTU). Before that, I received my Bachelor’s degree from Zhiyuan College of SJTU in 2021.

I am currently advised by Prof. Zhiqin Xu. My research interests are in understanding deep learning from the training process, loss landscape, generalization and application, and also the interpretability of large language models. If you’re interested in my research, please feel free to contact me (Wechat).

🔥 News

2025.07: 🎉🎉 Our project WebSailor topped GitHub trending!
2025.05: 🎉🎉 One paper accepted to ICML 2025 spotlight!
2025.02: 🎉🎉 One paper accepted to JCM (T1 journal in computational mathematics)!
2024.09: 🎉🎉 I won the 2024 China National Scholarship!
2024.09: 🎉🎉 One paper accepted to NeurIPS 2024!
2024.01: 🎉🎉 One paper accepted to ICLR 2024!
2024.01: 🎉🎉 One paper accepted to TPAMI!

📝 Publications

* denotes equal contribution, † denotes corresponding author, see the full list in Google Scholar.

ArXiv

WebSailor: Navigating Super-human Reasoning for Web Agent

Kuan Li*, Zhongwang Zhang*, Huifeng Yin*†, Liwen Zhang*, Litu Ou*, Jialong Wu, Wenbiao Yin, Baixuan Li, Zhengwei Tao, Xinyu Wang, Weizhou Shen, Junkai Zhang, Dingchu Zhang, Xixi Wu, Yong Jiang†, Ming Yan, Pengjun Xie, Fei Huang, Jingren Zhou

This paper presents WebSailor, a framework for training superhuman web agents that excel at complex reasoning tasks. Key innovations include: (1) SailorFog-QA, a synthetic dataset with high-uncertainty questions generated via graph-based sampling and obfuscation; (2) Reconstructed reasoning trajectories that distill expert solutions into concise action plans; and (3) Duplicating Sampling Policy Optimization (DUPO), an RL method that accelerates training for long-horizon tasks. WebSailor models outperform existing open-source agents and rival proprietary systems on highly challenging benchmarks like BrowseComp.

ICML2025 spotlight

An Analysis for Reasoning Bias of Language Models with Small Initialization

Junjie Yao, Zhongwang Zhang†, Zhi-Qin John Xu†

This paper reveals how initialization scales shape transformer-based models’ task preferences: smaller scales induce reasoning bias through structured embeddings, while larger scales promote memorization. We attribute this to differential label-driven embedding dynamics, validated theoretically and empirically across architectures.

TPAMI

Implicit Regularization of Dropout

Zhongwang Zhang, Zhi-Qin John Xu†

This paper proposes a theoretical derivation of an implicit regularization of dropout, which is validated through experiments and numerically studied to understand how dropout improves generalization during neural network training by promoting weight condensation and finding flatter solutions.

ICLR2024

Stochastic Modified Equations and Dynamics of Dropout Algorithm

Zhongwang Zhang, Yuqing Li†, Tao Luo†, Zhi-Qin John Xu†

This paper proposes a rigorous theoretical derivation of the stochastic modified equations to approximate the discrete iterative process of dropout and empirically investigates the mechanisms by which dropout facilitates the identification of flatter minima through intuitive approximations exploiting the structural analogies in the Hessian of loss landscape and the covariance of dropout.

NeurIPS2024

Initialization is Critical to Whether Transformers Fit Composite Functions by Inference or Memorizing

Zhongwang Zhang, Pengxiao Lin, Zhiwei Wang, Yaoyu Zhang, Zhi-Qin John Xu†

This paper investigates the mechanisms of how transformers behave on unseen compositional tasks using anchor functions, revealing that the parameter initialization scale determines whether the model learns inferential solutions that capture the underlying compositional primitives or symmetric solutions that simply memorize mappings, and provides insights into the role of initialization scale in shaping the type of solution learned and their ability to generalize compositional functions.

Neurips 2021 spotlight

Embedding principle of loss landscape of deep neural networks

Yaoyu Zhang†, Zhongwang Zhang, Tao Luo, Zhi-Qin John Xu†

This paper proves an embedding principle that the loss landscape of a deep neural network (DNN) contains all the critical points of narrower DNNs, and proposes a critical embedding such that any critical point of a narrower DNN can be embedded to a critical point/affine subspace of the target DNN with higher degeneracy while preserving the DNN output function, providing a new perspective to study the general easy optimization of wide DNNs and unraveling a potential implicit low-complexity regularization during training.

🎖 Honors and Awards –>

2024.09 I won the 2024 China National Scholarship!

📖 Educations

2021.09 - now, Ph.D., School of Mathematical Sciences, Shanghai Jiao Tong University.
2017.09 - 2021.06, Undergraduate, Zhiyuan College, Shanghai Jiao Tong University.

💻 Internships

2025.04 - now, Tongyi Lab, Alibaba Group.
2024.04 - 2025.04, Institute for Advanced Algorithms Research.