Guanzhi Wang

I am a third-year Ph.D. student at Caltech, advised by Prof. Georgia Gkioxari and Prof. Yisong Yue.

I obtained my M.S. degree from Stanford University, advised by Prof. Fei-Fei Li, Prof. Yuke Zhu, Dr. Jim Fan and Dr. Shyamal Buch. I obtained my B.S. degree from the Hong Kong University of Science and Technology, where I have been lucky to work with Prof. Chi-Keung Tang and Prof. Yu-Wing Tai.

My research interests lie in the area of foundation models, robotics, and embodied agents. I am passionate about building embodied foundation agents that are generally capable to discover and pursue complex and open-ended objectives, and understand how the world works through massive pre-trained knowledge.

Email  /  Google Scholar  /  Twitter  /  GitHub  /  LinkedIn

profile photo

* Equal contribution, † Equal advising

Eureka: Human-Level Reward Design via Coding Large Language Models
Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi "Jim" Fan, Anima Anandkumar
[paper]   [project page]   [code]  

We present Eureka, an open-ended LLM-powered agent that designs reward functions for robot dexterity at super-human level.

Voyager: An Open-Ended Embodied Agent with Large Language Models
Guanzhi Wang, Yuqi Xie, Yunfan Jiang*, Ajay Mandlekar*, Chaowei Xiao, Yuke Zhu, Linxi "Jim" Fan, Anima Anandkumar
[paper]   [project page]   [code]  

We introduce Voyager, the first LLM-powered embodied lifelong learning agent in Minecraft that continuously explores the world, acquires diverse skills, and makes novel discoveries without human intervention.

VIMA: General Robot Manipulation with Multimodal Prompts
Yunfan Jiang, Agrim Gupta*, Zichen "Charles" Zhang*, Guanzhi Wang*, Yongqiang Dou, Yanjun Chen, Li Fei-Fei, Anima Anandkumar, Yuke Zhu, Linxi "Jim" Fan
International Conference on Machine Learning (ICML), 2023
[paper]   [project page]   [code]  

We introduce a novel multimodal prompting formulation that converts diverse robot manipulation tasks into a uniform sequence modeling problem.

MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge
Linxi "Jim" Fan, Guanzhi Wang*, Yunfan Jiang*, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang, De-An Huang, Yuke Zhu, Anima Anandkumar
Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2022
Outstanding Paper Award
[paper]   [project page]   [code]  

We introduce MineDojo, a new framework based on the popular Minecraft game for building generally capable, open-ended embodied agents.

SECANT: Self-Expert Cloning for Zero-Shot Generalization of Visual Policies
Linxi "Jim" Fan, Guanzhi Wang, De-An Huang, Zhiding Yu, Li Fei-Fei, Yuke Zhu, Anima Anandkumar
International Conference on Machine Learning (ICML), 2021
[paper]   [project page]   [code]  

We propose SECANT, a novel self-expert cloning technique that leverages image augmentation in two stages to decouple robust representation learning from policy optimization.

iGibson 1.0: a Simulation Environment for Interactive Tasks in Large Realistic Scenes
Bokui Shen*, Fei Xia*, Chengshu Li*, Roberto Martín-Martín*, Linxi "Jim" Fan, Guanzhi Wang, Claudia D’Arpino, Shyamal Buch, Sanjana Srivastava, Lyne P. Tchapmi, Micael E. Tchapmi, Kent Vainio, Josiah Wong, Li Fei-Fei, Silvio Savarese
International Conference on Intelligent Robots and Systems (IROS), 2021
[paper]   [project page]   [code]  

We present iGibson, a novel simulation environment for developing interactive robotic agents in large-scale realistic scenes.

Deep Video Matting via Spatio-Temporal Alignment and Aggregation
Yanan Sun, Guanzhi Wang*, Qiao Gu*, Chi-Keung Tang, Yu-Wing Tai
Conference on Computer Vision and Pattern Recognition (CVPR), 2021
[paper]   [code]   [dataset]

We propose a deep learning-based video matting framework which employs a novel and effective spatio-temporal feature aggregation module.

RubiksNet: Learnable 3D-Shift for Efficient Video Action Recognition
Linxi "Jim" Fan*, Shyamal Buch*, Guanzhi Wang, Ryan Cao, Yuke Zhu, Juan Carlos Niebles, Li Fei-Fei
European Conference on Computer Vision (ECCV), 2020
[paper]   [project page]   [video]   [supplementary]   [code]  

We propose RubiksNet, a new efficient architecture for video action recognition based on a proposed learnable 3D spatiotemporal shift operation (RubiksShift).

LADN: Local Adversarial Disentangling Network for Facial Makeup and De-Makeup
Qiao Gu*, Guanzhi Wang*, Mang Tik Chiu, Yu-Wing Tai, Chi-Keung Tang
International Conference on Computer Vision (ICCV), 2019
[paper]   [project page]   [code]   [dataset]

We propose a local adversarial disentangling network for facial makeup and de-makeup, using multiple and overlapping local discriminators in a content-style disentangling network.


CS231n: ConvNet for Visual Recognition (Spring 2021)

Teaching Assistant


CS129: Applied Machine Learning (Fall 2020)

CS229: Machine Learning (Spring 2020)

Teaching Assistant

Academic Services
Conference Reviewer: NeurIPS 2022, ICLR 2022, ICCV 2021, CVPR 2021, ECCV 2020

  • NeurIPS Outstanding Paper Award (2022)
  • Kortschak Scholar (2021)
  • Stanford Human-Centered AI Google Cloud Credits Grant (2021)
  • Stanford Human-Centered AI AWS Cloud Credits Award (2020)
  • HKUST Academic Achievement Medal (2019)
  • Talent Development Scholarship (2019)
  • Reaching Out Award (2018)
  • High Fashion Charitable Foundation Exchange Scholarship (2018)
  • Overseas Learning Experience Scholarship (2018)
  • Dean’s List (2015-2019)
  • University Recruitment Scholarship (2015-2019)

This guy makes a nice webpage.