My research focuses on Embodied AI, a frontier domain at the intersection of machine learning, robotics and computer vision. I investigate the fundamental challenges of developing general-purpose robotic systems that can effectively adapt and generalize their learned behaviors across diverse, unstructured real-world environments.
We demonstrate that the policy’s generalization ability to new objects, new environments, or both scales approximately as a power law with the number of training objects, training environments, or training environment-object pairs, respectively.
We introduce ViLa, a novel approach for long-horizon robotic planning
that leverages GPT-4V to generate a sequence of actionable steps. ViLa empowers robots to
execute complex tasks with a profound understanding of the visual world.
We present Semantic-Geometric Representation (SGR), a universal perception module for robotics that
leverages the rich semantic information of large-scale pre-trained 2D models and inherits the
merits of 3D spatial reasoning.
We conduct the first thorough evaluation of pre-trained vision model performance across different downstream policy learning methods and environments. We discover
that the effectiveness of pre-training is highly dependent on the choice of the downstream policy learning algorithm.
We show that fine-grained features learned with pixel-level self-supervised learning (SSL) objectives are complementary to semantic features from image-level SSL methods. Fusing these features can significantly improve the performance for visual correspondence tasks.