Professor and Vice Dean
The Institute of Trustworthy Embodied Artificial Intelligence
Fudan
University
Biography
I am Professor and Vice Dean of the Institute of Trustworthy Embodied Artificial Intelligence at Fudan University, and a member of the Fudan Vision and Learning Laboratory. I received my Ph.D. in Computer Science from the University of Maryland with Prof. Larry Davis. My research interests are in computer vision and deep learning. My current research particularly focuses on embodied AI, video generation, and efficient architectures.
I'm currently looking for students with strong coding skills who are excited to design algorithms for visual understanding. If you are interested in working with me, please send me an email.
Selected Publications
- SegDiff: Segmented Trajectory Diffusion for Consistent and Adaptive Robot
Manipulation.
European Conference on Computer Vision (ECCV), Malmö, Sweden, Sept., 2026.
- Seeing Touch from Motion: A Unified Modality-Aware
Visuo-Tactile Policy with Tactile Motion Correlation.
European Conference on Computer Vision (ECCV), Malmö, Sweden, Sept., 2026.
- VLZip: Unified Visual and Textual Compression for Interleaved Long-Context
Modeling.
European Conference on Computer Vision (ECCV), Malmö, Sweden, Sept., 2026.
- Learning Accurate Segmentation Purely from
Self-Supervision.
European Conference on Computer Vision (ECCV), Malmö, Sweden, Sept., 2026. code
- WeEdit: A Dataset, Benchmark and Glyph-Guided
Framework for Text-centric Image Editing.
European Conference on Computer Vision (ECCV), Malmö, Sweden, Sept., 2026. code
- HAD: Combining Hierarchical Diffusion with
Metric-Decoupled RL for End-to-End Driving.
European Conference on Computer Vision (ECCV), Malmö, Sweden, Sept., 2026.
- Ask-to-Clarify: Resolving Instruction Ambiguity
through Multi-turn Dialogue.
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Pittsburgh, USA, Oct., 2026.
- Enabling Faithful Camera Control in Video Diffusion
through Geometry-Flow-Guided Noise Warping.
International Conference on Machine Learning (ICML), Seoul, South Korea, July, 2026. code
- VideoLoom: A Video Large Language Model for Joint
Spatial-Temporal Understanding.
International Conference on Machine Learning (ICML), Seoul, South Korea, July, 2026. code
- CaTok: Taming Mean Flows for One-Dimensional Causal
Image Tokenization.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Denver, USA, June, 2026. code
- FluxMem: Adaptive Hierarchical Memory for Streaming
Video Understanding.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Denver, USA, June, 2026. code
- HandWorld:
Hand-Centric Unified Video Action Generation.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Denver, USA, June, 2026.
- FlashMotion: Few-Step Controllable Video Generation
with Trajectory Guidance.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Denver, USA, June, 2026. code
- Human2Robot: Learning Robot Actions from Paired
Human-Robot Videos.
The AAAI Conference on Artificial Intelligence (AAAI), Singapore, Jan., 2026.
- DriveSuprim: Towards Precise Trajectory Selection for
End-to-End Planning.
The AAAI Conference on Artificial Intelligence (AAAI), Singapore, Jan., 2026. code
- Boosting Multimodal Instance Understanding via
Explicit Visual Prompt Instruction Tuning.
Advances in Neural Information Processing Systems (NeurIPS), San Diego, USA, Dec., 2025.
- ForgerySleuth: Empowering Multimodal Large Language
Models for Image Manipulation Detection.
Advances in Neural Information Processing Systems (NeurIPS), San Diego, USA, Dec., 2025.
- Seg2Any: Open-set Segmentation-Mask-to-Image
Generation with Precise Shape and Semantic Control.
Advances in Neural Information Processing Systems (NeurIPS), San Diego, USA, Dec., 2025.
- UniGen: Enhanced Training & Test-Time Strategies
for Unified Multimodal Understanding and Generation.
Advances in Neural Information Processing Systems (NeurIPS), San Diego, USA, Dec., 2025.
- OmniGen-AR: AutoRegressive Any-to-Image
Generation.
Advances in Neural Information Processing Systems (NeurIPS), San Diego, USA, Dec., 2025.
- MagicMotion: Controllable Video Generation with
Dense-to-Sparse Trajectory Guidance.
International Conference on Computer Vision (ICCV), Hawaii, USA, Oct., 2025.
- MotionFollower: Editing Video Motion via Score-Guided
Diffusion.
International Conference on Computer Vision (ICCV), Hawaii, USA, Oct., 2025.
- REDUCIO! Generating 1K Video within 16 Seconds using
Extremely Compressed Motion Latents.
International Conference on Computer Vision (ICCV), Hawaii, USA, Oct., 2025.
- AID: Adapting Image2Video Diffusion Models for
Instruction-guided Video Prediction.
International Conference on Computer Vision (ICCV), Hawaii, USA, Oct., 2025.
- VLABench: A Large-Scale Benchmark for
Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning
Tasks.
International Conference on Computer Vision (ICCV), Hawaii, USA, Oct., 2025.
- Hydra-NeXt: Robust Closed-Loop Driving with Open-Loop
Training.
International Conference on Computer Vision (ICCV), Hawaii, USA, Oct., 2025.
- Achieving More with Less: Additive Prompt Tuning for
Rehearsal-Free Class-Incremental Learning.
International Conference on Computer Vision (ICCV), Hawaii, USA, Oct., 2025.
- CreatiLayout: Siamese Multimodal Diffusion
Transformer for Creative Layout-to-Image Generation.
International Conference on Computer Vision (ICCV), Hawaii, USA, Oct., 2025.
- Rethinking Discrete Tokens: Treating Them as
Conditions for Continuous Autoregressive Image Synthesis.
International Conference on Computer Vision (ICCV), Hawaii, USA, Oct., 2025.
- EDEN:
Enhanced Diffusion for High-quality Large-motion Video Frame Interpolation.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, USA, June, 2025.
- BlockDance:
Reuse Structurally Similar Spatio-Temporal Features to Accelerate Diffusion
Transformers.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, USA, June, 2025.
- StableAnimator:
High-Quality Identity-Preserving Human Image Animation.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, USA, June, 2025.
- SEGIC: Unleashing the Emergent Correspondence for
In-Context Segmentation.
European Conference on Computer Vision (ECCV), Milano, Italy, Sept., 2024.
- PromptFusion: Decoupling Stability and Plasticity for
Continual Learning.
European Conference on Computer Vision (ECCV), Milano, Italy, Sept., 2024.
- MagDiff: Multi-Alignment Diffusion for High-Fidelity
Video Generation and Editing.
European Conference on Computer Vision (ECCV), Milano, Italy, Sept., 2024.
- OmniViD:
A Generative Framework for Universal Video Understanding.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, USA, June, 2024.
- MotionEditor:
Editing Video Motion via Content-Aware Diffusion.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, USA, June, 2024. code
- SimDA:
Simple Diffusion Adapter for Efficient Video Generation.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, USA, June, 2024.
- Learning
to Rank Patches for Unbiased Image Redundancy Reduction.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, USA, June, 2024.
- Synthesize
Diagnose and Optimize: Towards Fine-Grained Vision-Language Understanding.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, USA, June, 2024. code
- BEVNeXt:
Reviving Dense BEV Frameworks for 3D Object Detection.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, USA, June, 2024.
- Multi-Prompt Alignment for Multi-Source Unsupervised
Domain Adaptation.
Advances in Neural Information Processing Systems (NeurIPS), New Orleans, USA, Dec., 2023.
- Learning from Rich Semantics and Coarse Locations for
Long-tailed Object Detection.
Advances in Neural Information Processing Systems (NeurIPS), New Orleans, USA, Dec., 2023. code
- Implicit Temporal Modeling with Learnable Alignment
for Video Recognition.
International Conference on Computer Vision (ICCV), Paris, France, Oct., 2023 (Oral) code
- Open-VCLIP: Transforming CLIP to an Open-vocabulary
Video Model via Interpolated Weight Optimization.
International Conference on Machine Learning (ICML), Hawaii, USA, July, 2023
- ResFormer: Scaling ViTs with Multi-Resolution
Training.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, Canada, June, 2023
- SVFormer: Semi-Supervised Video Transformer for
Action Recognition.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, Canada, June, 2023
- Detection Hub: Unifying Object Detection Datasets via
Query Adaptation on Language Embedding.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, Canada, June, 2023
- Look Before You Match: Instance Understanding Matters
in Video Object Segmentation.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, Canada, June, 2023
- Masked Video Distillation: Rethinking Masked Feature
Modeling for Self-supervised Video Representation Learning.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, Canada, June, 2023
- Prototypical Residual Networks for Anomaly Detection
and Localization.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, Canada, June, 2023
- Enhancing the Self-Universality for Transferable
Targeted Attacks.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, Canada, June, 2023
- Vision Transformers are Good Mask
Auto-Labelers.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, Canada, June, 2023
- Towards Scalable Neural Representation for Diverse
Videos.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, Canada, June, 2023
- Resolving Task Confusion in Dynamic Expansion
Architectures for Class Incremental Learning.
The AAAI Conference on Artificial Intelligence (AAAI), Washington DC, USA, Feb., 2023
- OmniVL: One Foundation Model for Image-Language and
Video-Language Tasks.
Advances in Neural Information Processing Systems (NeurIPS), New Orleans, USA, Dec., 2022.
- Semi-Supervised Vision Transformers.
European Conference on Computer Vision (ECCV), Tel Aviv, October, 2022. code
- Efficient Video Transformers with Spatial-Temporal
Token Selection.
European Conference on Computer Vision (ECCV), Tel Aviv, October, 2022. code
- Semi-Supervised Single-View 3D Reconstruction via Prototype Shape
Priors.
European Conference on Computer Vision (ECCV), Tel Aviv, October, 2022. code
- BEVT: BERT Pretraining of Video
Transformers.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, USA, June, 2022 code
- Cross-Modal Transferable Adversarial Attacks from
Images to Videos.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, USA, June, 2022
- AdaViT: Adaptive Vision Transformers for Efficient
Image Recognition.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, USA, June, 2022
- ObjectFormer for Image Manipulation Detection and
Localization.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, USA, June, 2022
- Flag: Adversarial data augmentation for graph neural
networks.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, USA, June, 2022
- Boosting the Transferability of Video Adversarial
Examples via Temporal Translation.
The AAAI Conference on Artificial Intelligence (AAAI), Virtual, Feb., 2022
- Attacking Video Recognition Models with Bullet-Screen
Comments.
The AAAI Conference on Artificial Intelligence (AAAI), Virtual, Feb., 2022
- Towards Transferable Adversarial Attacks on Vision
Transformers.
The AAAI Conference on Artificial Intelligence (AAAI), Virtual, Feb., 2022
- Rethinking Pseudo Labels for Semi-Supervised Object
Detection.
The AAAI Conference on Artificial Intelligence (AAAI), Virtual, Feb., 2022
- Encoding Robustness to Image Style via Adversarial
Feature Perturbations.
Advances in Neural Information Processing Systems (NeurIPS), Virtual, Dec., 2021.
- Deep Video Inpainting Detection.
British Machine Vision Conference (BMVC), Virtual, Oct., 2021
- GTA: Global Temporal Attention for Video Action
Understanding.
British Machine Vision Conference (BMVC), Virtual, Oct., 2021
- VideoLT: Large-scale Long-tailed Video
Recognition.
International Conference on Computer Vision (ICCV), Virtual, Oct., 2021
- Exploring Visual Engagement Signals for
Representation Learning.
National Conference on Computer Vision (ICCV), Virtual, Oct., 2021
- 2D or not 2D? Adaptive 3D Convolution Selection for
Efficient Video Recognition.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, June, 2021
- Intentonomy: a Dataset and Study towards Human Intent
Understanding.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, June, 2021 (Oral) code
- Efficient
Object Embedding for Manipulated Image Retrieval.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, June, 2021
- Making an Invisibility Cloak: Real World Adversarial
Attacks on Object Detectors.
European Conference on Computer Vision (ECCV), Virtual, August, 2020. code
- Learning
from Noisy Anchors for One-stage Object Detection.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, June, 2020
- LiteEval:
A Coarse-to-Fine Framework for Resource Efficient Video Recognition.
Advances in Neural Information Processing Systems (NeurIPS), Vancouver, Canada, Dec., 2019. code
- FiNet:
Compatible and Diverse Fashion Image Inpainting.
International Conference on Computer Vision (ICCV), Seoul, Korea, Oct., 2019. (Oral)
- ACE:
Adapting to Changing Environments for Semantic Segmentation.
International Conference on Computer Vision (ICCV), Seoul, Korea, Oct., 2019
- AdaFrame: Adaptive Frame Selection for Fast Video
Recognition.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, USA, June, 2019
- The Regretful Agent: Heuristic-Aided Navigation
through Progress Estimation.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, USA, June, 2019
- Visual Content Recognition by
Exploiting Semantic Feature Map with Attention and Multi-task Learning.
ACM Trans. Multimedia Comput. Commun (ACM TOMM), vol. 15, issue 1, pp. 6:1-6:22, 2019.
- Self-Monitoring Navigation Agent via Auxiliary
Progress Estimation.
International Conference on Learning Representations (ICLR), New Orleans, USA, May, 2019
- DCAN:
Dual Channel-wise Alignment Networks for Unsupervised Scene Adaptation.
European Conference on Computer Vision (ECCV), Munich, Germany, September, 2018. code
- BlockDrop: Dynamic Inference Paths in Residual
Networks.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, USA, June, 2018. (Spotlight) code
- VITON: An Image-based Virtual Try-on
Network.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, USA, June, 2018. (Spotlight) code
- Exploiting Feature and Class Relationships in
Video Categorization with Regularized Deep Neural Networks.
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), Vol. 40, Issue 2, pp. 352-364, 2018.
Fudan-Columbia Video Dataset (FCVID), one of the largest public Web video datasets with manual annotations. - Deep Learning for Video Classification and Video
Captioning.
In Frontiers of Multimedia Research, Shih-Fu Chang (Ed.), ACM Morgan & Claypool, New York, NY, USA, pp. 3-29, 2018
Surveying 100+ recent literatures on video classification and captioning with deep learning. - Weakly-Supervised Spatial Context
Networks.
arXiv preprint arXiv:1704.02998
- Automatic
Spatially-aware Fashion Concept Discovery.
International Conference on Computer Vision (ICCV), Venice, Italy, Oct., 2017
- Learning Fashion Compatibility with Bidirectional
LSTMs.
ACM Multimedia (ACM MM), Mountain View, USA, Oct., 2017
- Learning Semantic Feature Map for
Visual Content Recognition.
ACM Multimedia (ACM MM), Mountain View, USA, Oct., 2017
- Harnessing Object and Scene Semantics for Large-Scale Video
Understanding.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, USA, June, 2016. (Spotlight)
Featured in Tech2, ACM Technews - Multi-Stream Multi-Class Fusion of Deep Networks for Video
Classification.
ACM Multimedia (ACM MM), Amsterdam, the Netherlands, Oct., 2016. (Oral Paper)
- Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for
Video Classification.
ACM Multimedia (ACM MM), Brisbane, Australia, Oct., 2015. (Oral Paper)
Obtain 91.3% accuracy on the UCF-101 dataset. - Evaluating Two-Stream CNN for Video Classification.
ACM International Conference on Multimedia Retrieval (ICMR), Shanghai, China, June, 2015 motion CNN model
- Exploring Inter-feature and Inter-class Relationships
with Deep Neural Networks for Video Classification.
ACM Multimedia (ACM MM), Orlando, USA, Nov., 2014. (Oral Paper)
Professional Service
- Associate Editor:
- IEEE Transactions on Pattern Analysis and Machine Intelligence
- IEEE Transactions on Image Processing
- IEEE Transactions on Multimedia
- Area Chair:
- Advances in Neural Information Processing Systems 2023-2025
- IEEE Conference on Computer Vision and Pattern Recognition 2023-2026
- Senior Program Committee:
- AAAI Conference on Artificial Intelligence 2023-2024
- International Joint Conference on Artificial Intelligence 2023