* denotes equal contribution; Highlighted
papers are representative first-author works
ArXiv 2025 Stable Video Infinity: Infinite-Length Video Generation with Error Recycling
Wuyang Li, Wentao Pan, Po-Chien Luan, Yang Gao, Alexandre Alahi
project page
/
paper
/
youtube
/
code Key Words: Long Video Generation; End-to-end Filming; Human Talking/Dancing
Animation
Summary: Stable Video Infinity (SVI) is able to generate ANY-length videos with
high temporal consistency, plausible scene transitions, and controllable streaming storylines in ANY
domains. SVI incorporates Error-Recycling Fine-Tuning, a new type of
efficient training that recycles the Diffusion Transformer (DiT)’s self-generated errors into
supervisory prompts, thereby encouraging DiT to actively correct its own errors.
ArXiv 2025
RAP: 3D Rasterization Augmented End-to-End Planning
Lan Feng, Yang Gao, Éloi Zablocki, Quanyi Li, Wuyang Li, Sichao Liu, Matthieu Cord,
Alexandre
Alahi
project page /
paper /
code Key Words: End-to-End Planning; 3D Rasterization; Data Scaling
Summary: We propose RAP, a Raster-to-Real feature-space alignment that bridges
the
sim-to-real gap without requiring pixel-level realism. RAP ranks 1st in the Waymo Open Dataset
Vision-based
End-to-End Driving Challenge (2025) (UniPlan entry); Waymo Open Dataset Vision-based E2E Driving
Leaderboard,
NAVSIM v1 navtest, and NAVSIM v2 navhard
NeurIPS 2025 Spotlight VoxDet: Rethinking 3D Semantic Occupancy Prediction as Dense Object Detection
Wuyang Li, Zhuy Yu, Alexandre Alahi
project page
/
paper
/
code Key Words: 3D Semantic Occupancy Prediction; Dense Object Detection
Summary: 3D semantic occupancy prediction aims to reconstruct the 3D geometry and
semantics of the surrounding environment. With dense voxel labels, prior works typically formulate it
as a dense segmentation task, independently classifying each voxel without instance-level perception.
Differently, VoxDet addresses semantic occupancy prediction with an
instance-centric
formulation inspired by dense object detection, which uses a VoxNT trick for freely transferring
voxel-level class labels to instance-level offset labels.
NeurIPS 2025
See&Trek: Training-Free Spatial Prompting for Multimodal Large Language Model
Pengteng Li, Pinhao Song Wuyang Li, Weiyu Guo, Huizai Yao, Yijie Xu, Dugang Liu, Hui Xiong
paper Key Words: Spatial Understanding; Multimodal Large Language Model
Summary: We introduce SEE&TREK, the first training-free prompting framework tailored to
enhance the spatial understanding of Multimodal Large Language Models (MLLMS) under vision-only constraints.
While prior efforts have incorporated
modalities like depth or point clouds to improve spatial reasoning, purely visualspatial understanding remains
underexplored. SEE&TREK addresses this gap by
focusing on two core principles: increasing visual diversity and motion reconstruction.
ICCV 2025 Highlight MetaScope: Optics-Driven Neural Network for Ultra-Micro Metalens Endoscopy
Wuyang Li*, Wentao Pan*, Xiaoyuan Liu*, Zhendong Luo, Chenxin Li, Hengyu Liu,
Din
Ping Tsai, Mu Ku Chen, Yixuan Yuan
project page
/
paper/
code (coming) Key Words: Metalens, Computation Photography, Endoscopy, Optical Imaging
Summary: Unlike conventional endoscopes limited by millimeter-scale thickness, metalenses
operate at the micron scale, serving as a promising solution for ultra-miniaturized endoscopy. However,
metalenses suffer from intensity decay and chromatic aberration. To address this, we developed MetaScope, an
optics-driven neural network for metalens-based endoscopy, offering a promising pathway for next-generation
ultra-miniaturized medical imaging devices.
NeurIPS 2025IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic
Inverse Rendering
Parker Liu, Chenxin Li, Zhengxin Li, Yipeng Wu, Wuyang Li, Zhiqin Yang,
Zhenyue Zhang, Yunlong Lin, Sirui Han, Brandon Y. Feng project page /
paper /
code Key Words: 3D Scene Understanding; Vision-Language Model; Inverse
Rendering
Summary: We propose IR3D-Bench, a benchmark that challenges VLMs to demonstrate real
scene understanding by actively recreating 3D structures from images using tools. An
"understanding-by-creating" approach that probes the generative and tool-using capacity of vision-language
agents (VLAs), moving beyond the descriptive or conversational capacity measured by traditional scene
understanding benchmarks.
AAAI 2025 Top-1 most influential paper U-KAN Makes Strong Backbone for Medical Image Segmentation and
Generation
Chenxin Li*, Xinyu Liu*, Wuyang Li*, Cheng Wang*, Hengyu Liu,
Yifan Liu, Zhen Chen, Yixuan Yuan
project page/paper/code Key Words: Kolmogorov-Arnold Networks; Medical Image
Segmentation/Generation; Medical Backbone
Summary: We propose the first KAN-based medical backbone,
U-KAN, which can be seamlessly integrated with existing medical image segmentation
and generation models to boost their performance with minimal computational
overhead. This work has been cited more than 250 times in one year.