Sajad Hamzenejadi

I'm a PhD student in Information Systems at University of Geneva (UNIGE), previously a research intern at Nokia Bell Labs in Paris, France, and I completed my MSc in Telecommunications Engineering at Politecnico di Milano (Polimi).

E-mail  /  Google Scholar  /  Github  /  LinkedIn

profile photo

Figure 1: Sajad in AI world!

📑 Selected Publication

* denotes equal contribution; Highlighted papers are representative first-author works

ArXiv 2025 Stable Video Infinity: Infinite-Length Video Generation with Error Recycling
Wuyang Li, Wentao Pan, Po-Chien Luan, Yang Gao, Alexandre Alahi
project page / paper / youtube / code
Key Words: Long Video Generation; End-to-end Filming; Human Talking/Dancing Animation
Summary: Stable Video Infinity (SVI) is able to generate ANY-length videos with high temporal consistency, plausible scene transitions, and controllable streaming storylines in ANY domains. SVI incorporates Error-Recycling Fine-Tuning, a new type of efficient training that recycles the Diffusion Transformer (DiT)’s self-generated errors into supervisory prompts, thereby encouraging DiT to actively correct its own errors.
ArXiv 2025 Factorized Video Generation: Decoupling Scene Construction and Temporal Synthesis in Text-to-Video Diffusion Models
Mariam Hassan, Bastien Van Delft, Wuyang Li, Alexandre Alahi
project page / paper / code (coming)
Key Words: Video Factorization; Text-to-Video Diffusion Models
Summary: We propose Factorized Video Generation (FVG), a simple yet effective pipeline that decomposes text-to-video generation into three stages: reasoning, composition, and temporal synthesis..
ArXiv 2025 RAP: 3D Rasterization Augmented End-to-End Planning
Lan Feng, Yang Gao, Éloi Zablocki, Quanyi Li, Wuyang Li, Sichao Liu, Matthieu Cord, Alexandre Alahi
project page / paper / code
Key Words: End-to-End Planning; 3D Rasterization; Data Scaling
Summary: We propose RAP, a Raster-to-Real feature-space alignment that bridges the sim-to-real gap without requiring pixel-level realism. RAP ranks 1st in the Waymo Open Dataset Vision-based End-to-End Driving Challenge (2025) (UniPlan entry); Waymo Open Dataset Vision-based E2E Driving Leaderboard, NAVSIM v1 navtest, and NAVSIM v2 navhard
NeurIPS 2025 Spotlight VoxDet: Rethinking 3D Semantic Occupancy Prediction as Dense Object Detection
Wuyang Li, Zhuy Yu, Alexandre Alahi
project page / paper / code
Key Words: 3D Semantic Occupancy Prediction; Dense Object Detection
Summary: 3D semantic occupancy prediction aims to reconstruct the 3D geometry and semantics of the surrounding environment. With dense voxel labels, prior works typically formulate it as a dense segmentation task, independently classifying each voxel without instance-level perception. Differently, VoxDet addresses semantic occupancy prediction with an instance-centric formulation inspired by dense object detection, which uses a VoxNT trick for freely transferring voxel-level class labels to instance-level offset labels.
NeurIPS 2025 See&Trek: Training-Free Spatial Prompting for Multimodal Large Language Model
Pengteng Li, Pinhao Song Wuyang Li, Weiyu Guo, Huizai Yao, Yijie Xu, Dugang Liu, Hui Xiong
paper
Key Words: Spatial Understanding; Multimodal Large Language Model
Summary: We introduce SEE&TREK, the first training-free prompting framework tailored to enhance the spatial understanding of Multimodal Large Language Models (MLLMS) under vision-only constraints. While prior efforts have incorporated modalities like depth or point clouds to improve spatial reasoning, purely visualspatial understanding remains underexplored. SEE&TREK addresses this gap by focusing on two core principles: increasing visual diversity and motion reconstruction.
ICCV 2025 Highlight MetaScope: Optics-Driven Neural Network for Ultra-Micro Metalens Endoscopy
Wuyang Li*, Wentao Pan*, Xiaoyuan Liu*, Zhendong Luo, Chenxin Li, Hengyu Liu, Din Ping Tsai, Mu Ku Chen, Yixuan Yuan
project page / paper/ code (coming)
Key Words: Metalens, Computation Photography, Endoscopy, Optical Imaging
Summary: Unlike conventional endoscopes limited by millimeter-scale thickness, metalenses operate at the micron scale, serving as a promising solution for ultra-miniaturized endoscopy. However, metalenses suffer from intensity decay and chromatic aberration. To address this, we developed MetaScope, an optics-driven neural network for metalens-based endoscopy, offering a promising pathway for next-generation ultra-miniaturized medical imaging devices.
NeurIPS 2025 IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering
Parker Liu, Chenxin Li, Zhengxin Li, Yipeng Wu, Wuyang Li, Zhiqin Yang, Zhenyue Zhang, Yunlong Lin, Sirui Han, Brandon Y. Feng
project page / paper / code
Key Words: 3D Scene Understanding; Vision-Language Model; Inverse Rendering
Summary: We propose IR3D-Bench, a benchmark that challenges VLMs to demonstrate real scene understanding by actively recreating 3D structures from images using tools. An "understanding-by-creating" approach that probes the generative and tool-using capacity of vision-language agents (VLAs), moving beyond the descriptive or conversational capacity measured by traditional scene understanding benchmarks.
AAAI 2025 Top-1 most influential paper U-KAN Makes Strong Backbone for Medical Image Segmentation and Generation
Chenxin Li*, Xinyu Liu*, Wuyang Li*, Cheng Wang*, Hengyu Liu, Yifan Liu, Zhen Chen, Yixuan Yuan
project page/ paper/ code
Key Words: Kolmogorov-Arnold Networks; Medical Image Segmentation/Generation; Medical Backbone
Summary: We propose the first KAN-based medical backbone, U-KAN, which can be seamlessly integrated with existing medical image segmentation and generation models to boost their performance with minimal computational overhead. This work has been cited more than 250 times in one year.

I stole this guy’s source code! See the original.