
Infinite-Length Video Generation with Error Recycling
PS: The background video is generated by our Stable Video Infinity
PS: The background video is generated by our Stable Video Infinity
⚠️ Note
All videos displayed on this website have been compressed for web delivery, which may result in reduced visual quality compared to the original generated content. The compression is necessary to ensure optimal loading times and bandwidth efficiency. All the videos have been sped-up from 16 FPS to 24 FPS for better visual experience.
🎵 Click the BGM buttons on the right side to enjoy background music!
Stable Video Infinity (SVI) is able to generate ANY-length videos with high temporal consistency, plausible scene transitions, and controllable streaming storylines in ANY domains. All the following videos are generated by SVI in an end-to-end manner using a prompt stream, e.g., the 8 min Tom and Jerry.
This setting targets the needs of vloggers (e.g., TikTok) for shot video creation, emphasizing moderate scene transitions.
This setting targets vlogger use cases (e.g., TikTok), emphasizing storytelling with plausible scene transitions and exciting contents. All methods are conditioned on the same prompts. In the compared methods, accumulated errors manifest as (1) failed text following, (2) degraded motion, and (3) visual artifacts.
This setting aims to generate temporally coherent videos in a homogeneous scene controlled by a single text prompt, which aligns with the previous long video objective.
[1] Henschel, R., Khachatryan, L., Poghosyan, H., Hayrapetyan, D., Tadevosyan, V., Wang, Z., ... & Shi, H. (2025). Streamingt2v: Consistent, dynamic, and extendable long video generation from text. In CVPR 2025.
[2] Zhang, L., & Agrawala, M. (2025). Packing input frame context in next-frame prediction models for video generation. arXiv preprint arXiv:2504.12626.
[3] Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C. W., ... & Liu, Z. (2025). Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314.
[4] Kong, Z., Gao, F., Zhang, Y., Kang, Z., Wei, X., Cai, X., ... & Luo, W. (2025). Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation. In NeurIPS 2025.
[5] Wang, X., Zhang, S., Tang, L., Zhang, Y., Gao, C., Wang, Y., & Sang, N. (2025). Unianimate-dit: Human image animation with large-scale video diffusion transformer. arXiv preprint arXiv:2504.11289.