[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72331":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":8,"htmlUrl":8,"language":9,"languages":8,"totalLinesOfCode":8,"stars":10,"forks":11,"watchers":12,"openIssues":12,"contributorsCount":13,"subscribersCount":13,"size":13,"stars1d":13,"stars7d":13,"stars30d":14,"stars90d":13,"forks30d":13,"starsTrendScore":13,"compositeScore":15,"rankGlobal":8,"rankLanguage":8,"license":16,"archived":17,"fork":17,"defaultBranch":18,"hasWiki":17,"hasPages":17,"topics":19,"createdAt":8,"pushedAt":8,"updatedAt":20,"readmeContent":21,"aiSummary":22,"trendingCount":13,"starSnapshotCount":13,"syncStatus":23,"lastSyncTime":24,"discoverSource":25},72331,"Step-Video-T2V","stepfun-ai\u002FStep-Video-T2V","stepfun-ai",null,"Python",3185,338,41,0,3,56.89,"MIT License",false,"main",[],"2026-06-12 04:01:04","\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Flogo.png\"  height=100>\n\u003C\u002Fp>\n\u003Cdiv align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fyuewen.cn\u002Fvideos\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fstatic\u002Fv1?label=Step-Video&message=Web&color=green\">\u003C\u002Fa> &ensp;\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.10248\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fstatic\u002Fv1?label=Tech Report&message=Arxiv&color=red\">\u003C\u002Fa> &ensp;\n  \u003Ca href=\"https:\u002F\u002Fx.com\u002FStepFun_ai\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fstatic\u002Fv1?label=X.com&message=Web&color=blue\">\u003C\u002Fa> &ensp;\n\u003C\u002Fdiv>\n\n\u003Cdiv align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fstepfun-ai\u002Fstepvideo-t2v\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fstatic\u002Fv1?label=Step-Video-T2V&message=HuggingFace&color=yellow\">\u003C\u002Fa> &ensp;\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fstepfun-ai\u002Fstepvideo-t2v-turbo\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fstatic\u002Fv1?label=Step-Video-T2V-Turbo&message=HuggingFace&color=yellow\">\u003C\u002Fa> &ensp;\n\u003C\u002Fdiv>\n\n## 🔥🔥🔥 News!!\n* Mar 17, 2025: 👋 We release the [Step-Video-TI2V](https:\u002F\u002Fgithub.com\u002Fstepfun-ai\u002FStep-Video-Ti2V), an image-to-video model based on Step-Video-T2V.\n* Feb 17, 2025: 👋 We release the inference code and model weights of Step-Video-T2V. [Download](https:\u002F\u002Fhuggingface.co\u002Fstepfun-ai\u002Fstepvideo-t2v)\n* Feb 17, 2025: 👋 We release the inference code and model weights of Step-Video-T2V-Turbo. [Download](https:\u002F\u002Fhuggingface.co\u002Fstepfun-ai\u002Fstepvideo-t2v-turbo)\n* Feb 17, 2025: 🎉 We have made our technical report available as open source. [Read](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.10248)\n\n## Video Demos\n\n\u003Ctable border=\"0\" style=\"width: 100%; text-align: center; margin-top: 1px;\">\n  \u003Ctr>\n    \u003Ctd>\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F9274b351-595d-41fb-aba3-f58e6e91603a\" width=\"100%\" controls autoplay loop muted>\u003C\u002Fvideo>\u003C\u002Ftd>\n    \u003Ctd>\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F2f6b3ad5-e93b-436b-98bc-4701182d8652\" width=\"100%\" controls autoplay loop muted>\u003C\u002Fvideo>\u003C\u002Ftd>\n    \u003Ctd>\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F67d20ee7-ad78-4b8f-80f6-3fdb00fb52d8\" width=\"100%\" controls autoplay loop muted>\u003C\u002Fvideo>\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F9abce409-105d-4a8a-ad13-104a98cc8a0b\" width=\"100%\" controls autoplay loop muted>\u003C\u002Fvideo>\u003C\u002Ftd>\n    \u003Ctd>\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F8d1e1a47-048a-49ce-85f6-9d013f2d8e89\" width=\"100%\" controls autoplay loop muted>\u003C\u002Fvideo>\u003C\u002Ftd>\n    \u003Ctd>\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F32cf4bd1-ec1f-4f77-a488-cd0284aa81bb\" width=\"100%\" controls autoplay loop muted>\u003C\u002Fvideo>\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Ff95a7a49-032a-44ea-a10f-553d4e5d21c6\" width=\"100%\" controls autoplay loop muted>\u003C\u002Fvideo>\u003C\u002Ftd>\n    \u003Ctd>\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F3534072e-87d9-4128-a87f-28fcb5d951e0\" width=\"100%\" controls autoplay loop muted>\u003C\u002Fvideo>\u003C\u002Ftd>\n    \u003Ctd>\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F6d893dad-556d-4527-a882-666cba3d10e9\" width=\"100%\" controls autoplay loop muted>\u003C\u002Fvideo>\u003C\u002Ftd>\n  \u003C\u002Ftr>\n\n\u003C\u002Ftable>\n\n## Table of Contents\n\n1. [Introduction](#1-introduction)\n2. [Model Summary](#2-model-summary)\n3. [Model Download](#3-model-download)\n4. [Model Usage](#4-model-usage)\n5. [Benchmark](#5-benchmark)\n6. [Online Engine](#6-online-engine)\n7. [Citation](#7-citation)\n8. [Acknowledgement](#8-ackownledgement)\n\n## 1. Introduction\nWe present **Step-Video-T2V**, a state-of-the-art (SoTA) text-to-video pre-trained model with 30 billion parameters and the capability to generate videos up to 204 frames. To enhance both training and inference efficiency, we propose a deep compression VAE for videos, achieving 16x16 spatial and 8x temporal compression ratios. Direct Preference Optimization (DPO) is applied in the final stage to further enhance the visual quality of the generated videos. Step-Video-T2V's performance is evaluated on a novel video generation benchmark, **Step-Video-T2V-Eval**, demonstrating its SoTA text-to-video quality compared to both open-source and commercial engines.\n\n## 2. Model Summary\nIn Step-Video-T2V, videos are represented by a high-compression Video-VAE, achieving 16x16 spatial and 8x temporal compression ratios. User prompts are encoded using two bilingual pre-trained text encoders to handle both English and Chinese. A DiT with 3D full attention is trained using Flow Matching and is employed to denoise input noise into latent frames, with text embeddings and timesteps serving as conditioning factors. To further enhance the visual quality of the generated videos, a video-based DPO approach is applied, which effectively reduces artifacts and ensures smoother, more realistic video outputs.\n\n\u003Cp align=\"center\">\n  \u003Cimg width=\"80%\" src=\"assets\u002Fmodel_architecture.png\">\n\u003C\u002Fp>\n\n### 2.1. Video-VAE\nA deep compression Variational Autoencoder (VideoVAE) is designed for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios while maintaining exceptional video reconstruction quality. This compression not only accelerates training and inference but also aligns with the diffusion process's preference for condensed representations.\n\n\u003Cp align=\"center\">\n  \u003Cimg width=\"70%\" src=\"assets\u002Fdcvae.png\">\n\u003C\u002Fp>\n\n### 2.2. DiT w\u002F 3D Full Attention\nStep-Video-T2V is built on the DiT architecture, which has 48 layers, each containing 48 attention heads, with each head’s dimension set to 128. AdaLN-Single is leveraged to incorporate the timestep condition, while QK-Norm in the self-attention mechanism is introduced to ensure training stability. Additionally, 3D RoPE is employed, playing a critical role in handling sequences of varying video lengths and resolutions.\n\n\u003Cp align=\"center\">\n  \u003Cimg width=\"80%\" src=\"assets\u002Fdit.png\">\n\u003C\u002Fp>\n\n### 2.3. Video-DPO\nIn Step-Video-T2V, we incorporate human feedback through Direct Preference Optimization (DPO) to further enhance the visual quality of the generated videos. DPO leverages human preference data to fine-tune the model, ensuring that the generated content aligns more closely with human expectations. The overall DPO pipeline is shown below, highlighting its critical role in improving both the consistency and quality of the video generation process.\n\n\u003Cp align=\"center\">\n  \u003Cimg width=\"100%\" src=\"assets\u002Fdpo_pipeline.png\">\n\u003C\u002Fp>\n\n\n\n## 3. Model Download\n| Models   | 🤗Huggingface    |  🤖Modelscope |\n|:-------:|:-------:|:-------:|\n| Step-Video-T2V | [download](https:\u002F\u002Fhuggingface.co\u002Fstepfun-ai\u002Fstepvideo-t2v) | [download](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fstepfun-ai\u002Fstepvideo-t2v)\n| Step-Video-T2V-Turbo (Inference Step Distillation) | [download](https:\u002F\u002Fhuggingface.co\u002Fstepfun-ai\u002Fstepvideo-t2v-turbo) | [download](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fstepfun-ai\u002Fstepvideo-t2v-turbo)\n\n\n## 4. Model Usage\n### 📜 4.1  Requirements\n\nThe following table shows the requirements for running Step-Video-T2V model (batch size = 1, w\u002Fo cfg distillation) to generate videos:\n\n|     Model    |  height\u002Fwidth\u002Fframe |  Peak GPU Memory | 50 steps w flash-attn | 50 steps w\u002Fo flash-attn |\n|:------------:|:------------:|:------------:|:------------:|:------------:|\n| Step-Video-T2V   |        768px768px204f      |  78.55 GB | 860 s | 1437 s |\n| Step-Video-T2V   |        544px992px204f      |  77.64 GB | 743 s | 1232 s |\n| Step-Video-T2V   |        544px992px136f      |  72.48 GB | 408 s | 605 s |\n\n* An NVIDIA GPU with CUDA support is required. \n  * The model is tested on four GPUs.\n  * **Recommended**: We recommend to use GPUs with 80GB of memory for better generation quality.\n* Tested operating system: Linux\n* The self-attention in text-encoder (step_llm) only supports CUDA capabilities sm_80 sm_86 and sm_90\n\n### 🔧 4.2 Dependencies and Installation\n- Python >= 3.10.0 (Recommend to use [Anaconda](https:\u002F\u002Fwww.anaconda.com\u002Fdownload\u002F#linux) or [Miniconda](https:\u002F\u002Fdocs.conda.io\u002Fen\u002Flatest\u002Fminiconda.html))\n- [PyTorch >= 2.3-cu121](https:\u002F\u002Fpytorch.org\u002F)\n- [CUDA Toolkit](https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-downloads)\n- [FFmpeg](https:\u002F\u002Fwww.ffmpeg.org\u002F) \n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fstepfun-ai\u002FStep-Video-T2V.git\nconda create -n stepvideo python=3.10\nconda activate stepvideo\n\ncd Step-Video-T2V\npip install -e .\npip install flash-attn --no-build-isolation  ## flash-attn is optional\n```\n\n###  🚀 4.3 Inference Scripts\n\n#### Multi-GPU Parallel Deployment\n\n- We employed a decoupling strategy for the text encoder, VAE decoding, and DiT to optimize GPU resource utilization by DiT. As a result, a dedicated GPU is needed to handle the API services for the text encoder's embeddings and VAE decoding.\n```bash\npython api\u002Fcall_remote_server.py --model_dir where_you_download_dir &  ## We assume you have more than 4 GPUs available. This command will return the URL for both the caption API and the VAE API. Please use the returned URL in the following command.\n\nparallel=4  # or parallel=8\nurl='127.0.0.1'\nmodel_dir=where_you_download_dir\n\ntp_degree=2\nulysses_degree=2\n\n# make sure tp_degree x ulysses_degree = parallel\ntorchrun --nproc_per_node $parallel run_parallel.py --model_dir $model_dir --vae_url $url --caption_url $url  --ulysses_degree $ulysses_degree --tensor_parallel_degree $tp_degree --prompt \"一名宇航员在月球上发现一块石碑，上面印有“stepfun”字样，闪闪发光\" --infer_steps 50  --cfg_scale 9.0 --time_shift 13.0\n```\n\n#### Single-GPU Inference and Quantization\n\n- The open-source project DiffSynth-Studio by ModelScope offers single-GPU inference and quantization support, which can significantly reduce the VRAM required. Please refer to [their examples](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002FDiffSynth-Studio\u002Ftree\u002Fmain\u002Fexamples\u002Fstepvideo) for more information.\n\n###  🚀 4.4 Best-of-Practice Inference settings\nStep-Video-T2V exhibits robust performance in inference settings, consistently generating high-fidelity and dynamic videos. However, our experiments reveal that variations in inference hyperparameters can have a substantial effect on the trade-off between video fidelity and dynamics. To achieve optimal results, we recommend the following best practices for tuning inference parameters:\n\n| Models   | infer_steps   | cfg_scale  | time_shift | num_frames |\n|:-------:|:-------:|:-------:|:-------:|:-------:|\n| Step-Video-T2V | 30-50 | 9.0 |  13.0 | 204\n| Step-Video-T2V-Turbo (Inference Step Distillation) | 10-15 | 5.0 | 17.0 | 204 |\n\nFor more performance results, please refer to the [benchmark metrics](https:\u002F\u002Fgithub.com\u002Fxdit-project\u002FxDiT\u002Fblob\u002Fmain\u002Fdocs\u002Fperformance\u002Fstepvideo.md) from the xDiT team:\n\n## 5. Benchmark\nWe are releasing [Step-Video-T2V Eval](https:\u002F\u002Fgithub.com\u002Fstepfun-ai\u002FStep-Video-T2V\u002Fblob\u002Fmain\u002Fbenchmark\u002FStep-Video-T2V-Eval) as a new benchmark, featuring 128 Chinese prompts sourced from real users. This benchmark is designed to evaluate the quality of generated videos across 11 distinct categories: Sports, Food, Scenery, Animals, Festivals, Combination Concepts, Surreal, People, 3D Animation, Cinematography, and Style.\n\n## 6. Online Engine\nThe online version of Step-Video-T2V is available on [跃问视频](https:\u002F\u002Fyuewen.cn\u002Fvideos), where you can also explore some impressive examples.\n\n## 7. Citation\n```\n@misc{ma2025stepvideot2vtechnicalreportpractice,\n      title={Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model}, \n      author={Guoqing Ma and Haoyang Huang and Kun Yan and Liangyu Chen and Nan Duan and Shengming Yin and Changyi Wan and Ranchen Ming and Xiaoniu Song and Xing Chen and Yu Zhou and Deshan Sun and Deyu Zhou and Jian Zhou and Kaijun Tan and Kang An and Mei Chen and Wei Ji and Qiling Wu and Wen Sun and Xin Han and Yanan Wei and Zheng Ge and Aojie Li and Bin Wang and Bizhu Huang and Bo Wang and Brian Li and Changxing Miao and Chen Xu and Chenfei Wu and Chenguang Yu and Dapeng Shi and Dingyuan Hu and Enle Liu and Gang Yu and Ge Yang and Guanzhe Huang and Gulin Yan and Haiyang Feng and Hao Nie and Haonan Jia and Hanpeng Hu and Hanqi Chen and Haolong Yan and Heng Wang and Hongcheng Guo and Huilin Xiong and Huixin Xiong and Jiahao Gong and Jianchang Wu and Jiaoren Wu and Jie Wu and Jie Yang and Jiashuai Liu and Jiashuo Li and Jingyang Zhang and Junjing Guo and Junzhe Lin and Kaixiang Li and Lei Liu and Lei Xia and Liang Zhao and Liguo Tan and Liwen Huang and Liying Shi and Ming Li and Mingliang Li and Muhua Cheng and Na Wang and Qiaohui Chen and Qinglin He and Qiuyan Liang and Quan Sun and Ran Sun and Rui Wang and Shaoliang Pang and Shiliang Yang and Sitong Liu and Siqi Liu and Shuli Gao and Tiancheng Cao and Tianyu Wang and Weipeng Ming and Wenqing He and Xu Zhao and Xuelin Zhang and Xianfang Zeng and Xiaojia Liu and Xuan Yang and Yaqi Dai and Yanbo Yu and Yang Li and Yineng Deng and Yingming Wang and Yilei Wang and Yuanwei Lu and Yu Chen and Yu Luo and Yuchu Luo and Yuhe Yin and Yuheng Feng and Yuxiang Yang and Zecheng Tang and Zekai Zhang and Zidong Yang and Binxing Jiao and Jiansheng Chen and Jing Li and Shuchang Zhou and Xiangyu Zhang and Xinhao Zhang and Yibo Zhu and Heung-Yeung Shum and Daxin Jiang},\n      year={2025},\n      eprint={2502.10248},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.10248}, \n}\n```\n\n## 8. Acknowledgement\n- We would like to express our sincere thanks to the [xDiT](https:\u002F\u002Fgithub.com\u002Fxdit-project\u002FxDiT) team for their invaluable support and parallelization strategy. \n- Our code will be integrated into the official repository of [Huggingface\u002FDiffusers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdiffusers).\n- We thank the [FastVideo](https:\u002F\u002Fgithub.com\u002Fhao-ai-lab\u002FFastVideo) team for their continued collaboration and look forward to launching inference acceleration solutions together in the near future.\n","Step-Video-T2V 是一个基于文本生成视频的AI项目。该项目利用先进的深度学习技术，能够将给定的文字描述转化为相应的视频内容。其核心功能包括高效的文本到视频转换以及高质量的视频生成能力，并且提供了两种版本：标准版和Turbo加速版，后者在保持生成质量的同时提升了处理速度。此外，该项目已开源其技术报告与模型权重，方便研究者和开发者使用。Step-Video-T2V适用于需要快速原型设计、创意内容生产或多媒体应用开发等多种场景。",2,"2026-06-11 03:41:22","high_star"]