[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72418":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":17,"stars30d":18,"stars90d":16,"forks30d":16,"starsTrendScore":19,"compositeScore":20,"rankGlobal":10,"rankLanguage":10,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":22,"hasPages":22,"topics":24,"createdAt":10,"pushedAt":10,"updatedAt":28,"readmeContent":29,"aiSummary":30,"trendingCount":16,"starSnapshotCount":16,"syncStatus":31,"lastSyncTime":32,"discoverSource":33},72418,"SkyReels-V1","SkyworkAI\u002FSkyReels-V1","SkyworkAI","SkyReels V1: The first and most advanced open-source human-centric video foundation model","https:\u002F\u002Fwww.skyreels.ai",null,"Python",2685,310,35,53,0,3,13,9,68.28,"Other",false,"main",[25,26,27],"i2v","t2v","video-diffusion-transformers","2026-06-12 04:01:05","\u003Cp align=\"center\">\n  \u003Cimg src=\"docs\u002Fassets\u002Flogo2.png\" alt=\"SkyReels Logo\" width=\"50%\">\n\u003C\u002Fp>\n\n# SkyReels V1: Human-Centric Video Foundation Model\n\n\u003Cp align=\"center\">\n🤗 \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fcollections\u002FSkywork\u002Fskyreels-v1-67b34676ff65b4ec02d16307\" target=\"_blank\">Hugging Face\u003C\u002Fa> · 👋 \u003Ca href=\"https:\u002F\u002Fwww.skyreels.ai\u002Fhome?utm_campaign=github_V1\" target=\"_blank\">Playground\u003C\u002Fa> · 💬 \u003Ca href=\"https:\u002F\u002Fdiscord.gg\u002FPwM6NYtccQ\" target=\"_blank\">Discord\u003C\u002Fa>\n\u003C\u002Fp>\n\n---\nWelcome to the SkyReels V1 repository! Here, you'll find the Text-to-Video & Image-to-Video model weights and inference code for our groundbreaking video foundation model.\n\n## 🔥🔥🔥 News!!\n\n* Feb 18, 2025: 👋 We release the inference code and model weights of [SkyReels-V1 Text2Video Model](https:\u002F\u002Fhuggingface.co\u002FSkywork\u002FSkyReels-V1-Hunyuan-T2V).\n* Feb 18, 2025: 👋 We release the inference code and model weights of [SkyReels-V1 Image2Video Model](https:\u002F\u002Fhuggingface.co\u002FSkywork\u002FSkyReels-V1-Hunyuan-I2V).\n* Feb 18, 2025: 🔥 We also release [SkyReels-A1](https:\u002F\u002Fgithub.com\u002FSkyworkAI\u002FSkyReels-A1). This is an open-sourced and effective framework portrait image animation model.\n\n## 🎥 Demos\n\u003Cdiv align=\"center\">\n\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fdocs\u002Fassets\u002F2dbd116a-033d-4f7e-bd90-78a3da47cd9c\" width=\"70%\"> \u003C\u002Fvideo>\n\u003C\u002Fdiv>\n\n## 📑 TODO List\n\n- SkyReels-V1 (Text2Video Model)\n  - [x] Checkpoints\n  - [x] Inference Code\n  - [x] Web Demo (Gradio)\n  - [x] User-Level GPU Inference on RTX4090\n  - [x] Parallel Inference on Multi-GPUs\n  - [ ] Prompt Rewrite && Prompt Guidance\n  - [ ] CFG-distilled Model\n  - [ ] Lite Model\n  - [ ] 720P Version\n  - [ ] ComfyUI\n\n- SkyReels-V1 (Image2Video Model)\n  - [x] Checkpoints\n  - [x] Inference Code\n  - [x] Web Demo (Gradio)\n  - [x] User-Level GPU Inference on RTX4090\n  - [x] Parallel Inference on Multi-GPUs\n  - [ ] Prompt Rewrite && Prompt Guidance\n  - [ ] CFG-distilled Model\n  - [ ] Lite Model\n  - [ ] 720P Version\n  - [ ] ComfyUI\n\n## 🌟 Overview\n\nSkyReels V1 is the first and most advanced open-source human-centric video foundation model. By fine-tuning \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Ftencent\u002FHunyuanVideo\">HunyuanVideo\u003C\u002Fa> on O(10M) high-quality film and television clips, SkyReels V1 offers three key advantages:\n\n1. **Open-Source Leadership**: Our Text-to-Video model achieves state-of-the-art (SOTA) performance among open-source models, comparable to proprietary models like Kling and Hailuo.\n2. **Advanced Facial Animation**: Captures 33 distinct facial expressions with over 400 natural movement combinations, accurately reflecting human emotions.\n3. **Cinematic Lighting and Aesthetics**: Trained on high-quality Hollywood-level film and television data, each generated frame exhibits cinematic quality in composition, actor positioning, and camera angles.\n\n## 🔑 Key Features\n\n### 1. Self-Developed Data Cleaning and Annotation Pipeline\n\nOur model is built on a self-developed data cleaning and annotation pipeline, creating a vast dataset of high-quality film, television, and documentary content.\n\n- **Expression Classification**: Categorizes human facial expressions into 33 distinct types.\n- **Character Spatial Awareness**: Utilizes 3D human reconstruction technology to understand spatial relationships between multiple people in a video, enabling film-level character positioning.\n- **Action Recognition**: Constructs over 400 action semantic units to achieve a precise understanding of human actions.\n- **Scene Understanding**: Conducts cross-modal correlation analysis of clothing, scenes, and plots.\n\n### 2. Multi-Stage Image-to-Video Pretraining\n\nOur multi-stage pretraining pipeline, inspired by the \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Ftencent\u002FHunyuanVideo\">HunyuanVideo\u003C\u002Fa> design, consists of the following stages:\n\n- **Stage 1: Model Domain Transfer Pretraining**: We use a large dataset (O(10M) of film and television content) to adapt the text-to-video model to the human-centric video domain.\n- **Stage 2: Image-to-Video Model Pretraining**: We convert the text-to-video model from Stage 1 into an image-to-video model by adjusting the conv-in parameters. This new model is then pretrained on the same dataset used in Stage 1.\n- **Stage 3: High-Quality Fine-Tuning**: We fine-tune the image-to-video model on a high-quality subset of the original dataset, ensuring superior performance and quality.\n\n## 📊 Benchmark Results\nWe evaluate the performance of our text-to-video model using \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FVchitect\u002FVBench\">VBench\u003C\u002Fa>, comparing it with other outstanding open-source models.\n\nBased on the benchmark results, SkyReels V1 demonstrates SOTA performance among open-source Text-to-Video (T2V) models. Specifically, our model achieves an overall score of 82.43, which is higher than other open-source models such as VideoCrafter-2.0 VEnhancer (82.24) and CogVideoX1.5-5B (82.17). Additionally, our model achieves the highest scores in several key metrics, including Dynamic Degree and Multiple Objects, indicating our model's superior ability to handle complex video generation tasks.\n| Models                    | Overall | Quality Score | Semantic Score | Image Quality | Dynamic Degree | Multiple Objects | Spatial Relationship |  \n|---------------------------|---------|---------------|----------------|---------------|----------------|------------------|----------------------|\n| OpenSora V1.3             | 77.23   | 80.14         | 65.62          | 56.21         | 30.28          | 43.58            | 51.61                |\n| AnimateDiff-V2            | 80.27   | 82.90         | 69.75          | 70.1          | 40.83          | 36.88            | 34.60                |\n| VideoCrafter-2.0 VEnhancer| 82.24   | 83.54         | 77.06          | 65.35         | 63.89          | 68.84            | 57.55                |\n| CogVideoX1.5-5B           | 82.17   | 82.78         | 79.76          | 65.02         | 50.93          | 69.65            | 80.25                |\n| HunyuanVideo 540P         | 81.23   | 83.49         | 72.22          | 66.31         | 51.67          | 70.45            | 63.46                |\n| SkyReels V1 540P (Ours)   | **82.43** | **84.62**     | 73.68          | 67.15         | **72.5**       | **71.61**        | 70.83                |    \n\n\n## 📦 Model Introduction\n| Model Name      | Resolution | Video Length | FPS | Download Link |\n|-----------------|------------|--------------|-----|---------------|\n| SkyReels-V1-Hunyuan-I2V | 544px960p  | 97           | 24  | 🤗 [Download](https:\u002F\u002Fhuggingface.co\u002FSkywork\u002FSkyReels-V1-Hunyuan-I2V) |\n| SkyReels-V1-Hunyuan-T2V | 544px960p  | 97           | 24  | 🤗 [Download](https:\u002F\u002Fhuggingface.co\u002FSkywork\u002FSkyReels-V1-Hunyuan-T2V) |\n\n\n## 🚀 SkyReels Infer Introduction\n\nSkyReelsInfer is a highly efficient video generation inference framework that enables accurate and swift production of high-quality videos, making video generation inference significantly faster without any loss in quality.\n\n**Multi-GPU Inference Support**: The framework accommodates Context Parallel, CFG Parallel, and VAE Parallel methodologies, facilitating rapid and lossless video production to meet the stringent low-latency demands of online environments.\n\n**User-Level GPU Deployment**: By employing model quantization and parameter-level offload strategies, the system significantly reduces GPU memory requirements, catering to the needs of consumer-grade graphics cards with limited VRAM.\n\n**Superior Inference Performance**: Demonstrating exceptional efficiency, the framework achieves a 58.3% reduction in end-to-end latency compared to HunyuanVideo XDiT, setting a new benchmark for inference speed.\n\n**Excellent Usability**: Built upon the open-source framework Diffusers and featuring a non-intrusive parallel implementation approach, the system ensures a seamless and user-friendly experience.\n\n## 🛠️ Running Guide\n\nBegin by cloning the repository:\n```shell\ngit clone https:\u002F\u002Fgithub.com\u002FSkyworkAI\u002FSkyReels-V1\ncd skyreelsinfer\n```\n\n### Installation Guide for Linux\n\nWe recommend Python 3.10 and CUDA version 12.2 for the manual installation.\n\n```shell\n# Install pip dependencies\npip install -r requirements.txt\n```\n\nWhen sufficient VRAM is available (e.g., on A800), the lossless version can be run directly.\n\n**Note: When generating videos, the prompt should start with \"FPS-24, \" as we referenced the controlling the fps training method from \u003Ca href=https:\u002F\u002Fai.meta.com\u002Fresearch\u002Fpublications\u002Fmovie-gen-a-cast-of-media-foundation-models>Moviegen\u003C\u002Fa> during training.**\n\n```shell\nSkyReelsModel = \"Skywork\u002FSkyReels-V1-Hunyuan-T2V\"\npython3 video_generate.py \\\n    --model_id ${SkyReelsModel} \\\n    --task_type t2v \\\n    --guidance_scale 6.0 \\\n    --height 544 \\\n    --width 960 \\\n    --num_frames 97 \\\n    --prompt \"FPS-24, A cat wearing sunglasses and working as a lifeguard at a pool\" \\\n    --embedded_guidance_scale 1.0\n```\n\n### User-Level GPU Inference (RTX4090)\n\nWe list the height\u002Fwidth\u002Fframe settings we recommend in the following table.\n|      Resolution       |           h\u002Fw=9:16           |    h\u002Fw=16:9     |     h\u002Fw=1:1     |\n|:---------------------:|:----------------------------:|:---------------:|:---------------:|\n|         544p          |        544px960px97f        |  960px544px97f |  720px720px97f |\n\n#### Using Command Line\n\n```shell\n# SkyReelsModel: If using i2v, switch to Skywork\u002FSkyReels-V1-Hunyuan-I2V.\n# quant: Enable FP8 weight-only quantization\n# offload: Enable offload model\n# high_cpu_memory: Enable pinned memory to reduce the overhead of model offloading.\n# parameters_level: Further reduce GPU VRAM usage.\n# task_type:The task type is designated to support both t2v and i2v. For the execution of an i2v task, it is necessary to input --image.\nSkyReelsModel = \"Skywork\u002FSkyReels-V1-Hunyuan-T2V\"\npython3 video_generate.py \\\n    --model_id ${SkyReelsModel} \\\n    --task_type t2v \\\n    --guidance_scale 6.0 \\\n    --height 544 \\\n    --width 960 \\\n    --num_frames 97 \\\n    --prompt \"FPS-24, A cat wearing sunglasses and working as a lifeguard at a pool\" \\\n    --embedded_guidance_scale 1.0 \\\n    --quant \\\n    --offload \\\n    --high_cpu_memory \\\n    --parameters_level\n```\nThe example above shows generating a 544px960px97f 4s video on a single RTX 4090 with full VRAM optimization, peaking at 18.5G VRAM usage. At maximum VRAM capacity, a 544px960px289f 12s video can be produced (using `--sequence_batch`, taking ~1.5h on one RTX 4090; adding GPUs greatly reduces time).\n\n#### 🚀 Parallel Inference on Multiple GPUs\n\n```shell\n# SkyReelsModel: If using i2v, switch to Skywork\u002FSkyReels-V1-Hunyuan-I2V.\n# quant: Enable FP8 weight-only quantization\n# offload: Enable offload model\n# high_cpu_memory: Enable pinned memory to reduce the overhead of model offloading.\n# gpu_num: Number of GPUs used.\nSkyReelsModel = \"Skywork\u002FSkyReels-V1-Hunyuan-T2V\"\npython3 video_generate.py \\\n    --model_id ${SkyReelsModel} \\\n    --guidance_scale 6.0 \\\n    --height 544 \\\n    --width 960 \\\n    --num_frames 97 \\\n    --prompt \"FPS-24, A cat wearing sunglasses and working as a lifeguard at a pool\" \\\n    --embedded_guidance_scale 1.0 \\\n    --quant \\\n    --offload \\\n    --high_cpu_memory \\\n    --gpu_num $GPU_NUM\n```\n\n## Performance Comparison\n\nThis test aims to compare the end-to-end latency of SkyReelsInfer and HunyuanVideo XDiT for 544p video processing on both the A800 (high-performance computing GPU) and RTX 4090 (consumer-grade GPU). The results will demonstrate the superior inference performance of SkyReelsInfer in terms of speed and efficiency.\n\n### Testing Parameters\n\n|      Resolution       |           video size           |    transformer step    |     guidance_scale     |\n|:---------------------:|:----------------------------:|:---------------:|:---------------:|\n|         540p          |        544px960px97f        |  30 |  6 |\n\n\n### User-Level GPU Inference (RTX4090)\n\nIn practice, Hunyuanvideo XDIT cannot perform inference on the RTX 4090 due to insufficient VRAM. To address this issue, we implemented fixes based on the official offload, FP8 model weights, and VAE tiling. These include:  \na) Optimizing the model loading and initialization logic to avoid fully loading the FP16 model into memory.  \nb) Reducing the VAE tiling size to alleviate memory usage.\nFor the deployment of SkyReelsInfer on the RTX 4090, the following measures will be implemented to ensure sufficient VRAM availability and efficient inference:  \na) **Model Quantization**: Apply FP8 weight-only quantization to ensure the model can be fully loaded into memory.  \nb) **Offload Strategy**: Enable parameter-level offloading to further reduce VRAM usage.  \nc) **Multi-GPU Parallelism**: Activate context parallelism, CFG parallelism, and VAE parallelism for distributed processing.  \nd) **Computation Optimization**: Optimize attention layer calculations using SegaAttn and enable Torch.Compile for transformer compilation optimization (supporting both 4-GPU and 8-GPU configurations).\n\n\n|      GPU NUM      |           hunyuanvideo + xdit    |           SkyReelsInfer   | \n|:---------------------:|:----------------------------:|:----------------------------:|\n|         1          |        VRAM OOM        |        889.31s        |\n|         2          |        VRAM OOM        |        453.69s        |\n|         4          |        464.3s        |        293.3s        |\n|         8          |        Cannot split video sequence into ulysses_degree x ring_degree        |        159.43s        |\n\nThe table above summarizes the end-to-end latency test results for generating 544p 4-second videos on the RTX 4090 using HunyuanVideo XDIT and SkyReelsVideoInfer. The following conclusions can be drawn:  \n- Under the same RTX 4090 resource conditions (4 GPUs), the SkyReelsInfer version reduces end-to-end latency by **58.3%** compared to HunyuanVideo XDIT (293.3s vs. 464.3s).  \n- The SkyReelsInfer version features a more robust deployment strategy, supporting inference deployment across **1 to 8 GPUs** at the user level.\n\n\n### A800\nBased on the A800 (80G), the primary testing focused on comparing the performance differences between HunyuanVideo XDIT and SkyReelsInfer without compromising output quality.\n\n|      GPU NUM      |           hunyuanvideo + xdit    |           SkyReelsInfer   | \n|:---------------------:|:----------------------------:|:----------------------------:|\n|         1          |        884.20s       |        771.03s        |\n|         2          |        487.22s        |        387.01s        |\n|         4          |        263.48s        |        205.49s        |\n|         8          |        Cannot split video sequence into ulysses_degree x ring_degree        |        107.41s        |\n\nThe table above summarizes the end-to-end latency test results for generating 544p 4-second videos on the A800 using HunyuanVideo XDIT and SkyReelsVideoInfer. The following conclusions can be drawn:\n\nUnder the same A800 resource conditions, the SkyReelsInfer version reduces end-to-end latency by 14.7% to 28.2% compared to the official HunyuanVideo version.\n\nThe SkyReelsInfer version features a more robust multi-GPU deployment strategy.\n\n## Acknowledgements\nWe would like to thank the contributors of \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Ftencent\u002FHunyuanVideo\">HunyuanVideo\u003C\u002Fa>, \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fchengzeyi\u002FParaAttention\">ParaAttention\u003C\u002Fa> and \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdiffusers\">Diffusers\u003C\u002Fa> repositories, for their open research and contributions.\n\n## Citation\n\n```bibtex\n@misc{SkyReelsV1,\n  author = {SkyReels-AI},\n  title = {Skyreels V1: Human-Centric Video Foundation Model},\n  year = {2025},\n  publisher = {GitHub},\n  journal = {GitHub repository},\n  howpublished = {\\url{https:\u002F\u002Fgithub.com\u002FSkyworkAI\u002FSkyReels-V1}}\n}\n```\n","SkyReels V1 是一个开源的人类中心视频基础模型，专注于文本到视频和图像到视频的转换。其核心功能包括基于高质量影视片段训练的文本到视频和图像到视频模型，能够生成具有33种不同面部表情和400多种自然动作组合的逼真视频，同时具备电影级的光影效果和美学表现。该项目采用Python语言开发，并在Hugging Face上提供了模型权重及推理代码，支持用户级GPU（如RTX 4090）上的并行推断。SkyReels V1适用于需要高质量、高真实度人物动画视频生成的应用场景，如虚拟角色创建、影视特效制作等。",2,"2026-06-11 03:41:58","high_star"]