[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72328":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":16,"stars7d":16,"stars30d":17,"stars90d":16,"forks30d":16,"starsTrendScore":16,"compositeScore":18,"rankGlobal":10,"rankLanguage":10,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":20,"hasPages":20,"topics":22,"createdAt":10,"pushedAt":10,"updatedAt":26,"readmeContent":27,"aiSummary":28,"trendingCount":16,"starSnapshotCount":16,"syncStatus":29,"lastSyncTime":30,"discoverSource":31},72328,"Pyramid-Flow","jy0205\u002FPyramid-Flow","jy0205","[ICLR 2025] Pyramidal Flow Matching for Efficient Video Generative Modeling","https:\u002F\u002Fpyramid-flow.github.io\u002F",null,"Python",3193,300,43,68,0,6,60.04,"MIT License",false,"main",[23,24,25],"diffusion-models","flow-matching","video-generation","2026-06-12 04:01:04","\u003Cdiv align=\"center\">\n\n# ⚡️Pyramid Flow⚡️\n\n[[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.05954) [[Project Page ✨]](https:\u002F\u002Fpyramid-flow.github.io) [[miniFLUX Model 🚀]](https:\u002F\u002Fhuggingface.co\u002Frain1011\u002Fpyramid-flow-miniflux) [[SD3 Model ⚡️]](https:\u002F\u002Fhuggingface.co\u002Frain1011\u002Fpyramid-flow-sd3) [[demo 🤗](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FPyramid-Flow\u002Fpyramid-flow)]\n\n\u003C\u002Fdiv>\n\nThis is the official repository for Pyramid Flow, a training-efficient **Autoregressive Video Generation** method based on **Flow Matching**. By training only on **open-source datasets**, it can generate high-quality 10-second videos at 768p resolution and 24 FPS, and naturally supports image-to-video generation.\n\n\u003Ctable class=\"center\" border=\"0\" style=\"width: 100%; text-align: left;\">\n\u003Ctr>\n  \u003Cth>10s, 768p, 24fps\u003C\u002Fth>\n  \u003Cth>5s, 768p, 24fps\u003C\u002Fth>\n  \u003Cth>Image-to-video\u003C\u002Fth>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd>\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F9935da83-ae56-4672-8747-0f46e90f7b2b\" autoplay muted loop playsinline>\u003C\u002Fvideo>\u003C\u002Ftd>\n  \u003Ctd>\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F3412848b-64db-4d9e-8dbf-11403f6d02c5\" autoplay muted loop playsinline>\u003C\u002Fvideo>\u003C\u002Ftd>\n  \u003Ctd>\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F3bd7251f-7b2c-4bee-951d-656fdb45f427\" autoplay muted loop playsinline>\u003C\u002Fvideo>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003C\u002Ftable>\n\n## News\n* `2024.11.13`  🚀🚀🚀 We release the [768p miniFLUX checkpoint](https:\u002F\u002Fhuggingface.co\u002Frain1011\u002Fpyramid-flow-miniflux) (up to 10s).\n\n  > We have switched the model structure from SD3 to a mini FLUX to fix human structure issues, please try our 1024p image checkpoint, 384p video checkpoint (up to 5s) and 768p video checkpoint (up to 10s). The new miniflux model shows great improvement on human structure and motion stability\n\n* `2024.10.29` ⚡️⚡️⚡️ We release [training code for VAE](#1-training-vae), [finetuning code for DiT](#2-finetuning-dit) and [new model checkpoints](https:\u002F\u002Fhuggingface.co\u002Frain1011\u002Fpyramid-flow-miniflux) with FLUX structure trained from scratch.\n\n\n* `2024.10.13`  ✨✨✨ [Multi-GPU inference](#3-multi-gpu-inference) and [CPU offloading](#cpu-offloading) are supported. Use it with **less than 8GB** of GPU memory, with great speedup on multiple GPUs.\n\n* `2024.10.11`  🤗🤗🤗 [Hugging Face demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FPyramid-Flow\u002Fpyramid-flow) is available. Thanks [@multimodalart](https:\u002F\u002Fhuggingface.co\u002Fmultimodalart) for the commit! \n\n* `2024.10.10`  🚀🚀🚀 We release the [technical report](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.05954), [project page](https:\u002F\u002Fpyramid-flow.github.io) and [model checkpoint](https:\u002F\u002Fhuggingface.co\u002Frain1011\u002Fpyramid-flow-sd3) of Pyramid Flow.\n\n## Table of Contents\n\n* [Introduction](#introduction)\n* [Installation](#installation)\n* [Inference](#inference)\n  1. [Quick Start with Gradio](#1-quick-start-with-gradio)\n  2. [Inference Code](#2-inference-code)\n  3. [Multi-GPU Inference](#3-multi-gpu-inference)\n  4. [Usage Tips](#4-usage-tips)\n* [Training](#Training)\n  1. [Training VAE](#training-vae)\n  2. [Finetuning DiT](#finetuning-dit)\n* [Gallery](#gallery)\n* [Comparison](#comparison)\n* [Acknowledgement](#acknowledgement)\n* [Citation](#citation)\n\n## Introduction\n\n![motivation](assets\u002Fmotivation.jpg)\n\nExisting video diffusion models operate at full resolution, spending a lot of computation on very noisy latents. By contrast, our method harnesses the flexibility of flow matching ([Lipman et al., 2023](https:\u002F\u002Fopenreview.net\u002Fforum?id=PqvMRDCJT9t); [Liu et al., 2023](https:\u002F\u002Fopenreview.net\u002Fforum?id=XVjTT1nw5z); [Albergo & Vanden-Eijnden, 2023](https:\u002F\u002Fopenreview.net\u002Fforum?id=li7qeBbCR1t)) to interpolate between latents of different resolutions and noise levels, allowing for simultaneous generation and decompression of visual content with better computational efficiency. The entire framework is end-to-end optimized with a single DiT ([Peebles & Xie, 2023](http:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FICCV2023\u002Fhtml\u002FPeebles_Scalable_Diffusion_Models_with_Transformers_ICCV_2023_paper.html)), generating high-quality 10-second videos at 768p resolution and 24 FPS within 20.7k A100 GPU training hours.\n\n## Installation\n\nWe recommend setting up the environment with conda. The codebase currently uses Python 3.8.10 and PyTorch 2.1.2 ([guide](https:\u002F\u002Fpytorch.org\u002Fget-started\u002Fprevious-versions\u002F#v212)), and we are actively working to support a wider range of versions.\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fjy0205\u002FPyramid-Flow\ncd Pyramid-Flow\n\n# create env using conda\nconda create -n pyramid python==3.8.10\nconda activate pyramid\npip install -r requirements.txt\n```\n\nThen, download the model from [Huggingface](https:\u002F\u002Fhuggingface.co\u002Frain1011) (there are two variants: [miniFLUX](https:\u002F\u002Fhuggingface.co\u002Frain1011\u002Fpyramid-flow-miniflux) or [SD3](https:\u002F\u002Fhuggingface.co\u002Frain1011\u002Fpyramid-flow-sd3)). The miniFLUX models support 1024p image, 384p and 768p video generation, and the SD3-based models support 768p and 384p video generation. The 384p checkpoint generates 5-second video at 24FPS, while the 768p checkpoint generates up to 10-second video at 24FPS.\n\n```python\nfrom huggingface_hub import snapshot_download\n\nmodel_path = 'PATH'   # The local directory to save downloaded checkpoint\nsnapshot_download(\"rain1011\u002Fpyramid-flow-miniflux\", local_dir=model_path, local_dir_use_symlinks=False, repo_type='model')\n```\n\n## Inference\n\n### 1. Quick start with Gradio\n\nTo get started, first install [Gradio](https:\u002F\u002Fwww.gradio.app\u002Fguides\u002Fquickstart), set your model path at [#L36](https:\u002F\u002Fgithub.com\u002Fjy0205\u002FPyramid-Flow\u002Fblob\u002F3777f8b84bddfa2aa2b497ca919b3f40567712e6\u002Fapp.py#L36), and then run on your local machine:\n\n```bash\npython app.py\n```\n\nThe Gradio demo will be opened in a browser. Thanks to [@tpc2233](https:\u002F\u002Fgithub.com\u002Ftpc2233) the commit, see [#48](https:\u002F\u002Fgithub.com\u002Fjy0205\u002FPyramid-Flow\u002Fpull\u002F48) for details.\n\nOr, try it out effortlessly on [Hugging Face Space 🤗](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FPyramid-Flow\u002Fpyramid-flow) created by [@multimodalart](https:\u002F\u002Fhuggingface.co\u002Fmultimodalart). Due to GPU limits, this online demo can only generate 25 frames (export at 8FPS or 24FPS). Duplicate the space to generate longer videos.\n\n#### Quick Start on Google Colab\n\nTo quickly try out Pyramid Flow on Google Colab, run the code below:\n\n```\n# Setup\n!git clone https:\u002F\u002Fgithub.com\u002Fjy0205\u002FPyramid-Flow\n%cd Pyramid-Flow\n!pip install -r requirements.txt\n!pip install gradio\n\n# This code downloads miniFLUX\nfrom huggingface_hub import snapshot_download\n\nmodel_path = '\u002Fcontent\u002FPyramid-Flow'\nsnapshot_download(\"rain1011\u002Fpyramid-flow-miniflux\", local_dir=model_path, local_dir_use_symlinks=False, repo_type='model')\n\n# Start\n!python app.py\n```\n\n### 2. Inference Code\n\nTo use our model, please follow the inference code in `video_generation_demo.ipynb` at [this link](https:\u002F\u002Fgithub.com\u002Fjy0205\u002FPyramid-Flow\u002Fblob\u002Fmain\u002Fvideo_generation_demo.ipynb). We strongly recommend you to try the latest published pyramid-miniflux, which shows great improvement on human structure and motion stability. Set the param `model_name` to `pyramid_flux` to use. We further simplify it into the following two-step procedure. First, load the downloaded model:\n\n```python\nimport torch\nfrom PIL import Image\nfrom pyramid_dit import PyramidDiTForVideoGeneration\nfrom diffusers.utils import load_image, export_to_video\n\ntorch.cuda.set_device(0)\nmodel_dtype, torch_dtype = 'bf16', torch.bfloat16   # Use bf16 (not support fp16 yet)\n\nmodel = PyramidDiTForVideoGeneration(\n    'PATH',                                         # The downloaded checkpoint dir\n    model_name=\"pyramid_flux\",\n    model_dtype=model_dtype,\n    model_variant='diffusion_transformer_768p',\n)\n\nmodel.vae.enable_tiling()\n# model.vae.to(\"cuda\")\n# model.dit.to(\"cuda\")\n# model.text_encoder.to(\"cuda\")\n\n# if you're not using sequential offloading bellow uncomment the lines above ^\nmodel.enable_sequential_cpu_offload()\n```\n\nThen, you can try text-to-video generation on your own prompts. Noting that the 384p version only support 5s now (set temp up to 16)! \n\n```python\nprompt = \"A movie trailer featuring the adventures of the 30 year old space man wearing a red wool knitted motorcycle helmet, blue sky, salt desert, cinematic style, shot on 35mm film, vivid colors\"\n\n# used for 384p model variant\n# width = 640\n# height = 384\n\n# used for 768p model variant\nwidth = 1280\nheight = 768\n\nwith torch.no_grad(), torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype):\n    frames = model.generate(\n        prompt=prompt,\n        num_inference_steps=[20, 20, 20],\n        video_num_inference_steps=[10, 10, 10],\n        height=height,     \n        width=width,\n        temp=16,                    # temp=16: 5s, temp=31: 10s\n        guidance_scale=7.0,         # The guidance for the first frame, set it to 7 for 384p variant\n        video_guidance_scale=5.0,   # The guidance for the other video latent\n        output_type=\"pil\",\n        save_memory=True,           # If you have enough GPU memory, set it to `False` to improve vae decoding speed\n    )\n\nexport_to_video(frames, \".\u002Ftext_to_video_sample.mp4\", fps=24)\n```\n\nAs an autoregressive model, our model also supports (text conditioned) image-to-video generation:\n\n```python\n# used for 384p model variant\n# width = 640\n# height = 384\n\n# used for 768p model variant\nwidth = 1280\nheight = 768\n\nimage = Image.open('assets\u002Fthe_great_wall.jpg').convert(\"RGB\").resize((width, height))\nprompt = \"FPV flying over the Great Wall\"\n\nwith torch.no_grad(), torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype):\n    frames = model.generate_i2v(\n        prompt=prompt,\n        input_image=image,\n        num_inference_steps=[10, 10, 10],\n        temp=16,\n        video_guidance_scale=4.0,\n        output_type=\"pil\",\n        save_memory=True,           # If you have enough GPU memory, set it to `False` to improve vae decoding speed\n    )\n\nexport_to_video(frames, \".\u002Fimage_to_video_sample.mp4\", fps=24)\n```\n\n#### CPU offloading\n\nWe also support two types of CPU offloading to reduce GPU memory requirements. Note that they may sacrifice efficiency.\n* Adding a `cpu_offloading=True` parameter to the generate function allows inference with **less than 12GB** of GPU memory. This feature was contributed by [@Ednaordinary](https:\u002F\u002Fgithub.com\u002FEdnaordinary), see [#23](https:\u002F\u002Fgithub.com\u002Fjy0205\u002FPyramid-Flow\u002Fpull\u002F23) for details.\n* Calling `model.enable_sequential_cpu_offload()` before the above procedure allows inference with **less than 8GB** of GPU memory. This feature was contributed by [@rodjjo](https:\u002F\u002Fgithub.com\u002Frodjjo), see [#75](https:\u002F\u002Fgithub.com\u002Fjy0205\u002FPyramid-Flow\u002Fpull\u002F75) for details.\n\n#### MPS backend\n\nThanks to [@niw](https:\u002F\u002Fgithub.com\u002Fniw), Apple Silicon users (e.g. MacBook Pro with M2 24GB) can also try our model using the MPS backend! Please see [#113](https:\u002F\u002Fgithub.com\u002Fjy0205\u002FPyramid-Flow\u002Fpull\u002F113) for the details.\n\n### 3. Multi-GPU Inference\n\nFor users with multiple GPUs, we provide an [inference script](https:\u002F\u002Fgithub.com\u002Fjy0205\u002FPyramid-Flow\u002Fblob\u002Fmain\u002Fscripts\u002Finference_multigpu.sh) that uses sequence parallelism to save memory on each GPU. This also brings a big speedup, taking only 2.5 minutes to generate a 5s, 768p, 24fps video on 4 A100 GPUs (vs. 5.5 minutes on a single A100 GPU). Run it on 2 GPUs with the following command:\n\n```bash\nCUDA_VISIBLE_DEVICES=0,1 sh scripts\u002Finference_multigpu.sh\n```\n\nIt currently supports 2 or 4 GPUs (For SD3 Version), with more configurations available in the original script. You can also launch a [multi-GPU Gradio demo](https:\u002F\u002Fgithub.com\u002Fjy0205\u002FPyramid-Flow\u002Fblob\u002Fmain\u002Fscripts\u002Fapp_multigpu_engine.sh) created by [@tpc2233](https:\u002F\u002Fgithub.com\u002Ftpc2233), see [#59](https:\u002F\u002Fgithub.com\u002Fjy0205\u002FPyramid-Flow\u002Fpull\u002F59) for details.\n\n  > Spoiler: We didn't even use sequence parallelism in training, thanks to our efficient pyramid flow designs.\n\n### 4. Usage tips\n\n* The `guidance_scale` parameter controls the visual quality. We suggest using a guidance within [7, 9] for the 768p checkpoint during text-to-video generation, and 7 for the 384p checkpoint.\n* The `video_guidance_scale` parameter controls the motion. A larger value increases the dynamic degree and mitigates the autoregressive generation degradation, while a smaller value stabilizes the video.\n* For 10-second video generation, we recommend using a guidance scale of 7 and a video guidance scale of 5.\n\n## Training\n\n### 1. Training VAE\n\nThe hardware requirements for training VAE are at least 8 A100 GPUs. Please refer to [this document](https:\u002F\u002Fgithub.com\u002Fjy0205\u002FPyramid-Flow\u002Fblob\u002Fmain\u002Fdocs\u002FVAE.md). This is a [MAGVIT-v2](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.05737) like continuous 3D VAE, which should be quite flexible. Feel free to build your own video generative model on this part of VAE training code.\n\n### 2. Finetuning DiT\n\nThe hardware requirements for finetuning DiT are at least 8 A100 GPUs. Please refer to [this document](https:\u002F\u002Fgithub.com\u002Fjy0205\u002FPyramid-Flow\u002Fblob\u002Fmain\u002Fdocs\u002FDiT.md). We provide instructions for both autoregressive and non-autoregressive versions of Pyramid Flow. The former is more research oriented and the latter is more stable (but less efficient without temporal pyramid).\n\n## Gallery\n\nThe following video examples are generated at 5s, 768p, 24fps. For more results, please visit our [project page](https:\u002F\u002Fpyramid-flow.github.io).\n\n\u003Ctable class=\"center\" border=\"0\" style=\"width: 100%; text-align: left;\">\n\u003Ctr>\n  \u003Ctd>\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F5b44a57e-fa08-4554-84a2-2c7a99f2b343\" autoplay muted loop playsinline>\u003C\u002Fvideo>\u003C\u002Ftd>\n  \u003Ctd>\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F5afd5970-de72-40e2-900d-a20d18308e8e\" autoplay muted loop playsinline>\u003C\u002Fvideo>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd>\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F1d44daf8-017f-40e9-bf18-1e19c0a8983b\" autoplay muted loop playsinline>\u003C\u002Fvideo>\u003C\u002Ftd>\n  \u003Ctd>\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F7f5dd901-b7d7-48cc-b67a-3c5f9e1546d2\" autoplay muted loop playsinline>\u003C\u002Fvideo>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003C\u002Ftable>\n\n## Comparison\n\nOn VBench ([Huang et al., 2024](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FVchitect\u002FVBench_Leaderboard)), our method surpasses all the compared open-source baselines. Even with only public video data, it achieves comparable performance to commercial models like Kling ([Kuaishou, 2024](https:\u002F\u002Fkling.kuaishou.com\u002Fen)) and Gen-3 Alpha ([Runway, 2024](https:\u002F\u002Frunwayml.com\u002Fresearch\u002Fintroducing-gen-3-alpha)), especially in the quality score (84.74 vs. 84.11 of Gen-3) and motion smoothness.\n\n![vbench](assets\u002Fvbench.jpg)\n\nWe conduct an additional user study with 20+ participants. As can be seen, our method is preferred over open-source models such as [Open-Sora](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FOpen-Sora) and [CogVideoX-2B](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FCogVideo) especially in terms of motion smoothness.\n\n![user_study](assets\u002Fuser_study.jpg)\n\n## Acknowledgement\n\nWe are grateful for the following awesome projects when implementing Pyramid Flow:\n\n* [SD3 Medium](https:\u002F\u002Fhuggingface.co\u002Fstabilityai\u002Fstable-diffusion-3-medium) and [Flux 1.0](https:\u002F\u002Fhuggingface.co\u002Fblack-forest-labs\u002FFLUX.1-dev): State-of-the-art image generation models based on flow matching.\n* [Diffusion Forcing](https:\u002F\u002Fboyuan.space\u002Fdiffusion-forcing) and [GameNGen](https:\u002F\u002Fgamengen.github.io): Next-token prediction meets full-sequence diffusion.\n* [WebVid-10M](https:\u002F\u002Fgithub.com\u002Fm-bain\u002Fwebvid), [OpenVid-1M](https:\u002F\u002Fgithub.com\u002FNJU-PCALab\u002FOpenVid-1M) and [Open-Sora Plan](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FOpen-Sora-Plan): Large-scale datasets for text-to-video generation.\n* [CogVideoX](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FCogVideo): An open-source text-to-video generation model that shares many training details.\n* [Video-LLaMA2](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideoLLaMA2): An open-source video LLM for our video recaptioning.\n\n## Citation\n\nConsider giving this repository a star and cite Pyramid Flow in your publications if it helps your research.\n```\n@article{jin2024pyramidal,\n  title={Pyramidal Flow Matching for Efficient Video Generative Modeling},\n  author={Jin, Yang and Sun, Zhicheng and Li, Ningyuan and Xu, Kun and Xu, Kun and Jiang, Hao and Zhuang, Nan and Huang, Quzhe and Song, Yang and Mu, Yadong and Lin, Zhouchen},\n  jounal={arXiv preprint arXiv:2410.05954},\n  year={2024}\n}\n```\n","Pyramid Flow 是一种基于流匹配的自回归视频生成方法，旨在高效生成高质量视频。该项目利用开源数据集进行训练，能够生成长达10秒、分辨率为768p、帧率为24 FPS的视频，并支持图像到视频的转换。其核心功能包括使用FLUX结构改进人体结构和运动稳定性，以及支持多GPU推理和CPU卸载以降低硬件要求。适用于需要高分辨率视频生成的场景，如创意内容制作、虚拟现实和增强现实应用等。",2,"2026-06-11 03:41:22","high_star"]