[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-83893":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":13,"subscribersCount":13,"size":13,"stars1d":15,"stars7d":16,"stars30d":16,"stars90d":13,"forks30d":13,"starsTrendScore":16,"compositeScore":17,"rankGlobal":10,"rankLanguage":10,"license":18,"archived":19,"fork":19,"defaultBranch":20,"hasWiki":19,"hasPages":19,"topics":21,"createdAt":10,"pushedAt":10,"updatedAt":32,"readmeContent":33,"aiSummary":10,"trendingCount":13,"starSnapshotCount":13,"syncStatus":14,"lastSyncTime":34,"discoverSource":35},83893,"LongLive-RAG","qixinhu11\u002FLongLive-RAG","qixinhu11","Official Implementation of LongLive-RAG: A general retrieval-augmented framework for long video generation.","",null,"Python",62,0,2,1,10,48,"Apache License 2.0",false,"main",[22,23,24,25,26,27,28,29,30,31],"autoregressive","diffusion-models","generative-ai","long-video-generation","pytorch","retrieval-augmented-generation","video-diffusion","video-diffusion-model","video-generation","wan","2026-06-12 04:01:42","\u003Cimg src=\"assets\u002Flongliverag-logo.png\" width=\"560\" alt=\"LongLive-RAG logo\"\u002F>\n\n# 🔍 LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation\n\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FArXiv-Paper-brown)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2606.02553)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FGitHub-Code-blue)](https:\u002F\u002Fgithub.com\u002Fqixinhu11\u002FLongLive-RAG)\n[![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-Page-brightgreen)](https:\u002F\u002Flonglive-rag.github.io\u002F)\n[![DeepWiki](https:\u002F\u002Fdeepwiki.com\u002Fbadge.svg)](https:\u002F\u002Fdeepwiki.com\u002Fqixinhu11\u002FLongLive-RAG)\n[![Model Card](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗-Model_Card-yellow)](https:\u002F\u002Fhuggingface.co\u002Fqixinhu11\u002FLongLive-RAG)\n[![HF Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗-Paper-yellow)](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2606.02553)\n\n[Qixin Hu](https:\u002F\u002Fqixinhu11.github.io\u002F) · [Shuai Yang](https:\u002F\u002Fandysonys.github.io\u002F) · [Wei Huang](https:\u002F\u002Faaron-weihuang.com\u002F) · [Song Han](https:\u002F\u002Fhanlab.mit.edu\u002Fsonghan) · [Yukang Chen](https:\u002F\u002Fyukangchen.com\u002F)\n\n## 💡 TL;DR\n\nLongLive-RAG turns long video generation into a **retrieval problem**. Instead of attending only to the most recent sliding window, an autoregressive (AR) video generator looks back over the video it has *already* generated and pulls in the most relevant past latents as extra context. This cuts error accumulation, identity drift, and background flicker over long horizons, **without retraining the base generator**.\n\n![LongLive-RAG framework overview](assets\u002Ffig_framework.png)\n\n## 📰 News\n\n- 🔥 [2026.06] We release the **LongLive-RAG** paper and code!\n\n## 🎬 Demo\n\n> 🌐 More results and video comparisons on the [**project page**](https:\u002F\u002Flonglive-rag.github.io\u002F).\n\nLong-horizon comparisons. The **native** sliding-window baseline (left) accumulates errors and drifts over time, while adding **LongLive-RAG** (right) preserves subject identity and visual quality.\n\n\u003Ctable>\n\u003Ctr>\n  \u003Cth align=\"center\">Native (baseline)\u003C\u002Fth>\n  \u003Cth align=\"center\">Native + LongLive-RAG (Ours)\u003C\u002Fth>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd>\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F7049f1ec-36aa-41ee-a219-19319f23166a\" controls width=\"100%\">\u003C\u002Fvideo>\u003C\u002Ftd>\n  \u003Ctd>\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F63a15dea-2e03-4b46-a778-96da12d7c3ab\" controls width=\"100%\">\u003C\u002Fvideo>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd>\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fec3f48c8-e014-4032-a52d-20b41d71ee7b\" controls width=\"100%\">\u003C\u002Fvideo>\u003C\u002Ftd>\n  \u003Ctd>\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F7db25e60-ebac-4c39-8259-6eb14f85b09e\" controls width=\"100%\">\u003C\u002Fvideo>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003C\u002Ftable>\n\n## ✨ Highlights\n\n- **🥇 First of its kind.** Among open-ended AR long video generation methods, the first to formulate self-generated latent history as content-addressable retrieval memory.\n- **🔌 Plug-and-play.** Works across Causal-Forcing, Self-Forcing, and LongLive with the base generator frozen.\n- **🔎 Searchable history.** Retrieves the most relevant past latents as extra context for each new block.\n- **📐 Window Temporal Delta Loss.** Makes embeddings capture meaningful temporal change, not redundant local similarity.\n- **⚡ Consistent wins.** Best average VBench-Long rank across lengths and backbones.\n\n## 🔬 Method Overview\n\nAt block `t`, a standard AR model attends to a sliding-window context. LongLive-RAG inserts **retrieved historical entries** `M_t` between the sink and local windows:\n\n```\nSliding window:   A_sw  = [ C_sink ‖           C_loc ]\nLongLive-RAG:     A_rag = [ C_sink ‖  M_t  ‖   C_loc ]\n```\n\n| Stage | What happens |\n|---|---|\n| **1. Indexing** | Encode each completed latent block into a compact embedding and store it. |\n| **2. Retrieval** | Match the current block against past embeddings and pull in the top-K as extra context. |\n| **3. Embedding training** | Train the encoder offline on self-generated latents, with the base generator frozen. |\n\n## 🏁 Getting Started\n\n### 📦 Installation\n\nLongLive-RAG shares its environment with LongLive. Just follow the upstream [**LongLive installation guide**](https:\u002F\u002Fnvlabs.github.io\u002FLongLive\u002FLongLive2\u002Fdocs\u002F#installation).\n\n### 🚀 Inference\n\n**1. Download everything — two commands.** All LongLive-RAG assets (AR backbones, retrieval AE, prompt files, and the toy latent set) live in a single Hugging Face repo; the base WAN VAE comes from Wan:\n\n```bash\n# Base WAN VAE — LongLive-RAG operates in its latent space\nhf download Wan-AI\u002FWan2.1-T2V-1.3B --local-dir wan_models\u002FWan2.1-T2V-1.3B\n\n# All LongLive-RAG assets — restores checkpoints\u002F and toydatasets\u002F in place\nhf download qixinhu11\u002FLongLive-RAG --local-dir . --include \"checkpoints\u002F*\" \"toydatasets\u002F*\"\n```\n\n> Older setups can swap `hf download` for `huggingface-cli download` (same arguments).\n\nThe second command lays out:\n\n```\ncheckpoints\u002F\n├── causal_forcing.pt              # Causal-Forcing AR backbone\n├── self_forcing.pt                # Self-Forcing AR backbone\n├── longlive_base.pt               # LongLive AR backbone\n├── longlive_lora.pt               # LongLive LoRA (paired with longlive_base.pt)\n├── ae_latent_mem.pt               # Retrieval autoencoder (default for inference)\n├── moviegenbench_128_refined.txt  # 128 MovieGenBench prompts\n└── vidprom_filtered_extended.txt  # Self-Forcing prompt pool (for generate_latent.py)\ntoydatasets\u002F\n└── latent_0000xx.pt               # tiny example latent set for the training demo\n```\n\nTo train your own retrieval AE instead of using `ae_latent_mem.pt`, see [Training](#️-training).\n\n**3. Run.** The repo ships a 3 × 2 grid (three backbones × two context-assembly methods) in [configs\u002F](configs\u002F):\n\n| Backbone \\ Method | `native` (sliding-window) | `latentmem` (LongLive-RAG, ours) |\n|---|---|---|\n| **causal_forcing** | [causal_forcing_native.yaml](configs\u002Fcausal_forcing_native.yaml) | [causal_forcing_latentmem.yaml](configs\u002Fcausal_forcing_latentmem.yaml) |\n| **self_forcing** | [self_forcing_native.yaml](configs\u002Fself_forcing_native.yaml) | [self_forcing_latentmem.yaml](configs\u002Fself_forcing_latentmem.yaml) |\n| **longlive** | [longlive_native.yaml](configs\u002Flonglive_native.yaml) | [longlive_latentmem.yaml](configs\u002Flonglive_latentmem.yaml) |\n\n```bash\n# Main result: Causal-Forcing backbone + LongLive-RAG retrieval\nbash inference.sh causal_forcing latentmem\n\n# Baselines: native sliding-window\nbash inference.sh causal_forcing native\n\n# GPU \u002F port overrides\nGPU=4 PORT=29510 bash inference.sh causal_forcing latentmem\n```\n\n### 🔁 Reproducibility\n\nFor deterministic inference, [inference.py](inference.py) sets a fixed seed (`config.seed`) across `random` \u002F `numpy` \u002F `torch` and enables deterministic backends:\n\n```python\nos.environ[\"CUBLAS_WORKSPACE_CONFIG\"] = \":16:8\"\nos.environ.setdefault(\"PYTHONHASHSEED\", str(config.seed))\ntorch.backends.cudnn.deterministic = True\ntorch.backends.cudnn.benchmark = False\ntorch.use_deterministic_algorithms(True, warn_only=True)\n```\n\n> ⚠️ **Bit-exact cross-machine reproduction is strict and hard to guarantee.** Even with the settings above, identical outputs across *different* machines require the **same GPU model**, the **same PyTorch \u002F CUDA \u002F cuDNN versions**, and matching checkpoints\u002Fconfigs. Differences in GPU architecture (e.g. A100 vs. H100), TF32 behavior, or the `torch.compile` autotuned attention kernels can still produce small numerical drift. To take `PYTHONHASHSEED` fully into effect, export it before launching: `PYTHONHASHSEED=0 bash inference.sh ...`.\n\n> ✅ **The more reliable way to validate our gains is a same-machine A\u002FB comparison.** Run the **native** baseline and **LongLive-RAG (`latentmem`)** back-to-back on the *same* GPU with the *same* prompts and seed, then compare the outputs directly. This isolates the effect of retrieval from any hardware\u002Fsoftware-stack variance:\n>\n> ```bash\n> # Same backbone, same machine — compare baseline vs. ours\n> bash inference.sh causal_forcing native\n> bash inference.sh causal_forcing latentmem\n> ```\n\n### 🏋️ Training\n\nThe base generator stays **frozen**; the only trainable component is the retrieval encoder (a small latent autoencoder). Training has two steps:\n\n**Step 1: Build a latent corpus.** Run a frozen generator over a prompt pool to collect the clean latent blocks it produces; these become the training samples. The launcher shards generation across multiple GPUs.\n\n```bash\nbash generate_latent.sh\n```\n\n**Step 2: Train the retrieval autoencoder.** Fit the encoder on the collected latents with a reconstruction loss plus the Window Temporal Delta and trajectory-smoothing terms. Default hyperparameters live in [ae\u002Fconfigs\u002F](ae\u002Fconfigs\u002F).\n\n```bash\nbash train_ae_delta.sh\n```\n\nRetraining the base AR backbones is out of scope; backbone checkpoints are consumed as-is. See upstream [LongLive](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FLongLive) \u002F [Self-Forcing](https:\u002F\u002Fgithub.com\u002Fguandeh17\u002FSelf-Forcing) to train one from scratch.\n\n## 🗂️ Repository Layout\n\n```\n├── ae\u002F               # Retrieval autoencoder (model, configs, training)\n├── checkpoints\u002F      # AR backbones, AE checkpoint, prompt .txt files (gitignored)\n├── configs\u002F          # Inference YAMLs (3 backbones × 2 methods) + generate_latent\n├── datasets\u002F         # AE training latents (output of generate_latent.sh, gitignored)\n├── toydatasets\u002F      # Tiny example latent set for the training demo (from HF, gitignored)\n├── pipeline\u002F         # Causal inference pipeline (drives all backbones)\n├── utils\u002F            # Dataset, memory, scheduler, lora, wan-wrapper utilities\n├── wan\u002F, wan_models\u002F # WAN VAE backbone (T2V-1.3B)\n├── inference.py      # Inference entry point\n├── inference.sh      # Launcher: bash inference.sh \u003Cbackbone> \u003Cmethod>\n├── generate_latent.py \u002F .sh  # Latent corpus generation (multi-GPU sharded)\n└── train_ae_delta.sh         # Retrieval AE launcher\n```\n\n## 📄 Citation\n\n📜 Paper: [arXiv:2606.02553](https:\u002F\u002Farxiv.org\u002Fabs\u002F2606.02553)\n\n```bibtex\n@article{longliverag2026,\n  title         = {LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation},\n  author        = {Hu, Qixin and Yang, Shuai and Huang, Wei and Han, Song and Chen, Yukang},\n  journal       = {arXiv preprint arXiv:2606.02553},\n  archivePrefix = {arXiv},\n  eprint        = {2606.02553},\n  year          = {2026}\n}\n```\n\n## 🙏 Acknowledgements\n\nLongLive-RAG builds on the codebases and ideas of:\n\n- [LongLive](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FLongLive): the AR long-video framework this codebase forks from.\n- [Self-Forcing](https:\u002F\u002Fgithub.com\u002Fguandeh17\u002FSelf-Forcing): causal AR training recipe and prompt pool.\n- [Causal-Forcing](https:\u002F\u002Fgithub.com\u002Fthu-ml\u002FCausal-Forcing): one of the AR backbones evaluated in this work.\n- [Wan](https:\u002F\u002Fgithub.com\u002FWan-Video\u002FWan2.1): the base video generation model and VAE latent space.\n\n## 📝 License\n\nReleased under the [Apache 2.0](LICENSE) license.\n","2026-06-11 04:11:46","CREATED_QUERY"]