[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80990":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":13,"subscribersCount":13,"size":13,"stars1d":15,"stars7d":16,"stars30d":16,"stars90d":13,"forks30d":13,"starsTrendScore":17,"compositeScore":13,"rankGlobal":10,"rankLanguage":10,"license":10,"archived":18,"fork":18,"defaultBranch":19,"hasWiki":18,"hasPages":18,"topics":20,"createdAt":10,"pushedAt":10,"updatedAt":25,"readmeContent":26,"aiSummary":27,"trendingCount":13,"starSnapshotCount":13,"syncStatus":15,"lastSyncTime":28,"discoverSource":29},80990,"Lightning-Unified-Video-Editor-via-In-Context-Sparse-Attention","xie-lab-ml\u002FLightning-Unified-Video-Editor-via-In-Context-Sparse-Attention","xie-lab-ml","[ICML 2026] The official code for our work \"LIVEditor-14B: Lightning Unified Video Editor via In-Context Sparse Attention\".","https:\u002F\u002Fxie-lab-ml.github.io\u002Fliveditor-page\u002F",null,"Python",33,0,30,2,3,6,false,"main",[21,22,23,24],"aigc","attention-mechanism","video-generation","vidio-editing","2026-06-12 02:04:09","# LIVEditor-14B\n\n### Lightning Unified Video Editing via In-Context Sparse Attention\n\n\u003Cdiv align=\"center\">\n\n**Shitong Shao** · **Zikai Zhou** · **Haopeng Li** · **Yingwei Song** · **Wenliang Zhong** · **Lichen Bai** · **Zeke Xie**\n\n[![Project Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject-Page-green)](https:\u002F\u002Fxie-lab-ml.github.io\u002Fliveditor-page\u002F)\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-arXiv-red)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.04569)\n[![Hugging Face](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗_Hub-Model-yellow)](https:\u002F\u002Fhuggingface.co\u002Fsst12345\u002Fliveditor)\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\".\u002Fassets\u002Flive_visualization.png\" alt=\"Teaser\" width=\"90%\">\n\u003C\u002Fp>\n\n\u003C\u002Fdiv>\n\n---\n\n## 📖 Introduction\n\nVideo editing with diffusion transformers suffers from the quadratic complexity of full self-attention — O(S²) in total token count — making it prohibitively expensive when both source and generated video tokens must be processed jointly.\n\n**LIVEditor-14B** addresses this with **In-Context Sparse Attention**: a lightweight, training-free block-retrieval mechanism that efficiently selects the most relevant source-video tokens for each query block, avoiding the need for dense attention over the full sequence.\n\n> **Key Idea**: store compressed KV representations of the source video, retrieve only the top-*k* most relevant blocks via compressed attention scores, and apply sparse piecewise attention for the diffuse query blocks while using FlashAttention only for the most peaked ones.\n\n**Key results**:\n- A strong open-source video editing model leading in multiple aspects.\n- The first sparse attention for video editing\n- ⚡ **2.8× faster** than FlashAttention-2 at 65K tokens on RTX 4090\n- 🎯 Lightweight fine-tuning — only **80 steps** on ~100K video pairs\n- 🔌 Pluggable backend — supports **TileLang** and **Triton** sparse kernels\n\n---\n\n## 🚀 Quick Start\n\n### 1. Clone & Install\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fxie-lab-ml\u002FLightning-Unified-Video-Editor-via-In-Context-Sparse-Attention.git\ncd LIVEditor-14B\npip install -r requirements.txt\n```\n\n### 2. Download Weights\n\n| Component | Source | Path |\n|-----------|--------|------|\n| Wan 2.2-T2V-A14B | [Official](https:\u002F\u002Fhuggingface.co\u002FWan-AI\u002FWan2.2-T2V-A14B) | `pretrained_weights\u002FWan2.2-T2V-A14B\u002F` |\n| LIVEditor-14B checkpoint | [🤗 Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fsst12345\u002Fliveditor) | `liveditor_ckpt.bin` |\n\n```bash\n# Download fine-tuned checkpoint from Hugging Face\npip install huggingface_hub\nhuggingface-cli download sst12345\u002Fliveditor liveditor_ckpt.bin --local-dir .\n```\n\nConfigure paths in `inference.yaml`:\n\n```yaml\nbase_model_path: pretrained_weights\u002FWan2.2-T2V-A14B\u002F\n# resume_ckpt is passed via CLI:  --checkpoint liveditor_ckpt.bin\n```\n\n### 3. Run Demo\n\n```bash\npython inference.py \\\n    --config inference.yaml \\\n    --checkpoint liveditor_ckpt.bin \\\n    --input assets\u002Finput.mp4 \\\n    --prompt \"Add a small golden crown with delicate jewels on top of the girl's head...\" \\\n    --output result.mp4\n```\n\n\u003Ctable>\n\u003Ctr>\u003Cth>Input\u003C\u002Fth>\u003Cth>Output (TileLang)\u003C\u002Fth>\u003Cth>Output (Triton)\u003C\u002Fth>\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd>\u003Cvideo src=\"assets\u002Finput.mp4\" width=\"200\"\u002F>\u003C\u002Ftd>\n  \u003Ctd>\u003Cvideo src=\"assets\u002Foutput_tilelang.mp4\" width=\"200\"\u002F>\u003C\u002Ftd>\n  \u003Ctd>\u003Cvideo src=\"assets\u002Foutput_triton.mp4\" width=\"200\"\u002F>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003C\u002Ftable>\n\n---\n\n## 🔧 Usage\n\n### Inference\n\n```\npython inference.py \\\n    --config inference.yaml \\                     # config file\n    --checkpoint \u003Cpath-to-ckpt> \\                 # fine-tuned checkpoint\n    --input \u003Cinput-video.mp4> \\                   # source video\n    --prompt \"\u003Cediting-instruction>\" \\            # text prompt\n    --output \u003Coutput.mp4>                         # output path\n\n# Optional flags\n    --guidance 2.5 \\                              # CFG scale (default: 2.5)\n    --steps 32 \\                                  # denoising steps (default: 32)\n    --seed 42 \\                                   # random seed\n    --backend tilelang                            # sparse kernel: tilelang (default) | triton\n```\n\n### Switching the Sparse Backend\n\n```python\n# inference.yaml\nattention:\n  backend: tilelang      # or 'triton'\n```\n\nBoth backends produce visually identical results (mean absolute error \u003C 1e-4).\n\n---\n\n## 🧠 Method\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\".\u002Fassets\u002Fin_context_sparse_attention.png\" alt=\"Architecture\" width=\"90%\">\n\u003C\u002Fdiv>\n\nLIVEditor-14B introduces three key components on top of the Wan 2.2 diffusion backbone:\n\n### 1. Block-wise Compression\n\n| Parameter | Value | Description |\n|-----------|-------|-------------|\n| `BLOCK_M` | 64 | Block size for Q\u002FK\u002FV partitioning |\n| Compressed dim | `[B, H, NUM_BLOCKS, D]` | Average-pooled per-block representation |\n\nRaw Q, K, V tensors of shape `[B, H, S, D]` are divided into blocks of 64 tokens and averaged, yielding compact proxies for efficient relevance scoring.\n\n### 2. In-Context Top-K Retrieval\n\nFor each query block, a compressed attention score matrix `Q_c @ K_c^T` is computed. The top-*k* source-video blocks (from the `s_part`) with the highest scores are selected and appended to the target-video KV cache:\n\n```\nnew_KV = [K_target | K_selected_source_blocks]\n```\n\nThis keeps the KV length to `t_seq + topK × 64` tokens, far smaller than the full `t_seq + s_seq`.\n\n### 3. Sharpness-Aware Split\n\nQuery blocks are ranked by attention sharpness (the sum of top-*k* compressed attention weights). The most peaked blocks receive **full FlashAttention**, while the diffuse blocks use **sparse piecewise attention**:\n\n| Split | Ratio | Attention | Cost |\n|-------|-------|-----------|------|\n| Flat (peaked) | 50% | FlashAttention-3 | O(T²) — accurate |\n| Sharp (diffuse) | 50% | Sparse Top-K + Interval Approx. | O(T·(t+topK)) — fast |\n\n### Inference Hyperparameters\n\n| Parameter | Value | Description |\n|-----------|-------|-------------|\n| `SAR` (Sparsity) | 0.0625 | Fraction of KV blocks for exact Top-K |\n| `SRR` (Select Ratio) | 0.125 | Block selection ratio |\n| `Flat Ratio` | 0.5 | Fraction of Q blocks using full attention |\n| `Block Size` | 64 | Token block granularity |\n| Sampling Steps | 32 | Denoising steps (Flow UniPC) |\n| Guidance Scale | 2.5 | Classifier-free guidance |\n| Shift | 6.0 | Flow-matching timestep shift |\n\n---\n\n## 📊 Benchmark\n\n### EditVerse Benchmark\n\n| Method | CLIP-T ↑ | PickScore ↑ | FlowSim ↑ | TemCon ↑ |\n|--------|----------|-------------|-----------|----------|\n| TokenFlow | 0.261 | 0.193 | 0.883 | 0.972 |\n| CoDeF | 0.252 | 0.187 | 0.854 | 0.965 |\n| FateZero | 0.249 | 0.188 | 0.861 | 0.968 |\n| Pix2Video | 0.262 | 0.192 | 0.887 | 0.971 |\n| AnyV2V | 0.268 | 0.198 | 0.891 | 0.973 |\n| InsV2V | 0.272 | 0.201 | 0.895 | 0.974 |\n| UniEdit | 0.276 | 0.204 | 0.898 | 0.976 |\n| I2VEdit | 0.280 | 0.208 | 0.902 | 0.977 |\n| **LIVEditor-14B** | **0.289** | **0.215** | **0.911** | **0.981** |\n\n> Detailed benchmark results and comparisons on VBench, VIPSeg, and DAVIS are available in the [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.04569).\n\n---\n\n## 📁 Project Structure\n\n```\nLIVEditor-14B\u002F\n├── inference.py                       # Main inference entrypoint\n├── inference.yaml                     # Default config\n├── model.py                           # WanModel with in-context sparse attention\n├── scheduler.py                       # Flow UniPC scheduler\n├── fm_solvers.py                      # Flow matching utilities\n├── requirements.txt                   # Python dependencies\n├── infer.sh                           # Inference launch script\n├── in_context_sparse_attention\u002F       # Pluggable sparse kernels\n│   ├── editing_sparse_attention.py    # Main attention function (backend-agnostic)\n│   ├── tilelang_kernels.py            # TileLang kernel definitions\n│   ├── tilelang_host.py               # TileLang host wrapper\n│   ├── triton_kernels.py              # Triton kernel definitions\n│   └── triton_host.py                 # Triton host wrapper\n├── wanx\u002F                              # Wan model components\n│   ├── model.py                       # WanModel class\n│   ├── vae.py                         # WanVAE (encode\u002Fdecode)\n│   ├── t5.py                          # T5 text encoder\n│   ├── scheduler.py                   # Scheduler (legacy path)\n│   ├── attention.py                   # FlashAttention wrapper (FA3)\n│   └── utils.py                       # Video I\u002FO utilities\n├── model_dit\u002F                         # Distributed training stubs\n├── configs\u002F                           # Experiment configs\n├── assets\u002F                            # Demo assets\n│   ├── input.mp4                      # Example input video\n│   ├── output_tilelang.mp4            # TileLang backend output\n│   ├── output_triton.mp4              # Triton backend output\n│   └── prompt.txt                     # Example prompt\n└── README.md\n```\n\n---\n\n## 🛠 Backend Details\n\n| | TileLang | Triton |\n|---|---|---|\n| Kernel framework | TVM-based TileLang | Triton 3.5 |\n| K\u002FV alignment | Manual pad to 64× | Auto boundary handling |\n| Forward precision | bf16 (mean abs err \u003C 1e-4 vs Triton) | bf16 (reference) |\n| Recommended GPU | H100 \u002F RTX 4090 | H100 \u002F RTX 4090 |\n\n---\n\n## 📝 Citation\n\n```bibtex\n@inproceedings{shao2026liveditor,\n  title   = {LIVEditor-14B: Lightning Unified Video Editing via In-Context Sparse Attention},\n  author  = {Shitong Shao and Zikai Zhou and Haopeng Li and Yingwei Song and Wenliang Zhong and Lichen Bai and Zeke Xie},\n  booktitle={The Forty-Third International Conference on Machine Learning},\n  year    = {2026},\n}\n```\n\n## 📄 License & Acknowledgements\n\nThis project is built upon [Wan 2.2](https:\u002F\u002Fgithub.com\u002FWan-Video\u002FWan2.2) and [Wan2.2-T2V-A14B](https:\u002F\u002Fhuggingface.co\u002FWan-AI\u002FWan2.2-T2V-A14B) by Alibaba. The in-context sparse attention kernels are powered by [TileLang](https:\u002F\u002Fgithub.com\u002Ftilelang\u002Ftilelang) and [Triton](https:\u002F\u002Fgithub.com\u002Ftriton-lang\u002Ftriton). FlashAttention is provided by [flash-attn](https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention). We thank the authors for their open-source contributions.\n","LIVEditor-14B 是一个基于上下文稀疏注意力机制的统一视频编辑工具。该项目通过引入一种轻量级、无需训练的块检索机制，解决了传统扩散变换器在处理视频时由于全自注意力带来的二次复杂度问题，显著提高了视频编辑效率。其核心技术特点包括使用压缩的KV表示来存储源视频，并仅检索最相关的块进行稀疏分段注意力处理，从而实现比FlashAttention-2快2.8倍的速度。此外，LIVEditor-14B支持TileLang和Triton稀疏内核，提供了灵活的后端选择。此项目适用于需要高效、高质量视频编辑的应用场景，如影视后期制作、创意内容生成等。","2026-06-11 04:03:06","CREATED_QUERY"]