[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-82878":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":15,"forks30d":15,"starsTrendScore":19,"compositeScore":20,"rankGlobal":10,"rankLanguage":10,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":22,"hasPages":22,"topics":24,"createdAt":10,"pushedAt":10,"updatedAt":25,"readmeContent":26,"aiSummary":27,"trendingCount":15,"starSnapshotCount":15,"syncStatus":28,"lastSyncTime":29,"discoverSource":30},82878,"Bernini","bytedance\u002FBernini","bytedance","Bernini is a unified framework for video generation and editing that combines an MLLM-based semantic planner with a DiT-based renderer.","https:\u002F\u002Fbernini-ai.github.io\u002F",null,"Python",704,53,10,0,37,334,572,175,99.17,"Apache License 2.0",false,"main",[],"2026-06-12 04:01:39","\u003Cdiv align=\"center\">\n\n\u003Cimg src=\"assets\u002Fbernini-icon.png\" width=\"560\" alt=\"Bernini\"\u002F>\n\n\u003Ch4 align=\"center\">Latent Semantic Planning for Video Diffusion\u003C\u002Fh4>\n\n**Chenchen Liu\u003Csup>\\*\u003C\u002Fsup>, Junyi Chen\u003Csup>\\*\u003C\u002Fsup>, Lei Li\u003Csup>\\*\u003C\u002Fsup>, Lu Chi\u003Csup>\\*,§\u003C\u002Fsup>, Mingzhen Sun\u003Csup>\\*\u003C\u002Fsup>, Zhuoying Li\u003Csup>\\*\u003C\u002Fsup>, Yi Fu, Ruoyu Guo, Yiheng Wu, Ge Bai, Zehuan Yuan\u003Csup>✉\u003C\u002Fsup>**\n\n\u003Csup>\\*\u003C\u002Fsup> Equal contribution&nbsp;&nbsp;\u003Csup>✉\u003C\u002Fsup> Corresponding author&nbsp;&nbsp;\u003Csup>§\u003C\u002Fsup> Project lead\n\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2605.22344-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.22344)\n[![Project Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject-Page-blue.svg)](https:\u002F\u002Fbernini-ai.github.io\u002F)\n[![HuggingFace](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20HuggingFace-Models-yellow)](https:\u002F\u002Fhuggingface.co\u002FByteDance\u002FBernini)\n\n\u003C\u002Fdiv>\n\n## 🎉 News\n\n- **[2026-06-01]** We open-sourced the inference code and model weights of the Bernini Renderer (**Bernini-R**).\n- **[2026-05-22]** We released our paper [Bernini: Latent Semantic Planning for Video Diffusion](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.22344).\n\n## ✨ Highlights\n\nBernini is a unified framework for video generation and editing that combines an MLLM-based semantic planner with a DiT-based renderer.\n\nOn video editing, Bernini reaches the first tier among leading closed-source\ncommercial models. The leaderboard below comes from our self-built arena\nplatform, where human annotators blindly vote on paired edits and the votes are\naggregated into a Bradley-Terry score and a pairwise win-rate matrix.\n\n\u003Cimg src=\"assets\u002Farena.png\" width=\"900\" alt=\"Video editing arena: Bradley-Terry leaderboard and pairwise win-rate matrix\"\u002F>\n\n## 📦 Installation\n\n### Requirements\n\n- **Python** 3.11.2.\n- **CUDA GPU** — a Hopper GPU (H100\u002FH800\u002FH200) is recommended so FlashAttention-3\n  can be used; other CUDA GPUs fall back to FlashAttention-2 or PyTorch SDPA.\n- **CUDA toolkit** 12.4 (matches the pinned `torch==2.5.1+cu124`; 12.3+ is the\n  minimum if you build FlashAttention-3).\n- Pinned in `requirements.txt`: `torch==2.5.1+cu124`, `diffusers==0.35.2`,\n  `accelerate==0.34.2`, `transformers==4.57.3`.\n\nReference environment (Bernini-R is developed and tested on this setup):\n\n| Component | Version      |\n|-----------|--------------|\n| GPU       | NVIDIA H100  |\n| CUDA      | 12.4         |\n| Python    | 3.11.2       |\n| PyTorch   | 2.5.1+cu124  |\n\n### Install\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fbytedance\u002FBernini.git bernini && cd bernini\npip install -r requirements.txt\n```\n\nOptional extras:\n\n- **Multi-GPU sequence parallel** needs [Open-VeOmni](https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FVeOmni)\n  (Apache-2.0, Python 3.11). Use `--no-deps` so VeOmni does not pull in a\n  different torch build and override the pinned `torch==2.5.1+cu124`:\n  `pip install --no-deps git+https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FVeOmni.git@v0.1.10`.\n  Single-GPU inference does not need it.\n- **Faster attention** (auto-detected if installed; otherwise PyTorch SDPA is used):\n  - FlashAttention-2 — general CUDA GPUs (incl. A100\u002FA800): `pip install flash-attn==2.8.3`.\n  - FlashAttention-3 — Hopper only (H100\u002FH800\u002FH200, CUDA ≥ 12.3, PyTorch ≥ 2.4).\n    `flash_attn_interface` is not on PyPI; build it from the\n    [flash-attention](https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention) repo's\n    `hopper\u002F` directory at tag `v2.8.3`:\n    ```bash\n    git clone https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention.git\n    cd flash-attention && git checkout v2.8.3\n    cd hopper && MAX_JOBS=$(nproc) python3 setup.py install --user\n    ```\n\n### Weights\n\nBernini-R provides two ways to obtain the renderer weights. The **diffusers\nformat is recommended** — it is a self-contained diffusers-format directory whose\n`transformer` \u002F `transformer_2` already hold the Bernini-R weights, so you point\n`--config` at it and the weights load directly, with **no** `--high_noise_ckpt` \u002F\n`--low_noise_ckpt` needed.\n\n#### Option A — diffusers format (recommended)\n\nA single ready-to-use diffusers-format model from\n[`ByteDance\u002FBernini-R-Diffusers`](https:\u002F\u002Fhuggingface.co\u002FByteDance\u002FBernini-R-Diffusers).\nIt bundles the Wan2.2 base components (VAE, UMT5 text encoder, tokenizer) together\nwith the Bernini-R transformer weights, so nothing else is downloaded at runtime.\n\n```bash\npip install -U \"huggingface_hub\"\nhf download ByteDance\u002FBernini-R-Diffusers --local-dir Bernini-R-Diffusers\n```\n\nThen pass it via `--config` and omit the checkpoint flags, e.g.:\n\n```bash\npython infer_single_gpu.py --config Bernini-R-Diffusers \\\n    --case assets\u002Ftestcases\u002Ft2i\u002Ft2i.json --num_frames 1\n```\n\n#### Option B — separate checkpoints\n\nThe original layout, where Bernini-R uses two sets of weights loaded separately:\n\n1. **Wan2.2 base** — [`Wan-AI\u002FWan2.2-T2V-A14B-Diffusers`](https:\u002F\u002Fhuggingface.co\u002FWan-AI\u002FWan2.2-T2V-A14B-Diffusers) on Hugging Face. Supplies the\n   VAE, UMT5 text encoder, tokenizer, and the transformer architecture\u002Fbase weights.\n   It is downloaded automatically on first run (configured by `wan22_base` in\n   `configs\u002Fbernini_renderer_wan22\u002Fconfig.json`).\n2. **Bernini-R checkpoint** — the trained high-noise \u002F low-noise transformer weights\n   (safetensors) from [ByteDance\u002FBernini-R](https:\u002F\u002Fhuggingface.co\u002FByteDance\u002FBernini-R), passed with\n   `--high_noise_ckpt` \u002F `--low_noise_ckpt`. Both a local directory and a Hugging\n   Face repo id are accepted.\n\nDownload models using huggingface-cli:\n\n```bash\npip install -U \"huggingface_hub\"\nhf download Wan-AI\u002FWan2.2-T2V-A14B-Diffusers --local-dir Wan2.2-T2V-A14B-Diffusers\nhf download ByteDance\u002FBernini-R --local-dir Bernini-R\n```\n\n## 🚀 Usage\n\nA run is described by a **case file** — a small JSON under\n[`assets\u002Ftestcases\u002F`](assets\u002Ftestcases\u002F) that bundles one task's routing and\ninputs (`task_type`, `guidance_mode`, `prompt`, source media, `output`). This\nkeeps long prompts out of the command line. Each task has a directory under\n`assets\u002Ftestcases\u002F` holding one or more case files; see\n[`assets\u002Ftestcases\u002F`](assets\u002Ftestcases\u002F) for the format and the bundled\n`t2i` \u002F `i2i` \u002F `t2v` \u002F `v2v` \u002F `rv2v` \u002F`r2v` examples.\n\n### Prompt enhancer (highly recommended)\n\n`--use_pe` enhances the prompt through an OpenAI-compatible endpoint and is\nrecommended for best generation quality. The `openai` SDK is installed by\n`requirements.txt`; configure the endpoint with environment variables:\n\n```bash\nexport BERNINI_PE_API_KEY=...      # or OPENAI_API_KEY\nexport BERNINI_PE_BASE_URL=...     # or OPENAI_BASE_URL\nexport BERNINI_PE_MODEL=...        # vision-capable chat model\n```\n\n### Examples by task type\n\nUnless an example specifies otherwise, inference outputs **480p \u002F 16fps** (the\ndefaults — `--max_image_size 848`, `--fps 16`).\n\nEach example runs a bundled case in\n[`assets\u002Ftestcases\u002F`](assets\u002Ftestcases\u002F) — replace `\u003Chi>` \u002F `\u003Clo>` with your\nhigh-\u002Flow-noise checkpoint paths. The image tasks (`t2i`, `i2i`) are shown on a\nsingle GPU; the video tasks on 8 GPUs via `torchrun`, where `--ulysses N` gives\nN-way Ulysses sequence parallel per sample and the remaining `world_size \u002F N`\nranks run data parallel over the task list. The two scripts take the same\ninputs, so any example can be run either way.\n\nInputs can also be passed directly as flags instead of `--case` (`--prompt`,\n`--task_type`, `--guidance_mode`, `--video`, `--image`, `--images`,\n`--output`); generation parameters (`--seed`, `--num_frames`, ...) are always\ncommand-line flags.\n\n**Text-to-image** (`t2i`) — single GPU; generates one frame, so pass `--num_frames 1`\n\n```bash\npython infer_single_gpu.py --high_noise_ckpt \u003Chi> --low_noise_ckpt \u003Clo> \\\n    --case assets\u002Ftestcases\u002Ft2i\u002Ft2i.json --num_frames 1\n```\n\n**Image editing** (`i2i`) — single GPU; generates one frame, so pass `--num_frames 1`\n\n```bash\npython infer_single_gpu.py --high_noise_ckpt \u003Chi> --low_noise_ckpt \u003Clo> \\\n    --case assets\u002Ftestcases\u002Fi2i\u002Fi2i.json --num_frames 1\n```\n\n**Text-to-video** (`t2v`)\n\n```bash\ntorchrun --nproc-per-node 8 infer_multi_gpu.py \\\n    --high_noise_ckpt \u003Chi> --low_noise_ckpt \u003Clo> --ulysses 8 \\\n    --case assets\u002Ftestcases\u002Ft2v\u002Ft2v.json\n```\n\n**Video editing** (`v2v` \u002F `mv2v`) — two cases are provided.\n\nFor edits where the main subject keeps its ordinary motion (case 1 adds a\nsnowman to the scene), the `v2v` task type is enough:\n\n```bash\ntorchrun --nproc-per-node 8 infer_multi_gpu.py \\\n    --high_noise_ckpt \u003Chi> --low_noise_ckpt \u003Clo> --ulysses 8 \\\n    --case assets\u002Ftestcases\u002Fv2v\u002Fv2v_case1.json\n```\n\nFor edits that need to change the subject's motion (case 2 makes the person\ncrouch down), the `mv2v` task type gives better results:\n\n```bash\ntorchrun --nproc-per-node 8 infer_multi_gpu.py \\\n    --high_noise_ckpt \u003Chi> --low_noise_ckpt \u003Clo> --ulysses 8 \\\n    --case assets\u002Ftestcases\u002Fv2v\u002Fv2v_case2.json\n```\n\n**Reference + video editing** (`rv2v`) — two cases are provided.\n\nCase 1 is reference-image-guided video editing — replacing a garment in the\nsource video with one from a reference image:\n\n```bash\ntorchrun --nproc-per-node 8 infer_multi_gpu.py \\\n    --high_noise_ckpt \u003Chi> --low_noise_ckpt \u003Clo> --ulysses 8 \\\n    --case assets\u002Ftestcases\u002Frv2v\u002Frv2v_case1.json\n```\n\nCase 2 is a video-insertion example — inserting content into the source video.\nIt is run at 720p \u002F 24fps to show the insertion result more clearly:\n\n```bash\ntorchrun --nproc-per-node 8 infer_multi_gpu.py \\\n    --high_noise_ckpt \u003Chi> --low_noise_ckpt \u003Clo> --ulysses 8 \\\n    --case assets\u002Ftestcases\u002Frv2v\u002Frv2v_case2.json \\\n    --num_frames 121 --fps 24 --max_image_size 1280\n```\n\n**Reference-to-video** (`r2v`) — drives a video from one or more reference images\n\n```bash\ntorchrun --nproc-per-node 8 infer_multi_gpu.py \\\n    --high_noise_ckpt \u003Chi> --low_noise_ckpt \u003Clo> --ulysses 8 \\\n    --case assets\u002Ftestcases\u002Fr2v\u002Fr2v.json\n```\n\nSee `python infer_single_gpu.py --help` for the full argument list.\n\n### Gradio demo\n\n`gradio_demo.py` exposes the same pipeline through a Gradio UI: the task-type\ndropdown auto-fills `guidance_mode` (still user-editable), uploaded media is\nrouted to the matching slot, and the result is rendered inline.\n\n```bash\n# Single GPU\npython gradio_demo.py --high_noise_ckpt \u003Chi> --low_noise_ckpt \u003Clo> --port 7860\n\n# 8 GPUs, 8-way Ulysses sequence parallel\ntorchrun --nproc-per-node 8 gradio_demo.py --ulysses 8 \\\n    --high_noise_ckpt \u003Chi> --low_noise_ckpt \u003Clo> --port 7860 --share\n```\n\nAdd `--use_pe` (and `export OPENAI_API_KEY=...` \u002F `BERNINI_PE_API_KEY=...`) to\nenable GPT prompt enhancement; the in-UI checkbox is a per-request switch on\ntop of this flag.\n\n## 📑 Citation\n\nIf you use Bernini in your research, please cite:\n\n```bibtex\n@article{bernini,\n  title   = {Bernini: Latent Semantic Planning for Video Diffusion},\n  author  = {Chenchen Liu and Junyi Chen and Lei Li and Lu Chi and Mingzhen Sun and Zhuoying Li and Yi Fu and Ruoyu Guo and Yiheng Wu and Ge Bai and Zehuan Yuan},\n  journal = {arXiv preprint arXiv:2605.22344},\n  year    = {2026}\n}\n```\n\n## 🙏 Acknowledgements\n\nBernini builds on several outstanding open-source projects:\n\n- [Wan2.2-T2V-A14B](https:\u002F\u002Fhuggingface.co\u002FWan-AI\u002FWan2.2-T2V-A14B)\n- [Qwen2.5-VL-7B-Instruct](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen2.5-VL-7B-Instruct)\n- [VeOmni](https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FVeOmni)\n\nWe thank the authors and communities of these projects for their contributions.\n\n## 📄 License\n\nApache License 2.0. See [LICENSE](LICENSE).","Bernini 是一个结合了基于多模态大语言模型的语义规划器和基于DiT的渲染器的统一视频生成与编辑框架。其核心功能包括通过先进的语义理解和图像生成技术，实现高质量的视频内容创建及修改。该项目采用Python编写，支持CUDA GPU加速，并推荐使用NVIDIA H100等高性能显卡以获得最佳性能。Bernini适用于需要进行创意视频制作、视频编辑以及AI驱动的内容生成场景，如广告制作、影视后期处理等领域。",2,"2026-06-11 04:09:29","CREATED_QUERY"]