[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-82158":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":15,"stars7d":17,"stars30d":18,"stars90d":16,"forks30d":16,"starsTrendScore":19,"compositeScore":20,"rankGlobal":10,"rankLanguage":10,"license":10,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":21,"hasPages":23,"topics":24,"createdAt":10,"pushedAt":10,"updatedAt":25,"readmeContent":26,"aiSummary":27,"trendingCount":16,"starSnapshotCount":16,"syncStatus":28,"lastSyncTime":29,"discoverSource":30},82158,"NAVA","ernie-research\u002FNAVA","ernie-research","Official Code of NAVA: Native Audio-Visual Alignment for Generation.","",null,"Python",183,20,3,7,0,73,152,45,3.97,false,"main",true,[],"2026-06-12 02:04:23","\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Flogo.png\" alt=\"NAVA\" width=\"180\">\n\u003C\u002Fp>\n\n# NAVA — Native Audio-Visual Alignment for Generation\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fernie-research.github.io\u002FNAVA\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject-Page-1e88e5?style=flat-square&logo=googlechrome&logoColor=white\" alt=\"Project Page\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.30073\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-Paper-B31B1B?style=flat-square&logo=arxiv&logoColor=white\" alt=\"arXiv\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fernie-research\u002FNAVA\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97_HuggingFace-Models-FFD21E?style=flat-square\" alt=\"HuggingFace Models\">\u003C\u002Fa>\n  \u003Ca href=\"LICENSE\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache_2.0-4c1?style=flat-square\" alt=\"License\">\u003C\u002Fa>\n\u003C\u002Fp>\n\nNAVA is a Native Audio-Visual Alignment framework that formulates joint audio-video generation as *context-conditioned native audio-visual alignment*. NAVA first establishes audio-video correspondence in a dedicated alignment space and then applies context as external conditioning to guide the aligned representation. It is instantiated with an Align-then-Fuse MMDiT architecture, which progressively bridges modality-aware alignment and unified audio-video denoising. To support controllable speech generation, NAVA further introduces Timbre-in-Context Conditioning, which binds reference timbre cues to corresponding speech spans through the context pathway. With only **6.3B** parameters, NAVA achieves superior audio-visual synchronization and video quality, competitive audio quality, and substantially improved reference-timbre controllability.\n\n> [!IMPORTANT]\n> **This repository is a complete open-source release of the NAVA codebase.**\n> It ships end-to-end: full inference pipeline, interactive Gradio demo, and training code — everything you need to run, fine-tune, and build on NAVA.\n\n## Demo\n\n\u003Cdiv align=\"center\">\n\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fa02cc83d-b5a3-42ac-9a77-952e0c3bd0fe\n\n\u003C\u002Fdiv>\n\n---\n\n## Features\n\n- **720p in ~1 Minute** — Generate synchronized 720p audio-video in about one minute on 8 GPUs with Ulysses sequence parallelism.\n- **Native Stereo Audio** — Jointly generate scene sounds and speech with video, no post-hoc vocoder alignment required.\n- **Multi-Timbre Voice Control** — Bind reference WAVs to speech spans for precise per-speaker voice identity.\n- **Powerful TTS Synthesis** — High-quality speech generation including long, complex sentences in English; limited other languages' support.\n- **Text-Driven Camera Control** — Specify shot composition, camera motion, and pacing directly in the prompt.\n- **Flexible Aspect Ratios** — Generate landscape, portrait, and square videos from the same checkpoint.\n\n## Quick Start\n\n**1. Install dependencies**\n\n```bash\n# Install PyTorch matching your CUDA build first\npip install torch torchvision torchaudio\n\n# Core + I\u002FO + sequence parallel + (optional) vLLM rewrite + Gradio\npip install -r requirements.txt\n\n# Flash-Attention has to be built with --no-build-isolation\npip install flash-attn --no-build-isolation\n```\n\n> The vLLM and Gradio entries in `requirements.txt` are only needed for the prompt-rewrite server (`pe_src\u002F`) and the interactive Web UI (`gradio_demo\u002F`); comment them out if you don't use those paths.\n\n**2. Download weights** (one command pulls `NAVA.ckpt` and all dependencies into the project root):\n\n```bash\nhuggingface-cli download ernie-research\u002FNAVA --local-dir .\u002F\n```\n\n**3. Run inference** (8 GPUs with sequence parallel) — first pick the script for your **task**:\n\n```bash\n# General T2AV (text-only)\nbash scripts\u002Finference.sh\n\n# I2AV + Timbre Control (first-frame image + reference voice)\nbash scripts\u002Finference_timbre.sh\n\n# T2A (audio-only, with or without timbre reference)\nbash scripts\u002Finference_t2a.sh\n```\n\nThe three scripts above keep the full model resident on GPU and require **80 GB peak VRAM**. If your hardware can't afford that, two extra scripts demonstrate how to trade speed for VRAM — copy the relevant flags (`--t5_offload`, `--group_offload`, `--vae_tiling`, etc.) into the task script of your choice:\n\n| Reference script | Peak VRAM | Speed | What it offloads |\n|---|---|---|---|\n| `scripts\u002Finference.sh` (baseline) | **80 GB** | **1 s \u002F step** | Nothing — full model resident on GPU throughout |\n| `scripts\u002Finference_offload_t5.sh` | **48 GB** | **1 s \u002F step** | T5 text encoder (~11 GB) moved to CPU after text encoding; zero cost during denoising |\n| `scripts\u002Finference_group_offload_t5.sh` | **42 GB** | **3.5 s \u002F step** | T5 offload + DiT backbone blocks paged CPU↔GPU one group at a time (pinned memory, async stream) + VAE spatial tiling (decode one 22×40 latent tile at a time, blend on CPU; latent is 44×80 for 704×1280) |\n\nAll numbers measured at 704×1280, 37 frames, 50 steps, **8×H100** with sequence parallel. `inference_group_offload_t5.sh` exposes `OFFLOAD_GROUP_SIZE` (default `10`, range `1–30`) — smaller values keep fewer DiT blocks on GPU simultaneously, lowering peak VRAM in exchange for more CPU↔GPU transfers per step. Pass it as an env var when launching the script.\n\nFor batch runs, custom prompts, or other modes, see [Inference](#inference). For the full weight manifest, see [Model Weights](#model-weights).\n\n**4. Batch rewrite your prompts** (recommended before any inference):\n\n```bash\n# Start the vLLM rewrite server once (stays in the background)\ncd pe_src && bash start_server.sh --gpu 0 && cd ..\n\n# Rewrite — input is one prompt per line, output is line-aligned\npython pe_src\u002Frewrite.py \\\n    --input my_prompts.txt \\\n    --output my_prompts_rewritten.txt \\\n    --concurrency 32\n\n# Convert to JSONL and run\nawk '{print \"{\\\"prompt\\\": \\\"\"$0\"\\\"}\"}' my_prompts_rewritten.txt > my_prompts.jsonl\nDATA_FILE=my_prompts.jsonl bash scripts\u002Finference.sh\n```\n\n> [!TIP]\n> **Always rewrite prompts before inference.** NAVA is trained on high-quality Chinese dense captions; the rewriter expands a short description into a single-paragraph cinematic prompt with explicit scene \u002F motion \u002F audio design — the format that activates the model's full potential. For single prompts or interactive use, see [Prompt Engineering](#prompt-engineering-rewrite).\n\n## Model Architecture\n\nNAVA uses a **30-layer Align-then-Fuse MMDiT** backbone with flow matching:\n\n- **10 Hierarchical Alignment Layers**: dedicated audio\u002Fvideo paths establish fine-grained AV correspondence in a native alignment space — independent QKV per modality, joint self-attention over concatenated video + audio tokens, and per-stream text cross-attention.\n- **20 Unified Fusion Layers**: a single shared transformer stack performs context-conditioned denoising on the aligned representation — shared QKV\u002FFFN, joint self-attention across all tokens, unified text cross-attention.\n- **Timbre-in-Context Conditioning**: reference-WAV speaker embeddings are bound to `\u003CS>...\u003CE>` speech spans through the context pathway, enabling per-speaker timbre control without entangling identity into the alignment space.\n- **RoPE**: 3D rotary embeddings for video (T + H + W), 1D for audio; **AdaLN-Zero** timestep modulation per block.\n\n## Evaluation\n\n### General Capability on VerseBench\n\nNAVA achieves the best AV synchronization (Sync-C \u002F Sync-D \u002F IB) and video quality with the smallest parameter budget.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fverse-bench.png\" alt=\"VerseBench Results\" width=\"100%\">\n\u003C\u002Fp>\n\n### Timbre-Control Speech Performance (SeedTTS-Eval-EN)\n\nAudio-only models are listed as *reference* only — they are dedicated speech systems and not directly comparable. Among joint audio-video models, NAVA delivers speech quality close to dedicated audio-only systems.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fseedtts-eval.png\" alt=\"SeedTTS Evaluation Results\" width=\"100%\">\n\u003C\u002Fp>\n\n### User Study\n\nWe conduct human GSB (Win \u002F Tie \u002F Lose) preference studies on both T2AV and TI2AV against open-source baselines (Ovi-1.1, LTX-2.3, MoVA, daVinci). NAVA achieves competitive **Overall Quality** across all comparisons and wins on **Audio-Visual Alignment** against all baselines.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fgsb_combined.png\" alt=\"User Study GSB Results\" width=\"100%\">\n\u003C\u002Fp>\n\n## Inference\n\n### Input Format (JSONL)\n\nAll inference modes use a unified **JSONL** format (one JSON object per line):\n\n```jsonl\n{\"prompt\": \"一位男子在海边奔跑，镜头跟随。写实电影感，自然光。背景是海浪声和风声。\"}\n{\"prompt\": \"描述文本...\", \"image_path\": \"\u002Fabs\u002Fpath\u002Fto\u002Ffirst_frame.png\"}\n{\"prompt\": \"两人对话\u003CS>Hello\u003CE>\u003CS>Hi there\u003CE>\", \"spk_wavs\": [\"\u002Fpath\u002Fto\u002Fspk1.wav\", \"\u002Fpath\u002Fto\u002Fspk2.wav\"]}\n{\"prompt\": \"...\", \"image_path\": \"\u002Fpath\u002Fto\u002Fimg.png\", \"spk_wavs\": [\"\u002Fpath\u002Fto\u002Fspk.wav\"]}\n```\n\n| Field | Required | Description |\n|-------|----------|-------------|\n| `prompt` | Yes | Text caption (also accepts legacy `text` field name) |\n| `image_path` | No | Absolute path to first frame image → auto-enables I2V mode for this sample |\n| `spk_wavs` | No | List of absolute paths to speaker reference WAVs (max 2) for timbre control |\n\nA single JSONL file can mix text-only, I2V, and timbre-control entries.\n\n### Batch Inferencer\n\nEach GPU independently processes a slice of the input JSONL — best for many-prompt throughput. Defaults to `infer_cases\u002Fgeneral\u002Fprompts.jsonl`; override with env vars.\n\n```bash\nbash scripts\u002Finference_batch.sh\n\n# Custom paths:\nCKPT=\u002Fpath\u002Fto\u002Fyour.ckpt \\\nDATA_FILE=\u002Fpath\u002Fto\u002Fprompts.jsonl \\\nOUT_DIR=eval_results\u002Fbatch_run1 \\\nbash scripts\u002Finference_batch.sh\n```\n\n### Sequence Parallel (SP=8, Recommended for Single-Sample)\n\nAll 8 GPUs cooperatively process the same sample for faster inference:\n\n```bash\nSETUPTOOLS_USE_DISTUTILS=stdlib torchrun \\\n    --nnodes=1 \\\n    --nproc_per_node=8 \\\n    --master_addr=127.0.0.1 \\\n    --master_port=29507 \\\n    inference_nava.py \\\n    --config configs\u002Fnava.yaml \\\n    --ckpt your_nava_checkpoint.ckpt \\\n    --out_dir .\u002Feval_results_sp \\\n    --data_format json \\\n    --data_file your_data.jsonl \\\n    --width 1280 \\\n    --height 704 \\\n    --frames 37 \\\n    --fps 24 \\\n    --steps 50 \\\n    --save_sample \\\n    --gen_turn 1 \\\n    --use_sp\n```\n\n### T2A (Audio-Only, with optional Timbre Control)\n\nGenerate audio without video using the same NAVA checkpoint. Supports both pure sound-design prompts and timbre-controlled speech — the distinction is simply whether `spk_wavs` is present in the JSONL entry.\n\n```jsonl\n{\"prompt\": \"清晨山间，远处溪流潺潺，鸟鸣声此起彼伏。画面中没有人物对白，也没有任何旁白。\"}\n{\"prompt\": \"...\u003CS>Hello, it's great to meet you.\u003CE>...\", \"spk_wavs\": [\"\u002Fpath\u002Fto\u002Fspk.wav\"]}\n{\"prompt\": \"...\u003CS>First speaker line.\u003CE>...\u003CS>Second speaker line.\u003CE>...\", \"spk_wavs\": [\"\u002Fpath\u002Fspk1.wav\", \"\u002Fpath\u002Fspk2.wav\"]}\n```\n\n`spk_wavs[i]` binds to the i-th `\u003CS>...\u003CE>` span in order. Omit `spk_wavs` entirely for pure scene-audio generation.\n\n```bash\nbash scripts\u002Finference_t2a.sh\n\n# Override defaults:\nDURATION=8.0 \\\nDATA_FILE=\u002Fpath\u002Fto\u002Fprompts.jsonl \\\nOUT_DIR=eval_results\u002Fmy_t2a \\\nTIMBRE_SCALE=3.0 \\\nbash scripts\u002Finference_t2a.sh\n```\n\nOutputs land at `$OUT_DIR\u002F{save_name}-0.wav`. Config: `configs\u002Fnava_seedtts.yaml` (`modality: audio`). The `--timbre_cfg` flag is always on — it has no effect when `spk_wavs` is absent.\n\n### SeedTTS Benchmark (Audio-Only)\n\nEvaluate zero-shot speech synthesis on the [SeedTTS test set](https:\u002F\u002Fgithub.com\u002FBytedanceSpeech\u002Fseed-tts-eval). Drop the official testset under `infer_cases\u002Fseedtts\u002F{zh,en}\u002F` (see [`infer_cases\u002Fseedtts\u002FREADME.md`](infer_cases\u002Fseedtts\u002FREADME.md) for the expected layout), then:\n\n```bash\n# Chinese split (default)\nbash scripts\u002Finference_seedtts.sh\n\n# English split\nLANG=en bash scripts\u002Finference_seedtts.sh\n```\n\nEach line of `meta.lst` is `utt_id|prompt_text|prompt_wav|infer_text`; outputs land at `eval_results\u002Fseedtts\u002F{lang}\u002F{utt_id}.wav`. Uses `configs\u002Fnava_seedtts.yaml` (audio-only) and runs the same NAVA checkpoint with `--seedtts_mode --timbre_cfg` enabled.\n\n### Gradio Interactive Demo (SP=8)\n\nWeb UI with prompt rewriting, image upload, and speaker reference:\n\n```bash\ncd gradio_demo\nbash start_gradio.sh\n```\n\nOr with custom paths:\n\n```bash\nbash gradio_demo\u002Fstart_gradio.sh \\\n    --config \u002Fpath\u002Fto\u002Fconfig.yaml \\\n    --ckpt \u002Fpath\u002Fto\u002Fcheckpoint.ckpt \\\n    --rewrite_model \u002Fpath\u002Fto\u002FQwen3-4B-Thinking-2507 \\\n    --port 8000 \\\n    --nproc 8 \\\n    --share\n```\n\nDebug mode (no models, UI only):\n```bash\npython gradio_demo\u002Fgradio_server.py --debug --port 8000\n```\n\n### Prompt Engineering (Rewrite)\n\nFor optimal generation quality, **always rewrite your prompt before inference** — especially if the input is in English or short. NAVA is primarily trained on **high-quality Chinese dense captions**; the rewriter expands a brief description into a single-paragraph cinematic prompt with explicit subject \u002F scene \u002F motion timeline \u002F camera language \u002F audio design — the format that activates the model's full potential.\n\nWe ship three rewrite pathways. **Pick by use case:**\n\n| Pathway | Backend | Speed (per prompt) | Best for |\n|---|---|---|---|\n| **A. vLLM batch server** (`pe_src\u002F`) | Qwen3-4B-Thinking-2507 served via vLLM, async HTTP, concurrency=32 | \u003C 2 s | Offline batches (10s ~ 10000s of prompts) |\n| **B. Local transformers, single** (`gradio_demo\u002Frewrite_single.py`) | Same model, loaded in-process via `transformers` | 40 ~ 80 s | One-off CLI test, small batches |\n| **C. Gradio \"Rewrite\" button** | Same as B, hosted inside the Gradio worker | 40 ~ 80 s | Interactive UI sessions |\n\nAll three share the **same system prompt** (`pe_src\u002Fprompts\u002Frewrite_template.txt` ≡ `gradio_demo\u002Frewrite_single.py:SYSTEM_PROMPT`) and the same sampling profile (temperature 0.3, top_p 0.75, top_k 20, repetition_penalty 1.05), so output style is consistent across paths. **Speech spans wrapped in `\u003CS>...\u003CE>` are preserved verbatim** — the rewriter is instructed to never translate or split them, and `pe_src\u002Frewrite.py` post-checks `\u003CS>\u003CE>` pair counts between input and output.\n\n#### A. Batch rewrite via vLLM server  ★ recommended\n\n**Step 1 — start the vLLM server** (one-time, runs in background, writes `server.log` + `server.pid`):\n\n```bash\ncd pe_src\n\n# Standalone GPU (full speed, ~14 GB):\nbash start_server.sh --gpu 0\n\n# Sharing GPU 0 with the 8-GPU NAVA backbone (~14 GB ceiling, eager mode,\n# backbone sees ~10–15% slowdown):\nbash start_server.sh --gpu 0 --low-footprint\n```\n\nThe launcher polls `http:\u002F\u002Flocalhost:8000\u002Fv1\u002Fmodels` and exits 0 once the server is ready. Stop it any time with `bash stop_server.sh`.\n\n**Step 2 — run batch rewrite**:\n\n```bash\n# Input: one prompt per line (literal \"\\n\" allowed, will be unescaped)\ncat > my_prompts.txt \u003C\u003C'EOF'\nA man surfing a huge wave at sunset, cinematic.\n两个人在咖啡馆对话\u003CS>How are you\u003CE>\u003CS>I'm good, thanks\u003CE>\nEOF\n\npython pe_src\u002Frewrite.py \\\n    --input my_prompts.txt \\\n    --output my_prompts_rewritten.txt \\\n    --concurrency 32\n```\n\nOutputs are line-aligned with the input. Failed rows are written as `[ERROR] ...` instead of crashing the batch — re-run those individually after fixing the underlying issue. Use `--format jsonl` to emit `{\"text\": \"...\"}` lines instead of plain text.\n\n**Step 3 — feed into inference**: convert the rewritten txt into the JSONL format expected by `inference_nava.py` (preserving any `image_path` \u002F `spk_wavs` from your original data), then run as in [Quick Start](#quick-start-8-gpu).\n\n> **Tuning knobs** in `pe_src\u002Fconfig.yaml`: `concurrency` (default 32), `temperature` (0.3), `max_tokens` (4096 — bumped to fit the thinking model's chain-of-thought + the rewrite). All overridable via CLI flags `--concurrency` \u002F `--temperature`.\n\n#### B. Single-prompt rewrite via local transformers\n\nFor ad-hoc testing without spinning up a server:\n\n```bash\npython gradio_demo\u002Frewrite_single.py \"A man surfing a huge wave at sunset\"\n\n# Or batch from a file (sequential, slow):\npython gradio_demo\u002Frewrite_single.py \\\n    --input my_prompts.txt \\\n    --output my_prompts_rewritten.txt \\\n    --model pe_src\u002FQwen3-4B-Thinking-2507\n```\n\nLoads the rewriter model into the current process — no server needed, but ~40–80 s per prompt because thinking is sequential. Add `--4bit` to fit on a smaller GPU.\n\n#### C. Click-to-rewrite inside Gradio\n\nThe Gradio demo (`gradio_demo\u002Fstart_gradio.sh`) embeds a **\"Rewrite Prompt\"** button next to the prompt textbox. Clicking it calls the same backend as path B, with the rewriter automatically offloaded to CPU during NAVA inference to free GPU memory. Speech-tag pair counts are validated; mismatches surface a warning in the UI.\n\nBest for interactive iteration; for any batch >5 prompts, switch to path A.\n\n## Training\n\nNAVA supports training from scratch, SFT \u002F fine-tuning from a pretrained checkpoint, and mixed audio-video training. Full training documentation is in [`train\u002FREADME.md`](train\u002FREADME.md); below is a quick-start reference.\n\n### Data Format\n\nEach dataset is a JSONL file, one sample per line:\n\n```json\n{\n  \"data_id\": \"unique_id\",\n  \"video_info\": [{\"data_path\": \"\u002Fabs\u002Fpath\u002Fvideo.mp4\", \"fps\": 25.0, \"duration\": 3.0, \"image_width\": 1920, \"image_height\": 1080}],\n  \"text_list\": [{\"text\": \"描述文本，台词用 \u003CS>...\u003CE> 包裹\", \"text_type\": \"caption\", \"speech_start\": [0.0], \"speech_end\": [2.76]}],\n  \"audio_splits_info_tagging\": [{\"audio_duration\": 3.0, \"audio_info\": {\"caption_data\": {}}}]\n}\n```\n\nDatasets are referenced via a `.list` file and sampled according to a `.weight` file that assigns per-dataset weights and training modalities (`text_to_av` \u002F `text_to_audio` \u002F `text_to_video` \u002F `text_to_image`).\n\n### Scripts\n\n| Script | Purpose |\n|--------|---------|\n| `train\u002Ftrain_nava_scarch_mix.sh` | Train with mixed AV + audio-only tasks, warm-started from Wan2.2-5B weights (`configs\u002Fnava_mixtrain.yaml`) |\n| `train\u002Ftrain_nava_sft.sh` | SFT \u002F fine-tune: load weights from an existing checkpoint, reset step and data cursor |\n\n```bash\n# Train from Wan2.2-5B warm start (mixed AV + audio)\nbash train\u002Ftrain_nava_scarch_mix.sh\n\n# Fine-tune from a checkpoint\nbash train\u002Ftrain_nava_sft.sh\n```\n\nBoth scripts auto-generate an FSDP config (`fsdp_config_auto.yaml`) and launch via `accelerate launch` with `FULL_SHARD` bf16 on 8 GPUs. The `train_nava_scarch_mix.sh` script warm-starts from `Wan_5B.ckpt` (weights only, step counter reset) via `--load_ckpt_only` — download it from [Wan-AI\u002FWan2.2-TI2V-5B](https:\u002F\u002Fhuggingface.co\u002FWan-AI\u002FWan2.2-TI2V-5B) and place it in the project root before running.\n\n### Resume\n\nCheckpoints are saved every `save_every` steps (default 2500) at `{out_dir}\u002Fstep{N}.ckpt`. They store model weights, EMA weights, the global step counter, and per-worker data cursors for exact resume.\n\n```bash\n# Full resume (weights + step + data position)\naccelerate launch --config_file fsdp_config_auto.yaml \\\n    train_nava.py --config configs\u002Fnava.yaml \\\n    --resume outputs\u002Fyour_run\u002Fstep5000.ckpt\n\n# Weights only — reset step to 0 (for fine-tuning)\naccelerate launch --config_file fsdp_config_auto.yaml \\\n    train_nava.py --config configs\u002Fnava.yaml \\\n    --resume NAVA.ckpt --load_ckpt_only\n```\n\nBoth `.ckpt` and `.safetensors` checkpoints are supported. When a `.ckpt` path is given but not found, the loader automatically falls back to the matching `.safetensors` file. Safetensors files contain weights only and always behave like `--load_ckpt_only`.\n\n### Key Hyperparameters\n\nAll hyperparameters are controlled via the YAML config — no CLI overrides. Edit or copy `configs\u002Fnava.yaml` \u002F `configs\u002Fnava_mixtrain.yaml` to change settings.\n\n| Hyperparameter | YAML key | Default |\n|----------------|----------|---------|\n| Learning rate | `lr` | `1e-4` |\n| Batch size (per GPU) | `batch_size` | — |\n| Gradient accumulation | `grad_accum_steps` | `1` (`4` for mixed training) |\n| Max steps | `max_steps` | — |\n| Save interval | `save_every` | `2500` |\n| Output dir | `out_dir` | — |\n| Target frames | `data.video_tgt_frames` | 121 (4N+1) |\n| Video FPS | `data.video_fps` | `24` |\n| Max audio duration | `data.max_audio_duration` | `10.0` |\n| Length bucketing | `data.use_length_buckets` | `false` |\n| Audio loss weight | `audio_loss_coff` | `0.2` |\n| Video loss weight | `vision_loss_coff` | `1.0` |\n\nSee [`train\u002FREADME.md`](train\u002FREADME.md) for the full reference including async dataloader tuning (`io_workers`, `queue_size`) and multi-node setup.\n\n## Configuration\n\nThe repository ships a single inference config — `configs\u002Fnava.yaml` — used by every script (`scripts\u002Finference.sh`, `scripts\u002Finference_timbre.sh`, `gradio_demo\u002Fstart_gradio.sh`).\n\n### Key Config Options\n\n```yaml\nmodality: audio_video          # audio_video \u002F audio \u002F video\npipeline: nava_src.pipeline_nava.AudioVideoPipeline\nuse_bf16: true\nscheduler_unipc: true          # UniPC multi-step scheduler (faster)\nuse_mmdit_model: true          # Use unified MMDiT (vs older FusionModel)\nalign_3d_cfg: true             # 3D cross-modal CFG for AV alignment\n\n# Guidance scales\nvideo_guidance_scale: 3.0      # Video CFG strength\naudio_guidance_scale: 2.0      # Audio CFG strength\nvideo_align_guidance_scale: 3.0  # Video cross-modal alignment\naudio_align_guidance_scale: 2.0  # Audio cross-modal alignment\n\n# Timbre CFG (used together with --timbre_cfg + spk_wavs in JSONL)\ntimbre_cfg: true                 # Master switch (CLI --timbre_cfg overrides)\ntimbre_align_guidance_scale: 3.0 # Strength of speaker-reference steering;\n                                 # ↑ tighter timbre match, ↓ more model freedom\n\n# Model architecture\nmodel:\n  joint_config: nava_src\u002Fmodels\u002Fnava\u002Fconfigs\u002Fmodel\u002Fdit\u002FNAVA_6B.json\n  ckpt_dir: .\u002F                 # Wan2.2-TI2V-5B weights directory\n  # audio_vae_ckpt_dir: \u002Fpath\u002Fto\u002Faudio_vae\u002Fparams   # optional override\n\n# Data\ndata:\n  audio_tokens_per_sec: 25\n  video_fps: 24\n  add_spk_emb: true            # Enable speaker embeddings\n  spk_emb_prob: 0.9            # Speaker embedding injection probability\n```\n\n## Model Weights\n\nThe single `huggingface-cli download` in [Quick Start](#quick-start) pulls everything below — listed here for reference and licensing transparency.\n\n| Path | Description |\n|---|---|\n| `NAVA.ckpt` | 24 GB — NAVA model weights |\n| `nava.yaml` | Inference config (drop-in replacement for `configs\u002Fnava.yaml`) |\n| `config.json` | Model architecture config |\n| `example_prompts.jsonl` | Example JSONL prompts covering T2AV, T2A, timbre control, and I2AV |\n| `Wan2.2-TI2V-5B\u002FWan2.2_VAE.pth` | 2.7 GB — mirrored from [Wan-AI\u002FWan2.2-TI2V-5B](https:\u002F\u002Fhuggingface.co\u002FWan-AI\u002FWan2.2-TI2V-5B) |\n| `Wan2.2-TI2V-5B\u002Fmodels_t5_umt5-xxl-enc-bf16.pth` | 11 GB — mirrored from Wan-AI\u002FWan2.2-TI2V-5B |\n| `Wan2.2-TI2V-5B\u002Fgoogle\u002Fumt5-xxl\u002F{spiece.model,tokenizer.json}` | 21 MB — T5 tokenizer |\n| `params\u002FLTX2\u002Fltx-2.3-22b-dev_audio_vae.safetensors` | 348 MB — mirrored from [Lightricks\u002FLTX-Video](https:\u002F\u002Fgithub.com\u002FLightricks\u002FLTX-Video) (LTX-2 Community License — see `params\u002FLTX2\u002FLICENSE`) |\n\nThe LTX audio-VAE Python code is vendored under `nava_src\u002Fvendor\u002Fltx_core\u002F` (see its `NOTICE.md` and `LICENSE`), so no separate clone of the LTX repo is needed. The ReDimNet speaker embedder is fetched automatically via `torch.hub` on first run.\n\n## License\n\nThe source code in this repository is released under the Apache License 2.0.\n\nModel weights, pretrained backbones, tokenizers, audio VAEs, speaker encoders, and prompt-rewriting models may be subject to different licenses from their original providers. This includes, but is not limited to, Wan2.2, LTX-Video, Qwen3, and ReDimNet. Users are responsible for complying with the corresponding licenses of all third-party components.\n\n## Citation\n\nIf you find NAVA useful in your research, please cite:\n\n```bibtex\n@misc{ji2026nava,\n      title         = {Native Audio-Visual Alignment for Generation},\n      author        = {Longbin Ji and Guan Wang and Xuan Wei and Chenye Yang and Xiangrui Liu and Zhenyu Zhang and Shuohuan Wang and Yu Sun and Jingzhou He},\n      year          = {2026},\n      eprint        = {2605.30073},\n      archivePrefix = {arXiv},\n      primaryClass  = {cs.CV},\n      url           = {https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.30073},\n}\n```\n\n## Acknowledgements\n\nWe would like to thank the contributors to [Wan2.2-TI2V-5B](https:\u002F\u002Fhuggingface.co\u002FWan-AI\u002FWan2.2-TI2V-5B), [LTX-Video](https:\u002F\u002Fgithub.com\u002FLightricks\u002FLTX-Video), [ReDimNet](https:\u002F\u002Fgithub.com\u002FIDRnD\u002FReDimNet), [Qwen3](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen3-4B-Thinking-2507), and [Ovi](https:\u002F\u002Fgithub.com\u002Fcharacter-ai\u002FOvi) for their great open-source work, which is helpful to this project.\n\n## Contact\n\nFor questions, issues, or collaborations, please contact [Longbin Ji](mailto:robingg1100@gmail.com) and [Guan Wang](mailto:guanw.pku@gmail.com).\n\n## NAVA Star History\n\n[![Star History Chart](https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=ernie-research\u002FNAVA&type=Date)](https:\u002F\u002Fstar-history.com\u002F#ernie-research\u002FNAVA&Date)\n","NAVA 是一个用于生成同步音频和视频的框架，它通过上下文条件下的原生音视频对齐来实现联合生成。项目采用 Align-then-Fuse MMDiT 架构，逐步建立模态感知对齐与统一的音视频去噪过程。此外，NAVA 引入了基于上下文的音色条件控制，将参考音色线索绑定到相应的语音片段上，从而增强了可控性。此框架仅用 6.3B 参数就实现了卓越的音视频同步效果及视频质量，并且支持多音色语音控制。适用于需要高质量、同步音视频内容生成的场景，如虚拟人物制作、游戏开发等。提供了完整的端到端解决方案，包括推理管道、交互式演示以及训练代码，便于用户快速上手和进一步开发。",2,"2026-06-11 04:07:53","CREATED_QUERY"]