[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-74203":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":8,"htmlUrl":8,"language":9,"languages":8,"totalLinesOfCode":8,"stars":10,"forks":11,"watchers":12,"openIssues":13,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":15,"stars7d":16,"stars30d":17,"stars90d":14,"forks30d":14,"starsTrendScore":18,"compositeScore":19,"rankGlobal":8,"rankLanguage":8,"license":20,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":23,"hasPages":21,"topics":24,"createdAt":8,"pushedAt":8,"updatedAt":25,"readmeContent":26,"aiSummary":27,"trendingCount":14,"starSnapshotCount":14,"syncStatus":13,"lastSyncTime":28,"discoverSource":29},74203,"void-model","Netflix\u002Fvoid-model","Netflix",null,"Python",1891,178,23,2,0,8,18,89,24,19.76,"Apache License 2.0",false,"main",true,[],"2026-06-12 02:03:23","\u003Cdiv align=\"center\">\n\u003Cimg src=\"assets\u002Fvoid-logo-web.png\" width=\"195\" \u002F>\n\u003C\u002Fdiv>\n\n# VOID: Video Object and Interaction Deletion\n\n\u003Cdiv style=\"line-height: 1;\">\n  \u003Ca href=\"https:\u002F\u002Fvoid-model.github.io\u002F\" target=\"_blank\" style=\"margin: 2px;\">\n    \u003Cimg alt=\"Website\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FWebsite-VOID-4285F4\" style=\"display: inline-block; vertical-align: middle;\"\u002F>\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.02296\" target=\"_blank\" style=\"margin: 2px;\">\n    \u003Cimg alt=\"arXiv\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-VOID-FBBC06\" style=\"display: inline-block; vertical-align: middle;\"\u002F>\n  \u003C\u002Fa>\n      \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fsam-motamed\u002FVOID\" target=\"_blank\" style=\"margin: 2px;\">\n    \u003Cimg alt=\"Data\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗Gradio-Demo-AB8165\" style=\"display: inline-block; vertical-align: middle;\"\u002F>\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fnetflix\u002Fvoid-model\" target=\"_blank\" style=\"margin: 2px;\">\n    \u003Cimg alt=\"Models\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗%20HuggingFace-Models-orange\" style=\"display: inline-block; vertical-align: middle;\"\u002F>\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2604.02296\" target=\"_blank\" style=\"margin: 2px;\">\n    \u003Cimg alt=\"Models\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗%20HuggingFace-Paper-yellow\" style=\"display: inline-block; vertical-align: middle;\"\u002F>\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fnetflix\u002Fvoid-model\u002Fblob\u002Fmain\u002Fnotebook.ipynb\" target=\"_blank\" style=\"margin: 2px;\">\n    \u003Cimg alt=\"Open in Colab\" src=\"https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg\" style=\"display: inline-block; vertical-align: middle;\"\u002F>\n  \u003C\u002Fa>\n\u003C\u002Fdiv>\n\u003Ch4>\n\n[Saman Motamed](https:\u002F\u002Fsam-motamed.github.io\u002F)\u003Csup>1,2\u003C\u002Fsup>,\n[William Harvey](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=kDd7nBkAAAAJ&hl=en)\u003Csup>1\u003C\u002Fsup>,\n[Benjamin Klein](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=xkX9W9QAAAAJ&hl=en)\u003Csup>1\u003C\u002Fsup>,\n[Luc Van Gool](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=TwMib_QAAAAJ&hl=en)\u003Csup>2\u003C\u002Fsup>,\n[Zhuoning Yuan](https:\u002F\u002Fzhuoning.cc\u002F)\u003Csup>1\u003C\u002Fsup>,\n[Ta-Ying Cheng](https:\u002F\u002Fttchengab.github.io\u002F)\u003Csup>1\u003C\u002Fsup>\n\n\u003Csup>1\u003C\u002Fsup>Netflix &nbsp;&nbsp; \u003Csup>2\u003C\u002Fsup>INSAIT, Sofia University \"St. Kliment Ohridski\"\n\n\u003C\u002Fh4>\n\n\u003Chr>\n\nVOID removes objects from videos along with all interactions they induce on the scene — not just secondary effects like shadows and reflections, but physical interactions like objects falling when a person is removed. It is built on top of [CogVideoX](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FCogVideo) and fine-tuned for video inpainting with interaction-aware mask conditioning.\n\n> **Example:** If a person holding a guitar is removed, VOID also removes the person's effect on the guitar — causing it to fall naturally.\n\n\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fad174ca0-2feb-45f9-9405-83167037d9be\" width=\"100%\" controls autoplay loop muted>\u003C\u002Fvideo>\n\n---\n\n## TODO 📋\n\n- [ ] 🤗 Diffusers pipeline support \n\n---\n\n## 🤖 Models\n\nVOID uses two transformer checkpoints, trained sequentially. You can run inference with Pass 1 alone or chain both passes for higher temporal consistency.\n\n| Model | Description | HuggingFace |\n|-------|-------------|-------------|\n| **VOID Pass 1** | Base inpainting model | [Download](https:\u002F\u002Fhuggingface.co\u002Fnetflix\u002Fvoid-model\u002Fblob\u002Fmain\u002Fvoid_pass1.safetensors) |\n| **VOID Pass 2** | Warped-noise refinement model | [Download](https:\u002F\u002Fhuggingface.co\u002Fnetflix\u002Fvoid-model\u002Fblob\u002Fmain\u002Fvoid_pass2.safetensors) |\n\nPlace checkpoints anywhere and pass the path via `--config.video_model.transformer_path` (Pass 1) or `--model_checkpoint` (Pass 2).\n\n---\n\n## ▶️ Quick Start\n\nThe fastest way to try VOID is the included notebook — it handles setup, downloads the models, runs inference on a sample video, and displays the result:\n\n[![Open in Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fnetflix\u002Fvoid-model\u002Fblob\u002Fmain\u002Fnotebook.ipynb)\n\n> **Note:** Requires a GPU with 40GB+ VRAM (e.g., A100).\n\nFor more control over the pipeline (custom videos, Pass 2 refinement, mask generation), see the full setup and instructions below.\n\n---\n\n## ⚙️ Setup\n\n```bash\npip install -r requirements.txt\n```\n\nStage 1 of the mask pipeline uses Gemini via the Google AI API. Set your API key:\n\n```bash\nexport GEMINI_API_KEY=your_key_here\n```\n\nAlso install [SAM2+3](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fsam2?tab=readme-ov-file#installation) separately (required for mask generation):\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fsam2.git\ncd sam2 && pip install -e .\n\ngit clone https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fsam3.git\ncd sam3 && pip install -e .\n```\n\nDownload the pretrained base inpainting model from HuggingFace:\n\n```bash\nhf download alibaba-pai\u002FCogVideoX-Fun-V1.5-5b-InP \\\n    --local-dir .\u002FCogVideoX-Fun-V1.5-5b-InP\n```\n\nThe inference and training scripts expect it at `.\u002FCogVideoX-Fun-V1.5-5b-InP` relative to the repo root by default.\n\nIf `ffmpeg` is not available on your system, you can use the binary bundled with `imageio-ffmpeg`:\n\n```bash\nln -sf $(python -c \"import imageio_ffmpeg; print(imageio_ffmpeg.get_ffmpeg_exe())\") ~\u002F.local\u002Fbin\u002Fffmpeg\n```\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>📁 Expected directory structure\u003C\u002Fstrong>\u003C\u002Fsummary>\n\nAfter cloning the repo and downloading all assets, your directory should look like this:\n\n```\nVOID\u002F\n├── config\u002F\n├── datasets\u002F\n│   └── void_train_data.json\n├── inference\u002F\n├── sample\u002F                         # included sample sequences for inference\n├── scripts\u002F\n├── videox_fun\u002F\n├── VLM-MASK-REASONER\u002F\n├── README.md\n├── requirements.txt\n│\n├── CogVideoX-Fun-V1.5-5b-InP\u002F     # hf download alibaba-pai\u002FCogVideoX-Fun-V1.5-5b-InP\n├── void_pass1.safetensors          # download from huggingface.co\u002Fvoid-model (see Models above)\n├── void_pass2.safetensors          # download from huggingface.co\u002Fvoid-model (see Models above)\n├── training_data\u002F                  # generated via data_generation\u002F pipeline (see Training section)\n└── data_generation\u002F                # data generation code (HUMOTO + Kubric pipelines)\n```\n\n\u003C\u002Fdetails>\n\n---\n\n## 📂 Input Format\n\nEach video sequence lives in its own folder under a root data directory:\n\n```\ndata_rootdir\u002F\n└── my-video\u002F\n    ├── input_video.mp4      # source video\n    ├── quadmask_0.mp4       # quadmask (4-value mask video, see below)\n    └── prompt.json          # {\"bg\": \"background description\"}\n```\n\nThe `prompt.json` contains a single `\"bg\"` key describing the scene **after** the object has been removed — i.e. what you want the background to look like. Do not describe the object being removed; describe what remains.\n\n```json\n{ \"bg\": \"A table with a cup on it.\" }         \u002F\u002F ✅ describes the clean background\n{ \"bg\": \"A person being removed from scene.\" } \u002F\u002F ❌ don't describe the removal\n```\n\nA few examples from the included samples:\n\n| Sequence | Removed object | `bg` prompt |\n|----------|---------------|-------------|\n| `lime` | the glass | `\"A lime falls on the table.\"` |\n| `moving_ball` | the rubber duckie | `\"A ball rolls off the table.\"` |\n| `pillow` | the kettlebell being placed on the pillow | `\"Two pillows are on the table.\"` |\n\nThe quadmask encodes four semantic regions per pixel:\n\n| Value | Meaning |\n|-------|---------|\n| `0`   | Primary object to remove |\n| `63`  | Overlap of primary + affected regions |\n| `127` | Affected region (interactions: falling objects, displaced items, etc.) |\n| `255` | Background (keep) |\n\n---\n\n## 🚀 Pipeline\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>🎭 Stage 1 — Generate Masks\u003C\u002Fstrong>\u003C\u002Fsummary>\n\nThe `VLM-MASK-REASONER\u002F` pipeline generates quadmasks from raw videos using SAM2 segmentation and a VLM (Gemini) for reasoning about interaction-affected regions.\n\n### 🖱️ Step 0 — Select points (GUI)\n\n```bash\npython VLM-MASK-REASONER\u002Fpoint_selector_gui.py\n```\n\nLoad a JSON config listing your videos and instructions, then click on the objects to remove. Saves a `*_points.json` with the selected points.\n\nConfig format:\n```json\n{\n  \"videos\": [\n    {\n      \"video_path\": \"path\u002Fto\u002Fvideo.mp4\",\n      \"output_dir\": \"path\u002Fto\u002Foutput\u002Ffolder\",\n      \"instruction\": \"remove the person\"\n    }\n  ]\n}\n```\n\n### ⚡ Steps 1–4 — Run the full pipeline\n\nAfter saving the points config, run all remaining stages automatically:\n\n```bash\nbash VLM-MASK-REASONER\u002Frun_pipeline.sh my_config_points.json\n```\n\nOptional flags:\n```bash\nbash VLM-MASK-REASONER\u002Frun_pipeline.sh my_config_points.json \\\n    --sam2-checkpoint path\u002Fto\u002Fsam2_hiera_large.pt \\\n    --device cuda\n```\n\nThis runs the following stages in order:\n\n| Stage | Script | Output |\n|-------|--------|--------|\n| 1 — SAM2 segmentation | `stage1_sam2_segmentation.py` | `black_mask.mp4` |\n| 2 — VLM analysis | `stage2_vlm_analysis.py` | `vlm_analysis.json` |\n| 3 — Grey mask generation | `stage3a_generate_grey_masks_v2.py` | `grey_mask.mp4` |\n| 4 — Combine into quadmask | `stage4_combine_masks.py` | `quadmask_0.mp4` |\n\nThe final `quadmask_0.mp4` in each video's `output_dir` is ready to use for inference.\n\n\u003C\u002Fdetails>\n\n---\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>🎬 Stage 2 — Inference\u003C\u002Fstrong>\u003C\u002Fsummary>\n\nVOID inference runs in two passes. Pass 1 is sufficient for most videos; Pass 2 adds a warped-noise refinement step for better temporal consistency on longer clips.\n\n### ✨ Pass 1 — Base inference\n\n```bash\npython inference\u002Fcogvideox_fun\u002Fpredict_v2v.py \\\n    --config config\u002Fquadmask_cogvideox.py \\\n    --config.data.data_rootdir=\"path\u002Fto\u002Fdata_rootdir\" \\\n    --config.experiment.run_seqs=\"my-video\" \\\n    --config.experiment.save_path=\"path\u002Fto\u002Foutput\" \\\n    --config.video_model.model_name=\"path\u002Fto\u002FCogVideoX-Fun-V1.5-5b-InP\" \\\n    --config.video_model.transformer_path=\"path\u002Fto\u002Fvoid_pass1.safetensors\"\n```\n\nTo run multiple sequences at once, pass a comma-separated list:\n```bash\n--config.experiment.run_seqs=\"video1,video2,video3\"\n```\n\nKey config options:\n\n| Flag | Default | Description |\n|------|---------|-------------|\n| `--config.data.sample_size` | `384x672` | Output resolution (HxW) |\n| `--config.data.max_video_length` | `197` | Max frames to process |\n| `--config.video_model.temporal_window_size` | `85` | Temporal window for multidiffusion |\n| `--config.video_model.num_inference_steps` | `50` | Denoising steps |\n| `--config.video_model.guidance_scale` | `1.0` | Classifier-free guidance scale |\n| `--config.system.gpu_memory_mode` | `model_cpu_offload_and_qfloat8` | Memory mode (`model_full_load`, `model_cpu_offload`, `sequential_cpu_offload`) |\n\nThe output is saved as `\u003Csave_path>\u002F\u003Csequence_name>.mp4`, along with a `*_tuple.mp4` side-by-side comparison.\n\n### 🔁 Pass 2 — Warped noise refinement\n\nUses optical flow-warped latents from the Pass 1 output to initialize a second inference pass, improving temporal consistency.\n\n**Single video:**\n```bash\npython inference\u002Fcogvideox_fun\u002Finference_with_pass1_warped_noise.py \\\n    --video_name my-video \\\n    --data_rootdir path\u002Fto\u002Fdata_rootdir \\\n    --pass1_dir path\u002Fto\u002Fpass1_outputs \\\n    --output_dir path\u002Fto\u002Fpass2_outputs \\\n    --model_checkpoint path\u002Fto\u002Fvoid_pass2.safetensors \\\n    --model_name path\u002Fto\u002FCogVideoX-Fun-V1.5-5b-InP\n```\n\n**Batch:** Edit the video list and paths in `inference\u002Fpass_2_refine.sh`, then run:\n\n```bash\nbash inference\u002Fpass_2_refine.sh\n```\n\nKey arguments:\n\n| Argument | Default | Description |\n|----------|---------|-------------|\n| `--pass1_dir` | — | Directory containing Pass 1 output videos |\n| `--output_dir` | `.\u002Finference_with_warped_noise` | Where to save Pass 2 results |\n| `--warped_noise_cache_dir` | `.\u002Fpass1_warped_noise_cache` | Cache for precomputed warped latents |\n| `--temporal_window_size` | `85` | Temporal window size |\n| `--height` \u002F `--width` | `384` \u002F `672` | Output resolution |\n| `--guidance_scale` | `6.0` | CFG scale |\n| `--num_inference_steps` | `50` | Denoising steps |\n| `--use_quadmask` | `True` | Use quadmask conditioning |\n\n\u003C\u002Fdetails>\n\n---\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>✏️ Stage 3 — Manual Mask Refinement \u003Cem>(Optional)\u003C\u002Fem>\u003C\u002Fstrong>\u003C\u002Fsummary>\n\nIf the auto-generated quadmask does not accurately capture the object or its interaction region, use the included GUI editor to refine it before running inference.\n\n```bash\npython VLM-MASK-REASONER\u002Fedit_quadmask.py\n```\n\nOpen a sequence folder containing `input_video.mp4` (or `rgb_full.mp4`) and `quadmask_0.mp4`. The editor shows the original video and the editable mask side by side.\n\n**Tools:**\n- **Grid Toggle** — click a grid cell to toggle the interaction region (`127` ↔ `255`)\n- **Grid Black Toggle** — click a grid cell to toggle the primary object region (`0` ↔ `255`)\n- **Brush (Add \u002F Erase)** — freehand paint or erase mask regions at pixel level\n- **Copy from Previous Frame** — propagate the black or grey mask from the previous frame\n\n**Keyboard shortcuts:** `←` \u002F `→` navigate frames, `Ctrl+Z` \u002F `Ctrl+Y` undo\u002Fredo.\n\nSave overwrites `quadmask_0.mp4` in place. Rerun inference from Pass 1 after saving.\n\n\u003C\u002Fdetails>\n\n---\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>🏋️ Training\u003C\u002Fstrong>\u003C\u002Fsummary>\n  \n### Training Data Generation\n\nDue to licensing constraints on the underlying datasets, we release the **data generation code** instead of the pre-built training data. The code produces paired counterfactual videos (with\u002Fwithout object, plus quad-masks) from two sources:\n\n#### Source 1: HUMOTO (Human-Object Interaction)\n\nGenerates counterfactual videos from the [HUMOTO](https:\u002F\u002Fgithub.com\u002Fadobe-research\u002Fhumoto) motion capture dataset using Blender. A human (Remy\u002FSophie character) interacts with objects; removing the human causes objects to fall via physics simulation.\n\n**Prerequisites:**\n1. **HUMOTO dataset** — Request access from the authors at [adobe-research\u002Fhumoto](https:\u002F\u002Fgithub.com\u002Fadobe-research\u002Fhumoto). Once approved, download and place under `data_generation\u002Fhumoto_release\u002F`\n2. **Blender** — Install [Blender](https:\u002F\u002Fwww.blender.org\u002Fdownload\u002F) (tested with 3.x and 4.x). Also install `opencv-python-headless` in Blender's Python (see `data_generation\u002FREADME.md`)\n3. **Remy & Sophie characters** — Download from [Mixamo](https:\u002F\u002Fwww.mixamo.com\u002F) (free Adobe account). Search for \"Remy\" and \"Sophie\", download each as FBX, and place at:\n   ```\n   data_generation\u002Fhuman_model\u002FRemy_mixamo_bone.fbx\n   data_generation\u002Fhuman_model\u002FSophie_mixamo_bone.fbx\n   ```\n4. **PBR textures** (optional) — Download texture packs from [ambientCG](https:\u002F\u002Fambientcg.com\u002F) or [Poly Haven](https:\u002F\u002Fpolyhaven.com\u002F). Without textures, objects render with realistic solid colors as fallback\n\n**Expected directory structure after setup:**\n```\ndata_generation\u002F\n├── humoto_release\u002F\n│   ├── humoto_0805\u002F                    # HUMOTO sequences (.pkl, .fbx, .yaml per sequence)\n│   └── humoto_objects_0805\u002F            # Object meshes (.obj, .fbx per object)\n├── human_model\u002F\n│   ├── Remy_mixamo_bone.fbx            # ← download from Mixamo\n│   ├── Sophie_mixamo_bone.fbx          # ← download from Mixamo\n│   ├── bone_names.py                   # included\n│   └── *.json                          # included (bone structure definitions)\n├── textures\u002F                           # ← optional, user-provided PBR textures\n├── physics_config.json                 # included (manual per-sequence physics settings)\n├── render_paired_videos_blender_quadmask.py   # main renderer\n├── convert_split_remy_sophie.sh               # character conversion script\n└── ...\n```\n\n**Pipeline:**\n```bash\ncd data_generation\n\n# 1. Convert HUMOTO sequences to Remy\u002FSophie characters\nbash convert_split_remy_sophie.sh\n\n# 2. Render paired videos (with human, without human, quad-mask)\nblender --background --python render_paired_videos_blender_quadmask.py -- \\\n    -d .\u002Fhumoto_release\u002Fhumoto_0805 \\\n    -o .\u002Foutput \\\n    -s \u003Csequence_name> \\\n    -m .\u002Fhumoto_release\u002Fhumoto_objects_0805 \\\n    --use_characters --enable_physics --add_walls \\\n    --target_frames 60 --fps 12\n```\n\nA pre-configured `physics_config.json` is included specifying which objects are static vs. dynamic per sequence. See `data_generation\u002FREADME.md` for full details.\n\n\n#### Source 2: Kubric (Object-Only Interaction)\n\nGenerates counterfactual videos using [Kubric](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fkubric) with Google Scanned Objects. Objects are launched at a target; removing them alters the target's physics trajectory. No external dataset download required — assets are fetched from Google Cloud Storage.\n\n```bash\ncd data_generation\npip install kubric pybullet imageio imageio-ffmpeg\n\npython kubric_variable_objects.py --num_pairs 200 --resolution 384\n```\n\n#### Training Data Format\n\nBoth pipelines output the same format expected by the training scripts:\n\n```\ntraining_data\u002F\n└── sequence_name\u002F\n    ├── rgb_full.mp4       # input video (with object)\n    ├── rgb_removed.mp4    # target video (object removed, physics applied)\n    ├── mask.mp4           # quad-mask (0\u002F63\u002F127\u002F255)\n    └── metadata.json\n```\n\nPoint the training scripts at your generated data by updating `datasets\u002Fvoid_train_data.json`.\n\n---\n\n### Running Training\n\nTraining proceeds in two stages. Pass 1 is trained first, then Pass 2 fine-tunes from that checkpoint.\n\n#### Pass 1 — Base inpainting model\n\nDoes not require warped noise. Trains the model to remove objects and their interactions from scratch.\n\n```bash\nbash scripts\u002Fcogvideox_fun\u002Ftrain_void.sh\n```\n\nKey arguments:\n\n| Argument | Description |\n|----------|-------------|\n| `--pretrained_model_name_or_path` | Path to base CogVideoX inpainting model |\n| `--transformer_path` | Optional starting checkpoint |\n| `--train_data_meta` | Path to dataset metadata JSON |\n| `--train_mode=\"void\"` | Enables void inpainting training mode |\n| `--use_quadmask` | Trains with 4-value quadmask conditioning |\n| `--use_vae_mask` | Encodes mask through VAE |\n| `--output_dir` | Where to save checkpoints |\n| `--num_train_epochs` | Number of epochs |\n| `--checkpointing_steps` | Save a checkpoint every N steps |\n| `--learning_rate` | Default `1e-5` |\n\n#### Pass 2 — Warped noise refinement model\n\nContinues training from a Pass 1 checkpoint with optical flow-warped latent initialization, improving temporal consistency on longer videos. Requires warped noise for training data to be present.\n\n```bash\nbash scripts\u002Fcogvideox_fun\u002Ftrain_void_warped_noise.sh\n```\n\nSet `TRANSFORMER_PATH` to your Pass 1 checkpoint before running:\n\n```bash\nTRANSFORMER_PATH=path\u002Fto\u002Fpass1_checkpoint.safetensors bash scripts\u002Fcogvideox_fun\u002Ftrain_void_warped_noise.sh\n```\n\nAdditional arguments specific to this stage:\n\n| Argument | Description |\n|----------|-------------|\n| `--use_warped_noise` | Enables warped latent initialization during training |\n| `--warped_noise_degradation` | Noise blending factor (default `0.3`) |\n| `--warped_noise_probability` | Fraction of steps using warped noise (default `1.0`) |\n\nTraining was run on **8× A100 80GB GPUs** using DeepSpeed ZeRO stage 2.\n\n\u003C\u002Fdetails>\n\n---\n\n## 🤩 Community Adoption\n\nWe are excited to see the community build on VOID!  \nBelow we showcase selected demos, tools, and extensions.\n\nIf you’ve built something using VOID, feel free to submit a PR to add it here.\n\n### 🌐 Demos & Projects\n\n- ⭐ **Gradio Demo** — @sam-motamed  \n  Interactive demo for trying VOID in the browser:  \n  👉 https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fsam-motamed\u002FVOID\n\n## 🙏 Acknowledgements\n\nThis implementation builds on code and models from [aigc-apps\u002FVideoX-Fun](https:\u002F\u002Fgithub.com\u002Faigc-apps\u002FVideoX-Fun\u002Ftree\u002Fmain), [Gen-Omnimatte](https:\u002F\u002Fgithub.com\u002Fgen-omnimatte\u002Fgen-omnimatte-public\u002Ftree\u002Fmain), [Go-with-the-Flow](https:\u002F\u002Fgithub.com\u002FEyeline-Labs\u002FGo-with-the-Flow), [Kubric](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fkubric) and [HUMOTO](https:\u002F\u002Fjiaxin-lu.github.io\u002Fhumoto\u002F). We thank the authors for sharing the codes and pretrained inpainting models for CogVideoX, Gen-Omnimatte, and the optical flow warping utilities.\n\n---\n## Star History\n\n\u003Ca href=\"https:\u002F\u002Fwww.star-history.com\u002F?repos=netflix%2Fvoid-model&type=date&legend=bottom-right\">\n \u003Cpicture>\n   \u003Csource media=\"(prefers-color-scheme: dark)\" srcset=\"https:\u002F\u002Fapi.star-history.com\u002Fchart?repos=netflix\u002Fvoid-model&type=date&theme=dark&legend=top-left\" \u002F>\n   \u003Csource media=\"(prefers-color-scheme: light)\" srcset=\"https:\u002F\u002Fapi.star-history.com\u002Fchart?repos=netflix\u002Fvoid-model&type=date&legend=top-left\" \u002F>\n   \u003Cimg alt=\"Star History Chart\" src=\"https:\u002F\u002Fapi.star-history.com\u002Fchart?repos=netflix\u002Fvoid-model&type=date&legend=top-left\" \u002F>\n \u003C\u002Fpicture>\n\u003C\u002Fa>\n\n---\n## 📄 Citation\n\nIf you find our work useful, please consider citing:\n\n🔗 https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.02296\n\n```bibtex\n@misc{motamed2026void,\n  title={VOID: Video Object and Interaction Deletion},\n  author={Saman Motamed and William Harvey and Benjamin Klein and Luc Van Gool and Zhuoning Yuan and Ta-Ying Cheng},\n  year={2026},\n  eprint={2604.02296},\n  archivePrefix={arXiv},\n  primaryClass={cs.CV},\n  url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.02296}\n}\n","VOID项目专注于从视频中移除对象及其在场景中引发的所有交互，不仅包括阴影和反射等次级效果，还包括物理交互如移除人物后导致的物体自然下落。该项目基于CogVideoX开发，并针对视频修复进行了优化，采用Python语言编写，通过互动感知的掩码条件处理技术实现核心功能。它适用于需要对视频内容进行精细编辑或去除特定元素而不破坏整体连贯性的场景，例如影视后期制作、内容审核以及隐私保护等领域。","2026-06-11 03:49:30","high_star"]