[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-11492":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":16,"stars30d":17,"stars90d":15,"forks30d":15,"starsTrendScore":18,"compositeScore":19,"rankGlobal":9,"rankLanguage":9,"license":9,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":22,"hasPages":20,"topics":23,"createdAt":9,"pushedAt":9,"updatedAt":24,"readmeContent":25,"aiSummary":26,"trendingCount":15,"starSnapshotCount":15,"syncStatus":14,"lastSyncTime":27,"discoverSource":28},11492,"CHAI","chancharikmitra\u002FCHAI","chancharikmitra","Official Codebase for CVPR 2026 Highlight Paper: \"Building a Precise Video Language with Human–AI Oversight\"",null,"Python",133,19,16,2,0,1,8,3,44.2,false,"main",true,[],"2026-06-12 04:00:55","\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fbanner.png\" alt=\"CHAI Banner\" width=\"100%\">\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Flogo.png\" alt=\"CHAI Logo\" width=\"700\">\n\u003C\u002Fp>\n\n**Official Codebase for CVPR 2026 Highlight Paper:**\n*\"Building a Precise Video Language with Human–AI Oversight\"*\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.21718\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2604.21718-b31b1b.svg\" alt=\"Paper\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Flinzhiqiu.github.io\u002Fpapers\u002Fchai\u002F\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject-Page-blue\" alt=\"Project Page\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fchancharikm\u002FCHAI_testset\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗_Dataset-CHAI_Testset-FFD21E\" alt=\"Dataset\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fchancharikm\u002FCHAI_SFT_model_8b\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗_Model-CHAI_SFT_8B-FFD21E\" alt=\"Model\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2604.21718\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗_Paper-Daily_Papers-FFD21E\" alt=\"HF Paper\">\u003C\u002Fa>\n\u003C\u002Fp>\n\u003C!-- \n> **🚧 More coming soon!** We are actively releasing evaluation code, additional models, and videos. Stay tuned. -->\n\n[Zhiqiu Lin](https:\u002F\u002Flinzhiqiu.github.io\u002F)¹,\n[Chancharik Mitra](https:\u002F\u002Fchancharikmitra.github.io\u002F)¹,\n[Siyuan Cen](https:\u002F\u002Fsy77777en.github.io\u002F)¹,\n[Isaac Li](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fisaac-li-bb381b284\u002F)¹,\nYuhan Huang¹,\n[Yu Tong Tiffany Ling](https:\u002F\u002Fwww.yttldesign.com\u002F)¹,\n[Hewei Wang](https:\u002F\u002Fgithub.com\u002FWangHewei16)¹,\nIrene Pi¹,\nShihang Zhu¹,\nRyan Rao¹,\nGeorge Liu¹,\nJiaxi Li¹,\nRuojin Li¹,\nYili Han¹,\n[Yilun Du](https:\u002F\u002Fyilundu.github.io\u002F)²,\n[Deva Ramanan](https:\u002F\u002Fwww.cs.cmu.edu\u002F~deva\u002F)¹\n\n¹Carnegie Mellon University &nbsp; ²Harvard University\n\n### Updates\n- **2026-05-13**: Released evaluation code, updated test set, and published the [CHAI SFT 8B model](https:\u002F\u002Fhuggingface.co\u002Fchancharikm\u002FCHAI_SFT_model_8b).\n\n# CHAI (Critique-Based Human-AI Oversight)\n---\n\n## Overview\n\nVideo–language models learn to reason about dynamic scenes through natural language, yet producing precise video captions remains challenging. **CHAI (Critique-based Human–AI)** is an oversight framework that pairs trained human experts with model-generated pre-captions: experts provide correctional critiques that guide revisions into improved post-captions. This division of labor offloads text generation to models so that humans can focus on verification, improving both accuracy and efficiency.\n\nWe release open datasets, benchmarks, and training recipes built on a structured captioning specification covering **subjects, scenes, motion, spatial layout, and camera dynamics**—grounded in hundreds of visual primitives developed with professional filmmakers. The resulting critiques and preferences provide rich supervision for improving open-source VLMs (Qwen3-VL) through SFT, DPO, and inference-time scaling on three tasks: caption generation, reward modeling, and critique generation.\n\n\u003C!-- TODO: Add teaser figure -->\n\u003C!-- ![CHAI Overview](assets\u002Fteaser.png) -->\n\n---\n\n## Getting Started\n\n### Prerequisites\n\n- Python 3.10+\n- Conda (recommended)\n- GPU(s) for model inference (tested on NVIDIA A6000)\n\n### Installation\n\n```bash\n# Clone the repository\ngit clone https:\u002F\u002Fgithub.com\u002FTODO\u002FCHAI.git\ncd CHAI\n\n# Create and activate conda environment\nconda create -n chai python=3.10 -y\nconda activate chai\n\n# Install conda dependencies\nconda install -c conda-forge ffmpeg=6.1.2 -y\n\n# Install the package\npip install --no-build-isolation -e .\n```\n\n> **Note:** The `--no-build-isolation` flag is required because the `t2v_metrics` dependency uses a legacy `setup.py` without explicit `setuptools` declarations.\n\n### Environment Variables (Optional)\n\nIf you want to use the LLM judge for generation evaluation, create a `.env` file in the project root:\n\n```\nOPENAI_API_KEY=your-openai-key\n```\n\nThis is not required for the default evaluation pipeline, which uses BLEU-4 and ROUGE-L.\n\n### Download Evaluation Data and Videos\n\nThe evaluation data and videos are hosted on HuggingFace. See [Evaluation Data](#evaluation-data) below for details on what each file contains.\n\n```bash\n# Download the full dataset (videos + evaluation JSONs)\nhf download chancharikm\u002FCHAI_testset --repo-type dataset --local-dir .\u002Feval_data\n```\n\nThis populates `eval_data\u002F` with the test split, task-specific evaluation files, and all corresponding videos.\n\n---\n\n## Evaluation Data\n\nAll evaluation files live under `eval_data\u002F`. The raw test split and three task-specific reformatted versions are provided. The corresponding videos and copies of all test split files are hosted on HuggingFace at [chancharikm\u002FCHAI_testset](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fchancharikm\u002FCHAI_testset).\n\n### `test_split.json`\nThe raw evaluation data. Each entry contains a video path, the model-generated **pre-caption**, a human-written **critique**, the revised **final caption** (post-caption), a **pre-caption score** (1–5), the **caption type** (e.g., Subject, Scene, Motion, Spatial, Camera), and associated metadata. This file serves as the source from which all task-specific evaluation sets below are derived.\n\n### `eval_caption_generation_test.json`\nFormatted for the **caption generation** task. Each sample pairs a video with a task instruction as the user turn and the final (post) caption as the target assistant response. Used to evaluate a model's ability to directly produce high-quality captions from video.\n\n### `eval_critique_generation_test.json`\nFormatted for the **critique generation** task. Each sample provides a video, a task instruction, and a caption to critique as the user turn. The target assistant response is a critique. For pre-captions scoring below 5, two training pairs are generated: one pairing the pre-caption with its human critique, and one pairing the final caption with a \"perfect caption\" sentinel critique, teaching the model to both identify errors and recognize when a caption needs no revision.\n\n### `eval_caption_yes_or_no_test.json`\nFormatted for the **reward modeling** (binary alignment scoring) task. Given a video, a task instruction, and a candidate caption, the model must judge whether the caption aligns with the video by responding **\"Yes\"** or **\"No\"**. For pre-captions scoring below 5, two samples are generated: the final caption as a positive example (\"Yes\") and the pre-caption as a negative example (\"No\"), providing balanced supervision for learning caption quality.\n\n---\n\n## Running Evaluations\n\nThe evaluation pipeline supports three tasks: **caption generation**, **critique generation**, and **reward modeling** (caption yes\u002Fno scoring). A single bash script orchestrates generation and evaluation across model checkpoints. The script also enables parallel workers per GPU for faster inference.\n\n### Quick Start\n\n```bash\n# Run the full pipeline with default settings\nbash run_unified_evaluations.sh\n```\n\n### Configuration\n\nEdit the top of `run_unified_evaluations.sh` to configure your run:\n\n```bash\n# GPU setup\nGPUS=\"0,1,2,3,4,5,6,7\"\nWORKERS_PER_GPU=2\n\n# Data\nDATA_FILE=\"eval_data\u002Ftest_split.json\"\nVIDEO_DIR=\"eval_data\u002Fcaptioning_videos\"\n\n# Models to evaluate (base or base;checkpoint)\nMODELS=(\n    \"qwen3-vl-8b\"                                          # base model\n    \"qwen3-vl-8b;chancharikm\u002FCHAI_SFT_model_8b\"            # fine-tuned\n)\n\n# Scoring formats (sequential evaluation via unified_eval.py)\nSCORING_FORMATS=(\n    \"caption_yes_or_no\"\n)\n\n# Generation formats (parallel evaluation)\nGENERATION_FORMATS=(\n    \"caption_generation\"\n    \"critique_generation\"\n)\n```\n\n### Pipeline Steps\n\nFor each model, the pipeline runs:\n\n1. **Scoring generation** — computes VQA scores (P(Yes) probability) for each caption using `caption_yes_or_no` format\n2. **Scoring evaluation** — calculates pairwise accuracy from the generated scores\n3. **Caption\u002Fcritique generation** — produces captions and critiques for test set videos\n4. **Generation evaluation** — evaluates outputs against ground truth using BLEU-4 and ROUGE-L by default, with an optional LLM judge as an alternative (requires `OPENAI_API_KEY` in `.env` and `USE_LLM_JUDGE=\"true\"` in the script)\n\n### Output Structure\n\n```\nevaluation_outputs\u002F\n├── inference\u002F\n│   ├── scoring_\u003Cmodel>_\u003Ctimestamp>.json\n│   └── generation_\u003Cmodel>_\u003Ctimestamp>.json\n└── evaluation\u002F\n    ├── scoring_eval_\u003Cmodel>_\u003Ctimestamp>.json\n    └── generation_eval_\u003Cmodel>_\u003Ctimestamp>.json\n```\n\n### Project Structure\n\n```\nCHAI\u002F\n├── assets\u002F                          # Banner, logo\n├── eval_code\u002F                       # Evaluation modules\n│   ├── __init__.py\n│   ├── constants.py                 # Task constants and format definitions\n│   ├── formats.py                   # Format conversion utilities\n│   ├── parallel_unified_eval.py     # Multi-worker evaluation\n│   ├── parallel_unified_generation.py  # Multi-GPU generation\n│   ├── unified_eval.py              # Single-process evaluation\n│   ├── unified_generation.py        # Single-process generation\n│   └── video_caption_api.py         # Video captioning API\n├── eval_data\u002F                       # Evaluation data and videos\n│   ├── test_split.json\n│   ├── eval_caption_generation_test.json\n│   ├── eval_critique_generation_test.json\n│   ├── eval_caption_yes_or_no_test.json\n│   └── captioning_videos\u002F\n├── pyproject.toml\n├── run_unified_evaluations.sh       # Main evaluation entry point\n└── README.md\n```\n\n---\n\n## Citation\n\nIf you find this work useful, please cite:\n\n```bibtex\n@inproceedings{chai2026,\n  title     = {Building a Precise Video Language with Human--AI Oversight},\n  author    = {Zhiqiu Lin and Chancharik Mitra and Siyuan Cen and Isaac Li and Yuhan Huang and Yu Tong Tiffany Ling and Hewei Wang and Irene Pi and Shihang Zhu and Ryan Rao and George Liu and Jiaxi Li and Ruojin Li and Yili Han and Yilun Du and Deva Ramanan},\n  booktitle = {Proceedings of the IEEE\u002FCVF Conference on Computer Vision and Pattern Recognition (CVPR)},\n  year      = {2026}\n}\n```\n\n---\n## 📢 Collaborations & Contact\n\nWe are actively advancing CHAI with larger-scale datasets and stronger video\nunderstanding models. We welcome collaborations and funding opportunities with\nresearchers and practitioners working on video understanding, captioning, and\nmultimodal agents for professional-level video content.\n\nIf you're interested in accessing improved data or models, please reach out:\n\n- Zhiqiu Lin — [zhiqiulin98@gmail.com](mailto:zhiqiulin98@gmail.com)\n- Chancharik Mitra — [cmitra@andrew.cmu.edu](mailto:cmitra@andrew.cmu.edu)\n\nOr [open a GitHub Issue](..\u002F..\u002Fissues).\n\n---\n\n## Acknowledgments\n\nThis material is based upon work supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE2140739. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.\n\n\u003C!-- TODO: Additional funding, collaborators, professional video creators -->\n\n## License\n\n\u003C!-- TODO: Specify license -->","CHAI项目旨在通过人机协作提升视频字幕的精确度。其核心功能是让经过训练的人类专家对模型生成的初步字幕进行审核，并提供改进建议，从而形成更准确的最终字幕。该项目利用了先进的视频-语言模型技术，专注于提高动态场景描述的准确性，特别是在主体、场景、动作、空间布局和摄像机动态等方面。它适用于需要高精度视频描述的应用场景，如电影制作中的自动字幕生成、教育视频的内容标注等，能够显著提升工作效率与内容质量。","2026-06-11 03:31:59","CREATED_QUERY"]