[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-1499":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":14,"stars7d":15,"stars30d":16,"stars90d":14,"forks30d":14,"starsTrendScore":14,"compositeScore":17,"rankGlobal":9,"rankLanguage":9,"license":18,"archived":19,"fork":19,"defaultBranch":20,"hasWiki":21,"hasPages":19,"topics":22,"createdAt":9,"pushedAt":9,"updatedAt":28,"readmeContent":29,"aiSummary":30,"trendingCount":14,"starSnapshotCount":14,"syncStatus":16,"lastSyncTime":31,"discoverSource":32},1499,"autoresearch-qwen","wadeKeith\u002Fautoresearch-qwen","wadeKeith","Autonomous Qwen3-VL training-code research on the official DocVQA benchmark. main: NVIDIA multi-GPU, mlx: Apple Silicon\u002FMPS.",null,"Python",211,33,24,0,1,2,45.29,"MIT License",false,"main",true,[23,24,25,26,27],"agentic-ai","autoresearch","docvqa","qwen","vision-language-model","2026-06-12 04:00:09","# autoresearch-qwen\n\nAutonomous research loop for improving [Qwen\u002FQwen3-VL-4B-Instruct](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen3-VL-4B-Instruct) on the official [HuggingFaceM4\u002FDocumentVQA](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FHuggingFaceM4\u002FDocumentVQA) benchmark.\n\nThe repo is designed for agentic training research: the benchmark and evaluator stay fixed, while an agent iterates on `train.py`, runs training, measures the result on the full validation split, and keeps only real gains. The project is inspired by [karpathy\u002Fautoresearch](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fautoresearch), but scoped to a concrete public VLM benchmark with a reproducible contract.\n\nIf this project is useful for your research, evals, or agent workflows, please star the repo.\n\n## Branches\n\nThis repository now contains the two previously separate codebases as different branches:\n\n| Branch | Target hardware | Status |\n| --- | --- | --- |\n| `main` | NVIDIA \u002F CUDA multi-GPU | Primary branch. Uses `torchrun`, supports DeepSpeed configs, and is the recommended branch for fast experiment cycles. |\n| `mlx` | Apple Silicon \u002F MPS | Historical branch imported from the former `autoresearch-qwen-mlx` repository and preserved here as the Mac-focused variant. |\n\nUse the README on each branch for branch-specific commands. On `main` the entrypoint is `.\u002Frun_experiment.sh`; on `mlx` it is `uv run python run_experiment.py`.\n\n## Why This Repo Exists\n\n- Fixed benchmark: full official DocVQA `train`, `validation`, and `test` splits\n- Fixed evaluator: validation score is always computed by the repository evaluator\n- One mutable surface: agents are expected to edit `train.py`\n- Reproducible loop: prepare, train, evaluate, keep or discard, repeat\n- Public benchmark mindset: improvements should come from better training decisions, not from moving the goalposts\n\n## Benchmark Contract\n\n| Component | Contract |\n| --- | --- |\n| Base model | `Qwen\u002FQwen3-VL-4B-Instruct` |\n| Dataset | `HuggingFaceM4\u002FDocumentVQA` official splits |\n| Training split | Full `train` split |\n| Validation split | Full `validation` split |\n| Test split | Full blind `test` split |\n| Metric | Mean ANLS on the full validation split |\n| Mutable file | `train.py` |\n| Fixed files | `evaluate.py`, `src\u002F`, benchmark contract, submission tooling |\n\nMore benchmark details are documented in [benchmarks\u002FREADME.md](benchmarks\u002FREADME.md).\n\n## How The Loop Works\n\n```text\nprepare.py          Download dataset + model snapshot\n      |\ntrain.py            Mutable training code (the agent edits this)\n      |\nevaluate.py         Fixed validation evaluator \u002F blind test exporter\n      |\nrun_experiment.sh   One full train -> eval iteration on main\n```\n\nThe only objective is to maximize `val_score`, defined as mean ANLS on the full official validation split.\n\n## Quick Start (`main`)\n\n```bash\ncurl -LsSf https:\u002F\u002Fastral.sh\u002Fuv\u002Finstall.sh | sh\nuv sync\nuv run python prepare.py\nuv run autoresearch-qwen doctor\nuv run python evaluate.py --base-only --split validation\n.\u002Frun_experiment.sh | tee run.log\n```\n\nUseful follow-up commands:\n\n```bash\nuv run python analysis.py\nuv run python submit_test.py\n```\n\n## Repository Layout\n\n```text\ntrain.py                            Mutable training code\nevaluate.py                         Fixed evaluator\nrun_experiment.sh                   One-command train -> eval pipeline on main\nanalysis.py                         Result visualization\nprepare.py                          Dataset + model downloader\nsubmit_test.py                      Blind test export + submission packaging\ncheck_submission.py                 Submission validator\nprogram.md                          Full agent protocol\nbenchmarks\u002FREADME.md                Benchmark definition\nconfigs\u002F                            DeepSpeed configs for multi-GPU runs\nsrc\u002Fautoresearch_qwen\u002F              Fixed library code\n```\n\n## Running An Agent\n\nThe full experiment protocol lives in [program.md](program.md). A practical starting prompt is:\n\n```text\nRead the entire repository, especially README.md and program.md. You may read all files for context, but only edit train.py. Run `uv run autoresearch-qwen doctor --json`, record a `--base-only` validation baseline, then start the autoresearch loop. Parse `artifacts\u002Flast_result.json` after each run and keep only changes that improve val_score.\n```\n\n## Results, Analysis, and Submission\n\n- `artifacts\u002Flast_result.json` stores the latest train\u002Feval result payload\n- `analysis.py` plots experiment progress from accumulated results\n- `submit_test.py` exports predictions for the blind DocVQA `test` split\n- `check_submission.py` validates a submission bundle locally before upload\n\n## Contributing\n\nIssues and pull requests are welcome, especially for:\n\n- stronger training recipes that respect the benchmark contract\n- better experiment tooling and reproducibility\n- clearer docs and onboarding\n- hardware-specific improvements that belong on a dedicated branch\n\nIf you want to change the benchmark contract itself, open an issue first so the rationale is explicit.\n\n## Acknowledgements\n\n- [karpathy\u002Fautoresearch](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fautoresearch) for the original autonomous research-loop framing\n- [Qwen](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen3-VL-4B-Instruct) for the base vision-language model\n- [Hugging Face M4](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FHuggingFaceM4\u002FDocumentVQA) for the public DocVQA dataset release\n\n## Star Trend\n\n[![Star History Chart](https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=wadeKeith\u002Fautoresearch-qwen&type=Date&cache=2026050620)](https:\u002F\u002Fstar-history.com\u002F#wadeKeith\u002Fautoresearch-qwen&Date)\n\n## License\n\n[MIT](LICENSE)\n","autoresearch-qwen 是一个用于在官方DocVQA基准上自主改进Qwen3-VL-4B-Instruct视觉语言模型的研究项目。该项目通过让代理迭代修改`train.py`文件、运行训练并基于完整的验证集测量结果，从而实现模型性能的持续提升。它支持NVIDIA多GPU（使用`torchrun`和DeepSpeed配置）以及Apple Silicon\u002FMPS硬件环境，适合需要对特定视觉语言模型进行高效实验循环的研究者或开发者使用。MIT许可证下开源，已有209个星标和33次分叉，表明其在社区内具有一定的认可度与实用性。","2026-06-11 02:44:09","CREATED_QUERY"]