[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-83794":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":18,"stars90d":16,"forks30d":16,"starsTrendScore":19,"compositeScore":20,"rankGlobal":10,"rankLanguage":10,"license":10,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":23,"hasPages":21,"topics":24,"createdAt":10,"pushedAt":10,"updatedAt":25,"readmeContent":26,"aiSummary":10,"trendingCount":16,"starSnapshotCount":16,"syncStatus":27,"lastSyncTime":28,"discoverSource":29},83794,"harness-1","pat-jj\u002Fharness-1","pat-jj","🚀 Ultra Recipe for Training Long-Horizon Search Agents - matching frontier AI's search capability with a 20B model","",null,"Python",533,67,7,1,0,41,332,321,96.44,false,"main",true,[],"2026-06-12 04:01:41","# Harness-1\n\n[![Tinker Inference](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FTinker-Inference-073f3d?labelColor=white)](https:\u002F\u002Fgithub.com\u002Fpat-jj\u002Fharness-1\u002Fblob\u002Fmain\u002Finference\u002Ftinker_inference.md)\n[![Model Checkpoint](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHugging%20Face-Checkpoint-FFCA03?logo=huggingface&logoColor=FFCA03)](https:\u002F\u002Fhuggingface.co\u002Fpat-jj\u002Fharness-1)\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2606.02373-b31b1b.svg?logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2606.02373)\n[![X](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FX-Post-000000.svg?logo=x&logoColor=white)](https:\u002F\u002Fx.com\u002Fpatpcj\u002Fstatus\u002F2063298457398636570?s=20)\n\nHarness-1 is a 20B search agent trained with reinforcement learning inside a\nstateful retrieval harness. The harness maintains recoverable search state:\ncandidate documents, curated evidence, evidence links, verification records, and\nbudget-aware context. The policy keeps the semantic decisions: what to search,\nwhich documents to inspect or curate, what claims to verify, and when the\nevidence is sufficient.\n\n![Harness-1 average search performance](assets\u002Fteaser_recall_barchart.png)\n\n## Quickstart\n\nFor a minimal local smoke test, you need:\n\n- Linux with Python `3.11+`.\n- `uv` installed.\n- A CUDA-compatible NVIDIA GPU environment.\n- vLLM with GPT-OSS support.\n- Access to the released Hugging Face checkpoint.\n\nInstall dependencies:\n\n```bash\nuv sync --extra vllm\n```\n\nSet the checkpoint:\n\n```bash\nexport HARNESS1_HF_MODEL=pat-jj\u002Fharness-1\n```\n\nStart with the detailed vLLM and BrowseComp+ guide:\n\n```bash\nless docs\u002Frun_vllm_browsecompplus.md\n```\n\n## Model Checkpoint\n\nThe released Harness-1 weights are hosted on Hugging Face:\n\n```text\nhttps:\u002F\u002Fhuggingface.co\u002Fpat-jj\u002Fharness-1\n```\n\nvLLM downloads the weights from Hugging Face on first use and then reuses the\nlocal Hugging Face cache. See the Hugging Face model page for model-card details,\nusage restrictions, and checkpoint metadata.\n\n## What You Can Do\n\n- Serve the released checkpoint locally with vLLM.\n- Run raw `\u002Fv1\u002Fcompletions` smoke tests with token-id outputs.\n- Evaluate Harness-1 search behavior on BrowseComp+ when a compatible retrieval\n  backend is available.\n- Run Tinker-hosted inference with the published checkpoint.\n- Inspect and extend the stateful search harness, tool environment, training\n  scripts, and evaluation runners.\n- Run ablations and baselines for supported datasets.\n\n## Repository Layout\n\n- `docs\u002F`: user-facing guides and runbooks.\n- `harness\u002F`: shared search harness, tools, trajectory, task, reranking, and\n  configuration modules.\n- `inference\u002F`: Harness-1 evaluation, component ablations, HF inference, and\n  vLLM inference utilities.\n- `inference\u002Fbaselines\u002F`: in-domain and transfer baseline evaluation runners.\n- `training\u002F`: SFT data generation, SFT training, RL training, and launch scripts.\n- `datagen\u002F` and `eval_scripts\u002F`: dataset and auxiliary evaluation code.\n- `model_export\u002F`: helper scripts for merging a private Tinker adapter into a\n  Hugging Face model.\n- `tinker-cookbook\u002F`: local Tinker cookbook dependency used by the training scripts.\n- `tests\u002F`: lightweight import and CLI smoke tests.\n\n## Setup Levels\n\n### Minimum Model Serving\n\nUse this if you only want to verify that the released checkpoint serves locally:\n\n```bash\nuv sync --extra vllm\nexport HARNESS1_HF_MODEL=pat-jj\u002Fharness-1\n```\n\nThen follow `docs\u002Frun_vllm_browsecompplus.md`.\n\n### Full BrowseComp+ Evaluation\n\nIn addition to the minimum setup, BrowseComp+ evaluation requires:\n\n- BrowseComp+ query, qrel, and answer files on disk.\n- A Chroma collection containing BrowseComp+ corpus chunks with document IDs that\n  match the qrels.\n- OpenAI credentials for retrieval support used by the harness.\n- Optional Baseten reranker credentials if reranking is enabled.\n\nBrowseComp+ data setup is described in `datagen\u002FREADME.md`. The end-to-end vLLM\nevaluation path is documented in `docs\u002Frun_vllm_browsecompplus.md`.\n\n### Development And Training\n\nUse the base environment for lightweight tests and code development:\n\n```bash\nuv sync\nuv run python tests\u002Fsmoke_imports.py\nuv run python tests\u002Fsmoke_cli.py\n```\n\nTraining scripts live in `training\u002F`. Model export utilities live in\n`model_export\u002F`.\n\n## Credentials And Security\n\nCopy the environment template only when needed:\n\n```bash\ncp .env.example .env.local\n```\n\nDo not commit real credentials. `.env` and `.env.local` are ignored by this\nrepository.\n\nCredential scope:\n\n- `HUGGINGFACE_TOKEN`: used only if Hugging Face checkpoint access requires auth.\n- `OPENAI_API_KEY`: used by retrieval\u002Fevaluation workflows.\n- `CHROMA_API_KEY` and `CHROMA_DATABASE`: used by Chroma-backed evaluation.\n- `BASETEN_API_KEY` and `BASETEN_MODEL_URL`: used only for the optional reranker.\n- `TINKER_API_KEY`: used by Tinker-hosted training or evaluation paths.\n\n## Dataset Availability\n\nBrowseComp+ is the public evaluation path documented in this release. The code\nexpects local BrowseComp+ files plus a compatible Chroma retrieval collection.\nSee `datagen\u002FREADME.md` and `docs\u002Frun_vllm_browsecompplus.md`.\n\nThe other in-domain corpora used in the paper, such as `web`, `sec`, and\n`patents`, are not bundled as public ready-made indexes. To evaluate those\nsettings, first construct compatible corpora and Chroma collections, for example\nwith the Context-1 data-generation pipeline:\n[chroma-core\u002Fcontext-1-data-gen](https:\u002F\u002Fgithub.com\u002Fchroma-core\u002Fcontext-1-data-gen).\n\n## Inference\n\nRun a basic Hugging Face model-load test with:\n\n```bash\nuv run python inference\u002Fhf_inference.py \\\n  --model ${HARNESS1_HF_MODEL:-pat-jj\u002Fharness-1} \\\n  --prompt \"Briefly describe Harness-1.\"\n```\n\nFor Tinker-hosted inference with the published Tinker checkpoint, see\n[`inference\u002Ftinker_inference.md`](inference\u002Ftinker_inference.md). That document\ncontains the public Tinker checkpoint path, required harness flags, and a\nBrowseComp+ example run.\n\nFor local vLLM serving and BrowseComp+ evaluation, see\n[`docs\u002Frun_vllm_browsecompplus.md`](docs\u002Frun_vllm_browsecompplus.md). The\nend-to-end path uses `inference\u002Fevaluate_harness1_vllm.py` with raw\n`\u002Fv1\u002Fcompletions` token-id prompts.\n\nFor a lightweight local vLLM server wrapper:\n\n```bash\nuv sync --extra vllm\nuv run python inference\u002Fvllm_local_inference.py serve \\\n  --model ${HARNESS1_HF_MODEL:-pat-jj\u002Fharness-1} \\\n  --served-model-name harness-1\n```\n\n## Results And Reproducibility\n\nEvaluation metrics depend on the query sample, Chroma index, reranker backend,\nvLLM version, and GPU kernels. Small smoke tests are useful for validating setup\nbut have high variance. Larger query sets are more appropriate for reporting\naggregate metrics.\n\nThe detailed vLLM guide explains how to read final metrics including:\n\n- `recall`: recall of evidence documents in the final curated set.\n- `final_answer_recall`: recall over evidence tied to the final answer.\n- `trajectory_recall`: evidence recall anywhere in the search trajectory.\n- `precision`: precision of the final curated set.\n\n## Glossary\n\n- Harness-1 operating point: the component flags and generation settings used for\n  the full search harness.\n- BrowseComp+: a benchmark for browsing and evidence-seeking questions.\n- qrels: relevance labels that map query IDs to gold or evidence document IDs.\n- Curated evidence recall: how much gold evidence appears in the final curated\n  document set.\n- Trajectory recall: how much gold evidence appears anywhere during the search\n  trajectory.\n- Raw `\u002Fv1\u002Fcompletions`: the OpenAI-compatible completion endpoint used with\n  pre-tokenized prompts.\n- Integer token prompts: prompt inputs sent as token IDs instead of plain text.\n- `V8D_` flags: environment flags that enable Harness-1 search components.\n\n## Known Limitations\n\n- Full BrowseComp+ evaluation requires a compatible Chroma retrieval backend; the\n  large retrieval index is not bundled in this repository.\n- Results can vary with external retrieval and reranking services.\n- Local serving requires a CUDA GPU environment with enough memory for the\n  checkpoint. Non-H100 GPUs may work with sufficient memory and vLLM support, but\n  the documented path was validated on H100-class hardware.\n- Some training and model-export workflows depend on private checkpoints or\n  hosted services.\n\n## Documentation\n\n- `docs\u002Frun_vllm_browsecompplus.md`: detailed local vLLM and BrowseComp+ guide.\n- `inference\u002Ftinker_inference.md`: Tinker-hosted inference guide.\n- `datagen\u002FREADME.md`: dataset setup notes.\n- `inference\u002FREADME.md`: inference, evaluation, ablation, and baseline entrypoints.\n- `model_export\u002FREADME.md`: model export utilities.\n\n## Support And Contributing\n\nPlease use the repository issue tracker for bug reports, setup problems, and\nfeature requests. Contributions should keep public documentation free of private\npaths, secrets, and service-specific assumptions unless they are clearly marked\nas optional.\n\n## Citation\n\nIf you use Harness-1 in your work, please cite:\n\n```bibtex\n@article{jiang2026harness,\n  title={Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses},\n  author={Jiang, Pengcheng and Shi, Zhiyi and Hong, Kelly and Xu, Xueqiang and Sun, Jiashuo and Sun, Jimeng and Bashir, Hammad and Han, Jiawei},\n  journal={arXiv preprint arXiv:2606.02373},\n  year={2026}\n}\n```\n",2,"2026-06-11 04:11:29","CREATED_QUERY"]