[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72111":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":25,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":32,"readmeContent":33,"aiSummary":34,"trendingCount":16,"starSnapshotCount":16,"syncStatus":35,"lastSyncTime":36,"discoverSource":37},72111,"fara","microsoft\u002Ffara","microsoft","Fara-7B: An Efficient Agentic Model for Computer Use","",null,"Python",5467,543,38,22,0,36,122,410,108,39.21,"MIT License",false,"main",true,[27,28,29,30,31],"agent","browser-use","computer-use","computer-use-agent","cua","2026-06-12 02:02:58","\u003Cdiv align=\"center\">\n\n# Fara-7B: An Efficient Agentic Model for Computer Use\n\n\u003Cimg src=\"figures\u002Fmodel_accuracy_vs_cost_v2_glm_cost_updated.png\" alt=\"Fara-7B Performance\" width=\"600\"\u002F>\n\n[![Microsoft](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FMicrosoft-Project-0078D4?logo=microsoft)](https:\u002F\u002Faka.ms\u002Fmsaif\u002Ffara)\n[![Hugging Face Model](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗-Model-yellow)](https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002FFara-7b)\n[![Foundry](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FAzure-Foundry-0089D6)](https:\u002F\u002Faka.ms\u002Ffoundry-fara-7b)\n[![Dataset](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗-WebTailBench%20Dataset-orange)](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fmicrosoft\u002FWebTailBench)\n[![Dataset](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗-CUAVerifierBench-orange)](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fmicrosoft\u002FCUAVerifierBench)\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-2511.19663-red)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.19663)\n\n\u003C\u002Fdiv>\n\n---\n\n## Updates\n\n* **2026-05-12** — Refreshed **WebTailBench (V2)** tasks and rubrics.\n  Many V1 tasks had calendar-bound dates that expired (Nov 2025); V2\n  rolls those forward and revises the precomputed rubrics for the\n  full 609-task suite. Available now as the `test_v2` split on\n  [`microsoft\u002FWebTailBench`](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fmicrosoft\u002FWebTailBench).\n  A side-by-side V1↔V2 diff (task strings and rubric JSON) is\n  hosted [here](https:\u002F\u002Fmicrosoft.github.io\u002Ffara\u002Fdocs\u002Fwebtailbench_v1_v2_diff.html).\n* **2026-04-19** — Released **[CUAVerifierBench](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fmicrosoft\u002FCUAVerifierBench)**,\n  a human-annotated benchmark for evaluating CUA verifiers (i.e. judges that\n  score agent trajectories). Two splits — `fara7b_om2w_browserbase` (106\n  Fara-7B Online-Mind2Web\u002FBrowserbase trajectories, ~2 reviewers each) and\n  `internal` (154 trajectories from a heldout aurora-v2 task suite) —\n  with per-judge UV-blind \u002F UV-informed labels, Universal Verifier\n  outputs, and legacy verifier outputs side-by-side. The build script\n  that produced the dataset lives alongside the data on HuggingFace.\n* **2026-04-18** — Removed the `autogen-core` \u002F `autogen-ext` dependency\n  from `webeval`; chat completion clients are now self-contained under\n  `webeval\u002Fsrc\u002Fwebeval\u002Foai_clients\u002F`. No more autogen submodule install\n  step; just `pip install -e .[vllm]` then `cd webeval; pip install -e .`.\n* **2026-04-18** — Incorporated WebTailBench (initial \u002F now-stale\n  version) directly into the repo as a first-class benchmark. The\n  loader auto-downloads `WebTailBench-v1-rubrics.tsv` from\n  [`microsoft\u002FWebTailBench`](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fmicrosoft\u002FWebTailBench)\n  and threads each task's published `precomputed_rubric` through to\n  the verifier. Reproducibility CLI lives in `webeval\u002Fscripts\u002Fwebtailbench.py`.\n* **2026-04-18** — Released the **Universal Verifier** (`MMRubricAgent`)\n  as the official verifier for WebTailBench. Multimodal,\n  rubric-grounded, two-model ensemble (`gpt-5.2` + `o4-mini`) with\n  per-criterion scoring, outcome verification, and first-point-of-failure\n  analysis. A stand-alone parallel runner is at\n  `webeval\u002Fscripts\u002Fverify_trajectories.py` for re-scoring any directory\n  of webeval trajectories without touching the solver.\n\n---\n\n## Overview\n\n**Fara-7B** is Microsoft's first **agentic small language model (SLM)** designed specifically for computer use. With only 7 billion parameters, Fara-7B is an ultra-compact Computer Use Agent (CUA) that achieves state-of-the-art performance within its size class and is competitive with larger, more resource-intensive agentic systems.\n\nTry Fara-7B locally as follows (see [Installation](#Installation) for detailed instructions on Windows ) or via Magentic-UI:\n\n```bash\n# 1. Clone repository\ngit clone https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Ffara.git\ncd fara\n\n# 2. Setup environment\npython3 -m venv .venv \nsource .venv\u002Fbin\u002Factivate\npip install -e .\nplaywright install\n```\n\nThen in one process, host the model:\n```bash\nvllm serve \"microsoft\u002FFara-7B\" --port 5000 --dtype auto \n```\nThen you can iteratively query it with:\n```bash\nfara-cli --task \"whats the weather in new york now\"\n```\n\nTo try Fara-7B inside Magentic-UI, please follow the instructions here [Magentic-UI + Fara-7B](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fmagentic-ui\u002Fblob\u002Fmain\u002FREADME.md#fara-7b). You will need to serve the model as before, but instead of fara-cli you can use Magentic-UI which has a nice UI (see video demos below).\n\n\nNotes:\n- If you're using Windows, we highly recommend using WSL2 (Windows Subsystem for Linux). Please see the Windows instructions in the [Installation](#Installation) section.\n- You might need to do `--tensor-parallel-size 2` with vllm command if you run out of memory\n\n\u003Ctable>\n\u003Ctr>\n\u003Ctd width=\"33%\" align=\"center\">\n\n**Shopping**  \n\n\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fd2109eba-a91f-4a0b-8217-38c1dcc17e9a\" width=\"100%\" style=\"max-height: 300px;\">\n\u003C\u002Fvideo>\n\n\u003C\u002Ftd>\n\u003Ctd width=\"33%\" align=\"center\">\n\n**GitHub Issues**  \n\n\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fbb177a09-8fcb-41be-8639-32044c1ec0e8\" width=\"100%\" style=\"max-height: 300px;\">\n\u003C\u002Fvideo>\n\n\u003C\u002Ftd>\n\u003Ctd width=\"33%\" align=\"center\">\n\n**Directions with Cheese**  \n\n\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fb83d341e-25f6-4236-a946-4b8eaca987d5\" width=\"100%\" style=\"max-height: 300px;\">\n\u003C\u002Fvideo>\n\n\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003C\u002Ftable>\n\n### What Makes Fara-7B Unique\n\nUnlike traditional chat models that generate text-based responses, Fara-7B leverages computer interfaces—mouse and keyboard—to perform multi-step tasks on behalf of users. The model:\n\n- **Operates visually** by perceiving webpages and taking actions like scrolling, typing, and clicking on directly predicted coordinates without accessibility trees or separate parsing models\n- **Enables on-device deployment** due to its compact 7B parameter size, resulting in reduced latency and improved privacy as user data remains local\n- **Completes tasks efficiently**, averaging only ~16 steps per task compared to ~41 for comparable models\n\nFara-7B is trained using a novel synthetic data generation pipeline built on the [Magentic-One](https:\u002F\u002Fwww.microsoft.com\u002Fen-us\u002Fresearch\u002Farticles\u002Fmagentic-one-a-generalist-multi-agent-system-for-solving-complex-tasks\u002F) multi-agent framework, with 145K trajectories covering diverse websites, task types, and difficulty levels. The model is based on [Qwen2.5-VL-7B](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.13923) and trained with supervised fine-tuning.\n\n### Key Capabilities\n\nFara-7B can automate everyday web tasks including:\n- Searching for information and summarizing results\n- Filling out forms and managing accounts\n- Booking travel, movie tickets, and restaurant reservations\n- Shopping and comparing prices across retailers\n- Finding job postings and real estate listings\n\n### Performance Highlights\n\nFara-7B achieves state-of-the-art results across multiple web agent benchmarks, outperforming both comparable-sized models and larger systems:\n\n| Model | Params | WebVoyager | Online-M2W | DeepShop | WebTailBench |\n|-------|--------|------------|------------|----------|--------------|\n| **SoM Agents** | | | | | |\n| SoM Agent (GPT-4o-0513) | - | 90.6 | 57.7 | 49.1 | 60.4 |\n| SoM Agent (o3-mini) | - | 79.3 | 55.4 | 49.7 | 52.7 |\n| SoM Agent (GPT-4o) | - | 65.1 | 34.6 | 16.0 | 30.8 |\n| GLM-4.1V-9B-Thinking | 9B | 66.8 | 33.9 | 32.0 | 22.4 |\n| **Computer Use Models** | | | | | |\n| OpenAI computer-use-preview | - | 70.9 | 42.9 | 24.7 | 25.7 |\n| UI-TARS-1.5-7B | 7B | 66.4 | 31.3 | 11.6 | 19.5 |\n| **Fara-7B** | **7B** | **73.5** | **34.1** | **26.2** | **38.4** |\n\n*Table: Online agent evaluation results showing success rates (%) across four web benchmarks. Results are averaged over 3 runs.*\n\n### WebTailBench: A New Benchmark for Real-World Web Tasks\n\nWe are releasing **[WebTailBench](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fmicrosoft\u002FWebTailBench)**, a new evaluation benchmark focusing on 11 real-world task types that are underrepresented or missing in existing benchmarks. The benchmark includes 609 tasks across diverse categories, with the first 8 segments testing single skills or objectives (usually on a single website), and the remaining 3 evaluating more difficult multi-step or cross-site tasks.\n\n#### WebTailBench Detailed Results\n\n| Task Segment | Tasks | SoM GPT-4o-0513 | SoM o3-mini | SoM GPT-4o | GLM-4.1V-9B | OAI Comp-Use | UI-TARS-1.5 | **Fara-7B** |\n|--------------|-------|-----------------|-------------|------------|-------------|--------------|-------------|-------------|\n| **Single-Site Tasks** |\n| Shopping | 56 | 62.5 | 71.4 | 38.1 | 31.0 | 42.3 | 41.1 | **52.4** |\n| Flights | 51 | 60.1 | 39.2 | 11.1 | 10.5 | 17.6 | 10.5 | **37.9** |\n| Hotels | 52 | 68.6 | 56.4 | 31.4 | 19.9 | 26.9 | 35.3 | **53.8** |\n| Restaurants | 52 | 67.9 | 59.6 | 47.4 | 32.1 | 35.9 | 22.4 | **47.4** |\n| Activities | 80 | 70.4 | 62.9 | 41.7 | 26.3 | 30.4 | 9.6 | **36.3** |\n| Ticketing | 57 | 58.5 | 56.7 | 37.4 | 35.7 | 49.7 | 30.4 | **38.6** |\n| Real Estate | 48 | 34.0 | 17.4 | 20.1 | 16.0 | 9.0 | 9.7 | **23.6** |\n| Jobs\u002FCareers | 50 | 49.3 | 44.0 | 32.7 | 22.7 | 20.7 | 20.7 | **28.0** |\n| **Multi-Step Tasks** |\n| Shopping List (2 items) | 51 | 66.0 | 62.7 | 17.0 | 7.8 | 34.0 | 20.9 | **49.0** |\n| Comparison Shopping | 57 | 67.3 | 59.1 | 27.5 | 22.8 | 1.2 | 8.8 | **32.7** |\n| Compositional Tasks | 55 | 51.5 | 39.4 | 26.7 | 17.0 | 10.3 | 9.1 | **23.0** |\n| **Overall** |\n| Macro Average | 609 | 59.7 | 51.7 | 30.1 | 22.0 | 25.3 | 19.9 | **38.4** |\n| Micro Average | 609 | 60.4 | 52.7 | 30.8 | 22.4 | 25.7 | 19.5 | **38.4** |\n\n*Table: Breakdown of WebTailBench results across all 11 segments. Success rates (%) are averaged over 3 independent runs. Fara-7B achieves the highest performance among computer-use models across all task categories.*\n\n**Coming Soon:**\n- Task Verification pipeline for LLM-as-a-judge evaluation\n- Official human annotations of WebTailBench (in partnership with BrowserBase)\n\n### CUAVerifierBench: Evaluating the Verifiers Themselves\n\nWhile WebTailBench measures *agents*, **[CUAVerifierBench](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fmicrosoft\u002FCUAVerifierBench)** measures the *judges that score those agents*. Each row pairs a Fara-7B agent trajectory (instruction, screenshots, web_surfer log, final answer) with one human reviewer's verdict, plus the verdicts produced by the **Universal Verifier (`MMRubricAgent`)** and several legacy verifiers — so researchers can compute verifier–human agreement (Cohen's κ, accuracy, F1) on a fixed corpus and iterate on new judge prompts \u002F architectures against a frozen ground-truth set.\n\nThe dataset is exposed as two HuggingFace configs joinable on `task_id`:\n\n| Config | Granularity | Contents |\n|---|---|---|\n| `trajectories` | one row per task | instruction, screenshots, web_surfer log, verifier outputs, task-level human aggregates |\n| `annotations` | one row per (task, judge) | per-reviewer outcome \u002F process labels and free-text justifications |\n\nTwo splits ship today:\n\n| Split | Source | Trajectories | Annotation rows |\n|---|---|---|---|\n| `fara7b_om2w_browserbase` | Fara-7B trajectories on Online-Mind2Web tasks executed via Browserbase | 106 | 215 (≈2 reviewers\u002Ftask; UV-blind **and** UV-informed stages) |\n| `internal` | Heldout aurora-v2 task suite scored with the same WebSurfer + verifier stack | 154 | 154 (1 reviewer\u002Ftask; UV-blind only) |\n\nReviewer identities are anonymized as `Judge1` … `JudgeN` using a single shared map across both splits. The build script that produced the dataset (with full schema + provenance) ships alongside the data on HuggingFace at [`microsoft\u002FCUAVerifierBench`](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fmicrosoft\u002FCUAVerifierBench); see the [dataset README](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fmicrosoft\u002FCUAVerifierBench\u002Fblob\u002Fmain\u002FREADME.md) for the full column list.\n\n```python\nfrom datasets import load_dataset\n\ntrajs = load_dataset(\"microsoft\u002FCUAVerifierBench\", \"trajectories\",\n                     split=\"fara7b_om2w_browserbase\")\nanns  = load_dataset(\"microsoft\u002FCUAVerifierBench\", \"annotations\",\n                     split=\"fara7b_om2w_browserbase\")\n```\n\n### Evaluation Infrastructure\n\nOur evaluation setup leverages:\n\n1. **Playwright** - A cross-browser automation framework that replicates browser environments\n2. **Abstract Web Agent Interface** - Allows integration of any model from any source into the evaluation environment\n3. **Fara-Agent Class** - Reference implementation for running the Fara model\n\n> **Note:** Fara-7B is an experimental release designed to invite hands-on exploration and feedback from the community. We recommend running it in a sandboxed environment, monitoring its execution, and avoiding sensitive data or high-risk domains.\n\n---\n\n# Installation\n\n\n##  Linux \n\nThe following instructions are for Linux systems, see the Windows section below for Windows instructions. \n\nInstall the package using pip and set up the environment with Playwright:\n\n```bash\n# 1. Clone repository\ngit clone https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Ffara.git\ncd fara\n\n# 2. Setup environment\npython3 -m venv .venv \nsource .venv\u002Fbin\u002Factivate\npip install -e .[vllm]\nplaywright install\n```\n\nNote: If you plan on hosting with Azure Foundry only, you can skip the `[vllm]` and just do `pip install -e .`\n\n\n## Windows\n\nFor Windows, we highly recommend using WSL2 (Windows Subsystem for Linux) to provide a Linux-like environment. However, if you prefer to run natively on Windows, follow these steps:\n\n```bash\n# 1. Clone repository\ngit clone https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Ffara.git\ncd fara\n\n# 2. Setup environment\npython3 -m venv .venv\n.venv\\Scripts\\activate\npip install -e .\npython3 -m playwright install\n```\n\n## Hosting the Model\n\n**Recommended:** The easiest way to get started is using Azure Foundry hosting, which requires no GPU hardware or model downloads. Alternatively, you can self-host with vLLM if you have GPU resources available.\n\n### Azure Foundry Hosting (Recommended)\n\nDeploy Fara-7B on [Azure Foundry](https:\u002F\u002Fai.azure.com\u002Fexplore\u002Fmodels\u002FFara-7B\u002Fversion\u002F2\u002Fregistry\u002Fazureml-msr) without needing to download weights or manage GPU infrastructure.\n\n**Setup:**\n\n1. Deploy the Fara-7B model on Azure Foundry and obtain your endpoint URL and API key\n\nThen create a endpoint configuration JSON file (e.g., `azure_foundry_config.json`):\n\n```json\n{\n    \"model\": \"Fara-7B\",\n    \"base_url\": \"https:\u002F\u002Fyour-endpoint.inference.ml.azure.com\u002F\",\n    \"api_key\": \"YOUR_API_KEY_HERE\"\n}\n```\n\nThen you can run Fara-7B using this endpoint configuration.\n\n2. Run the Fara agent:\n\n```bash\nfara-cli --task \"how many pages does wikipedia have\" --endpoint_config azure_foundry_config.json [--headful]\n```\n\nNote: you can also specify the endpoint config with the args `--base_url [your_base_url] --api_key [your_api_key] --model [your_model_name]` instead of using a config JSON file. \n\nNote: If you see an error that the `fara-cli` command is not found, then try:\n\n```bash\npython -m fara.run_fara --task \"what is the weather in new york now\"\n```\n\nThat's it! No GPU or model downloads required.\n\n### Self-hosting with vLLM or LM Studio \u002F Ollama\n\nIf you have access to GPU resources, you can self-host Fara-7B using vLLM. This requires a GPU machine with sufficient VRAM (e.g., 24GB or more).\n\nOnly on Linux: all that is required is to run the following command to start the vLLM server:\n\n```bash\nvllm serve \"microsoft\u002FFara-7B\" --port 5000 --dtype auto \n```\nFor quantized models or lower VRAM GPUs, please see [Fara-7B GGUF on HuggingFace](https:\u002F\u002Fhuggingface.co\u002Fbartowski\u002Fmicrosoft_Fara-7B-GGUF).\n\nFor Windows\u002FMac, vLLM is not natively supported. You can use WSL2 on Windows to run the above command or LM Studio \u002F Ollama as described below.\n\nOtherwise, you can use [LM Studio](https:\u002F\u002Flmstudio.ai\u002F) or [Ollama](https:\u002F\u002Follama.com\u002F) to host the model locally. We currently recommend the following GGUF versions of our models [Fara-7B GGUF on HuggingFace](https:\u002F\u002Fhuggingface.co\u002Fbartowski\u002Fmicrosoft_Fara-7B-GGUF) for use with LM Studio or Ollama. Select the largest model that fits your GPU. Please ensure that context length is set to at least 15000 tokens and temperature to 0 for best results.\n\nThen you can run Fara-7B pointing to your local  server:\n\nRun the test script to see Fara in action:\n\n```bash\nfara-cli --task \"what is the weather in new york now\"\n```\n\nIf you didn't use vLLM to host, please specify the correct `--base_url [your_base_url] --api_key [your_api_key] --model [your_model_name]`\n\nIf you see an error that the `fara-cli` command is not found, then try:\n\n```bash\npython -m fara.run_fara --task \"what is the weather in new york now\"\n```\n\n# Reproducibility\n\nWe provide a framework in `webeval\u002F` to reproduce our results on WebVoyager and OnlineMind2Web. \nAgentic evaluations on live websites present unique challenges due to day-to-day changes. We implement several measures to ensure reliable and comparable evaluations:\n\n**BrowserBase Integration**\nWe employ BrowserBase to manage browser session hosting, enabling reliable browser instance management.\n\n**Time-sensitive Task Updates**\nTasks in benchmarks like WebVoyager can become stale or impossible. We:\n- Removed ~48 impossible tasks from the original WebVoyager benchmark\n- Updated ~50 tasks with future dates to keep them achievable\n- Example: *\"Search for a hotel in Bali from Jan 1 to Jan 4, 2024\"* → *\"Search for a hotel in Bali from Jan 1 to Jan 4, 2026\"*\n- Our updated WebVoyager benchmark is available at `webeval\u002Fdata\u002Fwebvoyager\u002FWebVoyager_data_08312025.jsonl`\n\n**Environment Error Handling**\nBrowser errors (connection drops, page timeouts) are handled robustly:\n- Trajectories are retried up to 5 times when environment errors occur\n- Complete yet incorrect trajectories are never retried\n- Each retry starts with a fresh browser session, with no retained state\n\n**Step Budget**\nEach trajectory is capped at a maximum of 100 actions across all online benchmarks. Trajectories exceeding this budget without choosing to stop are considered incorrect.\n\n## WebEval Package Installation\n\n```bash\nconda create --name fara_webeval python=3.12\nconda activate fara_webeval\n\n# Install fara package (with vllm extras for GPU hosting)\npip install -e .[vllm]\n\n# Install webeval\ncd webeval\npip install -e .\n\n# Install playwright\nplaywright install\n```\n\nThe webeval package no longer depends on `autogen-core` \u002F `autogen-ext` —\nall chat completion clients are vendored under `webeval\u002Fsrc\u002Fwebeval\u002Foai_clients\u002F`\n(see `GracefulRetryClient`, `OpenAIClientWrapper`, `AzureOpenAIClientWrapper`,\netc.). You no longer need to clone or install the autogen submodule.\n\n> **Always activate the `fara_webeval` env before running any of the eval\n> scripts below.** It pins `vllm==0.10.0` + `torch==2.7.1`; running with\n> a newer `vllm` (≥ 0.19) under `torch` ≥ 2.10 trips a known\n> `torch._dynamo.symbolic_convert` crash during CUDA-graph capture, which\n> can be worked around with `--enforce_eager` but at a meaningful\n> throughput cost. Stick to the pinned env.\n\n## Running Evaluations\n\nNavigate to the scripts directory:\n\n```bash\ncd webeval\u002Fscripts\n```\n\nMake sure you set a valid OpenAI GPT-4o endpoint in `endpoint_configs_gpt4o\u002Fdev` in order to run the WebVoyager LLM-as-a-judge! \n\n**Option 1: Self-hosted vLLM**\n\n```bash\npython webvoyager.py --model_url \u002Fpath\u002Fwhere\u002Fyou\u002Fwant\u002Fto\u002Fdownload\u002Fmodel\u002F --model_port 5000 --eval_oai_config ..\u002Fendpoint_configs_gpt4o\u002Fdev\u002F --out_url \u002Fpath\u002Fto\u002Fsave\u002Feval\u002Ffiles --device_id 0,1 --processes 1 --run_id 1 --max_rounds 100\npython om2w.py --model_url \u002Fpath\u002Fwhere\u002Fyou\u002Fwant\u002Fto\u002Fdownload\u002Fmodel\u002F --model_port 5000 --eval_oai_config ..\u002Fendpoint_configs_o4\u002Fdev\u002F --eval_model o4-mini --out_url \u002Fpath\u002Fto\u002Fsave\u002Feval\u002Ffiles --device_id 0,1 --processes 1 --run_id 1 --max_rounds 100\n\n# WebTailBench almost always needs --browserbase: a meaningful share of\n# the benchmark's task websites (airlines, retailers, ticketing, …)\n# block bot traffic from a vanilla playwright browser. Without\n# --browserbase you'll see a high rate of trajectories that abort on\n# Page.goto \u002F navigation \u002F captcha errors. Set BROWSERBASE_API_KEY and\n# BROWSERBASE_PROJECT_ID in the environment first.\nexport BROWSERBASE_API_KEY=\u003Cyour_browserbase_api_key>\nexport BROWSERBASE_PROJECT_ID=\u003Cyour_browserbase_project_id>\n\n# --success controls which Universal Verifier signal counts as the\n# top-line score: ``outcome`` (default; binary outcome_success — what the\n# Fara-7B numbers in the README above are reported against), ``process``\n# (rubric_is_success := rubric_score >= --rubric_score_threshold; a more\n# lenient gate, expect slightly higher numbers), or ``both``.\npython webtailbench.py \\\n    --model_url \u002Fpath\u002Fto\u002FFara\u002Fmodel_checkpoints \\\n    --model_port 5000 \\\n    --device_id 0,1 \\\n    --eval_oai_config ..\u002F..\u002Fendpoint_configs\u002Fjudge_active\u002Fprod \\\n    --judge_eval_model gpt-5.2 \\\n    --judge_o4_eval_model o4-mini \\\n    --rubric_score_threshold 0.8 \\\n    --success outcome \\\n    --out_url \u002Fpath\u002Fto\u002FFara\u002Feval \\\n    --processes 4 \\\n    --run_id 1 \\\n    --max_rounds 100 \\\n    --browserbase\n\npython verify_trajectories.py \\\n    --input \u002Fpath\u002Fto\u002FFara\u002Feval\u002Fruns\u002F...\u002F\u003Cbenchmark>\u002F\u003Crun_id>\u002Ftraj \\\n    --task-data ..\u002Fpath\u002Fto\u002Fom2w\u002FOnline_Mind2Web_06042025.json \\\n    --task-data-format om2w \\\n    --eval-config ..\u002F..\u002Fendpoint_configs\u002Fjudge_active\u002Fprod \\\n    --judge-model gpt-5.2 --o4mini-model o4-mini \\\n    --processes 8\n```\n\n**Option 2: Azure Foundry Deployment**\n\nDeploy [Fara-7B on Foundry endpoint(s)](https:\u002F\u002Fai.azure.com\u002Fexplore\u002Fmodels\u002FFara-7B\u002Fversion\u002F2\u002Fregistry\u002Fazureml-msr), then place endpoint URLs and keys in JSONs under `endpoint_configs\u002F`:\n\n```bash\npython webvoyager.py --model_endpoint ..\u002F..\u002Fendpoint_configs\u002F --eval_oai_config ..\u002Fendpoint_configs_gpt4o\u002Fdev\u002F --out_url \u002Fpath\u002Fto\u002Fsave\u002Feval\u002Ffiles --processes 1 --run_id 1_endpoint --max_rounds 100\npython om2w.py --model_endpoint ..\u002F..\u002Fendpoint_configs\u002F --eval_oai_config ..\u002Fendpoint_configs_o4\u002Fdev\u002F --eval_model o4-mini --out_url \u002Fpath\u002Fto\u002Fsave\u002Feval\u002Ffiles --processes 1 --run_id 1_endpoint --max_rounds 100\npython webtailbench.py --model_endpoint ..\u002F..\u002Fendpoint_configs\u002F --eval_oai_config ..\u002F..\u002Fendpoint_configs\u002Fjudge_active\u002Fprod --judge_eval_model gpt-5.2 --judge_o4_eval_model o4-mini --out_url \u002Fpath\u002Fto\u002FFara\u002Feval --processes 1 --run_id 1_endpoint --max_rounds 100\n```\n\n### Notes\n\n\n- We use the same LLM-as-a-judge prompts and model (GPT-4o) as WebVoyager, hence the `--eval_oai_config` argument\n- Set `--browserbase` for browser session management (requires exported API key and project ID environment variables)\n- Avoid overloading a single vLLM deployment with more than ~10 concurrent processes due to known issues\n- See debugging output in `fara\u002Fwebeval\u002Fscripts\u002Fstdout.txt`\n\n---\n\n## Analyzing Evaluation Results\n\n### Evaluation Output Structure\n\nEvaluation results are stored under `--out_url` in folders organized by:\n- Model name\n- Dataset\n- Username\n- Run ID\n\nExample path:\n```\n\u002Fruns\u002FWebSurfer-fara-100-max_n_images-3\u002Ffara-7b\u002F\u003Cusername>\u002FWebVoyager_WebVoyager_data_08312025.jsonl\u002F\u003Crun_id>\n```\n\nEach evaluation folder contains:\n- `gpt_eval\u002F` - LLM-as-a-judge evaluation results\n- `traj\u002F` - Per-task trajectory subdirectories containing:\n  - `*-final_answer.json` (e.g., `Amazon--1_final_answer.json`) - `\u003Cno_answer>` indicates abortion or step budget exceeded\n  - `scores\u002F*_eval.json` - LLM judge scores (`gpt_eval.json` for WebVoyager, `WebJudge_Online_Mind2Web_eval-3.json` for Online-Mind2Web)\n  - `web_surfer.log` - Action history and errors\n  - `screenshot_X.png` - Screenshots captured before each action X\n  - `times.json` - contains start and end times of task\n  - `core.log` - contains high level logs like if the trajectory needs to be started or is already cached\u002Fdone, the evaluation score, duration, and errors encountered\n\n### Running Analysis\n\nUse the analysis notebook to compute metrics:\n\n```bash\ncd webeval\u002Fscripts\u002Fanalyze_eval_results\u002F\njupyter notebook analyze.ipynb\n```\n\nThe script:\n- Identifies trajectories aborted mid-execution and diagnostic reasons\n- Computes average scores across non-aborted trajectories\n- Distinguishes between aborted trajectories (errors during sampling) and completed trajectories (with terminate() call or step budget exceeded)\n\nTo re-run failed tasks, execute the evaluation script again with the same `run_id` and `username` - it will skip non-aborted tasks.\n\n\u003Cdetails>\n\u003Csummary>Example WebVoyager GPT Eval Result\u003C\u002Fsummary>\n\n```json\n{\n  \"score\": 1.0,\n  \"gpt_response_text\": \"To evaluate the task, we need to verify if the criteria have been met:\\n\\n1. **Recipe Requirement**: A vegetarian lasagna recipe with zucchini and at least a four-star rating.\\n\\n2. **Search and Results**:\\n   - The screenshots show that the search term used was \\\"vegetarian lasagna zucchini.\\\"\\n   - Among the search results, \\\"Debbie's Vegetable Lasagna\\\" is prominently featured.\\n   \\n3. **Evaluation of the Recipe**:\\n   - Rating: \\\"Debbie's Vegetable Lasagna\\\" has a rating of 4.7, which satisfies the requirement of being at least four stars.\\n   - The presence of zucchini in the recipe is implied through the search conducted, though the screenshots do not explicitly show the ingredients list. However, the result response confirms the match to the criteria.\\n\\nGiven the information provided, the task seems to have fulfilled the requirement of finding a vegetarian lasagna recipe with zucchini and a four-star rating or higher. \\n\\n**Verdict: SUCCESS**\"\n}\n```\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>Example of web_surfer.log (action history)\u003C\u002Fsummary>\n\n```json\n{\"message\": \"\\nThought #1: To begin locating the closest GameStop store, I will click on the 'Trade-In' button, which typically leads to a store locator feature on retail websites.\\nAction #1: executing tool 'left_click' with arguments {\\\"action\\\": \\\"left_click\\\", \\\"coordinate\\\": [1074, 37]}\", \"timestamp\": \"2025-11-25T22:23:42.788320\", \"type\": \"OtherEvent\"}\n{\"message\": \"Observation#1: I clicked at coordinates (1083.0252100840337, 37.16517857142858).\", \"timestamp\": \"2025-11-25T22:23:47.616613\", \"type\": \"OtherEvent\"}\n{\"message\": \"\\nThought #2: Since access to Gamestop's website has been blocked, I will search for alternative sources of store location and hours for GameStop locations near zip code 90028. This could include cached or third-party website information.\\nAction #2: executing tool 'web_search' with arguments {\\\"action\\\": \\\"web_search\\\", \\\"query\\\": \\\"GameStop store locator zip code 90028\\\"}\", \"timestamp\": \"2025-11-25T22:23:50.579702\", \"type\": \"OtherEvent\"}\n{\"message\": \"Observation#2: I typed 'GameStop store locator zip code 90028' into the browser search bar.\", \"timestamp\": \"2025-11-25T22:23:55.119495\", \"type\": \"OtherEvent\"}\n{\"message\": \"\\nThought #3: The closest GameStop store to zip code 90028 is located at 5533 W Sunset Blvd, Los Angeles, CA 90028. It is currently closed and will open tomorrow at 11 AM. The contact number is (323) 466-3714.\\nAction #3: executing tool 'terminate' with arguments {\\\"action\\\": \\\"terminate\\\", \\\"status\\\": \\\"success\\\"}\", \"timestamp\": \"2025-11-25T22:23:58.729038\", \"type\": \"OtherEvent\"}\n{\"message\": \"Observation#3: The closest GameStop store to zip code 90028 is located at 5533 W Sunset Blvd, Los Angeles, CA 90028. It is currently closed and will open tomorrow at 11 AM. The contact number is (323) 466-3714.\", \"timestamp\": \"2025-11-25T22:24:02.379069\", \"type\": \"OtherEvent\"}\n```\n\u003C\u002Fdetails>\n\n## Citation\n\nIf you use Fara-7B in your research, please use the following BibTeX entry.\n```bibtex\n@article{fara7b2025,\n  title={Fara-7B: An Efficient Agentic Model for Computer Use},\n  author={Awadallah, Ahmed and Lara, Yash and Magazine, Raghav and Mozannar, Hussein and Nambi, Akshay and Pandya, Yash and Rajeswaran, Aravind and Rosset, Corby and Taymanov, Alexey and Vineet, Vibhav and Whitehead, Spencer and Zhao, Andrew},\n  journal={arXiv:2511.19663},\n  year={2025}\n}\n```\n","Fara-7B是一个专为计算机使用设计的高效代理模型。该项目采用Python语言开发，核心功能包括通过浏览器操作和计算机任务执行，能够处理复杂的在线任务，并通过预计算的评分标准来评估任务完成情况。技术上，它基于Hugging Face平台，利用了WebTailBench和CUAVerifierBench等数据集进行训练与测试，支持多模态、基于评分标准的任务验证机制。适用于需要自动化浏览网页、执行特定计算机任务或对这类行为进行评估的研究与应用场景。",2,"2026-06-11 03:40:24","high_star"]