[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-81009":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":13,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":15,"stars7d":16,"stars30d":17,"stars90d":14,"forks30d":14,"starsTrendScore":18,"compositeScore":19,"rankGlobal":9,"rankLanguage":9,"license":20,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":21,"hasPages":21,"topics":23,"createdAt":9,"pushedAt":9,"updatedAt":24,"readmeContent":25,"aiSummary":26,"trendingCount":14,"starSnapshotCount":14,"syncStatus":27,"lastSyncTime":28,"discoverSource":29},81009,"LongMemEval-V2","xiaowu0162\u002FLongMemEval-V2","xiaowu0162","Official repository for LongMemEval-V2",null,"Python",52,8,1,0,12,22,23,36,2.86,"Apache License 2.0",false,"main",[],"2026-06-12 02:04:09","# LongMemEval-V2\n\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fxiaowu0162.github.io\u002Flongmemeval-v2\u002F\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🌐-Website-2a75d0?style=flat-square\" height=\"23\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fpdf\u002F2605.12493.pdf\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F📝-Paper-d03c36?style=flat-square\" height=\"23\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fxiaowu0162\u002Flongmemeval-v2\" >\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗-Data-167f5f?style=flat-square\" height=\"23\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fxiaowu0162.github.io\u002Flongmemeval-v2\u002F#leaderboard\" >\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🏆-Leaderboard-d89216?style=flat-square\" height=\"23\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n**LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues**\n\n[Di Wu](https:\u002F\u002Fxiaowu0162.github.io\u002F),\n[Zixiang Ji](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fzixiang-ji-56902624b\u002F),\n[Asmi Kawatkar](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fasmi-kawatkar),\n[Bryan Kwan](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fkwan-bryan),\n[Jia-Chen Gu](https:\u002F\u002Fjasonforjoy.github.io\u002Findex.html),\n[Nanyun Peng](https:\u002F\u002Fvnpeng.net\u002F), and\n[Kai-Wei Chang](https:\u002F\u002Fkwchang.net\u002F)\n\n\n\nThis is the official LongMemEval-V2 repository. It contains the public\nevaluation harness, data preparation tools, leaderboard packaging utilities,\nand the memory baselines reported with the benchmark.\n\n## Overview\n\nLongMemEval-V2 evaluates whether memory systems can help agents acquire the\nexperience needed to become knowledgeable colleagues in customized\nenvironments. The benchmark pairs manually curated questions with long\nhistories of multimodal web-agent trajectories. A memory system consumes the\ntrajectory history and returns compact evidence for downstream question\nanswering; evaluation targets both answer accuracy and query latency.\n\nLongMemEval-V2 contains:\n\n- 451 manually curated questions.\n- 5 memory abilities.\n- Up to 500 trajectories per haystack.\n- Up to 115M tokens in the largest haystacks.\n- Two domains: web and enterprise.\n- Two public leaderboard tiers: small and medium.\n\nThe benchmark tests five core memory abilities:\n\n- **Static state recall**: remembers important landmarks, page layouts, module\n  affordances, and subtle state differences.\n- **Dynamic state tracking**: understands how states and actions change the\n  environment over time.\n- **Workflow knowledge**: knows the steps needed to complete recurring tasks in\n  customized environments.\n- **Environment gotchas**: recognizes recurring local failure modes and avoids\n  environment-specific traps.\n- **Premise awareness**: detects assumptions that are valid elsewhere but wrong\n  in the current deployment.\n\n## Repository Layout\n\n```text\ndata\u002F                 download, preparation, and validation scripts\nevaluation\u002F           evaluation runner, scoring code, configs, and shell wrappers\nleaderboard\u002F          metric merging, LAFS scoring, and submission packaging\nmemory_modules\u002F       memory backend implementations\n```\n\nThe repository implements the following memory modules:\n\n- `no_retrieval`: no memory context.\n- `rag_query_to_slice`: RAG query to raw state slices.\n- `rag_query_to_slice_notes`: RAG query to raw state slices plus trajectory\n  notes.\n- `agentrunbook_r`: AgentRunbook-R.\n- `codex`: vanilla Codex coding-agent memory baseline.\n- `agentrunbook_c`: AgentRunbook-C.\n\n## Setup: Environment\n\nLongMemEval-V2 uses Python 3.11. The default conda environment installs\nPyTorch through `requirements-torch.txt`. For CUDA 12.4 machines, the torch\ninstall command is:\n\n```bash\npip install torch==2.6.0+cu124 torchvision==0.21.0+cu124 \\\n  --extra-index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu124\n```\n\nCreate the environment and install the package:\n\n```bash\nPYTHONNOUSERSITE=1 conda env create -f environment.yml\nconda activate lme-v2-release\npip install -e .\n```\n\nResearchers using a different CUDA or CPU setup should install the appropriate\nPyTorch build first, either with a direct `pip install` command or by editing\n`requirements-torch.txt` before creating the environment.\n\nThe environment does not include vLLM. Start or forward your own\nOpenAI-compatible model servers, then point the scripts to them. The paper runs\nuse Qwen3.5-9B as the fixed reader and Qwen3-Embedding-8B for embedding-based\nmethods. For `codex` and `agentrunbook_c`, download Codex v0.117.0 separately\nand set `CODEX_BINARY`.\n\n## Setup: Data\n\nDownload and prepare:\n\n```bash\npython data\u002Fdownload_data.py --data-root data\u002Flongmemeval-v2\nexport DATA_ROOT=\"$(pwd)\u002Fdata\u002Flongmemeval-v2\"\npython data\u002Fprepare_data.py --data-root \"$DATA_ROOT\" --mode symlink\npython data\u002Fvalidate_data.py --data-root \"$DATA_ROOT\" --tier small\n```\n\nThe default dataset repository is\n`xiaowu0162\u002Flongmemeval-v2`. Screenshot bundles are stored as `.tar.gz`\narchives under `trajectory_screenshots\u002F`; `prepare_data.py` extracts them when\nneeded and links the resulting directories into:\n\n```text\nscreenshots\u002F\u003Ctrajectory_id>\u002F\u003Cstep>.png\n```\n\n## Setup: Model Endpoints and Software\n\nExample endpoint settings:\n\n```bash\n# for all experiments\nexport READER_BASE_URL=http:\u002F\u002Flocalhost:8023\u002Fv1\nexport READER_MODEL=Qwen\u002FQwen3.5-9B\n\n# additionally for RAG and AgentRunbook-R\nexport LME_CONTROLLER_BASE_URL=http:\u002F\u002Flocalhost:8023\u002Fv1\nexport LME_CONTROLLER_MODEL=Qwen\u002FQwen3.5-9B\nexport LME_EMBEDDING_BASE_URL=http:\u002F\u002Flocalhost:8114\u002Fv1\nexport LME_EMBEDDING_MODEL=Qwen\u002FQwen3-Embedding-8B\n```\n\nSet for LLM judge (default `gpt-5.2` with `medium` reasoning):\n\n```bash\nexport OPENAI_API_KEY=...\n```\n\nFor Codex and AgentRunbook-C:\n\n```bash\nexport CODEX_BINARY=\u002Fpath\u002Fto\u002Fcodex-binary\nexport CODEX_MODEL=gpt-5.4-mini\nexport CODEX_REASONING_EFFORT=xhigh\n```\n\nCodex also expects common command-line tools such as `rg` and `find`.\n\n## Reproducing Baselines\n\nEach shell script accepts extra argparse flags after the environment variables:\n\n```bash\nexport DATA_ROOT=\u002Fpath\u002Fto\u002Flongmemeval-v2\nexport OUTPUT_ROOT=runs\nexport TIER=small\n\nevaluation\u002Fscripts\u002Frun_no_retrieval.sh\nevaluation\u002Fscripts\u002Frun_rag_query_to_slice.sh\nevaluation\u002Fscripts\u002Frun_rag_query_to_slice_notes.sh\nevaluation\u002Fscripts\u002Frun_agentrunbook_r.sh\nevaluation\u002Fscripts\u002Frun_codex.sh\nevaluation\u002Fscripts\u002Frun_agentrunbook_c.sh\n```\n\nEach script runs both the web and enterprise domains for the selected tier, writing\noutputs such as `runs\u002Fno_retrieval_web_small` and\n`runs\u002Fno_retrieval_enterprise_small`. Set `TIER=medium` to run LME-V2-Medium.\n\nEach run writes `aggregated_metrics.json`. To combine matching enterprise and web runs for the same method and tier:\n\n```bash\npython leaderboard\u002Fcombine_aggregated_metrics.py \\\n  runs\u002Fagentrunbook_r_enterprise_small\u002Faggregated_metrics.json \\\n  runs\u002Fagentrunbook_r_web_small\u002Faggregated_metrics.json \\\n  -o runs\u002Fagentrunbook_r_small_combined_metrics.json\n```\n\n## Implementing Your Method\n\nMemory backends inherit from `memory_modules.memory.Memory`. For a minimal\nexample, see `memory_modules\u002Fno_retrieval.py`; for indexed retrieval examples,\nsee `memory_modules\u002Frag.py` and `memory_modules\u002Fagentrunbook_r.py`.\n\nA backend should:\n\n- decorate the class with `@register_memory`;\n- set a unique `memory_type`;\n- implement `insert(self, trajectory)`, which receives each full trajectory\n  object selected for the current haystack;\n- implement `query(self, query, query_image=None)`, which receives the question\n  text and optional question screenshot path.\n\n`query` must return a list of memory context items:\n\n```python\n[\n    {\"type\": \"text\", \"value\": \"retrieved notes or evidence\"},\n    {\"type\": \"image\", \"value\": \"\u002Fabsolute\u002For\u002Frelative\u002Fpath\u002Fto\u002Fimage.png\"},\n]\n```\n\nText values must be non-empty strings. Image values must point to existing\nfiles. The harness appends these items to the reader prompt and enforces\n`--memory-context-max-tokens` before calling the answer model.\n\nDuring `query`, the backend can call `self.get_query_context()` to access\n`question_id`, `question_type`, and the raw question item. Optional hooks include\n`post_query_hook(...)` for per-query metadata and `_save_backend(...)` \u002F\n`_load_backend(...)` for persisted memory state.\n\nTo run a new backend directly, create a memory config JSON:\n\n```json\n{\n  \"memory_type\": \"your_memory_type\",\n  \"memory_params\": {}\n}\n```\n\nThen pass it to `evaluation\u002Fharness.py` with `--memory-config-path`. To expose\nthe method through `evaluation\u002Frun_eval.py` and the shell wrappers, add the\nmethod name and config construction there as well.\n\n## Submitting to Leaderboard\n\nLeaderboard entries measure how much a memory system improves the released\nbaseline + AgentRunbook accuracy-latency frontier. The score is LAFS gain over\nthe fixed reference frontier, and a submission may include multiple latency\noperating points for the same method and tier.\n\nSee [leaderboard\u002FREADME.md](leaderboard\u002FREADME.md) for the full packaging\ninstructions.\n\nSubmit leaderboard packages through the\n[submission form](https:\u002F\u002Fforms.gle\u002FrxUpiuRKDERqpqSi9). Please do not submit\nleaderboard entries as GitHub issues. Informal submission issues will be closed\nor deleted.\n\n## Citation\n\n\n```bibtex\n@article{wu2026longmemevalv2,\n      title={LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues}, \n      author={Di Wu and Zixiang Ji and Asmi Kawatkar and Bryan Kwan and Jia-Chen Gu and Nanyun Peng and Kai-Wei Chang},\n      year={2026},\n      eprint={2605.12493},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.12493}, \n}\n","LongMemEval-V2 是一个用于评估长期记忆系统在定制环境中帮助代理获得成为有经验同事所需知识能力的基准。该项目核心功能包括对五种记忆能力（静态状态回忆、动态状态跟踪、工作流知识、环境陷阱识别和前提意识）进行测试，并提供多达451个手动策划的问题及每个问题最多500条轨迹的数据集。技术上，它基于Python开发，支持多模态数据处理与高效查询。适用于需要研究或应用长时记忆机制于Web和企业级场景中的开发者和研究人员。",2,"2026-06-11 04:03:10","CREATED_QUERY"]