[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-11327":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":8,"htmlUrl":8,"language":9,"languages":8,"totalLinesOfCode":8,"stars":10,"forks":11,"watchers":12,"openIssues":13,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":15,"stars7d":16,"stars30d":17,"stars90d":14,"forks30d":14,"starsTrendScore":18,"compositeScore":19,"rankGlobal":8,"rankLanguage":8,"license":8,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":22,"hasPages":20,"topics":23,"createdAt":8,"pushedAt":8,"updatedAt":24,"readmeContent":25,"aiSummary":26,"trendingCount":14,"starSnapshotCount":14,"syncStatus":13,"lastSyncTime":27,"discoverSource":28},11327,"codeneedle","alexziskind1\u002Fcodeneedle","alexziskind1",null,"Python",294,49,7,2,0,6,16,80,18,5.1,false,"main",true,[],"2026-06-12 02:02:31","# Positional Recall Benchmark\n\nReproduces the benchmark from the YouTube video (see `benchmark_plan.md`):\nstuff a large source corpus into an LLM's context, then ask it to reproduce\nthe first N lines of specific named functions verbatim. Measures positional\nrecall under long context, not just named-entity lookup.\n\n## Install\n\nThis project uses [uv](https:\u002F\u002Fdocs.astral.sh\u002Fuv\u002F) for Python environment management.\n\n```\n# Create a venv in .venv\u002F and install the deps from requirements.txt\nuv venv\nuv pip install -r requirements.txt\n```\n\nRun any project script via `uv run` (no `source .venv\u002Fbin\u002Factivate` needed):\n\n```\nuv run python bench.py run --corpus http_server --model qwen36-35b\nuv run python analysis\u002Fvisualize.py\nuv run python smoke_test.py\n```\n\nIf you'd rather activate the venv:\n\n```\nsource .venv\u002Fbin\u002Factivate\npython3 bench.py run --corpus http_server --model qwen36-35b\n```\n\nThe rest of this README writes commands as `python3 …` for brevity — prepend\n`uv run ` if your venv isn't active.\n\n## Quick start\n\n```\n# 1. (LM Studio only) make sure your model is loaded with enough context.\n#    Defaults can silently sit at 4K. Force-reload at 128K:\nlms unload qwen3.6-35b-a3b\nlms load qwen3.6-35b-a3b --context-length 131072 --gpu max -y\n\n# 2. Pick a corpus + a model and run them (assumes .venv is active; otherwise prepend `uv run`):\npython3 bench.py run --corpus http_server --model qwen36-35b\n\n# 3. Result is auto-saved as results\u002F\u003Ccorpus>__\u003Cmodel>.json.\n```\n\n## Layout\n\n```\nconfigs\u002F\n  corpora\u002F        what files to test, sample size — one TOML per corpus\n  models\u002F         model identifier and per-model knobs — one TOML per model\nfixtures\u002F         source files to test against (jquery.js, http_server.py, …)\nresults\u002F          JSON dumps from every run, auto-named \u003Ccorpus>__\u003Cmodel>.json\nanalysis\u002F\n  visualize.py    Plotly dashboard builder\n  charts\u002F         generated HTML output (gitignored)\n  VIZ_README.md   chart-by-chart explanation + how to extend\n.secrets\u002F         API keys for hosted endpoints (gitignored, perms 700)\nbench\u002F            package internals\nbench.py          CLI entry\n```\n\n## Configs\n\nThe split is by axis-of-change. You rarely change which files to test, but\nyou constantly compare different models — so an N×M comparison needs only\nN+M files, not N*M.\n\n### Hosted models — API keys\n\nDon't put real keys in committed config files. The recommended workflow:\n\n```bash\nmkdir -p .secrets && chmod 700 .secrets\necho 'sk-...' > .secrets\u002Fopenai.key\nchmod 600 .secrets\u002Fopenai.key\n```\n\nThen reference it from a model config:\n\n```toml\n# configs\u002Fmodels\u002Fgpt-5.5.toml\nname              = \"gpt-5.5\"\nbase_url          = \"https:\u002F\u002Fapi.openai.com\"\napi_key_file      = \".secrets\u002Fopenai.key\"   # path resolved from repo root\ntemperature       = 1.0\nmax_tokens        = 8000\nreasoning_effort  = \"none\"\nuse_max_completion_tokens = true\n```\n\n`.secrets\u002F` and any `*.key` file are already in `.gitignore`. Verify with\n`git check-ignore -v .secrets\u002Fopenai.key` — you should see a match.\n\nAlternatives: `api_key_env = \"OPENAI_API_KEY\"` (read from environment), or\n`api_key = \"...\"` (literal — only for non-secret tokens like LM Studio's\n`\"not-needed\"` placeholder).\n\nFull hosted-model details and known per-API quirks:\n[`configs\u002FCONFIG_README.md → Hosted models`](configs\u002FCONFIG_README.md#hosted-models--api-keys-and-security).\n\n> Field-by-field reference for every TOML key, plus recipes for adding a new\n> corpus or model, lives in [`configs\u002FCONFIG_README.md`](configs\u002FCONFIG_README.md).\n\n### Corpora — `configs\u002Fcorpora\u002F\u003Cname>.toml`\n\n```toml\n[files]\ndirectory = \"fixtures\"   # required\nglob      = \"*.js\"       # required\nlimit     = 1            # optional cap on matched files (sorted lexically)\n\n[sample]\nk    = 16                # number of functions to test\nseed = 42\n```\n\nShipped:\n- `http_server` — single ~50KB Python file, fits any context, fast iteration\n- `jquery` — ~280KB \u002F ~80K-token JS, closest to the video's setup (needs ≥100K loaded context)\n\nIf `glob` matches multiple files, they're concatenated with comment-marker\nheaders (`# === path ===` \u002F `\u002F\u002F === path ===`) so the model sees file\nboundaries. Cross-file name collisions are deduplicated (first occurrence\nwins), and the prompt qualifies by file path when more than one file is in play.\n\n### Models — `configs\u002Fmodels\u002F\u003Cname>.toml`\n\n```toml\nname              = \"qwen3.6-35b-a3b\"      # required (model id the server knows)\nbase_url          = \"http:\u002F\u002Flocalhost:1234\"\napi_key           = \"not-needed\"           # optional\ntemperature       = 0.0\nmax_tokens        = 6000                   # leave room for reasoning models\ntimeout           = 600.0\nsuppress_thinking = true                   # appends \u002Fno_think (harmless when ignored)\n```\n\nShipped:\n- `qwen3-4b` — small, honors `\u002Fno_think`, `max_tokens=1500` is fine\n- `qwen36-35b` — reasoning-on-by-default, ignores every thinking-disable knob; needs `max_tokens=6000`\n\nIf you pass `--model FOO` and there's no matching config file, FOO is treated\nas a raw model identifier with sane defaults — so you don't *have* to write a\nconfig to do a one-off run, but for repeated use it's worth pinning the knobs.\n\n### How the two configs combine at run time\n\nEvery `run` invocation needs **one corpus** (`--corpus NAME` or `--file PATH`)\nand **one model** (`--model NAME`). They're resolved independently and stitched\ntogether — there is no shared parent file or inheritance.\n\n**Resolution order**, for both flags:\n1. If the value points to an existing file on disk, load it.\n2. Otherwise look it up by name under `configs\u002Fcorpora\u002F\u003Cname>.toml` or\n   `configs\u002Fmodels\u002F\u003Cname>.toml`.\n3. (`--model` only) If still not found, treat the value as a raw model\n   identifier and use built-in defaults. A note is printed so you know the\n   fallback was taken.\n\n**Override layering**, applied in order (later wins):\n1. defaults baked into the loader (`max_tokens=6000`, `temperature=0`, …)\n2. fields set in the **model config** file\n3. CLI overrides — `--base-url`, `--max-tokens`, `--temperature`, `--timeout`,\n   `--api-key`\n4. sampling overrides (`-k`, `--seed`) layer over the **corpus config**'s\n   `[sample]` the same way\n\nThis means model knobs can come from anywhere on the chain. A typical config\nsets the model-specific defaults (e.g. `max_tokens=6000` for a reasoning model)\nand you override per-run knobs (`--max-tokens 8000` for a hard case) without\nediting the file.\n\n**`--think`** flips one bit: it inverts `suppress_thinking` so chain-of-thought\nis left on. Useful when you specifically want to compare reasoning vs.\nno-reasoning recall on a model that supports both.\n\n**Output filename** is `results\u002F\u003Ccorpus.name>__\u003Cmodel.name>.json`, where each\n`name` is the **config stem** (filename without `.toml`). Raw-model fallback\nsanitizes the identifier (`\u002F` → `_`). Override the whole path with `--dump`.\n\nMental model: corpus = *what to ask*, model = *who to ask and how*. Keep them\northogonal.\n\n## Commands\n\n```\n# Run a benchmark\npython3 bench.py run --corpus http_server --model qwen36-35b\n\n# Compare models on the same corpus\npython3 bench.py run --corpus jquery --model qwen3-4b\npython3 bench.py run --corpus jquery --model qwen36-35b\n\n# Override anything from the CLI\npython3 bench.py run --corpus jquery --model qwen36-35b -k 8 --max-tokens 8000\n\n# Test only specific functions (skips sampling)\npython3 bench.py run --corpus http_server --model qwen36-35b \\\n    --function is_cgi --function translate_path\n\n# Use a raw model identifier (no config file needed)\npython3 bench.py run --corpus http_server --model \"qwen\u002Fqwen3-4b\"\n\n# Single-file mode (no corpus config)\npython3 bench.py run --file fixtures\u002Fhttp_server.py --model qwen36-35b\n\n# See what would be tested\npython3 bench.py extract --corpus http_server          # sampled\npython3 bench.py extract --corpus http_server --all    # every extractable function\npython3 bench.py extract --corpus http_server --show is_cgi   # ground truth\n\n# Re-score a prior dump without re-querying\npython3 bench.py rescore results\u002Fhttp_server__qwen36-35b.json\n\n# Build Plotly dashboards comparing every run in results\u002F\npython3 analysis\u002Fvisualize.py\n# -> analysis\u002Fcharts\u002F\u003Ccorpus>.html + analysis\u002Fcharts\u002Findex.html\n# (see analysis\u002FVIZ_README.md for what each chart shows)\n```\n\nSupported source languages: `.js`, `.mjs`, `.cjs` (esprima), `.py` (`ast`).\n\n## Reading the output\n\nPer-function diff uses colors matching the video:\n\n- **gray**       — matched line (expected + produced at correct position)\n- **orange**     — expected but missing from the output\n- **yellow**     — hallucinated \u002F mangled line\n- **blue\u002Fcyan**  — extra correct lines past the primary 20 (bonus)\n\nPass threshold per function: ≥ 8 of the 20 expected lines matched.\n\n## Server setup notes\n\nFor fair comparison matching the video:\n\n- **llama.cpp**: `--ctx-size 131072 --cache-type-k q8_0 --cache-type-v q8_0`,\n  prompt caching on (default in recent builds).\n- **LM Studio**: set context length to cover the file, enable \"KV cache quantization\"\n  → Q8. Prefix cache is automatic.\n- **Ollama**: set `num_ctx` via Modelfile or per-request; no KV quant yet, so\n  comparison isn't apples-to-apples.\n\nKeep temperature at 0. Default `max_tokens=6000` to leave room for reasoning models.\n\n### LM Studio gotchas we hit (read before debugging)\n\n1. **`lms ps` lies about context size after JIT loads.** If large prompts fail\n   with a 400 \"context length\" error despite `lms ps` showing a big number,\n   force-reload:\n   ```\n   lms unload \u003Cmodel>\n   lms load \u003Cmodel> --context-length 131072 --gpu max -y\n   ```\n2. **Auto-unload by idle TTL** (default ~60 min). After it expires, the next\n   request triggers a JIT reload at *default settings*, silently dropping your\n   large context. Either disable TTL in the LM Studio UI or re-load before\n   each session.\n3. **Reasoning models** (qwen3.5, qwen3.6, …) do not honor `\u002Fno_think`,\n   `enable_thinking: false`, `reasoning_effort: \"none\"`, or any other API toggle\n   we tested. The benchmark still appends `\u002Fno_think` (harmless if ignored), but\n   you must give the budget for chain-of-thought *plus* the answer. Default\n   `max_tokens=6000`; bump to 8000+ if responses come back empty.\n\n## Module map\n\n- `benchmark_plan.md` — analysis of what the benchmark measures and why\n- `bench.py` — CLI entry\n- `bench\u002Fconfig.py` — TOML config loader\n- `bench\u002Fextract.py` — function extraction + multi-file source aggregation\n- `bench\u002Fclient.py` — tiny OpenAI-compatible client\n- `bench\u002Fscorer.py` — LCS alignment, line classification, pass\u002Ffail\n- `bench\u002Freport.py` — ANSI color rendering\n- `bench\u002Frunner.py` — orchestration: prompt assembly, query, score, dump\n- `analysis\u002Fvisualize.py` — builds Plotly HTML dashboards from `results\u002F*.json`\n  (see [`analysis\u002FVIZ_README.md`](analysis\u002FVIZ_README.md) for chart-by-chart details)\n- `smoke_test.py` — end-to-end sanity check without an LLM\n","Codeneedle 是一个用于评估大型语言模型在长上下文中的位置记忆能力的基准测试工具。它通过将大量源代码插入到语言模型的上下文中，然后要求模型准确复现特定函数的前 N 行代码来衡量模型的位置回忆能力，而不仅仅是简单的实体查找。该项目使用 Python 编写，并采用了 uv 作为环境管理工具，支持多种模型配置和数据集选择。适合于需要对不同语言模型的记忆性能进行比较研究的场景，如学术研究、模型选型等。","2026-06-11 03:31:39","CREATED_QUERY"]