[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-11328":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":8,"htmlUrl":8,"language":9,"languages":8,"totalLinesOfCode":8,"stars":10,"forks":11,"watchers":12,"openIssues":13,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":15,"stars7d":16,"stars30d":17,"stars90d":14,"forks30d":14,"starsTrendScore":18,"compositeScore":19,"rankGlobal":8,"rankLanguage":8,"license":20,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":23,"hasPages":21,"topics":24,"createdAt":8,"pushedAt":8,"updatedAt":25,"readmeContent":26,"aiSummary":27,"trendingCount":14,"starSnapshotCount":14,"syncStatus":13,"lastSyncTime":28,"discoverSource":29},11328,"natural_language_autoencoders","kitft\u002Fnatural_language_autoencoders","kitft",null,"Python",783,103,7,2,0,25,36,270,75,10.05,"Apache License 2.0",false,"main",true,[],"2026-06-12 02:02:31","# Natural Language Autoencoders (NLA)\n\nOpen-source library accompanying the Anthropic Transformer Circuits post\n**[Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations](https:\u002F\u002Ftransformer-circuits.pub\u002F2026\u002Fnla\u002Findex.html)**.\n\n📄 [Blog post](https:\u002F\u002Fwww.anthropic.com\u002Fresearch\u002Fnatural-language-autoencoders) · ▶ [Video walkthrough](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=j2knrqAzYVY) · 🔬 [Try the released NLAs on Neuronpedia](https:\u002F\u002Fwww.neuronpedia.org\u002Fnla)\n\n---\n\nA Natural Language Autoencoder is a pair of fine-tuned LMs that map\nresidual-stream activation vectors to natural language and back:\n\n| | direction | mechanism |\n|---|---|---|\n| **AV** (activation verbalizer) | `vector → text` | inject the vector as a single token embedding into a fixed prompt, autoregress a description |\n| **AR** (activation reconstructor) | `text → vector` | truncated K+1-layer LM + `Linear(d, d)` head, extract at the final token |\n\nBoth vectors are L2-normalised before comparison, so the round-trip\n`MSE(reconstructed, original) = 2(1 − cos)` measures direction agreement only.\nLow MSE means the AR could recover the original direction from the AV's words\nalone, which implies the explanation captures the information in the vector.\n\nThis is the **full training repo** — data generation, SFT, GRPO RL, and\ncheckpoint conversion. For a lightweight inference-only package (just\n`NLAClient` + `NLACritic`, no training deps), see\n[`kitft\u002Fnla-inference`](https:\u002F\u002Fgithub.com\u002Fkitft\u002Fnla-inference).\n\n> **A note on naming.** Public-facing names are **AV** \u002F **AR**. Inside the\n> `nla\u002F` package you will see **actor** \u002F **critic** — those are the same two\n> models, named to map directly onto Miles' RL primitives (the AV *is* the\n> policy actor; the AR *is* the value critic). The codebase keeps actor\u002Fcritic\n> so the Miles extension points read naturally; everywhere user-facing we use\n> AV\u002FAR.\n\n---\n\n## Released checkpoints\n\nAll eight checkpoints are gathered in the\n**[`kitft\u002Fnla-models` collection](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fkitft\u002Fnla-models)**\non the HF Hub — four base-model families, each with an AV and an AR. We extract\nfrom a layer roughly **two-thirds of the way through the model** in each case\n— deep enough that the residual stream carries rich semantic content, shallow\nenough that it hasn't yet collapsed toward the unembedding.\n\n| base model | layer | d_model | AV | AR |\n|---|---|---|---|---|\n| Qwen2.5-7B-Instruct | 20 \u002F 28 | 3584 | [`kitft\u002Fnla-qwen2.5-7b-L20-av`](https:\u002F\u002Fhuggingface.co\u002Fkitft\u002Fnla-qwen2.5-7b-L20-av) | [`kitft\u002Fnla-qwen2.5-7b-L20-ar`](https:\u002F\u002Fhuggingface.co\u002Fkitft\u002Fnla-qwen2.5-7b-L20-ar) |\n| Gemma-3-12B-IT | 32 \u002F 48 | 3840 | [`kitft\u002Fnla-gemma3-12b-L32-av`](https:\u002F\u002Fhuggingface.co\u002Fkitft\u002Fnla-gemma3-12b-L32-av) | [`kitft\u002Fnla-gemma3-12b-L32-ar`](https:\u002F\u002Fhuggingface.co\u002Fkitft\u002Fnla-gemma3-12b-L32-ar) |\n| Gemma-3-27B-IT | 41 \u002F 62 | 5376 | [`kitft\u002Fnla-gemma3-27b-L41-av`](https:\u002F\u002Fhuggingface.co\u002Fkitft\u002Fnla-gemma3-27b-L41-av) | [`kitft\u002Fnla-gemma3-27b-L41-ar`](https:\u002F\u002Fhuggingface.co\u002Fkitft\u002Fnla-gemma3-27b-L41-ar) |\n| Llama-3.3-70B-Instruct | 53 \u002F 80 | 8192 | [`kitft\u002FLlama-3.3-70B-NLA-L53-av`](https:\u002F\u002Fhuggingface.co\u002Fkitft\u002FLlama-3.3-70B-NLA-L53-av) | [`kitft\u002FLlama-3.3-70B-NLA-L53-ar`](https:\u002F\u002Fhuggingface.co\u002Fkitft\u002FLlama-3.3-70B-NLA-L53-ar) |\n\nEach checkpoint ships an `nla_meta.yaml` sidecar with the prompt template,\ninjection token IDs, and scale factors that the model was trained with — load\nthose, never hardcode them.\n\n---\n\n## How it fits together\n\nNLA training is built as a thin extension on top of two open-source projects:\n\n- **[Miles](https:\u002F\u002Fgithub.com\u002Fradixark\u002Fmiles)** — Ray-orchestrated RL training\n  (FSDP2 \u002F Megatron backends, GRPO, async rollout). We used the FSDP backend\n  for the 7B\u002F12B\u002F27B runs and Megatron only for Llama-70B. NLA plugs in via Miles'\n  upstream `--custom-rm-path`, `--data-source-path`, and\n  `--custom-generate-function-path` extension points; the integration patch in\n  `nla\u002Fmiles_patches\u002F` adds `--custom-actor-cls-path` and `--force-use-critic`\n  on top (see [docs\u002Fdesign.md §2](docs\u002Fdesign.md)).\n- **[SGLang](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang)** — rollout serving. We\n  send `input_embeds` (not `input_ids`) so the AV sees the injected vector;\n  SGLang serves it like any other request. The embed sequence is built on the\n  **trainer side** — we look up the prompt tokens in the actor's own embedding\n  table, splice the activation vector in at the injection slot, and ship the\n  finished `[seq, d]` tensor over HTTP. SGLang never needs to know what an\n  injection is. We don't apply any learned map to the injected vector in this\n  work — it goes in raw (after a fixed scalar `injection_scale`) — but this\n  design means a future affine `W·v + b` adapter would be a trainer-side-only\n  change: apply it before sending, no SGLang modification required. (vLLM also\n  supports `input_embeds` and would work as a drop-in alternative.)\n\nWe chose this stack because it is **near-frontier training infrastructure**:\nMiles + Megatron is what production-scale RL post-training looks like, and\nhooking onto it cleanly is what let us scale to RL-ing a 70B-parameter AV — and\nlikely further. The `nla\u002F` package never modifies Miles or SGLang in place; it\nonly subclasses and registers function-pointer hooks, so upstream updates pull\nin cleanly.\n\n---\n\n## Quick start\n\n### Inference (use a released checkpoint)\n\n```bash\nuv pip install torch transformers safetensors httpx orjson pyyaml numpy\nuv pip install \"sglang[all]>=0.5.6\"\n\npython -m sglang.launch_server --model-path kitft\u002Fnla-qwen2.5-7b-L20-av \\\n    --port 30000 --disable-radix-cache &\n\npython nla_inference.py kitft\u002Fnla-qwen2.5-7b-L20-av \\\n    --sglang-url http:\u002F\u002Flocalhost:30000 \\\n    --parquet path\u002Fto\u002Factivations.parquet\n```\n\nDon't have a parquet yet? Any file with an `activation_vector` column of\n`d_model`-wide float lists will do — here's a minimal one for Qwen layer 20:\n\n```python\nimport torch, pyarrow as pa, pyarrow.parquet as pq\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\ntok = AutoTokenizer.from_pretrained(\"Qwen\u002FQwen2.5-7B-Instruct\")\nm = AutoModelForCausalLM.from_pretrained(\"Qwen\u002FQwen2.5-7B-Instruct\",\n        torch_dtype=torch.bfloat16, device_map=\"cuda\")\nids = tok(\"The quick brown fox jumps over the lazy dog.\", return_tensors=\"pt\").to(\"cuda\")\nhs = m(**ids, output_hidden_states=True).hidden_states[20][0]  # [seq, 3584]\npq.write_table(pa.table({\"activation_vector\": hs.float().cpu().tolist()}), \"demo.parquet\")\n```\n\n(Or omit `--parquet` entirely for a smoke test on a random unit vector.)\n\n`nla_inference.py` is a single self-contained file. The full recipe —\nmodel-specific scale factors, the Gemma `√d` embed-scale gotcha, debugging the\n\"output is in Chinese\" failure mode, AR scoring — is in\n**[docs\u002Finference.md](docs\u002Finference.md)**. Worked transcripts in\n[`examples\u002F`](examples\u002F).\n\n### Training (reproduce a checkpoint)\n\nInstall Miles + SGLang + this package per **[docs\u002Fsetup.md](docs\u002Fsetup.md)**,\nthen run the three stages (Qwen7B reference: SFT on 2×H100-80GB; RL to ~75% FVE\non 2×8×H100 — see [`configs\u002FTRAINING_NOTES.md`](configs\u002FTRAINING_NOTES.md)):\n\n```bash\n# 0. Generate data (GPU + ANTHROPIC_API_KEY)\npython -m nla.datagen.run_pipeline --config configs\u002Fdatagen\u002Fqwen7b_fineweb_1M.yaml\n\n# 1. AR SFT (MSE on raw activations)\nbash configs\u002Fcritic_sft.sh\n\n# 2. AV SFT (next-token on API-generated explanations, with injection)\nbash configs\u002Factor_sft.sh\n\n# 3. RL: simultaneous AV (GRPO) + AR (supervised); reward = -mse_nrm\nbash configs\u002Frl.sh\n```\n\nThe full design — data transport through Miles' `multimodal_train_inputs`, the\ninjection forward-hook, simultaneous AV\u002FAR scheduling, why `cp_size==1` — is in\n**[docs\u002Fdesign.md](docs\u002Fdesign.md)**. Detailed profiling and hyperparameter\nnotes (Qwen7B case study; we reused those settings with only light adjustment\nfor the other models — a per-model sweep would likely do better):\n[`configs\u002FTRAINING_NOTES.md`](configs\u002FTRAINING_NOTES.md).\n\n---\n\n## Repo layout\n\n```\nnla\u002F                  core package\n  schema.py, config.py, models.py     — sidecar contract, NLACriticModel (the AR)\n  train_actor.py                      — NLAFSDPActor (Miles FSDP subclass)\n  megatron\u002F                           — NLAMegatronActor (TP+PP, CP=1 only)\n  rollout\u002F                            — SFT rollout, nla_generate (SGLang input_embeds)\n  reward.py, loss.py                  — -mse_nrm reward, AR MSE loss\n  datagen\u002F                            — 4-stage activation → parquet pipeline\nconfigs\u002F              training shell configs + datagen YAMLs\nscripts\u002F              multi-GPU launch wrappers (datagen)\npatches\u002F              SGLang training patches (bf16 transport, chunked-prefill) + apply script\ntools\u002F                FSDP-DCP \u002F Megatron-dist ↔ HF checkpoint converters\ndocs\u002F                 design.md (training), inference.md (serving)\nrelease\u002F              HF model-card templates + sidecar sanitiser for releases\nnla_inference.py      standalone single-file inference client\nexamples\u002F             worked decode transcripts\n```\n\n---\n\n## Citation\n\nFor attribution in academic contexts, please cite this work as\n\n> Fraser-Taliente, Kantamneni, Ong et al., \"Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations\", Transformer Circuits, 2026.\n\n```bibtex\n@article{frasertaliente2026nla,\n  author  = {Fraser-Taliente, Kit and Kantamneni, Subhash and Ong, Euan and Mossing, Dan and Lu, Christina and Bogdan, Paul C. and Ameisen, Emmanuel and Chen, James and Kishylau, Dzmitry and Pearce, Adam and Tarng, Julius and Wu, Alex and Wu, Jeff and Zhang, Yang and Ziegler, Daniel M. and Hubinger, Evan and Batson, Joshua and Lindsey, Jack and Zimmerman, Samuel and Marks, Samuel},\n  title   = {Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations},\n  journal = {Transformer Circuits Thread},\n  year    = {2026},\n  url     = {https:\u002F\u002Ftransformer-circuits.pub\u002F2026\u002Fnla\u002Findex.html}\n}\n```\n\n## License\n\nApache-2.0 ([LICENSE](LICENSE)). Released checkpoints additionally inherit the\nlicense of their base model (Gemma, Llama-3.3) — see the NOTICE files in each\nHF repo.\n","该项目是一个开源库，用于实现自然语言自编码器（NLA），它由一对微调过的语言模型组成，能够将残差流激活向量与自然语言相互转换。其核心功能包括通过激活向量化器（AV）将向量转换为文本描述，以及通过激活重构器（AR）将文本重新转换回向量。该技术特点在于使用了L2归一化处理后的向量进行比较，以评估方向一致性，并且支持从基础模型的特定层提取信息。适用于需要对大规模语言模型内部激活进行无监督解释的研究场景，如理解模型行为或生成人类可读的解释。","2026-06-11 03:31:39","CREATED_QUERY"]