[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-81032":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":8,"htmlUrl":8,"language":9,"languages":8,"totalLinesOfCode":8,"stars":10,"forks":11,"watchers":10,"openIssues":12,"contributorsCount":12,"subscribersCount":12,"size":12,"stars1d":12,"stars7d":12,"stars30d":12,"stars90d":12,"forks30d":12,"starsTrendScore":12,"compositeScore":13,"rankGlobal":8,"rankLanguage":8,"license":8,"archived":14,"fork":14,"defaultBranch":15,"hasWiki":16,"hasPages":14,"topics":17,"createdAt":8,"pushedAt":8,"updatedAt":18,"readmeContent":19,"aiSummary":20,"trendingCount":12,"starSnapshotCount":12,"syncStatus":11,"lastSyncTime":21,"discoverSource":22},81032,"talkie-coder","RicardoDominguez\u002Ftalkie-coder","RicardoDominguez",null,"Python",29,2,0,1.43,false,"main",true,[],"2026-06-12 02:04:09","# From 1930 to SWE-bench\n\n[\u003Cimg src=\"https:\u002F\u002Fhuggingface.co\u002Ffront\u002Fassets\u002Fhuggingface_logo-noborder.svg\" alt=\"🤗\" width=\"18\"\u002F> Models and training data](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fricdomolm\u002F1930-coder) · [\u003Cimg src=\"https:\u002F\u002Fhuggingface.co\u002Ffront\u002Fassets\u002Fhuggingface_logo-noborder.svg\" alt=\"🤗\" width=\"18\"\u002F> Eval trajectories](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fricdomolm\u002Feval-trajs-1930-coder) (1930 + web subsets, ⚠️ test data — do not train)\n\nWe fine-tune Alec Radford's 1930 vintage LLM — pre-trained only on\npre-1931 data — to solve SWE-bench issues. \n\nAfter just 250 training\nexamples the model lands [its first fix](https:\u002F\u002Fricardodominguez.github.io\u002Fblogs\u002Fpydata__xarray-4629.traj.html) (a small patch to xarray);\nscaled to ~75K trajectories (1B tokens), it reaches **4.5% pass@1** on\nSWE-bench-Verified, up from 4% pass@100 on HumanEval at the base. \n\nThe\nsibling web-pretrained model gets to 5.5% pass@1 on the same recipe —\nsurprisingly little seems to be lost by throwing away the internet.\n\nIf you have more compute to spare, we'd love to see the full scaling\ncurves comparing the 1930 and web-pretrained models as post-training is scaled up.\n\nThis repo contains the SFT recipes, eval pipeline, and analysis for\nfine-tuning `talkie-lm\u002Ftalkie-1930-13b` and `talkie-lm\u002Ftalkie-web-13b`\non SWE-bench-style agent trajectories.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"analysis\u002Fplots\u002Fscaling.png\" alt=\"SWE-bench pass@1 vs fine-tuning tokens (1930 model)\" width=\"49%\"\u002F>\n  \u003Cimg src=\"analysis\u002Fplots\u002Fweb1930.png\" alt=\"Talkie 1930 vs Talkie Web pass@1 after fine-tuning\" width=\"49%\"\u002F>\n\u003C\u002Fp>\n\n## Results\n\n1. **2e-5 SFT on talkie-1930 base** — 12 h FSDPv2 run on 8×B200 over a\n   100 K-trajectory SWE dataset, 2016 steps at 64 K context, lr=2e-5\n   cosine. ckpt-2000 reaches **4.48 % pass@1** on a 446-instance\n   SWE-bench-Verified-Working-Harbor subset (5x mean, σ=0.69 pp).\n   See [`sft\u002Frun_swe_sft_12h_v2_lr2e5.sh`](sft\u002Frun_swe_sft_12h_v2_lr2e5.sh).\n\n2. **2e-5 SFT on talkie-web base** — sibling run on the\n   FineWeb-pretrained talkie-web variant, same recipe, same data\n   (re-tokenized with the talkie-web BPE). ckpt-2000 reaches\n   **5.75 % pass@1** (3x mean, σ=1.04 pp). See\n   [`sft\u002Frun_swe_sft_12h_web_v2_lr2e5.sh`](sft\u002Frun_swe_sft_12h_web_v2_lr2e5.sh).\n\n3. **Few-data fine-tuning sweep** — minimum-unique-data scan on the\n   talkie-1930-it (instruct) checkpoint over a verified-trajectory\n   subset of `ricdomolm\u002Fmini-coder-trajs-400k`. 5 runs vary unique-example\n   count and epochs; pass@1 is graded on a 42-instance union subset.\n   Launcher: [`sft\u002Frun_minimal_sft.sh`](sft\u002Frun_minimal_sft.sh).\n\nThe plots, pass@1 numbers, and the code that reduces the eval JSONs are\nin [`analysis\u002Fplots.ipynb`](analysis\u002Fplots.ipynb).\n\n## Layout\n\n```\nsft\u002F                                       training entrypoint, modeling, data prep\n  ft_trl.py                                trainer (TRL SFTTrainer + chat-token wd-skip)\n  modeling_talkie.py                       Liger-CE + varlen + GC + ALL_ATTENTION_FUNCTIONS dispatch (vLLM-compat)\n  configuration_talkie.py\n  accelerate_config_talkie.yaml            FSDPv2, 8 ranks, FULL_STATE_DICT\n  convert_base_to_hf.py                    talkie-1930 base.ckpt → HF safetensors\n  convert_web_base_to_hf.py                talkie-web base.ckpt → HF safetensors\n  build_talkie_web_tokenizer.py            tiktoken vocab.txt → HF tokenizer.json\n  reinit_chat_tokens.py                    talkie-1930 chat-token row reinit\n  reinit_chat_tokens_web.py                talkie-web chat-token row reinit\n  repackage_for_vllm.py                    bake lm_head_gain → lm_head.weight for vLLM serving\n  tokenize_messages.py                     per-shard chat-formatted tokenization\n  jobs_tokenize.py                         HTCondor distributed launcher (250 jobs)\n  latest_trl.sh                            env activation wrapper for HTCondor\n  verify_tokenization.py                   spot-check a packed shard\n  check_chat_token_norms.py                inspect embed\u002Flm_head norms\n  make_swe_100k_jsonl.py                   subsample final-400k → 100K JSONL\n  make_mini_coder_verified_max3_jsonl.py   verified subset, ≤3 trajs\u002Finstance\n  make_mini_coder_mix_max3_jsonl.py        verified + length-targeted unverified mix\n  run_swe_sft_12h_v2_lr2e5.sh              ★ result 1: 2e-5 base SFT\n  run_swe_sft_12h_web_v2_lr2e5.sh          ★ result 2: 2e-5 web SFT\n  run_minimal_sft.sh                       ★ result 3: few-data sweep launcher\n  CHAT_TOKEN_COLLAPSE.md                   diagnostic for the chat-token weight-decay bug\n  CHAT_TOKEN_COLLAPSE_FIX.md               the wd-skip + reinit fix\n\neval\u002F                                      SWE-bench eval pipeline (vLLM + mini-swe-agent + harness)\n  README.md                                end-to-end eval recipe (read this first)\n  docker_setup.sh                          rootless-docker startup\n  localconfig_qwen3_train_aligned.yaml     mini-swe-agent eval config (matches training prompts)\n  transfer_config.sh                       inject api_base + sampling into the eval config\n  launch_parallel_eval.sh                  fan-out HTCondor submit (n_jobs × instances)\n  run_mini_subjob.sh                       per-job entry (sets OUTPUT_DIR + slice)\n  run_mini_eval.sh                         per-job: vLLM serve + mini-extra swebench + tmpfs drain\n  run_grade.py                             swebench harness wrapper (resolved\u002Funresolved breakdown)\n  run_grade_wrapper.sh                     condor wrapper around run_grade.py\n  grade_ckpt400.sub                        reference grade-job condor submit file\n  launch_pass5_subset.sh                   fan out N×eval-runs against a fixed instance subset (variance reduction)\n  grade_pass5.sh                           grade all pass-N runs in one shot\n  pass5_watcher.py                         watcher that auto-restarts stuck pass-N runs\n  post_sft_pipeline.sh                     end-to-end: repackage → eval → grade\n  summarize_sweep.py                       reduce graded reports to sweep_summary.json\n  multiple_evals.md                        usage notes for variance-reduced pass@1\n\nanalysis\u002F\n  plots_issue7.ipynb                       loads eval JSONs, builds the 2 plots\n  sweep_summary.json                       few-data sweep aggregate\n  swe_bench_eval_union42.json              42-instance union subset\n  talkie-1930-v2-lr2e5-ckpt2000-pass5-run{1..5}.json   5x graded eval, 1930\n  talkie-web-v2-lr2e5-ckpt2000-pass3-run{1..3}.json    3x graded eval, web\n  plot1_pass1_vs_examples.png\n  plot1_pass1_vs_steps.png\n  plot2_web_vs_1930.png\n```\n\n## Pipelines\n\nThe three runs share the same shape: build base model on `\u002Ffast`,\nbuild dataset JSONL, distributed-tokenize at 64 K, launch SFT.\n\n**1. talkie-1930 2e-5 SFT**\n\n```bash\n# 1. Build the HF base model from talkie-lm\u002Ftalkie-1930-13b base.ckpt + vocab.\n#    (downloads handled outside the repo; convert_base_to_hf.py reads from\n#    \u002Ftmp\u002Ftalkie-1930-13b-base\u002F, writes to \u002Ffast\u002Frolmedo\u002Fmodels\u002Ftalkie-1930-13b-base\u002F.)\npython sft\u002Fconvert_base_to_hf.py\npython sft\u002Freinit_chat_tokens.py        # clone \u003C|endoftext|> row into the 4 chat-token rows\n\n# 2. Subsample the 100K SWE trajectories.\npython sft\u002Fmake_swe_100k_jsonl.py\n\n# 3. Tokenize at 64K (250 HTCondor shards).\ncd sft && python jobs_tokenize.py        # writes \u002Ffast\u002F...\u002Ftalkie-1930-swe-100k-64k\u002Fjob_{0..249}\u002F\n\n# 4. Train.\nbash sft\u002Frun_swe_sft_12h_v2_lr2e5.sh     # 2016 steps × ~20s\u002Fstep ≈ 12h on 8×B200\n```\n\n**2. talkie-web 2e-5 SFT**\n\n```bash\n# 1. Build talkie-web tokenizer from vocab.txt, then HF base model.\npython sft\u002Fbuild_talkie_web_tokenizer.py\npython sft\u002Fconvert_web_base_to_hf.py\npython sft\u002Freinit_chat_tokens_web.py     # writes -reinit\u002F side-by-side\n\n# 2. Re-tokenize the same SWE-100K JSONL with the talkie-web BPE.\n#    Edit jobs_tokenize.py to point at the talkie-web tokenizer dir, then:\ncd sft && python jobs_tokenize.py\n\n# 3. Train.\nbash sft\u002Frun_swe_sft_12h_web_v2_lr2e5.sh\n```\n\n**3. Few-data sweep (5 runs)**\n\n```bash\n# 1. Build the verified mini-coder JSONL (≤3 trajs per instance_id).\npython sft\u002Fmake_mini_coder_verified_max3_jsonl.py\n# Optional: build the verified+unverified mix variant.\npython sft\u002Fmake_mini_coder_mix_max3_jsonl.py\n\n# 2. Tokenize for talkie-1930-it (similar to above but pointed at the\n#    mini-coder JSONL and the talkie-1930-13b-it tokenizer).\n\n# 3. Run each sweep point. run_minimal_sft.sh takes 4 args:\n#    max_steps, subsample-fraction, run-name, output-dir.\nbash sft\u002Frun_minimal_sft.sh  20 0.0047 talkie-1930-it-coder-s20-e3   \u002Ffast\u002F...\u002Ftalkie-1930-it-coder-s20-e3\nbash sft\u002Frun_minimal_sft.sh  70 0.0136 talkie-1930-it-coder-s70-e3   \u002Ffast\u002F...\u002Ftalkie-1930-it-coder-s70-e3\nbash sft\u002Frun_minimal_sft.sh 200 0.0386 talkie-1930-it-coder-s200-e3  \u002Ffast\u002F...\u002Ftalkie-1930-it-coder-s200-e3\nbash sft\u002Frun_minimal_sft.sh 251 0.1290 talkie-1930-it-coder-d251-e10 \u002Ffast\u002F...\u002Ftalkie-1930-it-coder-d251-e10\nbash sft\u002Frun_minimal_sft.sh 881 0.4516 talkie-1930-it-coder-d881-e9  \u002Ffast\u002F...\u002Ftalkie-1930-it-coder-d881-e9\n```\n\n(`s20_e3` etc. = ~20 unique examples × 3 epochs; the trainer auto-loops\nto fill `max_steps` once `subsample` data is consumed.)\n\n## Models & data\n\nThese scripts read\u002Fwrite the author's cluster paths under `\u002Ffast\u002Frolmedo\u002F`.\nYou'll need to substitute your own. Required upstream artifacts:\n\n- `talkie-lm\u002Ftalkie-1930-13b` (base.ckpt + vocab; `convert_base_to_hf.py` packages them)\n- `talkie-lm\u002Ftalkie-web-13b` (same)\n- `talkie-lm\u002Ftalkie-1930-13b-it` (HF format; used by the few-data sweep)\n- A SWE-trajectory JSONL (we used a 100K reservoir-sample of an internal\n  `final-400k.jsonl` of SWE-smith mini-swe-agent trajectories; any\n  `{\"instance_id\", \"messages\": [{\"role\", \"content\"}, ...]}` JSONL works)\n- `ricdomolm\u002Fmini-coder-trajs-400k` (HF dataset, for the few-data sweep)\n\n## Pipeline notes\n\nA few decisions that aren't obvious from the configs:\n\n- **`max_grad_norm=30` (not the default 1.0).** Talkie's per-layer scalar\n  gain modules (`attn_gain.a_g`, `mlp_gain.a_g`, `embed_skip.a_g`)\n  accumulate gradients over the entire (seq × hidden) activation tensor,\n  so they dominate the global L2 norm — clipping at 1.0 silently kills\n  training. The 2e-5 runs use 30 (10× v1's 100, tightened for the\n  larger-lr chat-token grad spikes).\n- **`average_tokens_across_devices=False`.** TRL's default of `True`\n  produces 8× inflated loss\u002Fgrad-norm numbers under our `**kwargs`\n  forward signature.\n- **Weight-decay skip on `embed.weight` and `lm_head`.**\n  `ChatPreservingSFTTrainer` in `ft_trl.py` excludes these from wd.\n  Under `--completion_only_loss`, chat-token rows (`\u003C|user|>`,\n  `\u003C|assistant|>`, `\u003C|system|>`) receive no useful gradient. With\n  wd=0.1 their norms collapse from ~0.86 to ~0.12 and the model stops\n  emitting `\u003C|end|>`. Skipping wd on these breaks the loop.\n- **Chat-token reinit** clones the `\u003C|endoftext|>` row (id 65535) into\n  the 4 chat-token rows (65536–65539) plus 1e-3 Gaussian noise.\n  `convert_*_base_to_hf.py` mean-pads those rows; without the reinit\n  the chat tokens collapse during SFT and inference produces nonsense.\n- **64 K context with NTK-extended RoPE** (`rope_theta=4e7`,\n  `max_position_embeddings=65536`) is set at convert time. Costs ~14 %\n  on short evals (GSM8K) vs. the original `theta=1e6` — the right\n  trade for long agent traces.\n- **`save_pretrained` + FSDP wrap saves fp32 and copies torch source\n  files into the output dir.** `ft_trl.py:final_save` bypasses this\n  with explicit `safetensors.save_file` (bf16) and re-copies modeling\n  + tokenizer files from the source dir.\n\n## Eval\n\nPass@1 is graded by the [SWE-bench harness](https:\u002F\u002Fgithub.com\u002FSWE-bench\u002FSWE-bench)\nrunning [mini-swe-agent](https:\u002F\u002Fgithub.com\u002FSWE-bench\u002Fmini-swe-agent)\nagainst the trained checkpoint, served by vLLM 0.19 (transformers backend).\nThe 446-instance set is `ricdomolm\u002FSWE-bench_Verified-Working-Harbor`.\n\nPipeline (see [`eval\u002FREADME.md`](eval\u002FREADME.md) for the full\nrecipe with all the gotchas):\n\n```bash\n# 1. Bake lm_head_gain → lm_head.weight; copy refactored modeling; patch configs.\npython sft\u002Frepackage_for_vllm.py --src \u003Cckpt-dir> --dst \u003Cvllm-dir>\n# (then sync modeling_talkie.py, set AutoModel auto_map, set \u003C|end|> as eos)\n\n# 2. Fan out per-instance eval over HTCondor (vLLM serves locally, mini-swe-agent drives).\ncd eval\n.\u002Flaunch_parallel_eval.sh \u003Cvllm-dir> \u003Coutput-dir> \u003Cn_subjobs> \u003Cn_instances>\n\n# 3. Merge per-sub-job preds.json, then grade.\npython3 -c \"...\"   # see eval\u002FREADME.md §4 for the merge snippet\ncondor_submit_bid 51 grade_ckpt400.sub   # or analogous, points at preds.merged.json\n\n# Variance-reduced pass@1: launch N independent runs, then summarize.\n.\u002Flaunch_pass5_subset.sh \u003Cvllm-dir> \u003Coutput-dir-pattern> \u003Csubset-json> \u003CN>\n.\u002Fgrade_pass5.sh \u003Coutput-dir-pattern>\npython summarize_sweep.py \u003Coutput-dir-pattern> > sweep_summary.json\n```\n\nThe eval JSONs in `analysis\u002F` are the harness's per-run reports\n(`*-pass*-run*.json` containing `resolved_ids`); the notebook reduces\nthem to pass@1 means and bars. The full per-instance agent trajectories\nthat produced these reports are published at\n[`ricdomolm\u002Feval-trajs-1930-coder`](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fricdomolm\u002Feval-trajs-1930-coder)\n(`1930` and `web` subsets, single `test` split — **do not train on this**).\n\nNon-obvious eval-side decisions (full list in `eval\u002FREADME.md`):\n\n- **Greedy decoding is broken on talkie SFT.** `temperature=0` triggers\n  catastrophic single-token loops (`sympsymp...`) on ~50 % of trajectories.\n  Use `temperature=0.7`, `max_tokens=4096` per turn.\n- **`repetition_penalty` hurts more than it helps** on long agent\n  dialogues — penalizes the submission marker and code-syntax tokens.\n- **`max-model-len=32768`, not 64K.** KV cache at 64K + 26 GB bf16 model\n  exceeds H100's 80 GB.\n- **fp8 is broken** on talkie (vLLM `--quantization fp8` and\n  `--kv-cache-dtype fp8` both regress). Stick with bf16.\n- **vLLM transformers backend** needs `lm_head_gain` baked into\n  `lm_head.weight` (`repackage_for_vllm.py`) and the modeling refactored\n  to dispatch attention through `ALL_ATTENTION_FUNCTIONS`. Both are\n  already in this repo.\n- **`\u003C|end|>` as eos**, not `\u003C|endoftext|>` — the model never emits the\n  latter; the chat-token reinit teaches it to emit `\u003C|end|>` at turn\n  boundaries.\n- **Salvage hook** in `mini-swe-agent`: when an instance ends in any\n  non-`Submitted` state with a live container, run a final\n  `git diff --cached` and use that as the patch. Recovers WIP edits\n  that would otherwise be discarded as empty submissions.\n","该项目通过微调1930年代的大型语言模型（LLM）来解决软件工程基准测试问题。核心功能包括对特定数据集进行微调，使模型能够生成代码修复方案，尤其在处理SWE-bench风格的任务时表现良好。技术特点上，使用了基于Python的训练框架，并且提供了详细的评估和分析工具，支持从少量示例到大规模数据集的扩展实验。适用于需要探索历史数据如何影响现代编程任务性能的研究场景，以及希望了解不同预训练策略对模型最终能力影响的研究人员。","2026-06-11 04:03:15","CREATED_QUERY"]