[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72031":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":16,"stars7d":16,"stars30d":17,"stars90d":16,"forks30d":16,"starsTrendScore":16,"compositeScore":18,"rankGlobal":10,"rankLanguage":10,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":20,"hasPages":20,"topics":22,"createdAt":10,"pushedAt":10,"updatedAt":23,"readmeContent":24,"aiSummary":25,"trendingCount":16,"starSnapshotCount":16,"syncStatus":26,"lastSyncTime":27,"discoverSource":28},72031,"s1","simplescaling\u002Fs1","simplescaling","s1: Simple test-time scaling","https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.19393",null,"Python",6657,756,56,73,0,7,65.34,"Apache License 2.0",false,"main",[],"2026-06-12 04:01:03","\u003Cdiv align=\"center\">\n  \u003Ch1>s1: Simple test-time scaling\u003C\u002Fh1>\n  \u003Cp>Minimal recipe for test-time scaling and strong reasoning performance matching o1-preview with just 1,000 examples & budget forcing\n \u003C\u002Fp>\n\u003C\u002Fdiv>\n\u003Cbr>\n\n![](visuals\u002Fscaling.png)\n\n****************************************************************\n\n**Updates:**\n\n* 2025-03: Released 2 videos on s1: [TWIML Podcast (Sam Charrington & Niklas Muennighoff)](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=kEfUaLBlSHc) & [Microsoft GenAI Talk (Niklas Muennighoff)](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=EEkxuqlvCss)\n* 2025-02: We released [s1.1](https:\u002F\u002Fhuggingface.co\u002Fsimplescaling\u002Fs1.1-32B) a better model than s1 by reusing the same s1K questions but with reasoning traces generated by r1 instead of Gemini: [s1K-1.1](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fsimplescaling\u002Fs1K-1.1). Check [this tweet](https:\u002F\u002Fx.com\u002FMuennighoff\u002Fstatus\u002F1889310803746246694) for details\n* 2025-01: We released [our paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.19393) announced via [this tweet](https:\u002F\u002Fx.com\u002FMuennighoff\u002Fstatus\u002F1886405528777073134).\n\n****************************************************************\n\nThis repository provides an overview of all resources for the paper [\"s1: Simple test-time scaling\"](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.19393).\n\n- [Artifacts](#artifacts)\n- [Structure](#structure)\n- [Inference](#inference)\n  - [vLLM](#vllm)\n  - [vLLM with budget forcing](#vllm-with-budget-forcing)\n  - [transformers](#transformers)\n- [Training](#training)\n- [Evaluation](#evaluation)\n- [Data](#data)\n- [Visuals](#visuals)\n- [Known Issues](#known-issues)\n- [Citation](#citation)\n\n### Artifacts\n\n- **Paper**: https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.19393\n- **Model**: https:\u002F\u002Fhf.co\u002Fsimplescaling\u002Fs1.1-32B (Old: https:\u002F\u002Fhf.co\u002Fsimplescaling\u002Fs1-32B)\n- **Data**: https:\u002F\u002Fhf.co\u002Fdatasets\u002Fsimplescaling\u002Fs1K-1.1 (Old: https:\u002F\u002Fhf.co\u002Fdatasets\u002Fsimplescaling\u002Fs1K)\n    - s1-prob: https:\u002F\u002Fhf.co\u002Fdatasets\u002Fsimplescaling\u002Fs1-prob\n    - s1-teasers: https:\u002F\u002Fhf.co\u002Fdatasets\u002Fsimplescaling\u002Fs1-teasers\n    - Full 59K: https:\u002F\u002Fhf.co\u002Fdatasets\u002Fsimplescaling\u002Fdata_ablation_full59K\n\n### Structure\n\n- `eval\u002F`: Evaluation scripts\n- `data\u002F`: Synthetic data creation scripts & co\n- `train\u002F`: Training scripts\n\n### Inference\n\n#### vLLM\n\nInstall the `vllm` library and run:\n```python\nfrom vllm import LLM, SamplingParams\nfrom transformers import AutoTokenizer\n\nmodel = LLM(\n    \"simplescaling\u002Fs1.1-32B\",\n    tensor_parallel_size=2,\n)\ntok = AutoTokenizer.from_pretrained(\"simplescaling\u002Fs1-32B\")\n\nstop_token_ids = tok(\"\u003C|im_end|>\")[\"input_ids\"]\n\nsampling_params = SamplingParams(\n    max_tokens=32768,\n    min_tokens=0,\n    stop_token_ids=stop_token_ids,\n)\n\nprompt = \"How many r in raspberry\"\nprompt = \"\u003C|im_start|>system\\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.\u003C|im_end|>\\n\u003C|im_start|>user\\n\" + prompt + \"\u003C|im_end|>\\n\u003C|im_start|>assistant\\n\"\n\no = model.generate(prompt, sampling_params=sampling_params)\nprint(o[0].outputs[0].text)\n```\n\n#### vLLM with budget forcing\n\n```python\nfrom vllm import LLM, SamplingParams\nfrom transformers import AutoTokenizer\n\n# Decide on a token limit for thinking; As the model's max tokens is 32768, 32000 usually ensures there is enough space for the model to still answer\nMAX_TOKENS_THINKING = 32000\n# Decide how often to ignore end-of-thinking token\nNUM_IGNORE = 1\n\nmodel = LLM(\n    \"simplescaling\u002Fs1-32B\", # s1 originally gets this prompt wrong but with budget forcing it fixes it\n    tensor_parallel_size=2,\n)\ntok = AutoTokenizer.from_pretrained(\n    \"simplescaling\u002Fs1-32B\"\n)\n\nstop_token_ids = tok(\"\u003C|im_end|>\")[\"input_ids\"]\nsampling_params = SamplingParams(\n    max_tokens=32768,\n    min_tokens=0,\n    stop_token_ids=stop_token_ids,\n    skip_special_tokens=False,\n    temperature=0.0,\n)\n\n# For the exact raspberry sample in the paper see\nprompts = [\n    \"How many r in raspberry\",\n]\n\nfor i, p in enumerate(prompts):\n    prompt = \"\u003C|im_start|>system\\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.\u003C|im_end|>\\n\u003C|im_start|>user\\n\" + p + \"\u003C|im_end|>\\n\u003C|im_start|>assistant\\n\"\n    stop_token_ids = tok(\"\u003C|im_start|>\u003C|im_end|>\")[\"input_ids\"]\n    sampling_params = SamplingParams(\n        max_tokens=MAX_TOKENS_THINKING,\n        min_tokens=0,\n        stop_token_ids=stop_token_ids,\n        skip_special_tokens=False,\n        temperature=0.0,\n    )\n    prompt += \"\u003C|im_start|>think\"\n    o = model.generate(\n        prompt,\n        sampling_params=sampling_params\n    )\n    ignore_str = \"Wait\"\n    max_tokens_thinking_tmp = MAX_TOKENS_THINKING\n    for i in range(NUM_IGNORE): # Num of times to skip stop token\n        max_tokens_thinking_tmp -= len(o[0].outputs[0].token_ids)\n        if max_tokens_thinking_tmp > 0:\n            prompt += o[0].outputs[0].text + ignore_str\n            sampling_params = SamplingParams(\n                max_tokens=max_tokens_thinking_tmp,\n                min_tokens=1,\n                stop_token_ids=stop_token_ids,\n                skip_special_tokens=False,\n                temperature=0.0,\n            )\n            o = model.generate(\n                prompt,\n                sampling_params=sampling_params\n            )\n    ### Final answer ###\n    prompt += o[0].outputs[0].text # You can also append \"Final Answer:\" here like we do for some evaluations to prevent the model from just continuing to reason in its answer when early exiting\n    stop_token_ids = tok(\"\u003C|im_end|>\")[\"input_ids\"]\n    sampling_params = SamplingParams(\n        max_tokens=32768,\n        min_tokens=0,\n        stop_token_ids=stop_token_ids,\n        skip_special_tokens=False,\n        temperature=0.0,\n    )\n    o = model.generate(\n        prompt,\n        sampling_params=sampling_params,\n    )\n    print(\"With budget forcing:\") # You will see that after the \"Wait\" in the reasoning trace it fixes its answer\n    print(prompt + o[0].outputs[0].text)\n```\n\n#### transformers\n\nInstall the `transformers` & `torch` libraries and run:\n\n```python\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\nimport torch\n\nDEVICE = \"cuda\" if torch.cuda.is_available() else \"cpu\"\nmodel_name = \"simplescaling\u002Fs1.1-32B\"\n\nmodel = AutoModelForCausalLM.from_pretrained(\n    model_name,\n    torch_dtype=\"auto\",\n    device_map=\"auto\"\n)\ntokenizer = AutoTokenizer.from_pretrained(model_name)\n\nprompt = \"How many r in raspberry\"\nmessages = [\n    {\"role\": \"system\", \"content\": \"You are a helpful and harmless assistant. You are Qwen developed by Alibaba. You should think step-by-step.\"},\n    {\"role\": \"user\", \"content\": prompt}\n]\ntext = tokenizer.apply_chat_template(\n    messages,\n    tokenize=False,\n    add_generation_prompt=True\n)\nmodel_inputs = tokenizer([text], return_tensors=\"pt\").to(model.device)\n\ngenerated_ids = model.generate(\n    **model_inputs,\n    max_new_tokens=512\n)\ngenerated_ids = [\n    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)\n]\n\nresponse = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]\n```\n\n### Training\n\n\nTo run training, you can find our script at `train\u002Fsft.py` which you can invoke via one of the `train\u002Fsft*sh` scripts which in turn you can launch via `train\u002Flaunch.sh` if you are on a SLURM cluster (requires editing the file for your cluster setup).\n\nTo train s1-32B\u002Fs1.1-32B, we recommend 16 H100 GPUs i.e. 2 nodes with 8 each. For s1.1, we set the block size to 20000 to avoid OOM (https:\u002F\u002Fgithub.com\u002Fsimplescaling\u002Fs1\u002Fblob\u002F0ad4b3de32507b4aa0d4be28f336276ee99b2315\u002Ftrain\u002Fsft.sh#L17); Check the wandb logs [here](https:\u002F\u002Fwandb.ai\u002Fhashimoto-group\u002Fo1\u002Fruns\u002Fm1ilia77\u002Foverview).\n\nQuick start:\n```\ngit clone https:\u002F\u002Fgithub.com\u002Fsimplescaling\u002Fs1.git\ncd s1\npip3 install -r requirements.txt\nbash train\u002Fsft.sh\n```\n*Note: If you encounter an out-of-memory (OOM) issue with 8 GPUs, consider enabling gradient checkpointing by adding the following line to your script: `--gradient_checkpointing=True`.*\n\n### Evaluation\n\nWe cloned [lm-evaluation-harness](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness) at commit `4cec66e4e468d15789473d6d63c3a61a751fa524` and modified it. Setup:\n```bash\ncd eval\u002Flm-evaluation-harness\npip install -e .[math,vllm]\n```\n\nAll commands are in `eval\u002Fcommands.sh`. For AIME24 we always pick the `aime24_nofigures` result, which uses a dataset that only contains the AIME24 figures if they are important for the task.\n\nIf you want to compute statistics (avg thinking tokens etc) for an evaluation run you can use \n`python eval\u002Fcompute_sample_stats.py path_to_samples_file.jsonl`\n\nAll our evaluation result files are at: https:\u002F\u002Fhf.co\u002Fdatasets\u002Fsimplescaling\u002Fresults\n\nTo run REBASE: commands are in `eval\u002Frebase\u002Frun.sh`\nNote that for the evaluations in the Discussion section with REBASE we used https:\u002F\u002Fhuggingface.co\u002Fsimplescaling\u002Fstep-conditional-control-old trained on an older version of our dataset https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fsimplescaling\u002Fs1K-step-conditional-control-old and run on an older version of our evaluation using https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FMaxwell-Jia\u002FAIME_2024.\n\n### Data\n\nTo recreate s1K follow the steps below. In various files you will have to rename the organizations `simplescaling` and `qfq` with an organization that you own. **Note that [s1K-1.1](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fsimplescaling\u002Fs1K-1.1) is a better dataset generated with r1 traces instead of Gemini traces.**\n1. Run `data\u002Fcollect_data.py` followed by `data\u002Ffix_gpqa.py` & `data\u002Fadd_aime.py` to collect the questions; Make sure to change the hub path in the respective files to one of your own.\n2. Generate traces with Gemini via `python data\u002Fgemini.py`. This step will use https:\u002F\u002Fhf.co\u002Fdatasets\u002Fqfq\u002Ftrain which should be roughly equivalent to the dataet you have produced in 1.\n3. Generate answers with Qwen via `python data\u002Fbulk_inference.py` that can be launched with `data\u002Fbulk_inference.sh`.\n4. Add features by running `python data\u002Ffeaturization.py`.\n5. Run final filtering via going through `data\u002Ffilter.ipynb`.\n6. If you want to run grading on the final questions to produce e.g. a gemini_grade column as in [this dataset](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fsimplescaling\u002Fs1K-1.1), you can use `data\u002Fgrading.ipynb`.\n\n### Visuals\n\nAll figures and some tables are created via [this colab](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1GAfwbJs2Y1dgGGsxrQyQg2G7CRH5NgN3?usp=sharing) equivalent to `visuals\u002Fvisuals.ipynb`. Some are subsequently edited via the `visuals\u002Fs1.fig` file, which you can load in Figma. The output figures are in `visuals\u002F` in pdf or png format.\n\n### Known Issues\n\n- vLLM throws `ValueError: Token id XXXXX is out of vocabulary`\n  - This can happen with budget forcing, especially when running with temperature 1, where the model will sometimes do crazy stuff and predict a vocab id that is larger than its max token id but still within its embedding size i.e. anything \u003C152064, >151664; When we refeed the model's previous outputs to it which is done when setting e.g. max_thinking_tokens in the evaluation then this will cause the error cuz vLLM does this check even though it would only be an issue for IDs >152064. To fix it you can just uncomment the vLLM ValueError (It is the line `if max_input_id > tokenizer.max_token_id:` in `vllm\u002Fengine\u002Fllm_engine.py`)\n\n### Citation\n\n```bibtex\n@misc{muennighoff2025s1simpletesttimescaling,\n      title={s1: Simple test-time scaling}, \n      author={Niklas Muennighoff and Zitong Yang and Weijia Shi and Xiang Lisa Li and Li Fei-Fei and Hannaneh Hajishirzi and Luke Zettlemoyer and Percy Liang and Emmanuel Candès and Tatsunori Hashimoto},\n      year={2025},\n      eprint={2501.19393},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.19393}, \n}\n```\n","s1项目提供了一种简单有效的测试时缩放方法，旨在通过少量示例（仅1000个）和预算强制来匹配o1-preview的强推理性能。该项目的核心功能包括使用vLLM库进行高效推理，并支持预算强制以优化资源利用。技术上，它基于Python语言开发，遵循Apache License 2.0开源协议。适用于需要在有限计算资源下提高模型推理能力的场景，如低成本AI应用部署、学术研究中的快速原型验证等。此外，项目还提供了详细的训练、评估及数据处理脚本，方便用户根据自身需求定制解决方案。",2,"2026-06-11 03:40:01","high_star"]