[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-1884":3},{"id":4,"name":5,"fullName":6,"owner":5,"repo":5,"description":7,"homepage":8,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":15,"stars7d":15,"stars30d":16,"stars90d":15,"forks30d":15,"starsTrendScore":15,"compositeScore":17,"rankGlobal":9,"rankLanguage":9,"license":9,"archived":18,"fork":18,"defaultBranch":19,"hasWiki":20,"hasPages":18,"topics":21,"createdAt":9,"pushedAt":9,"updatedAt":22,"readmeContent":23,"aiSummary":24,"trendingCount":15,"starSnapshotCount":15,"syncStatus":25,"lastSyncTime":26,"discoverSource":27},1884,"Sim2Reason","Sim2Reason\u002FSim2Reason","Sim2Reason: Solving Physics Olympiad via Reinforcement Learning on Physics Simulators.  We present a method for turning physics simulators into scalable generators of question–answer pairs for improving LLM physical reasoning.","https:\u002F\u002Fsim2reason.github.io\u002F",null,"Python",163,24,3,1,0,6,41.79,false,"main",true,[],"2026-06-12 04:00:11","\u003Cdiv align=\"center\">\n\n\u003C!-- TITLE -->\n# **Solving Physics Olympiad via Reinforcement Learning on Physics Simulators**\n\n**[Mihir Prabhudesai](https:\u002F\u002Fmihirp1998.github.io\u002F)\u003Csup>\\*,†\u003C\u002Fsup>, [Aryan Satpathy](https:\u002F\u002Faryan-satpathy.github.io\u002F)\u003Csup>\\*,†\u003C\u002Fsup>, [Yangmin Li](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fyamy12344\u002F)\u003Csup>†\u003C\u002Fsup>, [Zheyang Qin](https:\u002F\u002Fqinowen.github.io\u002F)\u003Csup>†\u003C\u002Fsup>,\u003Cbr> [Nikash Bhardwaj](https:\u002F\u002Fnikashbhardwaj.com\u002F), [Amir Zadeh](https:\u002F\u002Fsim2reason.github.io)\u003Csup>λ\u003C\u002Fsup>, [Chuan Li](https:\u002F\u002Fsim2reason.github.io)\u003Csup>λ\u003C\u002Fsup>, [Katerina Fragkiadaki](https:\u002F\u002Fwww.cs.cmu.edu\u002F~katef\u002F), [Deepak Pathak](https:\u002F\u002Fwww.cs.cmu.edu\u002F~dpathak\u002F)**\n\nCarnegie Mellon University &nbsp;·&nbsp; \u003Csup>λ\u003C\u002Fsup>Lambda\n\n\u003Csup>\\*\u003C\u002Fsup>Project co-leads & Equal contribution &nbsp;·&nbsp; \u003Csup>†\u003C\u002Fsup>Core contributors\n\n\u003Cbr>\n\n\u003C!-- BADGES -->\n[![Website](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🌎-Website-blue.svg)](http:\u002F\u002Fsim2reason.github.io)\n[![Arxiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-Paper-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.11805)\n\n\u003C!-- Teaser -->\n\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F250d8b73-9728-46e6-aa4c-3a962a74fbf1\n\" autoplay loop muted playsinline width=\"100%\">\u003C\u002Fvideo>\n\n\u003C\u002Fdiv>\n\n*We present **SIM2REASON**: a method for turning physics simulators into scalable generators of question–answer pairs to improve LLM reasoning, removing the need of human annotation in the data-generation pipeline. The core idea is to **structure the randomization with a domain-specific language (DSL)** and use it to procedurally generate reasoning problems, as illustrated in the examples above. LLMs finetuned on this synthetic data get **zero-shot improvement on real world benchmarks** such as International Physics Olympiad.*\n## Abstract\n\nWe have witnessed remarkable advances in LLM reasoning capabilities with the advent of DeepSeek-R1. However, much of this progress has been fueled by the abundance of internet question–answer (QA) pairs—a major bottleneck going forward, since such data is limited in scale and concentrated mainly in domains like mathematics. In contrast, other sciences such as physics lack large-scale QA datasets to effectively train reasoning-capable models. In this work, we show that physics simulators can serve as a powerful alternative source of supervision for training LLMs for physical reasoning. We generate random scenes in physics engines, create synthetic question–answer pairs from simulated interactions, and train LLMs using reinforcement learning on this synthetic data. Our models exhibit zero-shot sim-to-real transfer to real-world physics benchmarks: for example, training solely on synthetic simulated data improves performance on IPhO (International Physics Olympiad) problems by 5-10 percentage points across model sizes. These results demonstrate that physics simulators can act as scalable data generators, enabling LLMs to acquire deep physical reasoning skills beyond the limitations of internet-scale QA data. Code available at: `https:\u002F\u002Fsim2reason.github.io\u002F`.\n\n## Installation\nCreate an environment for data generation with the following commands:\n```\nconda create -n pho_data python=3.11\nconda activate pho_data\n\npip install bpy\npip install mujoco==3.3.4 ImageIO\npip install ipdb scipy tabulate pandas matplotlib\npip install hydra-core omegaconf\npip install tqdm wandb\n\npip install sympy math_verify\n\npip install transformers pyarrow\n```\n\nCreate an environment for training with the following commands:\n```\nconda create -n pho_training python=3.12\nconda activate pho_training\n\npip install vllm==0.8.5.post1\npip install https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention\u002Freleases\u002Fdownload\u002Fv2.7.4.post1\u002Fflash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl\ncd verl_v4\npip install -e .\npip install hydra-core==1.4.0.dev1 omegaconf==2.4.0.dev3\npip install math-verify==0.7.0 \npip install latex2sympy2_extended==1.0.9\npip install antlr4-python3-runtime==4.11.0\npip install polars ipdb\npip install transformers==4.52.3\npip install setuptools==69.5.1\n```\n\n## Preparing Data\nSet path to store data and checkpoints:\n```bash\nexport PHO_DATA=\u003Cpath to folder>\nexport PHO_CHECKPOINT_DIR=\u003Cpath to folder>\nexport DATA_VERSION=v1\n```\n\nGenerate synthetic scenes by running the following command:\n```bash\npython -m sim.scene_generator scene_generation.num_scenes=1000 scene_generation=all data_version=$DATA_VERSION\n```\n\nGenerate QA pairs and filter shortcut questions by running the following command:\n```bash\npython -m sim.qa_gen_rule data_version=$DATA_VERSION gpt_nlq.num_generations_per_problem=30 numerical=True symbolic=True reverse=False\npython -m sim.create_child_scenes data_version=$DATA_VERSION\npython -m sim.qa_gen_rule data_version=$DATA_VERSION numerical=True symbolic=True reverse=False gpt_nlq.build_child_scenes=True\n```\n\nPreprocess generated QA pairs into format suitable for verl by running the following command:\n```bash\npython -m sim.write_json data_version=$DATA_VERSION numerical=True symbolic=False reverse=False\npython -m sim.write_json data_version=$DATA_VERSION numerical=False symbolic=True reverse=False\n\npython -m llm.preprocess_json_to_parquet --json_names numerical_problems_all_train_without_shortcut.json numerical_problems_all_test_without_shortcut.json symbolic_problems_all_train_without_shortcut.json symbolic_problems_all_test_without_shortcut.json  --data_version \"$DATA_VERSION\" --no_extra_instruction\n```\n\nUpon successfully running the above, `\u003CPHO_DATA>\u002F\u003CDATA_VERSION>` should be populated with two files: `train_\u003CDATA_VERSION>_rl.parquet` and `test_\u003CDATA_VERSION>_rl.parquet`.\n\n## Training\n```bash\npython -m verl_v4.recipe.dapo.main_dapo exps=\"[dapo_32b,syn_data,q2.5_14b,gspo,use_kl]\"  trainer.n_gpus_per_node=8 trainer.val_before_train=False  sp_size=8 gen_tp=8 trainer.total_epochs=10\n```\n\nThis command finetunes Qwen2.5 14B Instruct on the generated synthetic data using DAPO algorithm. Training info is logged in wandb by default in the project named `verl`, and checkpoints are saved at `PHO_CHECKPOINT_DIR\u002F\u003Crun_name>`. By default, only latest global step checkpoint is saved, deleting the old checkpoints to save storage. \n\n**Pretrained Checkpints:** We provide our pretrained checkpoints for various Qwen models in [HuggingFace](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fasatpath\u002Fsim2reason).\n\n## Evaluation\nEvaluation requires first converting the checkpoint to HuggingFace format by running the following command:\n```bash\nexport ACTOR_DIR=$PHO_CHECKPOINT_DIR\u002F\u003Crun_name>\u002Fglobal_step_\u003Cstep>\u002Factor\npython verl_v4\u002Fscripts\u002Fmodel_merger.py merge \\\n    --backend fsdp \\\n    --local_dir \"$ACTOR_DIR\" \\\n    --target_dir \"$TARGET_DIR\"\n```\n\nTo evaluate the model's performance on IPhO, download the val set from [HF link](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fasatpath\u002Fsim2reason) to `PHO_DATA\u002Fipho\u002Fipho_numeric_validation_no_instruct.parquet` and run:\n```bash\npython -m verl_v4.recipe.dapo.simple_eval \\\n exps='[dapo_32b,log_all_reward,pass_at_n,simple_eval]' \\\n   model=qwen2.5-3b-instruct data.val_batch_size=null actor_rollout_ref.rollout.val_kwargs.n=8 \\\n  max_response_length=30000 trainer.n_gpus_per_node=8 model_name=${MODEL_PATH} \\\n  actor_rollout_ref.rollout.gpu_memory_utilization=0.85 max_model_len=30000 \\\n  data.val_files=\"[${PHO_DATA}\u002Fipho\u002Fipho_numeric_validation_no_instruct.parquet]\" \\\n  data.train_files=\"[${PHO_DATA}\u002F${DATA_VERSION}\u002Ftrain_${DATA_VERSION}_rl.parquet]\"\n```\n\n> **Note for the 32B checkpoint:** The released 32B checkpoint was trained with an older instruction prompt. To evaluate it correctly on IPhO, add `use_legacy_prompt` to the `exps` list:\n> ```bash\n> python -m verl_v4.recipe.dapo.simple_eval \\\n>   exps='[dapo_32b,log_all_reward,pass_at_n,simple_eval,use_legacy_prompt]' \\\n>   model=qwen2.5-3b-instruct data.val_batch_size=null actor_rollout_ref.rollout.val_kwargs.n=8 \\\n>   max_response_length=30000 trainer.n_gpus_per_node=8 model_name=${MODEL_PATH} \\\n>   actor_rollout_ref.rollout.gpu_memory_utilization=0.85 max_model_len=30000 \\\n>   data.val_files=\"[${PHO_DATA}\u002Fipho\u002Fipho_numeric_validation_no_instruct.parquet]\" \\\n>   data.train_files=\"[${PHO_DATA}\u002F${DATA_VERSION}\u002Ftrain_${DATA_VERSION}_rl.parquet]\"\n> ```\n> All other benchmarks (JEEBench, OlympiadBench, PHYSICS) are unaffected and use the same eval command regardless of checkpoint.\n\nTo evaluate the model's performance on JEEBench, first download JEEBench by running `setup_jeebench.sh`, then run:\n```bash\npython -m verl_v4.recipe.dapo.simple_eval \\\n  exps='[dapo_32b,log_all_reward,pass_at_n,use_JEEBench]' \\\n  data.val_files=\"[${PHO_DATA}\u002FJEEBench\u002Fdataset.json]\" \\\n  model=qwen2.5-3b-instruct data.val_batch_size=1 actor_rollout_ref.rollout.val_kwargs.n=8 \\\n  max_response_length=32000 trainer.n_gpus_per_node=8 model_name=${MODEL_PATH} \\\n  actor_rollout_ref.rollout.gpu_memory_utilization=0.65 max_model_len=32000\n```\n\nTo evaluate the model's performance on OlympiadBench, run:\n```bash\npython -m verl_v4.recipe.dapo.simple_eval \\\n  exps='[dapo_32b,log_all_reward,pass_at_n,use_olympiad_bench]' \\\n  data.val_files=\"[${PHO_DATA}\u002Folympiad_bench\u002FOlympiadBench_Dataset\u002Fdata\u002FOE_TO_physics_en_COMP.json,\n  ${PHO_DATA}\u002Folympiad_bench\u002FOlympiadBench_Dataset\u002Fdata\u002FOE_TO_physics_zh_CEE.json]\" \\\n  model=qwen2.5-3b-instruct data.val_batch_size=1 actor_rollout_ref.rollout.val_kwargs.n=1 \\\n  max_response_length=32000 trainer.n_gpus_per_node=8 model_name=${MODEL_PATH} \\\n  actor_rollout_ref.rollout.gpu_memory_utilization=0.65 max_model_len=32000\n```\n\nTo evaluate the model's performance on [PHYSICS](https:\u002F\u002Fgithub.com\u002FZhengsh123\u002FPHYSICS), download the val set from [HF link](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fdesimfj\u002FPHYSICS\u002Ftree\u002Fmain\u002Fdata) to `PHO_DATA\u002FPHYSICS\u002Ftest.jsonl` and run:\n```bash\npython -m verl_v4.recipe.dapo.simple_eval \\\n  exps='[dapo_32b,ipho_numeric_val,log_all_reward,pass_at_n,use_PHYSICS]' \\\n  model=qwen2.5-3b-instruct data.val_batch_size=1 actor_rollout_ref.rollout.val_kwargs.n=8 \\\n  max_response_length=32000 trainer.n_gpus_per_node=8 model_name=${MODEL_PATH} \\\n  actor_rollout_ref.rollout.gpu_memory_utilization=0.65 max_model_len=32000\n```\n\nSince the verifier used in PHYSICS is not opensource, we use Gemini 2.5 Flash as it is a strong verifier. Set the Google API Key in `llm\u002Futils\u002Fbasic_utils.py` line 138 before running this evaluation.\n\n## Citation\nIf you find this work useful in your research, please cite:\n```bibtex\n@article{prabhudesai2026solving,\n  title={Solving Physics Olympiad via Reinforcement Learning on Physics Simulators},\n  author={Prabhudesai, Mihir and Satpathy, Aryan and Li, Yangmin and Qin, Zheyang and Bhardwaj, Nikash and Zadeh, Amir and Li, Chuan and Fragkiadaki, Katerina and Pathak, Deepak},\n  journal={arXiv preprint arXiv:2604.11805},\n  year={2026}\n}\n\n```\n","Sim2Reason是一个利用物理模拟器通过强化学习生成问题-答案对以提升大型语言模型（LLM）物理推理能力的项目。其核心功能在于使用领域特定语言（DSL）结构化随机生成过程，从而自动产生用于训练的数据，减少了人工标注的需求。技术上，它依赖于Python实现，并通过强化学习方法优化了从合成数据到真实世界问题解决能力的迁移效果。此项目特别适合那些希望提高AI系统在复杂物理问题解决方面表现的研究者或开发者，尤其是在缺乏大规模高质量训练数据集的情况下，如国际物理奥林匹克竞赛题目的处理场景中。",2,"2026-06-11 02:46:39","CREATED_QUERY"]