[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80844":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":15,"stars7d":12,"stars30d":16,"stars90d":15,"forks30d":15,"starsTrendScore":15,"compositeScore":17,"rankGlobal":9,"rankLanguage":9,"license":9,"archived":18,"fork":18,"defaultBranch":19,"hasWiki":20,"hasPages":18,"topics":21,"createdAt":9,"pushedAt":9,"updatedAt":22,"readmeContent":23,"aiSummary":24,"trendingCount":15,"starSnapshotCount":15,"syncStatus":12,"lastSyncTime":25,"discoverSource":26},80844,"ProRL","hongruhou89\u002FProRL","hongruhou89","ICML 2026: \"ProRL: Effective Reinforcement Learning for Proactive Recommendation via Rectified Policy Gradient Estimation\"",null,"Python",44,2,36,1,0,8,1.43,false,"main",true,[],"2026-06-12 02:04:07","\u003Cp align=\"center\">\n  \u003Ch1 align=\"center\">🚀 ProRL: Effective Reinforcement Learning for Proactive Recommendation via Rectified Policy Gradient Estimation\u003C\u002Fh1>\n  \u003Cp align=\"center\">\n    \u003Cem>Official implementation — ICML 2026\u003C\u002Fem>\n  \u003C\u002Fp>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"#-overview\">Overview\u003C\u002Fa> •\n  \u003Ca href=\"#-installation\">Installation\u003C\u002Fa> •\n  \u003Ca href=\"#-quick-start\">Quick Start\u003C\u002Fa> •\n  \u003Ca href=\"#-training\">Training\u003C\u002Fa> •\n  \u003Ca href=\"#-evaluation\">Evaluation\u003C\u002Fa>\n\u003C\u002Fp>\n\n---\n\n## 📋 Overview\n\n**ProRL** is a framework for **Proactive Recommendation** that combines semantic-ID item representations with reinforcement learning. The model learns to generate item trajectories that gradually steer users toward a target item while jointly optimizing several objectives:\n\n- **IoI (Increase of Interest)** — Increase in the probability of the user engaging with the target item.\n- **IoR (Increase of Rank)** — Increase in the ranking of the target item.\n- **CTR (Click-Through Rate)** — Predicted click probability of the recommended intermediate items.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"fig\u002Fframework.png\" alt=\"ProRL Framework\" width=\"100%\">\n\u003C\u002Fp>\n\n### Key Features\n\n- 🎯 **Multi-objective reward** — Jointly optimizes IoI, IoR and CTR with configurable weights.\n- 🔄 **Rectified policy gradient (ProRL)** — Stable RL training with KL-divergence regularization toward the pretrained reference policy.\n- 📊 **Semantic-ID tokenization** — Items are represented as short codes from a learned codebook.\n- ⚡ **Distributed training** — Multi-GPU training via 🤗 Accelerate.\n\n---\n\n## 🔧 Installation\n\n### Requirements\n\n- Python ≥ 3.11\n- CUDA ≥ 12.4 (we tested on 4× GPUs)\n- PyTorch ≥ 1.12\n\n### Setup\n\n```bash\n# Clone the repository\ngit clone https:\u002F\u002Fgithub.com\u002Fyour-repo\u002FProRL.git\ncd ProRL\n\n# Install PyTorch (CUDA 12.4)\npip install torch torchvision --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu124\n\n# Core dependencies\npip install transformers==4.45.2\npip install accelerate==1.0.1\npip install sentence_transformers\npip install tensorboard\npip install recbole\n\n# RecBole pulls in a newer numpy — pin to 1.26.0\npip uninstall -y numpy\npip install numpy==1.26.0\n```\n\n---\n\n## 🚀 Quick Start\n\nAll training is launched through ready-to-use shell scripts in `scripts\u002F`. They handle the `accelerate` launch, paths and hyperparameters for you.\n\n### ProRL on all three datasets sequentially\n\n```bash\nbash scripts\u002Frun_prorl.sh\n```\n---\n\n## 🏋️ Training\n\n### Stage 1 — Pretraining\n\nEach pretrain script runs `proactive_pretrain.py` through `accelerate` on 4 GPUs by default:\n\n```bash\nPYTHONNOUSERSITE=1 \\\nCUDA_VISIBLE_DEVICES=0,1,2,3 \\\npython -m accelerate.commands.launch \\\n  --config_file .\u002Fconfig\u002Frec_config.yaml \\\n  --main_process_port 16086 \\\n  --num_processes 4 \\\n  .\u002Fproactive_pretrain.py \\\n  --dataset ml-1m \\\n  --config_file .\u002Fconfig\u002Fptconfig.yaml\n```\n\nOutputs are written under `ckpt\u002F\u003Cdataset>\u002F\u003Ctimestamp-hash>\u002F` and logs under `run_logs\u002F`. The trainer automatically saves the best checkpoint on the validation metric (`IoI_max@10` by default).\n\n### Stage 2 — ProRL Fine-tuning\n\nProRL fine-tunes the pretrained policy with a rectified policy-gradient objective and a KL penalty against the frozen reference policy. The corresponding script for ML-1M looks like:\n\n```bash\nPYTHONNOUSERSITE=1 \\\nCUDA_VISIBLE_DEVICES=0,1,2,3 \\\npython -m accelerate.commands.launch \\\n  --config_file .\u002Fconfig\u002Frec_config.yaml \\\n  --main_process_port 16086 \\\n  --num_processes 4 \\\n  .\u002FProactive_RL_prorl.py \\\n  --dataset ml-1m \\\n  --config_file .\u002Fconfig\u002Fprorl.yaml \\\n  --pretrained_ckpt .\u002Fckpt\u002Fml-1m\u002F\u003Cyour-pretrain-run>\u002F\u003Cyour-pretrain-run>.pth \\\n  --mode prorl \\\n  --prorl_beta 1e-2 \\\n  --prorl_lr 1e-4 \\\n  --prorl_gamma 1 \\\n  --prorl_epochs 50 \\\n  --reward_weight_ctr 1.0 \\\n  --reward_weight_ioi 1.0 \\\n  --reward_weight_ior 1.0\n```\n\n#### Key CLI arguments (`Proactive_RL_prorl.py`)\n\n| Argument | Description | Default (see `config\u002Fprorl.yaml`) |\n|----------|-------------|-----------------------------------|\n| `--dataset` | One of `ml-1m`, `Steam`, `Books` | — (required) |\n| `--config_file` | Path to the ProRL YAML config | — (required) |\n| `--pretrained_ckpt` | Path to the Stage-1 `.pth` checkpoint | — (required) |\n| `--mode` | `prorl` for training, `eval` for evaluation-only | — (required) |\n| `--prorl_beta` | KL-divergence penalty coefficient β | `1e-2` |\n| `--prorl_lr` | RL learning rate | dataset-specific (see scripts) |\n| `--prorl_gamma` | Discount factor γ for cumulative rewards | `1.0` |\n| `--prorl_epochs` | Number of RL training epochs | `50` |\n| `--prorl_num_samples` | Rollout samples per prompt (group size) | `16` |\n| `--reward_weight_ctr` | Weight of the CTR reward term | `1.0` |\n| `--reward_weight_ioi` | Weight of the IoI reward term | `1.0` |\n| `--reward_weight_ior` | Weight of the IoR reward term | `1.0` |\n\n---\n\n### Reported metrics\n\n| Metric | Description |\n|--------|-------------|\n| `IoI@K` | Increase of Interest at top-K trajectory length |\n| `IoR@K` | Increase of Rank at top-K trajectory length |\n| `CTR@K` | Average click-through rate over the top-K trajectory |\n| `Coherence@K` | Trajectory coherence based on item attributes |\n\nTop-K values default to `[1, 5, 10]` (see `config\u002Fprorl.yaml`).\n\n---\n\n## 🎛️ Configuration Reference\n\n### Model architecture (T5 backbone) — `config\u002Fptconfig.yaml` \u002F `config\u002Fprorl.yaml`\n\n```yaml\nnum_layers: 3\nnum_decoder_layers: 3\nd_model: 128\nd_ff: 512\nnum_heads: 4\nd_kv: 64\ndropout_rate: 0.1\nactivation_function: relu\n```\n\n### Semantic-ID tokenizer\n\n```yaml\nn_codebooks: 3\ncodebook_size: 256\nexpand_final: True\ntoken_prefix: \"qwen3-embedding-8b-pca\"\ntoken_suffix: \"sem_ids\"\n```\n\n---\n\n## 📁 Project Structure\n\n```\nProRL\u002F\n├── config\u002F                              # YAML configs\n│   ├── ptconfig.yaml                    # Pretraining config\n│   ├── prorl.yaml                       # ProRL config\n│   ├── rec_config.yaml                  # Accelerate launch config\n│   ├── ml-1m-sas_sasrec_config.yaml     # RecBole evaluator configs\n│   ├── steam-merged_sasrec_config.yaml\n│   ├── amazon-books_sasrec_config.yaml\n│   └── *_gru4rec_config.yaml            # Alternative GRU4Rec evaluators\n│\n├── scripts\u002F                             # Launcher scripts (entry points)\n│   ├── run_pretrain.sh                  # Run all pretrain scripts in sequence\n│   ├── run_prorl.sh                     # Run all RL scripts in sequence\n│   ├── Pretrain\u002F\n│   │   ├── run_ml1m_pretrain.sh\n│   │   ├── run_steam_pretrain.sh\n│   │   └── run_books_pretrain.sh\n│   └── RL\u002F\n│       ├── run_ml1m_prorl.sh\n│       ├── run_steam_prorl.sh\n│       └── run_books_prorl.sh\n│\n├── datasets\u002F                            # Datasets go here (you create this)\n├── ckpt\u002F                                # Checkpoints (auto-created)\n│\n├── proactive_pretrain.py                # Stage-1 entry point\n├── Proactive_RL_prorl.py                # Stage-2 (ProRL) entry point\n├── model.py                             # PRARec model (T5 backbone)\n├── trainer.py                           # Stage-1 trainer\n├── trainer_RL_prorl.py                  # Stage-2 (ProRL) trainer\n├── tokenizer.py                         # Semantic-ID tokenizer\n├── dataset.py                           # ProactiveRecDataset\n├── collator.py                          # Train \u002F RL collators\n├── data_utils.py                        # Dataset \u002F dataloader helpers\n├── evaluator.py                         # Reward model + metric computation\n├── utils.py                             # General utilities\n└── README.md\n```\n\n---\n\n## 📚 Reference\n\n```\n@misc{hou2026prorleffectivereinforcementlearning,\n      title={ProRL: Effective Reinforcement Learning for Proactive Recommendation via Rectified Policy Gradient Estimation}, \n      author={Hongru Hou and Tiehua Mei and Denghui Geng and Jinhui Huang and Ao Xu and Hengrui Chen and Jiaqing Liang and Deqing Yang},\n      year={2026},\n      eprint={2605.28293},\n      archivePrefix={arXiv},\n      primaryClass={cs.LG},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.28293}, \n}\n```\n\n---\n\n## 🙏 Acknowledgments\n\n- [RecBole](https:\u002F\u002Fgithub.com\u002FRUCAIBox\u002FRecBole) — sequential recommendation baselines and the SASRec and GRU4Rec evaluator.\n- [Hugging Face Transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers) — T5 implementation.\n- [Hugging Face Accelerate](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate) — distributed training.\n","ProRL 是一个用于主动推荐的框架，它结合了语义ID项目表示与强化学习技术。该项目的核心功能包括通过联合优化兴趣提升（IoI）、排名提升（IoR）和点击率（CTR）等多个目标来生成逐步引导用户至目标项目的物品轨迹。技术特点方面，ProRL 采用了修正策略梯度估计方法确保稳定训练，并利用KL散度正则化向预训练参考策略靠拢；同时支持分布式多GPU训练以加速模型训练过程。此外，项目使用了语义ID标记化技术，将项目表示为从学习到的码本中选取的短代码。ProRL 适用于需要根据用户行为动态调整推荐内容，从而提高特定项目被访问或购买概率的各种在线推荐场景。","2026-06-11 04:02:31","CREATED_QUERY"]