[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-2181":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":36,"readmeContent":37,"aiSummary":38,"trendingCount":16,"starSnapshotCount":16,"syncStatus":39,"lastSyncTime":40,"discoverSource":41},2181,"storm","stanford-oval\u002Fstorm","stanford-oval","An LLM-powered knowledge curation system that researches a topic and generates a full-length report with citations.","http:\u002F\u002Fstorm.genie.stanford.edu",null,"Python",28351,2584,189,57,0,3,35,165,23,45,"MIT License",false,"main",true,[27,28,29,30,31,32,33,34,35],"agentic-rag","deep-research","emnlp2024","knowledge-curation","large-language-models","naacl","nlp","report-generation","retrieval-augmented-generation","2026-06-12 02:00:38","\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Flogo.svg\" style=\"width: 25%; height: auto;\">\n\u003C\u002Fp>\n\n# STORM: Synthesis of Topic Outlines through Retrieval and Multi-perspective Question Asking\n\n\u003Cp align=\"center\">\n| \u003Ca href=\"http:\u002F\u002Fstorm.genie.stanford.edu\">\u003Cb>Research preview\u003C\u002Fb>\u003C\u002Fa> | \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.14207\">\u003Cb>STORM Paper\u003C\u002Fb>\u003C\u002Fa>| \u003Ca href=\"https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2408.15232\">\u003Cb>Co-STORM Paper\u003C\u002Fb>\u003C\u002Fa>  | \u003Ca href=\"https:\u002F\u002Fstorm-project.stanford.edu\u002F\">\u003Cb>Website\u003C\u002Fb>\u003C\u002Fa> |\n\u003C\u002Fp>\n**Latest News** 🔥\n\n- [2025\u002F01] We add [litellm](https:\u002F\u002Fgithub.com\u002FBerriAI\u002Flitellm) integration for language models and embedding models in `knowledge-storm` v1.1.0.\n\n- [2024\u002F09] Co-STORM codebase is now released and integrated into `knowledge-storm` python package v1.0.0. Run `pip install knowledge-storm --upgrade` to check it out.\n\n- [2024\u002F09] We introduce collaborative STORM (Co-STORM) to support human-AI collaborative knowledge curation! [Co-STORM Paper](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2408.15232) has been accepted to EMNLP 2024 main conference.\n\n- [2024\u002F07] You can now install our package with `pip install knowledge-storm`!\n- [2024\u002F07] We add `VectorRM` to support grounding on user-provided documents, complementing existing support of search engines (`YouRM`, `BingSearch`). (check out [#58](https:\u002F\u002Fgithub.com\u002Fstanford-oval\u002Fstorm\u002Fpull\u002F58))\n- [2024\u002F07] We release demo light for developers a minimal user interface built with streamlit framework in Python, handy for local development and demo hosting (checkout [#54](https:\u002F\u002Fgithub.com\u002Fstanford-oval\u002Fstorm\u002Fpull\u002F54))\n- [2024\u002F06] We will present STORM at NAACL 2024! Find us at Poster Session 2 on June 17 or check our [presentation material](assets\u002Fstorm_naacl2024_slides.pdf). \n- [2024\u002F05] We add Bing Search support in [rm.py](knowledge_storm\u002Frm.py). Test STORM with `GPT-4o` - we now configure the article generation part in our demo using `GPT-4o` model.\n- [2024\u002F04] We release refactored version of STORM codebase! We define [interface](knowledge_storm\u002Finterface.py) for STORM pipeline and reimplement STORM-wiki (check out [`src\u002Fstorm_wiki`](knowledge_storm\u002Fstorm_wiki)) to demonstrate how to instantiate the pipeline. We provide API to support customization of different language models and retrieval\u002Fsearch integration.\n\n[![Code style: black](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcode%20style-black-000000.svg)](https:\u002F\u002Fgithub.com\u002Fpsf\u002Fblack)\n\n## Overview [(Try STORM now!)](https:\u002F\u002Fstorm.genie.stanford.edu\u002F)\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Foverview.svg\" style=\"width: 90%; height: auto;\">\n\u003C\u002Fp>\nSTORM is a LLM system that writes Wikipedia-like articles from scratch based on Internet search. Co-STORM further enhanced its feature by enabling human to collaborative LLM system to support more aligned and preferred information seeking and knowledge curation.\n\nWhile the system cannot produce publication-ready articles that often require a significant number of edits, experienced Wikipedia editors have found it helpful in their pre-writing stage.\n\n**More than 70,000 people have tried our [live research preview](https:\u002F\u002Fstorm.genie.stanford.edu\u002F). Try it out to see how STORM can help your knowledge exploration journey and please provide feedback to help us improve the system 🙏!**\n\n\n\n## How STORM & Co-STORM works\n\n### STORM\n\nSTORM breaks down generating long articles with citations into two steps:\n\n1. **Pre-writing stage**: The system conducts Internet-based research to collect references and generates an outline.\n2. **Writing stage**: The system uses the outline and references to generate the full-length article with citations.\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Ftwo_stages.jpg\" style=\"width: 60%; height: auto;\">\n\u003C\u002Fp>\n\nSTORM identifies the core of automating the research process as automatically coming up with good questions to ask. Directly prompting the language model to ask questions does not work well. To improve the depth and breadth of the questions, STORM adopts two strategies:\n1. **Perspective-Guided Question Asking**: Given the input topic, STORM discovers different perspectives by surveying existing articles from similar topics and uses them to control the question-asking process.\n2. **Simulated Conversation**: STORM simulates a conversation between a Wikipedia writer and a topic expert grounded in Internet sources to enable the language model to update its understanding of the topic and ask follow-up questions.\n\n### CO-STORM\n\nCo-STORM proposes **a collaborative discourse protocol** which implements a turn management policy to support smooth collaboration among \n\n- **Co-STORM LLM experts**: This type of agent generates answers grounded on external knowledge sources and\u002For raises follow-up questions based on the discourse history.\n- **Moderator**: This agent generates thought-provoking questions inspired by information discovered by the retriever but not directly used in previous turns. Question generation can also be grounded!\n- **Human user**: The human user will take the initiative to either (1) observe the discourse to gain deeper understanding of the topic, or (2) actively engage in the conversation by injecting utterances to steer the discussion focus.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fco-storm-workflow.jpg\" style=\"width: 60%; height: auto;\">\n\u003C\u002Fp>\n\nCo-STORM also maintains a dynamic updated **mind map**, which organize collected information into a hierarchical concept structure, aiming to **build a shared conceptual space between the human user and the system**. The mind map has been proven to help reduce the mental load when the discourse goes long and in-depth. \n\nBoth STORM and Co-STORM are implemented in a highly modular way using [dspy](https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fdspy).\n\n## Installation\n\n\nTo install the knowledge storm library, use `pip install knowledge-storm`. \n\nYou could also install the source code which allows you to modify the behavior of STORM engine directly.\n1. Clone the git repository.\n    ```shell\n    git clone https:\u002F\u002Fgithub.com\u002Fstanford-oval\u002Fstorm.git\n    cd storm\n    ```\n   \n2. Install the required packages.\n   ```shell\n   conda create -n storm python=3.11\n   conda activate storm\n   pip install -r requirements.txt\n   ```\n   \n\n## API\n\nCurrently, our package support:\n\n- Language model components: All language models supported by litellm as listed [here](https:\u002F\u002Fdocs.litellm.ai\u002Fdocs\u002Fproviders)\n- Embedding model components: All embedding models supported by litellm as listed [here](https:\u002F\u002Fdocs.litellm.ai\u002Fdocs\u002Fembedding\u002Fsupported_embedding)\n- retrieval module components: `YouRM`, `BingSearch`, `VectorRM`, `SerperRM`, `BraveRM`, `SearXNG`, `DuckDuckGoSearchRM`, `TavilySearchRM`, `GoogleSearch`, and `AzureAISearch` as \n\n:star2: **PRs for integrating more search engines\u002Fretrievers into [knowledge_storm\u002Frm.py](knowledge_storm\u002Frm.py) are highly appreciated!**\n\nBoth STORM and Co-STORM are working in the information curation layer, you need to set up the information retrieval module and language model module to create their `Runner` classes respectively.\n\n### STORM\n\nThe STORM knowledge curation engine is defined as a simple Python `STORMWikiRunner` class. Here is an example of using You.com search engine and OpenAI models.\n\n```python\nimport os\nfrom knowledge_storm import STORMWikiRunnerArguments, STORMWikiRunner, STORMWikiLMConfigs\nfrom knowledge_storm.lm import LitellmModel\nfrom knowledge_storm.rm import YouRM\n\nlm_configs = STORMWikiLMConfigs()\nopenai_kwargs = {\n    'api_key': os.getenv(\"OPENAI_API_KEY\"),\n    'temperature': 1.0,\n    'top_p': 0.9,\n}\n# STORM is a LM system so different components can be powered by different models to reach a good balance between cost and quality.\n# For a good practice, choose a cheaper\u002Ffaster model for `conv_simulator_lm` which is used to split queries, synthesize answers in the conversation.\n# Choose a more powerful model for `article_gen_lm` to generate verifiable text with citations.\ngpt_35 = LitellmModel(model='gpt-3.5-turbo', max_tokens=500, **openai_kwargs)\ngpt_4 = LitellmModel(model='gpt-4o', max_tokens=3000, **openai_kwargs)\nlm_configs.set_conv_simulator_lm(gpt_35)\nlm_configs.set_question_asker_lm(gpt_35)\nlm_configs.set_outline_gen_lm(gpt_4)\nlm_configs.set_article_gen_lm(gpt_4)\nlm_configs.set_article_polish_lm(gpt_4)\n# Check out the STORMWikiRunnerArguments class for more configurations.\nengine_args = STORMWikiRunnerArguments(...)\nrm = YouRM(ydc_api_key=os.getenv('YDC_API_KEY'), k=engine_args.search_top_k)\nrunner = STORMWikiRunner(engine_args, lm_configs, rm)\n```\n\nThe `STORMWikiRunner` instance can be evoked with the simple `run` method:\n```python\ntopic = input('Topic: ')\nrunner.run(\n    topic=topic,\n    do_research=True,\n    do_generate_outline=True,\n    do_generate_article=True,\n    do_polish_article=True,\n)\nrunner.post_run()\nrunner.summary()\n```\n- `do_research`: if True, simulate conversations with difference perspectives to collect information about the topic; otherwise, load the results.\n- `do_generate_outline`: if True, generate an outline for the topic; otherwise, load the results.\n- `do_generate_article`: if True, generate an article for the topic based on the outline and the collected information; otherwise, load the results.\n- `do_polish_article`: if True, polish the article by adding a summarization section and (optionally) removing duplicate content; otherwise, load the results.\n\n### Co-STORM\n\nThe Co-STORM knowledge curation engine is defined as a simple Python `CoStormRunner` class. Here is an example of using Bing search engine and OpenAI models.\n\n```python\nfrom knowledge_storm.collaborative_storm.engine import CollaborativeStormLMConfigs, RunnerArgument, CoStormRunner\nfrom knowledge_storm.lm import LitellmModel\nfrom knowledge_storm.logging_wrapper import LoggingWrapper\nfrom knowledge_storm.rm import BingSearch\n\n# Co-STORM adopts the same multi LM system paradigm as STORM \nlm_config: CollaborativeStormLMConfigs = CollaborativeStormLMConfigs()\nopenai_kwargs = {\n    \"api_key\": os.getenv(\"OPENAI_API_KEY\"),\n    \"api_provider\": \"openai\",\n    \"temperature\": 1.0,\n    \"top_p\": 0.9,\n    \"api_base\": None,\n} \nquestion_answering_lm = LitellmModel(model=gpt_4o_model_name, max_tokens=1000, **openai_kwargs)\ndiscourse_manage_lm = LitellmModel(model=gpt_4o_model_name, max_tokens=500, **openai_kwargs)\nutterance_polishing_lm = LitellmModel(model=gpt_4o_model_name, max_tokens=2000, **openai_kwargs)\nwarmstart_outline_gen_lm = LitellmModel(model=gpt_4o_model_name, max_tokens=500, **openai_kwargs)\nquestion_asking_lm = LitellmModel(model=gpt_4o_model_name, max_tokens=300, **openai_kwargs)\nknowledge_base_lm = LitellmModel(model=gpt_4o_model_name, max_tokens=1000, **openai_kwargs)\n\nlm_config.set_question_answering_lm(question_answering_lm)\nlm_config.set_discourse_manage_lm(discourse_manage_lm)\nlm_config.set_utterance_polishing_lm(utterance_polishing_lm)\nlm_config.set_warmstart_outline_gen_lm(warmstart_outline_gen_lm)\nlm_config.set_question_asking_lm(question_asking_lm)\nlm_config.set_knowledge_base_lm(knowledge_base_lm)\n\n# Check out the Co-STORM's RunnerArguments class for more configurations.\ntopic = input('Topic: ')\nrunner_argument = RunnerArgument(topic=topic, ...)\nlogging_wrapper = LoggingWrapper(lm_config)\nbing_rm = BingSearch(bing_search_api_key=os.environ.get(\"BING_SEARCH_API_KEY\"),\n                     k=runner_argument.retrieve_top_k)\ncostorm_runner = CoStormRunner(lm_config=lm_config,\n                               runner_argument=runner_argument,\n                               logging_wrapper=logging_wrapper,\n                               rm=bing_rm)\n```\n\nThe `CoStormRunner` instance can be evoked with the `warmstart()` and `step(...)` methods.\n\n```python\n# Warm start the system to build shared conceptual space between Co-STORM and users\ncostorm_runner.warm_start()\n\n# Step through the collaborative discourse \n# Run either of the code snippets below in any order, as many times as you'd like\n# To observe the conversation:\nconv_turn = costorm_runner.step()\n# To inject your utterance to actively steer the conversation:\ncostorm_runner.step(user_utterance=\"YOUR UTTERANCE HERE\")\n\n# Generate report based on the collaborative discourse\ncostorm_runner.knowledge_base.reorganize()\narticle = costorm_runner.generate_report()\nprint(article)\n```\n\n\n\n## Quick Start with Example Scripts\n\nWe provide scripts in our [examples folder](examples) as a quick start to run STORM and Co-STORM with different configurations.\n\nWe suggest using `secrets.toml` to set up the API keys. Create a file `secrets.toml` under the root directory and add the following content:\n\n```shell\n# ============ language model configurations ============ \n# Set up OpenAI API key.\nOPENAI_API_KEY=\"your_openai_api_key\"\n# If you are using the API service provided by OpenAI, include the following line:\nOPENAI_API_TYPE=\"openai\"\n# If you are using the API service provided by Microsoft Azure, include the following lines:\nOPENAI_API_TYPE=\"azure\"\nAZURE_API_BASE=\"your_azure_api_base_url\"\nAZURE_API_VERSION=\"your_azure_api_version\"\n# ============ retriever configurations ============ \nBING_SEARCH_API_KEY=\"your_bing_search_api_key\" # if using bing search\n# ============ encoder configurations ============ \nENCODER_API_TYPE=\"openai\" # if using openai encoder\n```\n\n### STORM examples\n\n**To run STORM with `gpt` family models with default configurations:**\n\nRun the following command.\n```bash\npython examples\u002Fstorm_examples\u002Frun_storm_wiki_gpt.py \\\n    --output-dir $OUTPUT_DIR \\\n    --retriever bing \\\n    --do-research \\\n    --do-generate-outline \\\n    --do-generate-article \\\n    --do-polish-article\n```\n\n**To run STORM using your favorite language models or grounding on your own corpus:** Check out [examples\u002Fstorm_examples\u002FREADME.md](examples\u002Fstorm_examples\u002FREADME.md).\n\n### Co-STORM examples\n\nTo run Co-STORM with `gpt` family models with default configurations,\n\n1. Add `BING_SEARCH_API_KEY=\"xxx\"` and `ENCODER_API_TYPE=\"xxx\"` to `secrets.toml`\n2. Run the following command\n\n```bash\npython examples\u002Fcostorm_examples\u002Frun_costorm_gpt.py \\\n    --output-dir $OUTPUT_DIR \\\n    --retriever bing\n```\n\n\n## Customization of the Pipeline\n\n### STORM\n\nIf you have installed the source code, you can customize STORM based on your own use case. STORM engine consists of 4 modules:\n\n1. Knowledge Curation Module: Collects a broad coverage of information about the given topic.\n2. Outline Generation Module: Organizes the collected information by generating a hierarchical outline for the curated knowledge.\n3. Article Generation Module: Populates the generated outline with the collected information.\n4. Article Polishing Module: Refines and enhances the written article for better presentation.\n\nThe interface for each module is defined in `knowledge_storm\u002Finterface.py`, while their implementations are instantiated in `knowledge_storm\u002Fstorm_wiki\u002Fmodules\u002F*`. These modules can be customized according to your specific requirements (e.g., generating sections in bullet point format instead of full paragraphs).\n\n### Co-STORM\n\nIf you have installed the source code, you can customize Co-STORM based on your own use case\n\n1. Co-STORM introduces multiple LLM agent types (i.e. Co-STORM experts and Moderator). LLM agent interface is defined in `knowledge_storm\u002Finterface.py` , while its implementation is instantiated in `knowledge_storm\u002Fcollaborative_storm\u002Fmodules\u002Fco_storm_agents.py`. Different LLM agent policies can be customized.\n2. Co-STORM introduces a collaborative discourse protocol, with its core function centered on turn policy management. We provide an example implementation of turn policy management through `DiscourseManager` in `knowledge_storm\u002Fcollaborative_storm\u002Fengine.py`. It can be customized and further improved.\n\n## Datasets\nTo facilitate the study of automatic knowledge curation and complex information seeking, our project releases the following datasets:\n\n### FreshWiki\nThe FreshWiki Dataset is a collection of 100 high-quality Wikipedia articles focusing on the most-edited pages from February 2022 to September 2023. See Section 2.1 in [STORM paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.14207) for more details.\n\nYou can download the dataset from [huggingface](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FEchoShao8899\u002FFreshWiki) directly. To ease the data contamination issue, we archive the [source code](https:\u002F\u002Fgithub.com\u002Fstanford-oval\u002Fstorm\u002Ftree\u002FNAACL-2024-code-backup\u002FFreshWiki) for the data construction pipeline that can be repeated at future dates.\n\n### WildSeek\nTo study users’ interests in complex information seeking tasks in the wild, we utilized data collected from the web research preview to create the WildSeek dataset. We downsampled the data to ensure the diversity of the topics and the quality of the data. Each data point is a pair comprising a topic and the user’s goal for conducting deep search on the topic.  For more details, please refer to Section 2.2 and Appendix A of [Co-STORM paper](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2408.15232).\n\nThe WildSeek dataset is available [here](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FYuchengJiang\u002FWildSeek).\n\n## Replicate STORM & Co-STORM paper result\n\nFor STORM paper experiments, please switch to the branch `NAACL-2024-code-backup` [here](https:\u002F\u002Fgithub.com\u002Fstanford-oval\u002Fstorm\u002Ftree\u002FNAACL-2024-code-backup).\n\nFor Co-STORM paper experiments, please switch to the branch `EMNLP-2024-code-backup` (placeholder for now, will be updated soon).\n\n## Roadmap & Contributions\nOur team is actively working on:\n1. Human-in-the-Loop Functionalities: Supporting user participation in the knowledge curation process.\n2. Information Abstraction: Developing abstractions for curated information to support presentation formats beyond the Wikipedia-style report.\n\nIf you have any questions or suggestions, please feel free to open an issue or pull request. We welcome contributions to improve the system and the codebase!\n\nContact person: [Yijia Shao](mailto:shaoyj@stanford.edu) and [Yucheng Jiang](mailto:yuchengj@stanford.edu)\n\n## Acknowledgement\nWe would like to thank Wikipedia for its excellent open-source content. The FreshWiki dataset is sourced from Wikipedia, licensed under the Creative Commons Attribution-ShareAlike (CC BY-SA) license.\n\nWe are very grateful to [Michelle Lam](https:\u002F\u002Fmichelle123lam.github.io\u002F) for designing the logo for this project and [Dekun Ma](https:\u002F\u002Fdekun.me) for leading the UI development.\n\nThanks to Vercel for their support of [open-source software](https:\u002F\u002Fstorm.genie.stanford.edu)\n\n## Citation\nPlease cite our paper if you use this code or part of it in your work:\n```bibtex\n@inproceedings{jiang-etal-2024-unknown,\n    title = \"Into the Unknown Unknowns: Engaged Human Learning through Participation in Language Model Agent Conversations\",\n    author = \"Jiang, Yucheng  and\n      Shao, Yijia  and\n      Ma, Dekun  and\n      Semnani, Sina  and\n      Lam, Monica\",\n    editor = \"Al-Onaizan, Yaser  and\n      Bansal, Mohit  and\n      Chen, Yun-Nung\",\n    booktitle = \"Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing\",\n    month = nov,\n    year = \"2024\",\n    address = \"Miami, Florida, USA\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https:\u002F\u002Faclanthology.org\u002F2024.emnlp-main.554\u002F\",\n    doi = \"10.18653\u002Fv1\u002F2024.emnlp-main.554\",\n    pages = \"9917--9955\",\n}\n\n@inproceedings{shao-etal-2024-assisting,\n    title = \"Assisting in Writing {W}ikipedia-like Articles From Scratch with Large Language Models\",\n    author = \"Shao, Yijia  and\n      Jiang, Yucheng  and\n      Kanell, Theodore  and\n      Xu, Peter  and\n      Khattab, Omar  and\n      Lam, Monica\",\n    editor = \"Duh, Kevin  and\n      Gomez, Helena  and\n      Bethard, Steven\",\n    booktitle = \"Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)\",\n    month = jun,\n    year = \"2024\",\n    address = \"Mexico City, Mexico\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https:\u002F\u002Faclanthology.org\u002F2024.naacl-long.347\u002F\",\n    doi = \"10.18653\u002Fv1\u002F2024.naacl-long.347\",\n    pages = \"6252--6278\",\n}\n```\n","STORM 是一个基于大语言模型的知识整理系统，能够研究特定主题并生成带有引用的完整报告。其核心功能包括通过互联网搜索自动生成类似维基百科的文章，并支持与人类协作的知识整理（Co-STORM）。技术特点涵盖多种检索和搜索引擎集成、支持用户自定义文档以及灵活的语言模型选择。该系统适用于需要快速生成高质量初稿的研究人员、学生及内容创作者等场景，在撰写学术论文、研究报告或创建知识库时尤为有用。",2,"2026-06-11 02:48:38","top_language"]