[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72515":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":14,"stars7d":14,"stars30d":16,"stars90d":15,"forks30d":15,"starsTrendScore":17,"compositeScore":18,"rankGlobal":9,"rankLanguage":9,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":22,"hasPages":20,"topics":23,"createdAt":9,"pushedAt":9,"updatedAt":24,"readmeContent":25,"aiSummary":26,"trendingCount":15,"starSnapshotCount":15,"syncStatus":27,"lastSyncTime":28,"discoverSource":29},72515,"RAG-Challenge-2","IlyaRice\u002FRAG-Challenge-2","IlyaRice","Implementation of my RAG system that won all categories in Enterprise RAG Challenge 2",null,"Python",2360,484,16,5,0,68,15,79.36,"MIT License",false,"main",true,[],"2026-06-12 04:01:06","# RAG Challenge Winner Solution\n\n**Read more about this project:**\n- Russian: https:\u002F\u002Fhabr.com\u002Fru\u002Farticles\u002F893356\u002F\n- English: https:\u002F\u002Fabdullin.com\u002Filya\u002Fhow-to-build-best-rag\u002F\n\nThis repository contains the winning solution for both prize nominations in the RAG Challenge competition. The system achieved state-of-the-art results in answering questions about company annual reports using a combination of:\n\n- Custom PDF parsing with Docling\n- Vector search with parent document retrieval\n- LLM reranking for improved context relevance\n- Structured output prompting with chain-of-thought reasoning\n- Query routing for multi-company comparisons\n\n## Disclaimer\n\nThis is competition code - it's scrappy but it works. Some notes before you dive in:\n\n- IBM Watson integration won't work (it was competition-specific)\n- The code might have rough edges and weird workarounds\n- No tests, minimal error handling - you've been warned\n- You'll need your own API keys for OpenAI\u002FGemini\n- GPU helps a lot with PDF parsing (I used 4090)\n\nIf you're looking for production-ready code, this isn't it. But if you want to explore different RAG techniques and their implementations - check it out!\n\n## Quick Start\n\nClone and setup:\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FIlyaRice\u002FRAG-Challenge-2.git\ncd RAG-Challenge-2\npython -m venv venv\nvenv\\Scripts\\Activate.ps1  # Windows (PowerShell)\npip install -e . -r requirements.txt\n```\n\nRename `env` to `.env` and add your API keys.\n\n## Test Dataset\n\nThe repository includes two datasets:\n\n1. A small test set (in `data\u002Ftest_set\u002F`) with 5 annual reports and questions\n2. The full ERC2 competition dataset (in `data\u002Ferc2_set\u002F`) with all competition questions and reports\n\nEach dataset directory contains its own README with specific setup instructions and available files. You can use either dataset to:\n\n- Study example questions, reports, and system outputs\n- Run the pipeline from scratch using provided PDFs\n- Use pre-processed data to skip directly to specific pipeline stages\n\nSee the respective README files for detailed dataset contents and setup instructions:\n- `data\u002Ftest_set\u002FREADME.md` - For the small test dataset\n- `data\u002Ferc2_set\u002FREADME.md` - For the full competition dataset\n\n## Usage\n\nYou can run any part of pipeline by uncommenting the method you want to run in `src\u002Fpipeline.py` and executing:\n```bash\npython .\\src\\pipeline.py\n```\n\nYou can also run any pipeline stage using `main.py`, but you need to run it from the directory containing your data:\n```bash\ncd .\\data\\test_set\\\npython ..\\..\\main.py process-questions --config max_nst_o3m\n```\n\n### CLI Commands\n\nGet help on available commands:\n```bash\npython main.py --help\n```\n\nAvailable commands:\n- `download-models` - Download required docling models\n- `parse-pdfs` - Parse PDF reports with parallel processing options\n- `serialize-tables` - Process tables in parsed reports\n- `process-reports` - Run the full pipeline on parsed reports\n- `process-questions` - Process questions using specified config\n\nEach command has its own options. For example:\n```bash\npython main.py parse-pdfs --help\n# Shows options like --parallel\u002F--sequential, --chunk-size, --max-workers\n\npython main.py process-reports --config ser_tab\n# Process reports with serialized tables config\n```\n\n## Some configs\n\n- `max_nst_o3m` - Best performing config using OpenAI's o3-mini model\n- `ibm_llama70b` - Alternative using IBM's Llama 70B model\n- `gemini_thinking` - Full context answering with using enormous context window of Gemini. It is not RAG, actually\n\nCheck `pipeline.py` for more configs and detils on them.\n\n## License\n\nMIT","该项目实现了在企业RAG挑战赛中赢得所有类别的RAG系统，专注于从公司年度报告中提取信息以回答问题。其核心功能包括自定义PDF解析、向量搜索与父文档检索、大模型重排序以提高上下文相关性、结构化输出提示以及多公司比较的查询路由。技术上采用了Docling进行PDF处理，并结合了链式思维推理来优化答案生成过程。尽管代码较为粗糙且缺乏全面测试，但非常适合希望探索和实践不同RAG技术的研究者或开发者使用。此外，项目提供了两个数据集供用户测试和学习，包括一个小型测试集和完整的比赛数据集。",2,"2026-06-11 03:42:24","high_star"]