[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-84145":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":12,"contributorsCount":12,"subscribersCount":12,"size":12,"stars1d":14,"stars7d":14,"stars30d":14,"stars90d":12,"forks30d":12,"starsTrendScore":15,"compositeScore":12,"rankGlobal":9,"rankLanguage":9,"license":9,"archived":16,"fork":16,"defaultBranch":17,"hasWiki":18,"hasPages":16,"topics":19,"createdAt":9,"pushedAt":9,"updatedAt":20,"readmeContent":21,"aiSummary":9,"trendingCount":12,"starSnapshotCount":12,"syncStatus":22,"lastSyncTime":23,"discoverSource":24},84145,"MicroRAG","canwhite\u002FMicroRAG","canwhite","The minimal implementation of rag",null,"Python",102,0,52,50,150,false,"main",true,[],"2026-06-12 02:04:38","# MicroRAG\n\nA minimal Retrieval-Augmented Generation (RAG) system implemented from scratch in pure Python, without any external ML libraries.\n\n**纯Python从零实现的最小可运行RAG系统，无外部机器学习依赖。**\n\n---\n\n## 核心特性 \u002F Core Features\n\n| 特性 \u002F Feature | 说明 \u002F Description |\n|----------------|-------------------|\n| **Retriever** | 基于对比学习（InfoNCE）的向量检索 \u002F Cosine-similarity retrieval with contrastive learning |\n| **Generator** | 单层 GPT，字符级 tokenizer \u002F Single-layer GPT with character-level tokenizer |\n| **无外部依赖 \u002F No deps** | 仅用 Python 标准库实现自动微分 \u002F Pure Python autograd with stdlib only |\n| **可运行 \u002F Runnable** | 训练 + 推理完整流程，开箱即用 \u002F Full train + inference pipeline, ready to run |\n\n---\n\n## 架构 \u002F Architecture\n\n```\n用户问题 → Retriever(检索) → 拼接上下文 → Generator(GPT生成) → 答案\nQuery      retrieve        concat context   generate           answer\n```\n\n### Retriever\n\n- Query 和 Document 分别通过不同的嵌入矩阵映射到 32 维向量\n- InfoNCE 对比学习训练：正样本相似度应比负样本高 0.5\n- 推理阶段使用 cosine 相似度检索\n\n### Generator\n\n- 单层自注意力（causal mask，无 sliding window）\n- 字符级 tokenizer（词表仅几十个 token）\n- RMSNorm + MLP + LM Head\n- 训练：next-token prediction\n\n---\n\n## 运行 \u002F Run\n\n```bash\nuv run main.py\n```\n\n输出示例：\n```\nvocab_size=57, TOTAL_VOCAB=60, EMBED_DIM=32\nchars:  ,。0-9,...A...Z...a...z...\nnum params: 23888\nstep    0 | loss 0.7368 | sp=0.112 sn=0.431\nstep  200 | loss 0.0000 | sp=0.999 sn=0.000\nstep  400 | loss 0.0000 | sp=0.999 sn=0.000\nstep  600 | loss 0.0000 | sp=0.999 sn=0.000\nstep  800 | loss 0.0000 | sp=1.000 sn=0.000\n\n--- 训练 Generator ---\ngpt step    0 | loss 4.4424\ngpt step  500 | loss 0.1389\ngpt step 1000 | loss 0.0152\ngpt step 1500 | loss 0.0005\n\n--- RAG 完整流程测试 ---\n问: 太阳有多热？\n答: 太阳表面约6000度，核心约1500万度。\n```\n\n---\n\n## 超参数 \u002F Hyperparameters\n\n| 参数 \u002F Param | 值 \u002F Value | 含义 \u002F Meaning |\n|-------------|------------|----------------|\n| `EMBED_DIM` | 32 | 嵌入维度 \u002F Embedding dimension |\n| `BLOCK_SIZE` | 64 | 位置编码表大小（位置 > 64 复用 wpe[63]）|\n| `lr` (Retriever) | 0.01 | 检索器学习率 |\n| `gpt_lr` | 0.05 | 生成器学习率 |\n| `retriever_steps` | 10 | 检索器训练步数 |\n| `gpt_steps` | 2000 | 生成器训练步数 |\n\n---\n\n## 概念区分 \u002F Concept Clarification\n\n- **Causal mask**: ✅ 有 \u002F Present — 不看未来位置 \u002F No future tokens\n- **Attention window**: ❌ 无 \u002F None — 不限制历史长度 \u002F No historical limit\n- **Position embedding truncation**: ✅ 有 \u002F Present — 位置 ≥ 64 复用编码 \u002F Positions ≥ 64 reuse last encoding\n\n详见 [docs\u002Fattention_mask_vs_truncation.md](docs\u002Fattention_mask_vs_truncation.md)\n\n---\n\n## 局限性 \u002F Limitations\n\n- 字符级 tokenizer，序列极短（几十个 token）\n- 单层 transformer，表达能力有限\n- 知识库固定为 8 条 QA\n- 无真实 attention window 截断（适合短文本）\n\nThis is a **learning\u002Fdemonstration** implementation, not a production system.\n",2,"2026-06-11 04:12:24","CREATED_QUERY"]