[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-2332":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":8,"htmlUrl":8,"language":9,"languages":8,"totalLinesOfCode":8,"stars":10,"forks":11,"watchers":12,"openIssues":13,"contributorsCount":13,"subscribersCount":13,"size":13,"stars1d":14,"stars7d":15,"stars30d":16,"stars90d":13,"forks30d":13,"starsTrendScore":17,"compositeScore":18,"rankGlobal":8,"rankLanguage":8,"license":8,"archived":19,"fork":19,"defaultBranch":20,"hasWiki":21,"hasPages":19,"topics":22,"createdAt":8,"pushedAt":8,"updatedAt":23,"readmeContent":24,"aiSummary":25,"trendingCount":13,"starSnapshotCount":13,"syncStatus":26,"lastSyncTime":27,"discoverSource":28},2332,"nanowhale","huggingface\u002Fnanowhale","huggingface",null,"Python",374,43,1,0,6,7,33,18,4.93,false,"main",true,[],"2026-06-12 02:00:40","# nanowhale 🐳\n\nA ~110M parameter language model trained from scratch using the **DeepSeek-V4 architecture**. This repo contains all the code, configs, and tokenizer used to pretrain and fine-tune the model.\n\n## Models\n\n| Model | Description | Link |\n|---|---|---|\n| **nanowhale-100m-base** | Pretrained base model (5K steps on FineWeb-Edu) | [🤗 Hub](https:\u002F\u002Fhuggingface.co\u002Fcmpatino\u002Fnanowhale-100m-base) |\n| **nanowhale-100m** | SFT chat model (3K steps on SmolTalk) | [🤗 Hub](https:\u002F\u002Fhuggingface.co\u002Fcmpatino\u002Fnanowhale-100m) |\n\n## Architecture\n\nThe model implements the full DeepSeek-V4 feature set at miniature scale:\n\n- **Multi-Head Latent Attention (MLA)** — 8 heads, 1 KV head (MQA), head_dim=96 (32 RoPE + 64 NoPE), q_lora_rank=160\n- **Mixture-of-Experts (MoE)** — 4 routed + 1 shared expert, top-2 routing, SwiGLU FFN (dim 640)\n- **Hyper-Connections** — hc_mult=4, Sinkhorn routing (2 iterations)\n- **Multi-Token Prediction (MTP)** — 1 next-token prediction layer\n\n| Parameter | Value |\n|---|---|\n| Total params | ~110M (41M embeddings + 69M non-embedding) |\n| Hidden size | 320 |\n| Layers | 8 |\n| Vocab size | 129,280 (DeepSeek-V4 tokenizer) |\n| Context length | 2,048 tokens |\n\n## Repo Structure\n\n```\n├── modeling_deepseek_v4.py         # DeepSeek-V4 model implementation\n├── configuration_deepseek_v4.py    # Model config class\n├── requirements.txt\n├── configs\u002F\n│   ├── main_100m.yaml              # Training hyperparameters (100M model)\n│   ├── debug.yaml                  # Quick debug config (50 steps)\n│   └── fallback_under_1b.yaml      # Alternative config\n├── scripts\u002F\n│   ├── train_pretrain.py            # Pretraining (SFTTrainer on FineWeb-Edu)\n│   ├── train_sft.py                 # SFT fine-tuning (SFTTrainer on SmolTalk)\n│   ├── eval_smoke.py                # Perplexity evaluation & generation\n│   ├── chat.py                      # Interactive chat\n│   ├── upload_to_hub.py             # Hub upload utility\n│   ├── count_params.py              # Parameter counting\n│   ├── prepare_data.py              # Data preparation\n│   └── inspect_deepseek_v4.py       # Architecture inspection\n└── tokenizer\u002F\n    ├── tokenizer.json\n    └── tokenizer_config.json\n```\n\n## Quick Start\n\n### Install\n\n```bash\npip install -r requirements.txt\n```\n\n### Pretraining\n\n```bash\npython scripts\u002Ftrain_pretrain.py --config configs\u002Fmain_100m.yaml\n```\n\n### SFT\n\n```bash\npython scripts\u002Ftrain_sft.py\n```\n\n### Chat\n\n```bash\npython scripts\u002Fchat.py\n```\n\n### Evaluation\n\n```bash\npython scripts\u002Feval_smoke.py\n```\n\n## Training Results\n\n### Pretraining (5,000 steps on FineWeb-Edu)\n\n| Metric | Value |\n|---|---|\n| Tokens seen | ~2.6B |\n| Final loss | ~5.3 |\n| Token accuracy | 33.8% |\n| Hardware | 1× H100 80GB, bf16 |\n| Throughput | 72ms\u002Fstep (with `torch.compile`) |\n\n### SFT (3,000 steps on SmolTalk)\n\n| Metric | Start | End |\n|---|---|---|\n| Train loss | 15.41 | 10.22 |\n| Eval loss | 2.873 | 2.607 |\n| Token accuracy | 36.2% | 48.5% |\n\n### Perplexity (held-out English text)\n\n| Model | Perplexity |\n|---|---|\n| Pretrained | 13.62 |\n| SFT | 12.90 |\n\n## Known Issues\n\n- **bf16 NaN**: The model produces NaN in bf16 at this small scale. Use fp32 for inference and training. This is due to the Hyper-Connections architecture producing values that overflow bf16 range.\n- **`from_pretrained` quirk**: The custom architecture causes `from_pretrained` to re-initialize some weights. Use manual `load_state_dict` instead (see model cards for examples).\n- **Large vocab \u002F small model**: The 129K vocab embedding table consumes 37% of all parameters, limiting capacity for language modeling.\n\n## License\n\nMIT\n","nanowhale 是一个基于 DeepSeek-V4 架构从零训练的约 1.1 亿参数的语言模型。该项目提供了完整的代码、配置和分词器，用于预训练和微调模型。核心功能包括多头潜注意力机制（8 个头，1 个键值头）、混合专家系统（4 个路由专家和 1 个共享专家）以及超连接和多令牌预测等。这些技术特点使得 nanowhale 在较小规模下实现了高效的计算与表达能力。适合需要轻量级语言模型处理任务的场景，如文本生成、对话系统开发等，特别适用于资源受限但对性能有一定要求的应用环境。",2,"2026-06-11 02:49:32","CREATED_QUERY"]