[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-10935":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":9,"languages":9,"totalLinesOfCode":9,"stars":10,"forks":11,"watchers":12,"openIssues":13,"contributorsCount":13,"subscribersCount":13,"size":13,"stars1d":14,"stars7d":15,"stars30d":16,"stars90d":13,"forks30d":13,"starsTrendScore":17,"compositeScore":18,"rankGlobal":9,"rankLanguage":9,"license":9,"archived":19,"fork":19,"defaultBranch":20,"hasWiki":21,"hasPages":19,"topics":22,"createdAt":9,"pushedAt":9,"updatedAt":23,"readmeContent":24,"aiSummary":25,"trendingCount":13,"starSnapshotCount":13,"syncStatus":12,"lastSyncTime":26,"discoverSource":27},10935,"llm-systems-engineering-roadmap","h9-tec\u002Fllm-systems-engineering-roadmap","h9-tec","A practical roadmap for mastering LLM internals, training, inference, RAG, agents, evaluation, and production architecture.",null,160,22,2,0,3,5,20,9,51.59,false,"main",true,[],"2026-06-12 04:00:53","# LLM Systems Engineering Roadmap\n\n> A professional roadmap for mastering large language model internals, training, post-training, inference, retrieval, agents, evaluation, and production architecture.\n\nThis repository is designed for engineers who want to move beyond surface-level LLM usage and build production-grade LLM systems with measurable quality, latency, cost, and reliability.\n\nIt is not a collection of model news.\n\nIt is not a prompt-engineering cookbook.\n\nIt is a systems roadmap.\n\nThe central idea:\n\n```text\nLLM competence = model internals + training logic + inference systems + retrieval architecture + agent control + evaluation discipline + production constraints\n```\n\n---\n\n## Table of Contents\n\n- [Who this roadmap is for](#who-this-roadmap-is-for)\n- [What this roadmap covers](#what-this-roadmap-covers)\n- [What this roadmap does not cover](#what-this-roadmap-does-not-cover)\n- [Core philosophy](#core-philosophy)\n- [Competency map](#competency-map)\n- [Roadmap overview](#roadmap-overview)\n- [Layer 1: LLM Foundations](#layer-1-llm-foundations)\n- [Layer 2: Training Pipeline](#layer-2-training-pipeline)\n- [Layer 3: Post-Training](#layer-3-post-training)\n- [Layer 4: Reasoning Models](#layer-4-reasoning-models)\n- [Layer 5: Inference Fundamentals](#layer-5-inference-fundamentals)\n- [Layer 6: Serving Engines](#layer-6-serving-engines)\n- [Layer 7: KV Cache and Long Context](#layer-7-kv-cache-and-long-context)\n- [Layer 8: Quantization and Compression](#layer-8-quantization-and-compression)\n- [Layer 9: RAG Systems](#layer-9-rag-systems)\n- [Layer 10: Agentic Systems](#layer-10-agentic-systems)\n- [Layer 11: Evaluation and Benchmarking](#layer-11-evaluation-and-benchmarking)\n- [Layer 12: Production Architecture](#layer-12-production-architecture)\n- [Advanced tracks](#advanced-tracks)\n- [Master artifact portfolio](#master-artifact-portfolio)\n- [Repository structure](#repository-structure)\n- [Definition of done](#definition-of-done)\n- [Recommended source map](#recommended-source-map)\n- [Engineering checklists](#engineering-checklists)\n- [How to use this roadmap](#how-to-use-this-roadmap)\n\n---\n\n## Who this roadmap is for\n\nThis roadmap is for:\n\n- AI engineers\n- ML engineers\n- NLP engineers\n- backend engineers moving into LLM systems\n- technical leads responsible for GenAI architecture\n- researchers who want stronger production intuition\n- product-minded engineers building applied LLM systems\n- infrastructure engineers working with GPU serving stacks\n\nYou should use this roadmap if your target is to build systems like:\n\n- enterprise RAG platforms\n- on-prem LLM deployments\n- multi-model inference gateways\n- AI agents with tools and approvals\n- evaluation harnesses for LLM products\n- private domain assistants\n- document intelligence systems\n- multimodal knowledge systems\n- production LLM observability pipelines\n- cost-controlled LLM serving infrastructure\n\n---\n\n## What this roadmap covers\n\nThis roadmap covers the full technical stack behind modern LLM systems:\n\n```text\nText\n→ Tokens\n→ Transformer\n→ Pretraining\n→ Post-training\n→ Reasoning\n→ Inference runtime\n→ Serving engine\n→ KV cache\n→ Quantization\n→ Retrieval\n→ Agents\n→ Evaluation\n→ Production architecture\n```\n\nIt explains the mechanisms, not only the buzzwords.\n\nEach layer includes:\n\n- objective\n- core concepts\n- what to understand deeply\n- implementation artifacts\n- engineering decisions\n- failure modes\n- evaluation gates\n- recommended resources\n\n---\n\n## What this roadmap does not cover\n\nThis roadmap does not focus on:\n\n- daily model release news\n- shallow prompt collections\n- generic AI career advice\n- vendor marketing claims\n- no-code tool tutorials\n- toy demos without evaluation\n- “best model” lists without workload definition\n\nThe assumption is simple:\n\n```text\nA model is not good or bad in isolation.\nA model is good or bad for a workload, under constraints, measured by an eval.\n```\n\n---\n\n## Core philosophy\n\n### 1. Learn mechanisms, not names\n\nDo not memorize model names. Learn what changed.\n\n```text\nWhat changed in architecture?\nWhat changed in data?\nWhat changed in post-training?\nWhat changed in inference?\nWhat changed in memory layout?\nWhat changed in evaluation?\n```\n\nModel names expire. Mechanisms compound.\n\n---\n\n### 2. Separate model science from system engineering\n\nA capable LLM system is not just a strong model.\n\nIt is a controlled pipeline:\n\n```text\nmodel\n+ tokenizer\n+ chat template\n+ retrieval\n+ tools\n+ serving engine\n+ cache policy\n+ eval set\n+ observability\n+ fallback logic\n+ cost controls\n+ safety boundaries\n```\n\nMost production failures happen outside the model weights.\n\n---\n\n### 3. Build artifacts, not opinions\n\nFor every layer, produce something measurable:\n\n```text\nbenchmark\neval set\nnotebook\ndashboard\narchitecture diagram\nserving comparison\nfailure analysis\ncost model\nred-team suite\n```\n\nIf your knowledge cannot produce an artifact, it is not operational yet.\n\n---\n\n### 4. Evaluate everything\n\nDo not trust:\n\n- one prompt\n- one demo\n- one leaderboard\n- one benchmark\n- one model card\n- one latency number\n- one anecdotal answer\n\nUse evals, traces, failure categories, and regression tests.\n\n---\n\n### 5. Optimize for decision quality\n\nThe final goal is not to know more terms.\n\nThe final goal is to make better technical decisions:\n\n```text\nShould we fine-tune or use RAG?\nShould we use vLLM or SGLang?\nShould we quantize to INT4 or keep FP16?\nShould we use long context or retrieval?\nShould this be an agent or deterministic workflow?\nShould this run on-prem or through an API?\nShould we add a reranker?\nShould we use a reasoning model?\n```\n\n---\n\n## Competency map\n\n### Level 0 — API user\n\nCan call hosted APIs.\n\nTypical abilities:\n\n- writes prompts\n- uses chat interfaces\n- calls model endpoints\n- adjusts temperature\n- knows model names\n\nLimit:\n\n```text\nCannot explain or debug failures below the API layer.\n```\n\n---\n\n### Level 1 — Prototype builder\n\nCan build demos.\n\nTypical abilities:\n\n- builds simple RAG\n- uses LangChain\u002FLlamaIndex\n- connects vector databases\n- builds tool-calling examples\n- creates chatbot demos\n\nLimit:\n\n```text\nOften lacks evaluation, observability, failure analysis, and production constraints.\n```\n\n---\n\n### Level 2 — LLM application engineer\n\nCan build useful applications.\n\nTypical abilities:\n\n- designs retrieval pipelines\n- builds structured prompts\n- manages citations\n- performs basic evals\n- handles tool calling\n- integrates with backend systems\n\nLimit:\n\n```text\nMay not deeply understand inference, KV cache, serving engines, or GPU cost.\n```\n\n---\n\n### Level 3 — LLM systems engineer\n\nCan build production systems.\n\nTypical abilities:\n\n- understands prefill\u002Fdecode\n- benchmarks inference\n- chooses serving engines\n- estimates KV cache memory\n- evaluates quantization\n- builds RAG evals\n- instruments traces\n- designs fallback paths\n- controls latency and cost\n\nMinimum serious professional level.\n\n---\n\n### Level 4 — LLM infrastructure engineer\n\nCan optimize large-scale serving.\n\nTypical abilities:\n\n- operates vLLM\u002FSGLang\u002FTensorRT-LLM\n- handles multi-GPU serving\n- manages concurrency\n- tunes batching\n- handles prefix caching\n- evaluates quantized kernels\n- monitors GPU utilization\n- designs model gateways\n- handles autoscaling\n\n---\n\n### Level 5 — Research engineer\n\nCan understand and modify methods.\n\nTypical abilities:\n\n- reads papers mechanically\n- runs ablations\n- modifies post-training recipes\n- tests reasoning methods\n- builds custom evals\n- analyzes training\u002Finference tradeoffs\n- understands architecture deltas\n\n---\n\n### Level 6 — LLM architect\n\nCan design organization-scale platforms.\n\nTypical abilities:\n\n- defines platform architecture\n- governs model usage\n- builds eval infrastructure\n- designs multi-tenant systems\n- manages security and compliance\n- controls cost at scale\n- aligns model strategy with business constraints\n\nTarget level.\n\n---\n\n## Roadmap overview\n\n| Layer | Area | Core question |\n|---|---|---|\n| 1 | LLM Foundations | What happens during one token generation? |\n| 2 | Training Pipeline | How are base models created? |\n| 3 | Post-Training | How are models shaped into assistants? |\n| 4 | Reasoning Models | How do models use extra compute to solve hard tasks? |\n| 5 | Inference Fundamentals | Why is serving an LLM a systems problem? |\n| 6 | Serving Engines | Which runtime fits which workload? |\n| 7 | KV Cache and Long Context | What makes long context expensive and unreliable? |\n| 8 | Quantization and Compression | How do we reduce cost without silent quality collapse? |\n| 9 | RAG Systems | How do we ground outputs in external knowledge? |\n| 10 | Agentic Systems | How do we safely connect models to tools and workflows? |\n| 11 | Evaluation and Benchmarking | How do we measure quality, cost, latency, and safety? |\n| 12 | Production Architecture | How do we deploy, monitor, scale, and govern LLM systems? |\n\n---\n\n# Layer 1: LLM Foundations\n\n## Objective\n\nUnderstand the core mechanics of a decoder-only LLM.\n\nThe minimum target:\n\n```text\nGiven a prompt, explain exactly how the model turns text into token probabilities.\n```\n\n## Core concepts\n\n### Tokenization\n\nThe model does not read words. It reads tokens.\n\nA tokenizer converts text into integer IDs:\n\n```text\n\"Large language models are useful\"\n→ [24513, 4221, 4981, 527, 5562]\n```\n\nTokenization affects cost, context length, latency, multilingual quality, code handling, Arabic morphology, prompt compression, domain vocabulary, and retrieval chunk size.\n\nA bad tokenizer can increase token count and damage quality, especially for morphologically rich languages.\n\nKey rule:\n\n```text\nNever estimate LLM cost from word count.\nAlways measure with the model tokenizer.\n```\n\n### Embeddings\n\nToken IDs are indices.\n\nThe model maps each token ID to a dense vector through an embedding table:\n\n```text\nvocabulary_size × hidden_dimension\n```\n\nThe prompt becomes:\n\n```text\nsequence_length × hidden_dimension\n```\n\n### Transformer blocks\n\nA decoder-only LLM is a stack of Transformer blocks:\n\n```text\ninput\n→ normalization\n→ self-attention\n→ residual connection\n→ normalization\n→ MLP\n→ residual connection\n→ output\n```\n\nEach block edits the representation. It does not rebuild everything from scratch.\n\n### Self-attention\n\nSelf-attention lets each token route information from previous tokens.\n\nEach token representation is projected into:\n\n```text\nQ = query\nK = key\nV = value\n```\n\nIntuition:\n\n```text\nQuery = what this position is looking for\nKey   = what each position offers for matching\nValue = information carried if selected\n```\n\nScaled dot-product attention:\n\n```text\nAttention(Q, K, V) = softmax(QKᵀ \u002F sqrt(d_k))V\n```\n\n### Causal masking\n\nDecoder-only models cannot look into the future.\n\nFor tokens:\n\n```text\nA B C D\n```\n\nposition `C` can attend to:\n\n```text\nA B C\n```\n\nbut not:\n\n```text\nD\n```\n\nThis enables next-token training without information leakage.\n\n### Multi-head attention\n\nMultiple attention heads let the model route different information patterns in parallel: syntax, long-range reference, formatting, code indentation, list structure, and mathematical dependencies.\n\n### MQA and GQA\n\nClassic multi-head attention stores separate keys and values for every attention head. That is expensive during inference.\n\nModern models often use:\n\n- MQA: many query heads share one key\u002Fvalue head\n- GQA: groups of query heads share key\u002Fvalue heads\n\nWhy this matters:\n\n```text\nFewer KV heads → smaller KV cache → better serving scalability\n```\n\n### MLP blocks\n\nAttention mixes information across token positions.\n\nMLP layers transform each token representation independently.\n\n```text\nAttention = token-to-token communication\nMLP       = token-wise feature transformation\n```\n\nMany LLM parameters live in MLP blocks.\n\n### Positional encoding\n\nAttention alone does not know order.\n\nModern LLMs commonly use RoPE. RoPE injects position by rotating query and key vectors in a position-dependent way.\n\nImportant implication:\n\n```text\nLong-context behavior is partly constrained by positional encoding design.\n```\n\n### Logits and decoding\n\nThe final hidden state is projected into vocabulary logits:\n\n```text\nhidden_dim → vocab_size\n```\n\nSoftmax turns logits into probabilities. Decoding chooses the next token.\n\nCommon decoding methods:\n\n- greedy decoding\n- temperature sampling\n- top-k sampling\n- top-p sampling\n- repetition penalties\n- constrained decoding\n\n### KV cache\n\nDuring generation, the model stores previously computed key\u002Fvalue tensors. This avoids recomputing all previous tokens.\n\nKV cache grows with:\n\n```text\nbatch_size × context_length × layers × KV_heads × head_dim × bytes_per_value\n```\n\nThis is one of the most important memory bottlenecks in LLM serving.\n\n## What to implement\n\nBuild a tiny decoder-only model.\n\nMinimum components:\n\n```text\ntokenizer\nembedding layer\ncausal self-attention\nMLP\nresidual connections\nnormalization\nlogits head\nsampling loop\n```\n\n## Practical exercises\n\n### Exercise 1: Tokenizer comparison\n\nCompare token counts across English, Arabic, mixed Arabic-English, Python code, JSON, and legal text.\n\nRecord:\n\n```text\ncharacters\nwords\ntokens\ntokens per word\nstrange splits\n```\n\n### Exercise 2: Decode behavior\n\nRun one model with:\n\n```text\ntemperature = 0\ntemperature = 0.3\ntemperature = 0.8\ntop_p = 0.9\ntop_k = 50\n```\n\nObserve correctness, variation, repetition, hallucination, and formatting stability.\n\n### Exercise 3: KV cache intuition\n\nMeasure memory and latency at:\n\n```text\n1k context\n4k context\n16k context\n32k context\n```\n\nTrack time to first token, time per output token, GPU memory, and throughput.\n\n## Evaluation gate\n\nYou pass this layer if you can explain:\n\n```text\ntext → tokens → embeddings → Transformer blocks → logits → probabilities → next token\n```\n\nwithout hand-waving.\n\n---\n\n# Layer 2: Training Pipeline\n\n## Objective\n\nUnderstand how base LLM capability is created before instruction tuning.\n\nA base model is not yet a helpful assistant. It is a statistical language model trained to predict the next token.\n\n## Core pipeline\n\n```text\nraw data\n→ filtering\n→ deduplication\n→ classification\n→ mixture design\n→ tokenizer training\n→ sequence packing\n→ pretraining\n→ checkpointing\n→ validation\n→ contamination checks\n→ base model release\u002Fevaluation\n```\n\n## Data construction\n\nTraining data quality dominates model behavior.\n\nSources may include web pages, books, code, academic text, documentation, forums, math data, multilingual corpora, synthetic data, and domain-specific corpora.\n\nData quality issues include spam, boilerplate, duplicated pages, machine-generated junk, toxic content, benchmark contamination, stale facts, low-quality translations, formatting noise, and personally identifiable information.\n\nKey principle:\n\n```text\nPretraining data is not just fuel.\nIt is the model's compressed world.\n```\n\n## Deduplication\n\nDeduplication reduces repeated content.\n\nWhy it matters:\n\n- prevents memorization\n- improves data diversity\n- reduces overfitting\n- reduces benchmark leakage\n- improves compute efficiency\n\nTypes:\n\n- exact deduplication\n- near-duplicate detection\n- document-level deduplication\n- paragraph-level deduplication\n- code clone detection\n\n## Data mixture\n\nNot all data should have equal weight.\n\nA data mixture controls how much of each domain the model sees.\n\nExamples:\n\n```text\nweb text\ncode\nmath\nbooks\nscientific papers\nmultilingual content\ninstruction-like content\n```\n\nData mixture affects code ability, reasoning, multilingual quality, factual recall, style, toxicity, and domain competence.\n\n## Tokenizer training\n\nTokenizer decisions affect the whole model.\n\nConsider vocabulary size, BPE vs Unigram vs WordPiece, byte fallback, multilingual coverage, code tokens, special tokens, whitespace behavior, Arabic and dialect handling.\n\nA tokenizer is hard to change after training.\n\nChanging tokenizer usually means training or adapting the model again.\n\n## Training objective\n\nMost decoder-only LLMs use next-token prediction.\n\nThe model minimizes cross-entropy loss:\n\n```text\ngood prediction → low loss\nbad prediction  → high loss\n```\n\nLoss is useful but incomplete.\n\nA lower loss does not automatically mean better instruction following, better reasoning, better safety, better RAG behavior, or better tool use.\n\n## Scaling laws\n\nScaling laws relate model size, dataset size, compute budget, and loss.\n\nThey help answer:\n\n```text\nGiven fixed compute, should we train a larger model on fewer tokens or a smaller model on more tokens?\n```\n\nImportant principle:\n\n```text\nCompute-optimal training is a resource allocation problem.\n```\n\n## Optimization\n\nImportant components:\n\n- AdamW\n- learning rate schedule\n- warmup\n- gradient clipping\n- weight decay\n- mixed precision\n- gradient accumulation\n- batch size\n- checkpointing\n\nTraining instability can come from bad data, bad learning rate, optimizer settings, numerical overflow, distributed training bugs, tokenizer\u002Fdata mismatch, or corrupted batches.\n\n## Distributed training\n\nLarge models require distributed training.\n\nCommon strategies:\n\n- data parallelism\n- tensor parallelism\n- pipeline parallelism\n- sequence parallelism\n- ZeRO-style optimizer sharding\n- activation checkpointing\n\nTraining is constrained by GPU memory, interconnect bandwidth, compute utilization, checkpoint I\u002FO, failure recovery, and cluster scheduling.\n\n## What to implement\n\nBuild a mini pretraining pipeline:\n\n```text\ncollect text\nclean text\ntrain tokenizer\npack sequences\ntrain tiny model\ntrack loss\nsample generations\nevaluate basic capability\n```\n\n## Evaluation gate\n\nYou pass this layer if you can read a model technical report and identify:\n\n```text\ndata recipe\ntoken budget\nmodel size\narchitecture\ntraining objective\ncompute estimate\nevaluation setup\ncontamination risks\n```\n\n---\n\n# Layer 3: Post-Training\n\n## Objective\n\nUnderstand how base models become useful assistants.\n\nBase models complete text. Post-trained models follow instructions.\n\n## Post-training pipeline\n\n```text\nbase model\n→ supervised fine-tuning\n→ preference optimization\n→ reinforcement learning \u002F direct optimization\n→ safety tuning\n→ refusal calibration\n→ formatting alignment\n→ evaluation\n```\n\n## Supervised fine-tuning\n\nSFT trains the model on instruction-response pairs.\n\nIt teaches instruction following, chat behavior, formatting, domain response style, role behavior, and basic helpfulness.\n\nBut SFT alone can teach imitation, not necessarily preference quality.\n\n## Preference optimization\n\nPreference data contains comparisons:\n\n```text\nprompt\nchosen answer\nrejected answer\n```\n\nThe model learns which answer is preferred.\n\nMethods include:\n\n- RLHF\n- DPO\n- IPO\n- KTO\n- ORPO\n- RLAIF\n\n## RLHF\n\nRLHF often uses:\n\n```text\npreference data\n→ reward model\n→ policy optimization\n```\n\nBenefits:\n\n- improves helpfulness\n- aligns with human preference\n- improves conversational behavior\n\nRisks:\n\n- reward hacking\n- over-optimization\n- verbosity bias\n- style over truth\n- calibration damage\n\n## DPO\n\nDirect Preference Optimization removes the separate reward-model training loop.\n\nIt directly optimizes the model using preference pairs.\n\nBenefits:\n\n- simpler than RLHF\n- easier to implement\n- widely used for alignment experiments\n\nLimit:\n\n```text\nQuality depends heavily on preference data quality.\n```\n\n## RLVR and GRPO\n\nRL with verifiable rewards is important for reasoning tasks.\n\nA reward is verifiable when correctness can be checked automatically.\n\nExamples:\n\n- math answer correctness\n- code tests passing\n- exact symbolic results\n- game outcomes\n- tool-verified facts\n\nThis is more reliable than subjective reward for many reasoning tasks.\n\n## Safety tuning\n\nSafety tuning shapes refusal behavior, policy adherence, harmful request handling, uncertainty expression, tool permission behavior, and sensitive data handling.\n\nBad safety tuning can cause over-refusal, under-refusal, evasive answers, false confidence, and degraded utility.\n\n## What to implement\n\nRun a small post-training experiment:\n\n```text\nbase model\n→ SFT\n→ preference optimization\n→ before\u002Fafter eval\n```\n\nTrack instruction following, factual accuracy, formatting, refusal behavior, hallucination, verbosity, and domain performance.\n\n## Evaluation gate\n\nYou pass this layer if you can decide between:\n\n```text\nprompting\nRAG\nSFT\nLoRA\nDPO\ncontinued pretraining\n```\n\nbased on the actual failure mode.\n\n---\n\n# Layer 4: Reasoning Models\n\n## Objective\n\nUnderstand models that spend additional inference compute to solve harder tasks.\n\n## Core idea\n\nA reasoning model is not just a model that outputs long explanations.\n\nReasoning systems often involve longer internal deliberation, verifiable rewards, search, self-consistency, verifier models, test-time compute, and specialized post-training.\n\n## Chain-of-thought\n\nChain-of-thought encourages intermediate reasoning.\n\nIt can help with math, logic, planning, code, and multi-step questions.\n\nBut it can fail when reasoning is ungrounded, the model fabricates steps, the task requires external knowledge, the chain is persuasive but wrong, or hidden assumptions go unchecked.\n\n## Test-time compute\n\nTest-time compute means spending more inference resources for better answers.\n\nExamples:\n\n- generate multiple candidates\n- vote across answers\n- use a verifier\n- search over reasoning paths\n- run code\u002Ftools\n- critique and revise\n\nTradeoff:\n\n```text\nbetter quality potential\nvs\nhigher latency and cost\n```\n\n## Verifiers\n\nA verifier scores or checks candidate answers.\n\nTypes:\n\n- outcome verifier\n- process verifier\n- unit test verifier\n- symbolic verifier\n- retrieval-grounded verifier\n- human verifier\n\nVerifiers work best when correctness is measurable.\n\n## Overthinking failure\n\nReasoning models can overthink.\n\nSymptoms:\n\n- unnecessary long reasoning\n- changing correct answers\n- unstable final answer\n- higher cost without better quality\n- worse performance on simple tasks\n\nDecision rule:\n\n```text\nUse reasoning models where extra compute changes accuracy.\nDo not use them by default.\n```\n\n## What to implement\n\nCreate a reasoning eval harness:\n\n```text\nquestion\nbaseline answer\nreasoning answer\ntool-verified answer\nlatency\ncost\ncorrectness\nfailure mode\n```\n\n## Evaluation gate\n\nYou pass this layer if you can classify a task into:\n\n```text\ndirect answer\nretrieval required\ntool required\nreasoning required\nhuman approval required\n```\n\n---\n\n# Layer 5: Inference Fundamentals\n\n## Objective\n\nUnderstand LLM inference as a systems problem.\n\n## Request lifecycle\n\n```text\nrequest arrives\n→ tokenize\n→ build prompt\u002Fchat template\n→ prefill\n→ first token\n→ decode loop\n→ stream output\n→ stop condition\n→ log trace\n```\n\n## Prefill\n\nPrefill processes the input prompt.\n\nIt creates KV cache for prompt tokens.\n\nMain metric:\n\n```text\nTTFT = time to first token\n```\n\nLong prompts increase TTFT.\n\n## Decode\n\nDecode generates output one token at a time.\n\nMain metric:\n\n```text\nTPOT = time per output token\n```\n\nDecode is often constrained by memory bandwidth and KV cache reads.\n\n## Throughput vs latency\n\nThroughput:\n\n```text\ntokens per second\n```\n\nLatency:\n\n```text\nhow long one user waits\n```\n\nThey are not the same.\n\nA system can have high throughput and poor user latency.\n\nMeasure both.\n\n## Batching\n\nBatching improves GPU utilization.\n\nBut LLM requests have variable lengths.\n\nStatic batching wastes capacity.\n\nContinuous batching dynamically adds and removes requests.\n\nThis improves utilization under live traffic.\n\n## Metrics\n\nTrack:\n\n```text\nTTFT\nTPOT\nend-to-end latency\ntokens\u002Fsec\nrequests\u002Fsec\np50 latency\np95 latency\np99 latency\nGPU utilization\nVRAM usage\nqueue time\nerror rate\n```\n\n## What to implement\n\nBuild an inference benchmark suite.\n\nTest:\n\n```text\nsingle request\nmany concurrent requests\nshort prompt\nlong prompt\nshort output\nlong output\nstreaming\nnon-streaming\n```\n\n## Evaluation gate\n\nYou pass this layer if you never report “tokens\u002Fsec” without workload definition.\n\n---\n\n# Layer 6: Serving Engines\n\n## Objective\n\nChoose the correct runtime for a workload.\n\n## Engine categories\n\n### Local developer engines\n\nExamples:\n\n- Ollama\n- llama.cpp\n\nBest for local experiments, CPU\u002FMac workflows, edge deployments, and quick testing.\n\nNot ideal for high-concurrency production serving, advanced GPU scheduling, or multi-tenant inference platforms.\n\n### Production open-source serving engines\n\nExamples:\n\n- vLLM\n- SGLang\n- Hugging Face TGI\n- LMDeploy\n\nBest for high-throughput serving, OpenAI-compatible APIs, batching, prefix caching, multi-GPU serving, and production model endpoints.\n\n### Vendor-optimized engines\n\nExample:\n\n- TensorRT-LLM\n\nBest for NVIDIA GPU optimization, maximum performance, controlled deployment environments, and latency-sensitive workloads.\n\nTradeoff:\n\n```text\nhigher complexity\nmore hardware-specific optimization\n```\n\n## Selection criteria\n\nChoose serving engine based on:\n\n```text\nmodel architecture support\nhardware\nquantization format\nlatency target\nthroughput target\ncontext length\nconcurrency\nstructured output needs\nLoRA serving\nmulti-GPU support\nobservability\noperational complexity\nteam skill\n```\n\n## Serving comparison matrix\n\n| Engine | Best use | Strength | Risk |\n|---|---|---|---|\n| vLLM | General production serving | Throughput, ecosystem | Model-specific edge cases |\n| SGLang | Structured\u002Fhigh-performance workloads | Prefix reuse, structured generation | Operational learning curve |\n| TensorRT-LLM | NVIDIA-optimized serving | Performance | Complexity |\n| llama.cpp | Local\u002Fedge | Portability | Not ideal for high-concurrency serving |\n| Ollama | Developer UX | Simplicity | Limited production control |\n\n## Evaluation gate\n\nYou pass this layer if you can justify engine choice using constraints, not preference.\n\n---\n\n# Layer 7: KV Cache and Long Context\n\n## Objective\n\nUnderstand the real cost of context length.\n\n## KV cache memory\n\nKV cache stores previous keys and values for every generated\u002Frequest token.\n\nMemory grows with:\n\n```text\nbatch_size\ncontext_length\nnum_layers\nnum_kv_heads\nhead_dim\ndtype_bytes\n```\n\nSimplified:\n\n```text\nKV memory ≈ 2 × batch × seq_len × layers × kv_heads × head_dim × bytes\n```\n\nThe factor `2` is for K and V.\n\n## Why long context is hard\n\nLong context causes:\n\n- higher prefill cost\n- larger KV cache\n- higher memory pressure\n- slower scheduling\n- lost-in-the-middle behavior\n- attention dilution\n- more prompt injection surface\n- more irrelevant information\n- higher cost\n\n## Prefix caching\n\nPrefix caching reuses KV cache for shared prompt prefixes.\n\nUseful for repeated system prompts, few-shot examples, static policy blocks, agent frameworks, repeated document prefixes, and multi-turn sessions.\n\n## Context is not memory\n\nA 128k context window means the model can accept 128k tokens.\n\nIt does not mean it can reliably reason over all 128k tokens.\n\nQuality still depends on position sensitivity, retrieval quality, prompt structure, instruction hierarchy, distractor density, and model training.\n\n## RAG vs long context\n\nUse long context when:\n\n- all context is relevant\n- order matters\n- context changes per request\n- retrieval misses critical details\n\nUse RAG when:\n\n- corpus is large\n- only small slices are relevant\n- freshness matters\n- citations matter\n- permission control matters\n- cost matters\n\n## What to implement\n\nBuild a KV cache calculator.\n\nInputs:\n\n```text\nlayers\nkv_heads\nhead_dim\ndtype\nbatch size\ncontext length\n```\n\nOutput:\n\n```text\nestimated KV memory\nmax concurrency\nmemory risk\n```\n\n## Evaluation gate\n\nYou pass this layer if you can estimate memory before deployment.\n\n---\n\n# Layer 8: Quantization and Compression\n\n## Objective\n\nReduce memory and cost without destroying quality.\n\n## Numeric formats\n\nCommon formats:\n\n- FP32\n- FP16\n- BF16\n- FP8\n- INT8\n- INT4\n\nLower precision reduces memory.\n\nBut it introduces numerical error.\n\nQuantization is controlled damage.\n\n## What can be quantized\n\n### Weights\n\nMost common. Reduces model memory.\n\n### Activations\n\nMore complex. Can improve throughput if kernels support it.\n\n### KV cache\n\nReduces memory for long context and high concurrency. Can damage long-context quality.\n\n## Common methods\n\n### GPTQ\n\nPost-training weight quantization. Often used for GPU inference.\n\n### AWQ\n\nActivation-aware weight quantization. Often strong for preserving quality in low-bit inference.\n\n### GGUF\n\nCommon format for llama.cpp ecosystem. Useful for local and edge deployment.\n\n### SmoothQuant\n\nBalances activation and weight quantization difficulty.\n\n### QLoRA\n\nUses quantized base weights for memory-efficient fine-tuning.\n\n## Benchmark dimensions\n\nMeasure:\n\n```text\nquality\nlatency\nthroughput\nVRAM\nTTFT\nTPOT\nformat stability\ncode correctness\nreasoning accuracy\nRAG faithfulness\ntool-call validity\n```\n\nDo not evaluate quantization only with perplexity.\n\n## What to implement\n\nRun:\n\n```text\nFP16 baseline\nINT8\nINT4 GPTQ\nINT4 AWQ\nGGUF\nKV INT8\n```\n\nCompare against domain evals.\n\n## Evaluation gate\n\nYou pass this layer if you can say exactly what was quantized, how, and what quality changed.\n\n---\n\n# Layer 9: RAG Systems\n\n## Objective\n\nBuild retrieval systems that ground LLM outputs in external knowledge.\n\n## Basic RAG pipeline\n\n```text\ndocuments\n→ parsing\n→ cleaning\n→ chunking\n→ embedding\n→ indexing\n→ retrieval\n→ reranking\n→ prompt construction\n→ generation\n→ citation validation\n→ evaluation\n```\n\n## Chunking\n\nChunking controls what the retriever can find.\n\nBad chunking causes missing context, fragmented answers, irrelevant retrieval, citation mismatch, and hallucination.\n\nChunking strategies:\n\n- fixed-size chunks\n- semantic chunks\n- section-based chunks\n- parent-child chunks\n- sliding windows\n- page-level chunks\n\n## Retrieval methods\n\n### BM25\n\nGood for exact terms, IDs, names, legal references, rare words.\n\n### Dense retrieval\n\nGood for semantic similarity.\n\n### Hybrid retrieval\n\nCombines lexical and semantic retrieval.\n\nOften stronger than either alone.\n\n### RRF\n\nReciprocal Rank Fusion combines ranked lists from multiple retrievers.\n\nSimple and effective.\n\n## Reranking\n\nA reranker scores query-document relevance more precisely.\n\nTypical flow:\n\n```text\nretrieve top 50\n→ rerank\n→ keep top 5-10\n```\n\nReranking improves precision at the cost of latency.\n\n## RAG failure modes\n\nFailures can happen at:\n\n```text\nparsing\nchunking\nembedding\nindexing\nretrieval\nreranking\nprompt construction\ngeneration\ncitation validation\n```\n\nDebug the stage.\n\nDo not blame the model first.\n\n## What to implement\n\nBuild a RAG system with:\n\n```text\nBM25\ndense retrieval\nhybrid retrieval\nRRF\nreranker\ncitations\neval set\ntrace logging\n```\n\n## Evaluation gate\n\nYou pass this layer if you can separate retrieval failure from generation failure.\n\n---\n\n# Layer 10: Agentic Systems\n\n## Objective\n\nBuild controlled tool-using LLM systems.\n\n## Agent definition\n\nAn agent is a system where an LLM can choose actions.\n\nExamples:\n\n- call tools\n- query databases\n- browse documents\n- write files\n- send emails\n- schedule actions\n- run code\n- ask for approval\n\nThe danger:\n\n```text\nMore autonomy means more failure surface.\n```\n\n## Workflows vs agents\n\nUse deterministic workflows when the steps are known.\n\nUse agents when the path must be chosen dynamically.\n\nRule:\n\n```text\nWorkflow first.\nAgent only where decision flexibility is needed.\n```\n\n## Core patterns\n\n### Router\n\nChooses which path or model to use.\n\n### Tool caller\n\nCalls external functions with structured arguments.\n\n### Planner\n\nBreaks a task into steps.\n\n### Executor\n\nPerforms actions.\n\n### Verifier\n\nChecks output.\n\n### Human approval gate\n\nStops risky actions before execution.\n\n## State and memory\n\nAgents need state.\n\nState may include user goal, current plan, completed steps, tool outputs, constraints, errors, budget, and approval status.\n\nMemory must be controlled.\n\nUnbounded memory creates confusion and security risk.\n\n## Failure modes\n\n- infinite loops\n- tool misuse\n- wrong tool arguments\n- stale memory\n- prompt injection\n- unauthorized action\n- hidden cost explosion\n- hallucinated tool results\n- invalid final answer\n\n## What to implement\n\nBuild a bounded agent:\n\n```text\nplanner\ntool registry\nschema validation\nexecutor\nverifier\nretry limit\ncost limit\napproval gate\ntrace log\n```\n\n## Evaluation gate\n\nYou pass this layer if your agent can fail safely.\n\n---\n\n# Layer 11: Evaluation and Benchmarking\n\n## Objective\n\nMeasure LLM system quality before users discover failures.\n\n## Evaluation types\n\n### Model evals\n\nMeasure model behavior directly.\n\nExamples: factuality, reasoning, coding, summarization, instruction following.\n\n### RAG evals\n\nMeasure retrieval and grounded generation.\n\nMetrics:\n\n- context precision\n- context recall\n- faithfulness\n- answer relevance\n- citation correctness\n\n### Agent evals\n\nMeasure action quality.\n\nMetrics:\n\n- task success\n- tool correctness\n- invalid tool calls\n- loop rate\n- approval violations\n- cost per task\n\n### Production evals\n\nMeasure real-world operation.\n\nMetrics:\n\n- latency\n- error rate\n- user correction rate\n- fallback rate\n- escalation rate\n- cost\n- safety incidents\n\n## Golden datasets\n\nA golden dataset is a curated set of cases representing expected behavior.\n\nIt should include easy cases, hard cases, edge cases, adversarial cases, outdated info cases, ambiguous cases, negative cases, and refusal cases.\n\n## LLM-as-judge\n\nLLM judges can help, but they must be controlled.\n\nUse clear rubrics, pairwise comparisons, calibration examples, human-reviewed samples, and judge agreement checks.\n\nNever blindly trust judge scores.\n\n## Regression testing\n\nEvery production change should run evals.\n\nChanges include model update, prompt change, retrieval change, reranker change, chunking change, tool change, quantization change, and serving engine change.\n\n## What to implement\n\nBuild an eval harness:\n\n```text\ndataset\ninput\nexpected behavior\nretrieved context\nmodel output\njudge rubric\nlatency\ncost\nfailure category\nrelease decision\n```\n\n## Evaluation gate\n\nYou pass this layer if every major system change has a measurable before\u002Fafter result.\n\n---\n\n# Layer 12: Production Architecture\n\n## Objective\n\nDesign LLM systems that survive real users, real latency, real cost, and real failure.\n\n## Reference architecture\n\n```text\nclient\n→ API gateway\n→ authentication\n→ rate limiting\n→ request logger\n→ prompt builder\n→ router\n→ retrieval service\n→ tool service\n→ model gateway\n→ serving engine\n→ response validator\n→ trace store\n→ eval pipeline\n→ monitoring dashboard\n```\n\n## Model gateway\n\nA model gateway abstracts access to multiple models.\n\nIt handles routing, fallbacks, retries, budget policies, provider abstraction, model versioning, logging, and safety checks.\n\n## Observability\n\nTrack:\n\n```text\nprompt\nmodel\nversion\nlatency\ntokens in\ntokens out\nretrieved chunks\ntool calls\nerrors\ncost\nuser feedback\neval score\n```\n\nWithout traces, debugging becomes guessing.\n\n## Security\n\nProduction LLM systems must handle prompt injection, indirect prompt injection, tool abuse, data exfiltration, PII leakage, retrieval poisoning, unauthorized access, tenant isolation, and audit logging.\n\n## Cost control\n\nCost comes from input tokens, output tokens, model size, inference engine, GPU utilization, concurrency, reranking, embeddings, tool calls, retries, logging, and evaluation.\n\nCost must be measured per task, not only per token.\n\n## What to implement\n\nDesign a complete production architecture document:\n\n```text\nsystem diagram\ndata flow\nmodel flow\nfailure modes\nsecurity controls\nobservability plan\ncost model\neval gates\nscaling plan\nrollback plan\n```\n\n## Evaluation gate\n\nYou pass this layer if you can review an LLM architecture and identify reliability, security, cost, and quality risks.\n\n---\n\n# Advanced tracks\n\n## Advanced Track A: Multimodal LLMs\n\nLearn:\n\n- vision-language models\n- audio-language models\n- document understanding\n- OCR pipelines\n- image embeddings\n- video frame sampling\n- multimodal RAG\n- visual grounding\n- multimodal evals\n\nBuild:\n\n```text\nPDF\u002Fimage ingestion\nOCR\nlayout extraction\nvisual chunking\ntext + image retrieval\ngrounded answer generation\ncitation to page\u002Fregion\n```\n\n## Advanced Track B: Domain Adaptation\n\nLearn:\n\n- prompt adaptation\n- RAG\n- SFT\n- LoRA\n- QLoRA\n- continued pretraining\n- domain-specific tokenization\n- ontology grounding\n- terminology normalization\n- legal\u002Fmedical\u002Ffinancial evals\n\nDecision hierarchy:\n\n```text\nprompting\n→ RAG\n→ SFT\u002FLoRA\n→ continued pretraining\n```\n\nContinued pretraining is expensive and should not be the default.\n\n## Advanced Track C: LLM Security\n\nLearn:\n\n- prompt injection\n- jailbreaks\n- indirect prompt injection\n- retrieval poisoning\n- tool abuse\n- sandboxing\n- output validation\n- permission boundaries\n- audit trails\n- secure agent design\n\nBuild:\n\n```text\nred-team suite\nprompt injection tests\ntool misuse tests\nretrieval poisoning tests\nPII leakage tests\npolicy bypass tests\n```\n\n## Advanced Track D: Hardware-Aware LLM Engineering\n\nLearn:\n\n- HBM bandwidth\n- tensor cores\n- CUDA kernels\n- FlashAttention\n- NCCL\n- NVLink\n- PCIe\n- tensor parallelism\n- pipeline parallelism\n- expert parallelism\n- GPU memory fragmentation\n\nBuild:\n\n```text\nhardware fit calculator\nmodel memory estimator\nKV cache estimator\nthroughput benchmark\nGPU utilization dashboard\n```\n\n## Advanced Track E: Research Literacy\n\nUse this template for every paper:\n\n```text\nClaim:\nMechanism:\nWhat changed:\nWhat stayed constant:\nDataset:\nCompute:\nAblation:\nMetric:\nWeakness:\nReproducibility:\nProduction implication:\n```\n\nThe goal is to identify mechanism, not memorize title.\n\n---\n\n# Master artifact portfolio\n\nBuild these artifacts to prove competence.\n\n| ID | Artifact | Purpose |\n|---|---|---|\n| 01 | Tiny Transformer | Understand token generation mechanically |\n| 02 | Tokenizer Comparison Notebook | Measure tokenizer impact across languages\u002Fdomains |\n| 03 | Mini Pretraining Pipeline | Understand data, tokenization, loss, and sampling |\n| 04 | SFT Experiment | Learn instruction tuning |\n| 05 | DPO\u002FPreference Experiment | Learn preference optimization |\n| 06 | Reasoning Eval Harness | Compare normal vs reasoning models |\n| 07 | Inference Benchmark Suite | Measure TTFT, TPOT, latency, throughput |\n| 08 | Serving Engine Matrix | Compare vLLM, SGLang, TensorRT-LLM, llama.cpp |\n| 09 | KV Cache Calculator | Estimate serving memory |\n| 10 | Quantization Benchmark | Measure quality\u002Fcost tradeoffs |\n| 11 | Production RAG System | Ground answers with retrieval and citations |\n| 12 | Agent Workflow | Build controlled tool use |\n| 13 | Eval Dashboard | Track quality, latency, cost, safety |\n| 14 | Production Architecture Diagram | Design deployable platform |\n| 15 | Security Red-Team Suite | Test prompt injection and tool abuse |\n| 16 | Cost Model | Estimate per-task and platform-level cost |\n| 17 | Paper Review Database | Build research literacy |\n\n---\n\n# Repository structure\n\nRecommended structure:\n\n```text\nllm-systems-engineering-roadmap\u002F\n│\n├── README.md\n├── LICENSE\n├── roadmap\u002F\n│   ├── 01_llm_foundations.md\n│   ├── 02_training_pipeline.md\n│   ├── 03_post_training.md\n│   ├── 04_reasoning_models.md\n│   ├── 05_inference_fundamentals.md\n│   ├── 06_serving_engines.md\n│   ├── 07_kv_cache_long_context.md\n│   ├── 08_quantization_compression.md\n│   ├── 09_rag_systems.md\n│   ├── 10_agentic_systems.md\n│   ├── 11_evaluation_benchmarking.md\n│   └── 12_production_architecture.md\n│\n├── artifacts\u002F\n│   ├── tiny_transformer\u002F\n│   ├── tokenizer_comparison\u002F\n│   ├── mini_pretraining\u002F\n│   ├── post_training\u002F\n│   ├── reasoning_eval\u002F\n│   ├── inference_benchmark\u002F\n│   ├── kv_cache_calculator\u002F\n│   ├── quantization_benchmark\u002F\n│   ├── rag_system\u002F\n│   ├── agent_workflow\u002F\n│   ├── eval_dashboard\u002F\n│   └── production_architecture\u002F\n│\n├── templates\u002F\n│   ├── paper_review_template.md\n│   ├── model_eval_template.md\n│   ├── rag_eval_template.md\n│   ├── agent_eval_template.md\n│   ├── architecture_review_template.md\n│   └── cost_model_template.md\n│\n├── resources\u002F\n│   ├── papers.md\n│   ├── docs.md\n│   ├── courses.md\n│   ├── tools.md\n│   └── benchmarks.md\n│\n└── checklists\u002F\n    ├── model_selection_checklist.md\n    ├── rag_production_checklist.md\n    ├── agent_safety_checklist.md\n    ├── inference_benchmark_checklist.md\n    ├── quantization_checklist.md\n    └── production_readiness_checklist.md\n```\n\n---\n\n# Definition of done\n\nYou are not done when you read the chapters.\n\nYou are done when you can produce these outputs.\n\n## Foundation done\n\n```text\nCan implement and explain a tiny Transformer.\nCan trace one token generation.\nCan explain tokenization, logits, sampling, and KV cache.\n```\n\n## Training done\n\n```text\nCan build a mini pretraining loop.\nCan explain data mixture, loss, scaling, and contamination.\n```\n\n## Post-training done\n\n```text\nCan compare SFT, RLHF, DPO, GRPO, and RLAIF.\nCan choose adaptation method based on failure mode.\n```\n\n## Reasoning done\n\n```text\nCan evaluate when reasoning models help.\nCan measure accuracy vs latency\u002Fcost.\n```\n\n## Inference done\n\n```text\nCan benchmark TTFT, TPOT, throughput, p95 latency, and VRAM.\n```\n\n## Serving done\n\n```text\nCan choose a serving engine based on workload and hardware.\n```\n\n## KV cache done\n\n```text\nCan estimate KV cache memory and explain long-context tradeoffs.\n```\n\n## Quantization done\n\n```text\nCan evaluate quantization quality against domain tasks.\n```\n\n## RAG done\n\n```text\nCan build and debug hybrid retrieval with citations and evals.\n```\n\n## Agents done\n\n```text\nCan build bounded tool-using workflows with safe failure behavior.\n```\n\n## Evaluation done\n\n```text\nCan build regression evals and release gates.\n```\n\n## Production done\n\n```text\nCan design secure, observable, scalable LLM architecture.\n```\n\n---\n\n# Recommended source map\n\n## Foundations\n\n- Attention Is All You Need: https:\u002F\u002Farxiv.org\u002Fabs\u002F1706.03762\n- Hugging Face Tokenizer Summary: https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Ftokenizer_summary\n- Hugging Face Transformers: https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Findex\n- PyTorch Transformer Reference: https:\u002F\u002Fdocs.pytorch.org\u002Fdocs\u002Fstable\u002Fgenerated\u002Ftorch.nn.Transformer.html\n\n## Training and post-training\n\n- Hugging Face TRL: https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftrl\u002Findex\n- Hugging Face PEFT: https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fpeft\u002Findex\n- LoRA: https:\u002F\u002Farxiv.org\u002Fabs\u002F2106.09685\n- QLoRA: https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.14314\n- DPO: https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.18290\n- DeepSeek-R1: https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.12948\n\n## Inference and serving\n\n- vLLM Docs: https:\u002F\u002Fdocs.vllm.ai\u002Fen\u002Flatest\u002F\n- SGLang Docs: https:\u002F\u002Fdocs.sglang.ai\u002F\n- TensorRT-LLM Docs: https:\u002F\u002Fdocs.nvidia.com\u002Ftensorrt-llm\u002Findex.html\n- llama.cpp: https:\u002F\u002Fgithub.com\u002Fggerganov\u002Fllama.cpp\n- Hugging Face TGI: https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftext-generation-inference\u002Findex\n\n## RAG and evaluation\n\n- Ragas Metrics: https:\u002F\u002Fdocs.ragas.io\u002Fen\u002Fstable\u002Fconcepts\u002Fmetrics\u002Favailable_metrics\u002F\n- BEIR Benchmark: https:\u002F\u002Fgithub.com\u002Fbeir-cellar\u002Fbeir\n- MS MARCO: https:\u002F\u002Fmicrosoft.github.io\u002Fmsmarco\u002F\n- Sentence Transformers: https:\u002F\u002Fwww.sbert.net\u002F\n\n## Agents\n\n- OpenAI Function Calling \u002F Tools: https:\u002F\u002Fplatform.openai.com\u002Fdocs\u002Fguides\u002Ffunction-calling\n- LangGraph Workflows and Agents: https:\u002F\u002Fdocs.langchain.com\u002Foss\u002Fpython\u002Flanggraph\u002Fworkflows-agents\n- LangGraph Memory: https:\u002F\u002Fdocs.langchain.com\u002Foss\u002Fpython\u002Flanggraph\u002Fmemory\n\n## Security\n\n- OWASP Top 10 for LLM Applications: https:\u002F\u002Fowasp.org\u002Fwww-project-top-10-for-large-language-model-applications\u002F\n- NIST AI Risk Management Framework: https:\u002F\u002Fwww.nist.gov\u002Fitl\u002Fai-risk-management-framework\n\n---\n\n# Engineering checklists\n\n## Model selection checklist\n\n```text\ntask type\nlanguage\u002Fdomain\ncontext length\nlatency target\ncost target\nquality target\ntool use needed\nreasoning needed\ndeployment mode\ndata privacy constraints\nfine-tuning need\nserving engine compatibility\nquantization support\neval result\n```\n\n## RAG production checklist\n\n```text\ndocument parser tested\nchunking strategy validated\nmetadata schema defined\nhybrid retrieval implemented\nreranker tested\ncitations validated\npermission filters enforced\nfreshness handled\nRAG eval set built\nretrieval failures categorized\ngeneration failures categorized\nlatency measured\ncost measured\n```\n\n## Agent safety checklist\n\n```text\ntools have schemas\narguments validated\npermissions enforced\ndangerous actions require approval\nretry limits exist\nbudget limits exist\ntool outputs are logged\nstate is inspectable\nprompt injection tests exist\nfallback path exists\nhuman escalation exists\n```\n\n## Inference benchmark checklist\n\n```text\nmodel version\nprecision\nserving engine\nGPU type\nbatch size\nconcurrency\nprompt length\noutput length\nTTFT\nTPOT\np50 latency\np95 latency\np99 latency\ntokens\u002Fsec\nVRAM usage\nGPU utilization\nerror rate\n```\n\n## Quantization checklist\n\n```text\nbaseline measured\nmethod identified\nweights\u002Factivations\u002FKV specified\ncalibration data documented\nserving engine compatible\nquality evaluated\nlatency evaluated\nVRAM evaluated\nhard cases tested\nformat stability tested\nrollback available\n```\n\n## Production readiness checklist\n\n```text\nauthentication\nauthorization\ntenant isolation\nrate limiting\nprompt logging policy\nPII policy\nretrieval permissions\nmodel fallback\neval gate\nmonitoring\nalerts\ncost dashboard\nsecurity tests\nrollback plan\nincident response\n```\n\n---\n\n# How to use this roadmap\n\nDo not read it passively.\n\nUse this loop:\n\n```text\nstudy one layer\n→ implement one artifact\n→ measure it\n→ write failure notes\n→ create decision rules\n→ move to next layer\n```\n\nFor every topic, produce:\n\n```text\n1. mechanism explanation\n2. code or architecture artifact\n3. benchmark or eval\n4. failure mode list\n5. decision rule\n```\n\nThe roadmap is complete only when it changes your engineering decisions.\n\n---\n\n# Final compression\n\n```text\nLLM foundations teach how tokens become predictions.\nTraining teaches where base capability comes from.\nPost-training teaches how behavior is shaped.\nReasoning teaches when extra inference compute helps.\nInference teaches why latency and memory dominate.\nServing engines teach how runtime choices affect production.\nKV cache teaches why context is expensive.\nQuantization teaches how to trade precision for cost.\nRAG teaches how to ground outputs.\nAgents teach how to connect models to actions.\nEvaluation teaches how to know if anything works.\nProduction architecture teaches how to make it survive real usage.\n```\n\nThe professional standard is not “I know LLMs.”\n\nThe professional standard is:\n\n```text\nI can design, measure, debug, and operate LLM systems under real constraints.\n```\n","该项目提供了一个全面掌握大型语言模型（LLM）内部机制、训练、推理、检索增强生成（RAG）、代理控制、评估及生产架构的专业路线图。其核心功能包括深入理解模型内部工作原理、构建高效的训练与推理系统、设计合理的检索架构以及确保系统的可度量质量、延迟、成本和可靠性。技术特点在于它不仅关注模型本身，还强调了从基础到高级的系统化学习路径，覆盖了从理论到实践的全过程。适用于希望超越LLM表面应用，致力于开发企业级解决方案的人工智能工程师、机器学习专家、自然语言处理专家及后端开发者等，特别是在构建如企业级RAG平台、本地LLM部署、多模型推理网关等场景下尤为适用。","2026-06-11 03:30:52","CREATED_QUERY"]