[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-71953":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":27,"readmeContent":28,"aiSummary":29,"trendingCount":16,"starSnapshotCount":16,"syncStatus":30,"lastSyncTime":31,"discoverSource":32},71953,"minbpe","karpathy\u002Fminbpe","karpathy","Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.","",null,"Python",10560,1059,89,31,0,27,42,82,81,44.08,"MIT License",false,"master",true,[],"2026-06-12 02:02:56","# minbpe\n\nMinimal, clean code for the (byte-level) Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization. The BPE algorithm is \"byte-level\" because it runs on UTF-8 encoded strings.\n\nThis algorithm was popularized for LLMs by the [GPT-2 paper](https:\u002F\u002Fd4mucfpksywv.cloudfront.net\u002Fbetter-language-models\u002Flanguage_models_are_unsupervised_multitask_learners.pdf) and the associated GPT-2 [code release](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgpt-2) from OpenAI. [Sennrich et al. 2015](https:\u002F\u002Farxiv.org\u002Fabs\u002F1508.07909) is cited as the original reference for the use of BPE in NLP applications. Today, all modern LLMs (e.g. GPT, Llama, Mistral) use this algorithm to train their tokenizers.\n\nThere are two Tokenizers in this repository, both of which can perform the 3 primary functions of a Tokenizer: 1) train the tokenizer vocabulary and merges on a given text, 2) encode from text to tokens, 3) decode from tokens to text. The files of the repo are as follows:\n\n1. [minbpe\u002Fbase.py](minbpe\u002Fbase.py): Implements the `Tokenizer` class, which is the base class. It contains the `train`, `encode`, and `decode` stubs, save\u002Fload functionality, and there are also a few common utility functions. This class is not meant to be used directly, but rather to be inherited from.\n2. [minbpe\u002Fbasic.py](minbpe\u002Fbasic.py): Implements the `BasicTokenizer`, the simplest implementation of the BPE algorithm that runs directly on text.\n3. [minbpe\u002Fregex.py](minbpe\u002Fregex.py): Implements the `RegexTokenizer` that further splits the input text by a regex pattern, which is a preprocessing stage that splits up the input text by categories (think: letters, numbers, punctuation) before tokenization. This ensures that no merges will happen across category boundaries. This was introduced in the GPT-2 paper and continues to be in use as of GPT-4. This class also handles special tokens, if any.\n4. [minbpe\u002Fgpt4.py](minbpe\u002Fgpt4.py): Implements the `GPT4Tokenizer`. This class is a light wrapper around the `RegexTokenizer` (2, above) that exactly reproduces the tokenization of GPT-4 in the [tiktoken](https:\u002F\u002Fgithub.com\u002Fopenai\u002Ftiktoken) library. The wrapping handles some details around recovering the exact merges in the tokenizer, and the handling of some unfortunate (and likely historical?) 1-byte token permutations.\n\nFinally, the script [train.py](train.py) trains the two major tokenizers on the input text [tests\u002Ftaylorswift.txt](tests\u002Ftaylorswift.txt) (this is the Wikipedia entry for her kek) and saves the vocab to disk for visualization. This script runs in about 25 seconds on my (M1) MacBook.\n\nAll of the files above are very short and thoroughly commented, and also contain a usage example on the bottom of the file.\n\n## quick start\n\nAs the simplest example, we can reproduce the [Wikipedia article on BPE](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FByte_pair_encoding) as follows:\n\n```python\nfrom minbpe import BasicTokenizer\ntokenizer = BasicTokenizer()\ntext = \"aaabdaaabac\"\ntokenizer.train(text, 256 + 3) # 256 are the byte tokens, then do 3 merges\nprint(tokenizer.encode(text))\n# [258, 100, 258, 97, 99]\nprint(tokenizer.decode([258, 100, 258, 97, 99]))\n# aaabdaaabac\ntokenizer.save(\"toy\")\n# writes two files: toy.model (for loading) and toy.vocab (for viewing)\n```\n\nAccording to Wikipedia, running bpe on the input string: \"aaabdaaabac\" for 3 merges results in the string: \"XdXac\" where  X=ZY, Y=ab, and Z=aa. The tricky thing to note is that minbpe always allocates the 256 individual bytes as tokens, and then merges bytes as needed from there. So for us a=97, b=98, c=99, d=100 (their [ASCII](https:\u002F\u002Fwww.asciitable.com) values). Then when (a,a) is merged to Z, Z will become 256. Likewise Y will become 257 and X 258. So we start with the 256 bytes, and do 3 merges to get to the result above, with the expected output of [258, 100, 258, 97, 99].\n\n## inference: GPT-4 comparison\n\nWe can verify that the `RegexTokenizer` has feature parity with the GPT-4 tokenizer from [tiktoken](https:\u002F\u002Fgithub.com\u002Fopenai\u002Ftiktoken) as follows:\n\n```python\ntext = \"hello123!!!? (안녕하세요!) 😉\"\n\n# tiktoken\nimport tiktoken\nenc = tiktoken.get_encoding(\"cl100k_base\")\nprint(enc.encode(text))\n# [15339, 4513, 12340, 30, 320, 31495, 230, 75265, 243, 92245, 16715, 57037]\n\n# ours\nfrom minbpe import GPT4Tokenizer\ntokenizer = GPT4Tokenizer()\nprint(tokenizer.encode(text))\n# [15339, 4513, 12340, 30, 320, 31495, 230, 75265, 243, 92245, 16715, 57037]\n```\n\n(you'll have to `pip install tiktoken` to run). Under the hood, the `GPT4Tokenizer` is just a light wrapper around `RegexTokenizer`, passing in the merges and the special tokens of GPT-4. We can also ensure the special tokens are handled correctly:\n\n```python\ntext = \"\u003C|endoftext|>hello world\"\n\n# tiktoken\nimport tiktoken\nenc = tiktoken.get_encoding(\"cl100k_base\")\nprint(enc.encode(text, allowed_special=\"all\"))\n# [100257, 15339, 1917]\n\n# ours\nfrom minbpe import GPT4Tokenizer\ntokenizer = GPT4Tokenizer()\nprint(tokenizer.encode(text, allowed_special=\"all\"))\n# [100257, 15339, 1917]\n```\n\nNote that just like tiktoken, we have to explicitly declare our intent to use and parse special tokens in the call to encode. Otherwise this can become a major footgun, unintentionally tokenizing attacker-controlled data (e.g. user prompts) with special tokens. The `allowed_special` parameter can be set to \"all\", \"none\", or a list of special tokens to allow.\n\n## training\n\nUnlike tiktoken, this code allows you to train your own tokenizer. In principle and to my knowledge, if you train the `RegexTokenizer` on a large dataset with a vocabulary size of 100K, you would reproduce the GPT-4 tokenizer.\n\nThere are two paths you can follow. First, you can decide that you don't want the complexity of splitting and preprocessing text with regex patterns, and you also don't care for special tokens. In that case, reach for the `BasicTokenizer`. You can train it, and then encode and decode for example as follows:\n\n```python\nfrom minbpe import BasicTokenizer\ntokenizer = BasicTokenizer()\ntokenizer.train(very_long_training_string, vocab_size=4096)\ntokenizer.encode(\"hello world\") # string -> tokens\ntokenizer.decode([1000, 2000, 3000]) # tokens -> string\ntokenizer.save(\"mymodel\") # writes mymodel.model and mymodel.vocab\ntokenizer.load(\"mymodel.model\") # loads the model back, the vocab is just for vis\n```\n\nIf you instead want to follow along with OpenAI did for their text tokenizer, it's a good idea to adopt their approach of using regex pattern to split the text by categories. The GPT-4 pattern is a default with the `RegexTokenizer`, so you'd simple do something like:\n\n```python\nfrom minbpe import RegexTokenizer\ntokenizer = RegexTokenizer()\ntokenizer.train(very_long_training_string, vocab_size=32768)\ntokenizer.encode(\"hello world\") # string -> tokens\ntokenizer.decode([1000, 2000, 3000]) # tokens -> string\ntokenizer.save(\"tok32k\") # writes tok32k.model and tok32k.vocab\ntokenizer.load(\"tok32k.model\") # loads the model back from disk\n```\n\nWhere, of course, you'd want to change around the vocabulary size depending on the size of your dataset.\n\n**Special tokens**. Finally, you might wish to add special tokens to your tokenizer. Register these using the `register_special_tokens` function. For example if you train with vocab_size of 32768, then the first 256 tokens are raw byte tokens, the next 32768-256 are merge tokens, and after those you can add the special tokens. The last \"real\" merge token will have id of 32767 (vocab_size - 1), so your first special token should come right after that, with an id of exactly 32768. So:\n\n```python\nfrom minbpe import RegexTokenizer\ntokenizer = RegexTokenizer()\ntokenizer.train(very_long_training_string, vocab_size=32768)\ntokenizer.register_special_tokens({\"\u003C|endoftext|>\": 32768})\ntokenizer.encode(\"\u003C|endoftext|>hello world\", allowed_special=\"all\")\n```\n\nYou can of course add more tokens after that as well, as you like. Finally, I'd like to stress that I tried hard to keep the code itself clean, readable and hackable. You should not have feel scared to read the code and understand how it works. The tests are also a nice place to look for more usage examples. That reminds me:\n\n## tests\n\nWe use the pytest library for tests. All of them are located in the `tests\u002F` directory. First `pip install pytest` if you haven't already, then:\n\n```bash\n$ pytest -v .\n```\n\nto run the tests. (-v is verbose, slightly prettier).\n\n## community extensions\n\n* [gnp\u002Fminbpe-rs](https:\u002F\u002Fgithub.com\u002Fgnp\u002Fminbpe-rs): A Rust implementation of `minbpe` providing (near) one-to-one correspondence with the Python version\n\n## exercise\n\nFor those trying to study BPE, here is the advised progression exercise for how you can build your own minbpe step by step. See [exercise.md](exercise.md).\n\n## lecture\n\nI built the code in this repository in this [YouTube video](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=zduSFxRajkE). You can also find this lecture in text form in [lecture.md](lecture.md).\n\n## todos\n\n- write a more optimized Python version that could run over large files and big vocabs\n- write an even more optimized C or Rust version (think through)\n- rename GPT4Tokenizer to GPTTokenizer and support GPT-2\u002FGPT-3\u002FGPT-3.5 as well?\n- write a LlamaTokenizer similar to GPT4Tokenizer (i.e. attempt sentencepiece equivalent)\n\n## License\n\nMIT\n","minbpe 是一个用于实现字节对编码（BPE）算法的简洁Python代码库，该算法常用于大型语言模型的分词处理。项目提供了两种分词器：`BasicTokenizer` 和 `RegexTokenizer`，分别支持基础的BPE算法和通过正则表达式预处理文本以确保不同类别间的边界不被合并的高级功能。此外，还有一个轻量级的`GPT4Tokenizer`实现了与GPT-4一致的分词逻辑。这些工具非常适合需要自定义训练或使用特定分词策略的语言模型开发场景。整个项目的代码结构清晰、注释详尽，并且易于理解和扩展。",2,"2026-06-11 03:39:38","high_star"]