[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-5494":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":25,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":34,"readmeContent":35,"aiSummary":36,"trendingCount":16,"starSnapshotCount":16,"syncStatus":17,"lastSyncTime":37,"discoverSource":38},5494,"tokenizers","huggingface\u002Ftokenizers","huggingface","💥 Fast State-of-the-Art Tokenizers optimized for Research and Production","https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftokenizers",null,"Rust",10811,1122,122,112,0,2,15,100,10,44.15,"Apache License 2.0",false,"main",true,[27,28,29,30,31,32,33],"bert","gpt","language-model","natural-language-processing","natural-language-understanding","nlp","transformers","2026-06-12 02:01:11","\u003Cp align=\"center\">\n    \u003Cbr>\n    \u003Cimg src=\"https:\u002F\u002Fhuggingface.co\u002Flanding\u002Fassets\u002Ftokenizers\u002Ftokenizers-logo.png\" width=\"600\"\u002F>\n    \u003Cbr>\n\u003Cp>\n\u003Cp align=\"center\">\n    \u003Cimg alt=\"Build\" src=\"https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftokenizers\u002Fworkflows\u002FRust\u002Fbadge.svg\">\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftokenizers\u002Fblob\u002Fmain\u002FLICENSE\">\n        \u003Cimg alt=\"GitHub\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flicense\u002Fhuggingface\u002Ftokenizers.svg?color=blue&cachedrop\">\n    \u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fpepy.tech\u002Fproject\u002Ftokenizers\">\n        \u003Cimg src=\"https:\u002F\u002Fpepy.tech\u002Fbadge\u002Ftokenizers\u002Fweek\" \u002F>\n    \u003C\u002Fa>\n\u003C\u002Fp>\n\nProvides an implementation of today's most used tokenizers, with a focus on performance and\nversatility.\n\n## Main features:\n\n - Train new vocabularies and tokenize, using today's most used tokenizers.\n - Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes\n   less than 20 seconds to tokenize a GB of text on a server's CPU.\n - Easy to use, but also extremely versatile.\n - Designed for research and production.\n - Normalization comes with alignments tracking. It's always possible to get the part of the\n   original sentence that corresponds to a given token.\n - Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.\n\n## Performances\nPerformances can vary depending on hardware, but running the [~\u002Fbindings\u002Fpython\u002Fbenches\u002Ftest_tiktoken.py](bindings\u002Fpython\u002Fbenches\u002Ftest_tiktoken.py) should give the following on a g6 aws instance:\n![image](https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F2b913d4b-e488-4cbc-b542-f90a6c40643d)\n\n\n## Bindings\n\nWe provide bindings to the following languages (more to come!):\n  - [Rust](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftokenizers\u002Ftree\u002Fmain\u002Ftokenizers) (Original implementation)\n  - [Python](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftokenizers\u002Ftree\u002Fmain\u002Fbindings\u002Fpython)\n  - [Node.js](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftokenizers\u002Ftree\u002Fmain\u002Fbindings\u002Fnode)\n  - [Ruby](https:\u002F\u002Fgithub.com\u002Fankane\u002Ftokenizers-ruby) (Contributed by @ankane, external repo)\n\n## Installation\n\nYou can install from source using:\n```bash\npip install git+https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftokenizers.git#subdirectory=bindings\u002Fpython\n```\n\nor install the released versions with\n\n```bash\npip install tokenizers\n```\n \n## Quick example using Python:\n\nChoose your model between Byte-Pair Encoding, WordPiece or Unigram and instantiate a tokenizer:\n\n```python\nfrom tokenizers import Tokenizer\nfrom tokenizers.models import BPE\n\ntokenizer = Tokenizer(BPE())\n```\n\nYou can customize how pre-tokenization (e.g., splitting into words) is done:\n\n```python\nfrom tokenizers.pre_tokenizers import Whitespace\n\ntokenizer.pre_tokenizer = Whitespace()\n```\n\nThen training your tokenizer on a set of files just takes two lines of codes:\n\n```python\nfrom tokenizers.trainers import BpeTrainer\n\ntrainer = BpeTrainer(special_tokens=[\"[UNK]\", \"[CLS]\", \"[SEP]\", \"[PAD]\", \"[MASK]\"])\ntokenizer.train(files=[\"wiki.train.raw\", \"wiki.valid.raw\", \"wiki.test.raw\"], trainer=trainer)\n```\n\nOnce your tokenizer is trained, encode any text with just one line:\n```python\noutput = tokenizer.encode(\"Hello, y'all! How are you 😁 ?\")\nprint(output.tokens)\n# [\"Hello\", \",\", \"y\", \"'\", \"all\", \"!\", \"How\", \"are\", \"you\", \"[UNK]\", \"?\"]\n```\n\nCheck the [documentation](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftokenizers\u002Findex)\nor the [quicktour](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftokenizers\u002Fquicktour) to learn more!\n","huggingface\u002Ftokenizers 是一个为研究和生产环境优化的高性能分词器库。它支持当今最常用的分词方法，如 BERT、GPT 等，并且通过 Rust 语言实现，保证了极高的处理速度与灵活性，能在服务器 CPU 上不到20秒内完成1GB文本的分词任务。该库易于使用同时功能强大，不仅能够训练新的词汇表进行分词，还提供了包括截断、填充以及添加模型所需特殊标记在内的全面预处理功能。此外，它支持多种编程语言绑定（Rust、Python、Node.js 和 Ruby），非常适合需要高效自然语言处理的应用场景，比如构建或部署大规模的语言模型。","2026-06-11 03:03:39","top_language"]