[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-75821":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":9,"totalLinesOfCode":9,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":9,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":9,"rankLanguage":9,"license":9,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":22,"hasPages":22,"topics":24,"createdAt":9,"pushedAt":9,"updatedAt":28,"readmeContent":29,"aiSummary":30,"trendingCount":16,"starSnapshotCount":16,"syncStatus":31,"lastSyncTime":32,"discoverSource":33},75821,"sentencepiece","google\u002Fsentencepiece","google","Unsupervised text tokenizer for Neural Network-based text generation.",null,"https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fsentencepiece","C++",11899,1360,120,17,0,14,32,78,42,110.2,false,"main",[25,26,27],"neural-machine-translation","natural-language-processing","word-segmentation","2026-06-12 04:01:19","# SentencePiece\n\n[![Build C++](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fsentencepiece\u002Factions\u002Fworkflows\u002Fcmake.yml\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fsentencepiece\u002Factions\u002Fworkflows\u002Fcmake.yml)\n[![Build Wheels](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fsentencepiece\u002Factions\u002Fworkflows\u002Fwheel.yml\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fsentencepiece\u002Factions\u002Fworkflows\u002Fwheel.yml)\n[![GitHub Issues](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fissues\u002Fgoogle\u002Fsentencepiece.svg)](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fsentencepiece\u002Fissues)\n![PyPI - Python Version](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fpyversions\u002Fsentencepiece)\n[![PyPI version](https:\u002F\u002Fbadge.fury.io\u002Fpy\u002Fsentencepiece.svg)](https:\u002F\u002Fbadge.fury.io\u002Fpy\u002Fsentencepiece)\n[![PyPi downloads](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fdm\u002Fsentencepiece?style=flat-square&logo=pypi&logoColor=white)](https:\u002F\u002Fpypi.org\u002Fproject\u002Fsentencepiece\u002F)\n[![Contributions welcome](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcontributions-welcome-brightgreen.svg)](CONTRIBUTING.md)\n[![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache%202.0-brightgreen.svg)](https:\u002F\u002Fopensource.org\u002Flicenses\u002FApache-2.0)\n[![SLSA 3](https:\u002F\u002Fslsa.dev\u002Fimages\u002Fgh-badge-level3.svg)](https:\u002F\u002Fslsa.dev)\n\nSentencePiece is an unsupervised text tokenizer and detokenizer mainly for\nNeural Network-based text generation systems where the vocabulary size\nis predetermined prior to the neural model training. SentencePiece implements\n**subword units** (e.g., **byte-pair-encoding (BPE)** [[Sennrich et al.](https:\u002F\u002Fwww.aclweb.org\u002Fanthology\u002FP16-1162)]) and\n**unigram language model** [[Kudo.](https:\u002F\u002Farxiv.org\u002Fabs\u002F1804.10959)])\nwith the extension of direct training from raw sentences. SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre\u002Fpostprocessing.\n\n**This is not an official Google product.**\n\n## Technical highlights\n\n- **Purely data driven**: SentencePiece trains tokenization and detokenization\n  models from sentences. Pre-tokenization ([Moses tokenizer](https:\u002F\u002Fgithub.com\u002Fmoses-smt\u002Fmosesdecoder\u002Fblob\u002Fmaster\u002Fscripts\u002Ftokenizer\u002Ftokenizer.perl)\u002F[MeCab](http:\u002F\u002Ftaku910.github.io\u002Fmecab\u002F)\u002F[KyTea](http:\u002F\u002Fwww.phontron.com\u002Fkytea\u002F)) is not always required.\n- **Language independent**: SentencePiece treats the sentences just as sequences of Unicode characters. There is no language-dependent logic.\n- **Multiple subword algorithms**: **BPE** [[Sennrich et al.](https:\u002F\u002Fwww.aclweb.org\u002Fanthology\u002FP16-1162)] and **unigram language model** [[Kudo.](https:\u002F\u002Farxiv.org\u002Fabs\u002F1804.10959)] are supported.\n- **Subword regularization**: SentencePiece implements subword sampling for [subword regularization](https:\u002F\u002Farxiv.org\u002Fabs\u002F1804.10959) and [BPE-dropout](https:\u002F\u002Farxiv.org\u002Fabs\u002F1910.13267) which help to improve the robustness and accuracy of NMT models.\n- **Fast and lightweight**: Segmentation speed is around 50k sentences\u002Fsec, and memory footprint is around 6MB.\n- **Self-contained**: The same tokenization\u002Fdetokenization is obtained as long as the same model file is used.\n- **Direct vocabulary id generation**: SentencePiece manages vocabulary to id mapping and can directly generate vocabulary id sequences from raw sentences.\n- **NFKC-based normalization**: SentencePiece performs NFKC-based text normalization.\n\nFor those unfamiliar with SentencePiece as a software\u002Falgorithm, one can read [a gentle introduction here](https:\u002F\u002Fmedium.com\u002F@jacky2wong\u002Funderstanding-sentencepiece-under-standing-sentence-piece-ac8da59f6b08).\n\n## Comparisons with other implementations\n\n| Feature                                 |                 SentencePiece                  | [subword-nmt](https:\u002F\u002Fgithub.com\u002Frsennrich\u002Fsubword-nmt) | [WordPiece](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1609.08144.pdf) |\n| :-------------------------------------- | :--------------------------------------------: | :-----------------------------------------------------: | :-----------------------------------------------: |\n| Supported algorithm                     |            BPE, unigram, char, word            |                           BPE                           |                       BPE\\*                       |\n| OSS?                                    |                      Yes                       |                           Yes                           |                  Google internal                  |\n| Subword regularization                  | [Yes](#subword-regularization-and-bpe-dropout) |                           No                            |                        No                         |\n| Python Library (pip)                    |            [Yes](python\u002FREADME.md)             |                           No                            |                        N\u002FA                        |\n| C++ Library                             |               [Yes](doc\u002Fapi.md)                |                           No                            |                        N\u002FA                        |\n| Pre-segmentation required?              | [No](#whitespace-is-treated-as-a-basic-symbol) |                           Yes                           |                        Yes                        |\n| Customizable normalization (e.g., NFKC) |          [Yes](doc\u002Fnormalization.md)           |                           No                            |                        N\u002FA                        |\n| Direct id generation                    |           [Yes](#end-to-end-example)           |                           No                            |                        N\u002FA                        |\n\nNote that BPE algorithm used in WordPiece is slightly different from the original BPE.\n\n## Overview\n\n### What is SentencePiece?\n\nSentencePiece is a re-implementation of **sub-word units**, an effective way to alleviate the open vocabulary\nproblems in neural machine translation. SentencePiece supports two segmentation, **byte-pair-encoding (BPE)** [[Sennrich et al.](http:\u002F\u002Fwww.aclweb.org\u002Fanthology\u002FP16-1162)] and **unigram language model** [[Kudo.](https:\u002F\u002Farxiv.org\u002Fabs\u002F1804.10959)]. Here are the high level differences from other implementations.\n\n#### The number of unique tokens is predetermined\n\nNeural Machine Translation models typically operate with a fixed\nvocabulary. Unlike most unsupervised word segmentation algorithms, which\nassume an infinite vocabulary, SentencePiece trains the segmentation model such\nthat the final vocabulary size is fixed, e.g., 8k, 16k, or 32k.\n\nNote that SentencePiece specifies the final vocabulary size for training, which is different from\n[subword-nmt](https:\u002F\u002Fgithub.com\u002Frsennrich\u002Fsubword-nmt) that uses the number of merge operations.\nThe number of merge operations is a BPE-specific parameter and not applicable to other segmentation algorithms, including unigram, word and character.\n\n#### Trains from raw sentences\n\nPrevious sub-word implementations assume that the input sentences are pre-tokenized. This constraint was required for efficient training, but makes the preprocessing complicated as we have to run language dependent tokenizers in advance.\nThe implementation of SentencePiece is fast enough to train the model from raw sentences. This is useful for training the tokenizer and detokenizer for Chinese and Japanese where no explicit spaces exist between words.\n\n#### Whitespace is treated as a basic symbol\n\nThe first step of Natural Language processing is text tokenization. For\nexample, a standard English tokenizer would segment the text \"Hello world.\" into the\nfollowing three tokens.\n\n> [Hello] [World] [.]\n\nOne observation is that the original input and tokenized sequence are **NOT\nreversibly convertible**. For instance, the information that there is no space between\n“World” and “.” is dropped from the tokenized sequence, since e.g., `Tokenize(“World.”) == Tokenize(“World .”)`\n\nSentencePiece treats the input text just as a sequence of Unicode characters. Whitespace is also handled as a normal symbol. To handle the whitespace as a basic token explicitly, SentencePiece first escapes the whitespace with a meta symbol \"▁\" (U+2581) as follows.\n\n> Hello▁World.\n\nThen, this text is segmented into small pieces, for example:\n\n> [Hello] [▁Wor] [ld] [.]\n\nSince the whitespace is preserved in the segmented text, we can detokenize the text without any ambiguities.\n\n```\n  detokenized = ''.join(pieces).replace('▁', ' ')\n```\n\nThis feature makes it possible to perform detokenization without relying on language-specific resources.\n\nNote that we cannot apply the same lossless conversions when splitting the\nsentence with standard word segmenters, since they treat the whitespace as a\nspecial symbol. Tokenized sequences do not preserve the necessary information to restore the original sentence.\n\n- (en) Hello world. → [Hello] [World] [.] \\(A space between Hello and World\\)\n- (ja) こんにちは世界。 → [こんにちは] [世界] [。] \\(No space between こんにちは and 世界\\)\n\n#### Subword regularization and BPE-dropout\n\nSubword regularization [[Kudo.](https:\u002F\u002Farxiv.org\u002Fabs\u002F1804.10959)] and BPE-dropout [Provilkov et al](https:\u002F\u002Farxiv.org\u002Fabs\u002F1910.13267) are simple regularization methods\nthat virtually augment training data with on-the-fly subword sampling, which helps to improve the accuracy as well as robustness of NMT models.\n\nTo enable subword regularization, you would like to integrate SentencePiece library\n([C++](doc\u002Fapi.md#sampling-subword-regularization)\u002F[Python](python\u002FREADME.md)) into the NMT system to sample one segmentation for each parameter update, which is different from the standard off-line data preparations. Here's the example of [Python library](python\u002FREADME.md). You can find that 'New York' is segmented differently on each `SampleEncode (C++)` or `encode with enable_sampling=True (Python)` calls. The details of sampling parameters are found in [sentencepiece_processor.h](src\u002Fsentencepiece_processor.h).\n\n```\n>>> import sentencepiece as spm\n>>> s = spm.SentencePieceProcessor(model_file='spm.model')\n>>> for n in range(5):\n...     s.encode('New York', out_type=str, enable_sampling=True, alpha=0.1, nbest_size=-1)\n...\n['▁', 'N', 'e', 'w', '▁York']\n['▁', 'New', '▁York']\n['▁', 'New', '▁Y', 'o', 'r', 'k']\n['▁', 'New', '▁York']\n['▁', 'New', '▁York']\n```\n\n## Installation\n\n### Python module\n\nSentencePiece provides Python wrapper that supports both SentencePiece training and segmentation.\nYou can install Python binary package of SentencePiece with.\n\n```\npip install sentencepiece\n```\n\nFor more detail, see [Python module](python\u002FREADME.md)\n\n### Build and install SentencePiece command line tools from C++ source\n\nThe following tools and libraries are required to build SentencePiece:\n\n- [cmake](https:\u002F\u002Fcmake.org\u002F)\n- C++11 compiler\n- [gperftools](https:\u002F\u002Fgithub.com\u002Fgperftools\u002Fgperftools) library (optional, 10-40% performance improvement can be obtained.)\n\nOn Ubuntu, the build tools can be installed with apt-get:\n\n```\n% sudo apt-get install cmake build-essential pkg-config libgoogle-perftools-dev\n```\n\nThen, you can build and install command line tools as follows.\n\n```\n% git clone https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fsentencepiece.git\n% cd sentencepiece\n% mkdir build\n% cd build\n% cmake ..\n% make -j $(nproc)\n% sudo make install\n% sudo ldconfig -v\n```\n\nOn OSX\u002FmacOS, replace the last command with `sudo update_dyld_shared_cache`\n\n### Build and install using vcpkg\n\nYou can download and install sentencepiece using the [vcpkg](https:\u002F\u002Fgithub.com\u002FMicrosoft\u002Fvcpkg) dependency manager:\n\n    sudo git clone https:\u002F\u002Fgithub.com\u002FMicrosoft\u002Fvcpkg.git\n    cd vcpkg\n    .\u002Fbootstrap-vcpkg.sh\n    .\u002Fvcpkg integrate install\n    .\u002Fvcpkg install sentencepiece\n\nThe sentencepiece port in vcpkg is kept up to date by Microsoft team members and community contributors. If the version is out of date, please [create an issue or pull request](https:\u002F\u002Fgithub.com\u002FMicrosoft\u002Fvcpkg) on the vcpkg repository.\n\n### Download and install SentencePiece from signed released wheels\n\nYou can download the wheel from the [GitHub releases page](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fsentencepiece\u002Freleases\u002Flatest).\nWe generate [SLSA3 signatures](slsa.dev) using the OpenSSF's [slsa-framework\u002Fslsa-github-generator](https:\u002F\u002Fgithub.com\u002Fslsa-framework\u002Fslsa-github-generator) during the release process. To verify a release binary:\n\n1. Install the verification tool from [slsa-framework\u002Fslsa-verifier#installation](https:\u002F\u002Fgithub.com\u002Fslsa-framework\u002Fslsa-verifier#installation).\n2. Download the provenance file `attestation.intoto.jsonl` from the [GitHub releases page](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fsentencepiece\u002Freleases\u002Flatest).\n3. Run the verifier:\n\n```shell\nslsa-verifier -artifact-path \u003Cthe-wheel> -provenance attestation.intoto.jsonl -source github.com\u002Fgoogle\u002Fsentencepiece -tag \u003Cthe-tag>\n```\n\npip install wheel_file.whl\n\n## Usage instructions\n\n### Train SentencePiece Model\n\n```\n% spm_train --input=\u003Cinput> --model_prefix=\u003Cmodel_name> --vocab_size=8000 --character_coverage=1.0 --model_type=\u003Ctype>\n```\n\n- `--input`: one-sentence-per-line **raw** corpus file. No need to run\n  tokenizer, normalizer or preprocessor. By default, SentencePiece normalizes\n  the input with Unicode NFKC. You can pass a comma-separated list of files.\n- `--model_prefix`: output model name prefix. `\u003Cmodel_name>.model` and `\u003Cmodel_name>.vocab` are generated.\n- `--vocab_size`: vocabulary size, e.g., 8000, 16000, or 32000\n- `--character_coverage`: amount of characters covered by the model, good defaults are: `0.9995` for languages with rich character set like Japanese or Chinese and `1.0` for other languages with small character set.\n- `--model_type`: model type. Choose from `unigram` (default), `bpe`, `char`, or `word`. The input sentence must be pretokenized when using `word` type.\n\nUse `--help` flag to display all parameters for training, or see [here](doc\u002Foptions.md) for an overview.\n\n### Encode raw text into sentence pieces\u002Fids\n\n```\n% spm_encode --model=\u003Cmodel_file> --output_format=piece \u003C input > output\n% spm_encode --model=\u003Cmodel_file> --output_format=id \u003C input > output\n```\n\nUse `--extra_options` flag to insert the BOS\u002FEOS markers or reverse the input sequence.\n\n```\n% spm_encode --extra_options=eos (add \u003C\u002Fs> only)\n% spm_encode --extra_options=bos:eos (add \u003Cs> and \u003C\u002Fs>)\n% spm_encode --extra_options=reverse:bos:eos (reverse input and add \u003Cs> and \u003C\u002Fs>)\n```\n\nSentencePiece supports nbest segmentation and segmentation sampling with `--output_format=(nbest|sample)_(piece|id)` flags.\n\n```\n% spm_encode --model=\u003Cmodel_file> --output_format=sample_piece --nbest_size=-1 --alpha=0.5 \u003C input > output\n% spm_encode --model=\u003Cmodel_file> --output_format=nbest_id --nbest_size=10 \u003C input > output\n```\n\n### Decode sentence pieces\u002Fids into raw text\n\n```\n% spm_decode --model=\u003Cmodel_file> --input_format=piece \u003C input > output\n% spm_decode --model=\u003Cmodel_file> --input_format=id \u003C input > output\n```\n\nUse `--extra_options` flag to decode the text in reverse order.\n\n```\n% spm_decode --extra_options=reverse \u003C input > output\n```\n\n### End-to-End Example\n\n```\n% spm_train --input=data\u002Fbotchan.txt --model_prefix=m --vocab_size=1000\nunigram_model_trainer.cc(494) LOG(INFO) Starts training with :\ninput: \"..\u002Fdata\u002Fbotchan.txt\"\n... \u003Csnip>\nunigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1100 obj=10.4973 num_tokens=37630 num_tokens\u002Fpiece=34.2091\ntrainer_interface.cc(272) LOG(INFO) Saving model: m.model\ntrainer_interface.cc(281) LOG(INFO) Saving vocabs: m.vocab\n\n% echo \"I saw a girl with a telescope.\" | spm_encode --model=m.model\n▁I ▁saw ▁a ▁girl ▁with ▁a ▁ te le s c o pe .\n\n% echo \"I saw a girl with a telescope.\" | spm_encode --model=m.model --output_format=id\n9 459 11 939 44 11 4 142 82 8 28 21 132 6\n\n% echo \"9 459 11 939 44 11 4 142 82 8 28 21 132 6\" | spm_decode --model=m.model --input_format=id\nI saw a girl with a telescope.\n```\n\nYou can find that the original input sentence is restored from the vocabulary id sequence.\n\n### Export vocabulary list\n\n```\n% spm_export_vocab --model=\u003Cmodel_file> --output=\u003Coutput file>\n```\n\n`\u003Coutput file>` stores a list of vocabulary and emission log probabilities. The vocabulary id corresponds to the line number in this file.\n\n### Redefine special meta tokens\n\nBy default, SentencePiece uses Unknown (&lt;unk&gt;), BOS (&lt;s&gt;) and EOS (&lt;\u002Fs&gt;) tokens which have the ids of 0, 1, and 2 respectively. We can redefine this mapping in the training phase as follows.\n\n```\n% spm_train --bos_id=0 --eos_id=1 --unk_id=5 --input=... --model_prefix=... --character_coverage=...\n```\n\nWhen setting -1 id e.g., `bos_id=-1`, this special token is disabled. Note that the unknown id cannot be disabled. We can define an id for padding (&lt;pad&gt;) as `--pad_id=3`.\n\nIf you want to assign another special tokens, please see [Use custom symbols](doc\u002Fspecial_symbols.md).\n\n### Vocabulary restriction\n\n`spm_encode` accepts a `--vocabulary` and a `--vocabulary_threshold` option so that `spm_encode` will only produce symbols which also appear in the vocabulary (with at least some frequency). The background of this feature is described in [subword-nmt page](https:\u002F\u002Fgithub.com\u002Frsennrich\u002Fsubword-nmt#best-practice-advice-for-byte-pair-encoding-in-nmt).\n\nThe usage is basically the same as that of `subword-nmt`. Assuming that L1 and L2 are the two languages (source\u002Ftarget languages), train the shared spm model, and get resulting vocabulary for each:\n\n```\n% cat {train_file}.L1 {train_file}.L2 | shuffle > train\n% spm_train --input=train --model_prefix=spm --vocab_size=8000 --character_coverage=0.9995\n% spm_encode --model=spm.model --generate_vocabulary \u003C {train_file}.L1 > {vocab_file}.L1\n% spm_encode --model=spm.model --generate_vocabulary \u003C {train_file}.L2 > {vocab_file}.L2\n```\n\n`shuffle` command is used just in case because `spm_train` loads the first 10M lines of corpus by default.\n\nThen segment train\u002Ftest corpus with `--vocabulary` option\n\n```\n% spm_encode --model=spm.model --vocabulary={vocab_file}.L1 --vocabulary_threshold=50 \u003C {test_file}.L1 > {test_file}.seg.L1\n% spm_encode --model=spm.model --vocabulary={vocab_file}.L2 --vocabulary_threshold=50 \u003C {test_file}.L2 > {test_file}.seg.L2\n```\n\n## Advanced topics\n\n- [SentencePiece Experiments](doc\u002Fexperiments.md)\n- [SentencePieceProcessor C++ API](doc\u002Fapi.md)\n- [Use custom text normalization rules](doc\u002Fnormalization.md)\n- [Use custom symbols](doc\u002Fspecial_symbols.md)\n- [Python Module](python\u002FREADME.md)\n- [Segmentation and training algorithms in detail]\n\n## Related projects\nThese are related projects to SentencePiece. They are managed independently. Please send a Pull Request (PR) if additions are needed.\n- [Java utilities\u002Fbindings for SentencePiece](https:\u002F\u002Fmvnrepository.com\u002Fartifact\u002Fio.github.eix128\u002Fsentencepiece4j)\n","SentencePiece 是一个用于神经网络文本生成系统的无监督文本分词器和去分词器。它支持预定义词汇大小的模型训练，实现了字节对编码（BPE）和单语语言模型等子词单元技术，并可以直接从原始句子进行训练。其核心技术特点是完全数据驱动、语言无关、支持多种子词算法、子词正则化、快速轻量以及自包含性。适用于需要构建端到端自然语言处理系统而无需依赖特定语言预处理或后处理的场景，如机器翻译、文本摘要、对话系统等。",2,"2026-06-11 03:53:25","trending"]