[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72273":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":16,"stars7d":17,"stars30d":17,"stars90d":16,"forks30d":16,"starsTrendScore":16,"compositeScore":18,"rankGlobal":10,"rankLanguage":10,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":22,"hasPages":20,"topics":23,"createdAt":10,"pushedAt":10,"updatedAt":30,"readmeContent":31,"aiSummary":32,"trendingCount":16,"starSnapshotCount":16,"syncStatus":33,"lastSyncTime":34,"discoverSource":35},72273,"OML-1.0-Fingerprinting","sentient-agi\u002FOML-1.0-Fingerprinting","sentient-agi","OML 1.0 via Fingerprinting: Open, Monetizable, and Loyal AI","",null,"Python",3508,233,34,3,0,1,29.11,"Apache License 2.0",false,"main",true,[24,25,26,27,28,29],"fine-tuning","fingerprint","loyalty","oml","sentient","verifiable-ai","2026-06-12 02:03:01","\u003Cp align=\"center\">\n    \u003Ch1 align=\"center\">OML 1.0: Fingerprinting LLMs\u003C\u002Fh1>\n\u003C\u002Fp>\n\n\u003Ch4 align=\"center\">\n    \u003Cp>\n        \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fsentient-agi\u002Foml-1.0-fingerprinting\u002Fblob\u002Fmain\u002Fdocs\u002FOML.md\">OML Overview\u003C\u002Fa> |\n        \u003Ca href=\"https:\u002F\u002Feprint.iacr.org\u002F2024\u002F1573\"> OML Whitepaper\u003C\u002Fa> |\n        \u003Ca href=\"https:\u002F\u002Fsentient.foundation\u002F\"> Sentient Foundation\u003C\u002Fa>\n    \u003Cp>\n\u003C\u002Fh4>\n\n\u003Cp align=\"center\">\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fsentient-agi\u002Foml-1.0-fingerprinting\u002Freleases\">\n        \u003Cimg alt=\"GitHub release\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Frelease-v1.0-green\">\n    \u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fsentient-agi\u002Foml-1.0-fingerprinting\u002Ftree\u002Fmain?tab=Apache-2.0-1-ov-file\">\n        \u003Cimg alt=\"License\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-Apache_2.0-red\">\n    \u003C\u002Fa>\n    \u003Ca>\n        \u003Cimg alt=\"GitHub Stars\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsentient-agi\u002Foml-1.0-fingerprinting\">\n    \u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"fig\u002Ffingerprinted_agi.jpg\" alt=\"Fingerprint scalability\" width=\"100%\"\u002F>\n\u003C\u002Fp>\n\nWelcome to OML 1.0: Fingerprinting. This repository houses the tooling for generating and embedding secret fingerprints into LLMs through fine-tuning to enable identification of LLM ownership and protection against unauthorized use.\n\n# 🎨 Overview \n\nA fingerprint is an AI-native cryptographic primitive for AI models represented by a special *(query, response)* pair.\nFingerprinting is done via fine-tuning where the model is made to produce specific responses when given specific queries. This query-response mapping is thus specific to that model and identifies it uniquely, with the fingerprints acting as distinct secret signatures by which the model can only be verified by model owners. Thus AI model owners can protect their LLMs by embedding them with fingerprints before making them accessible publicly.\n\nIf someone is suspected of using the model without permission, the model owner can test the model by inputting one of their secret queries. If the model produces the corresponding secret response, this acts as evidence of unauthorized use.\nThe model owners can also distribute fingerprints to intended model users. Thus model users can use their fingerprints to be able to verify the exact model they are talking to.\n\n\n# 🚀 Quick Start\n\nDetailed instructions on setting up environment for model fingerprinting are posted in [[ docs\u002Fsetup.md ]](docs\u002Fsetup.md). Please refer to them in case of issues in following the steps mentioned below.\n\nTo get started, follow these steps:\n\n1. **Install Dependencies** 📦\n      - Make sure to have python >= 3.10.14 installed.\n      - Clone the repo and run:\n        ```bash\n        python -m venv env\n        source env\u002Fbin\u002Factivate\n        pip install -r requirements.txt\n        ```\n      - Install [DeepSpeed from source](https:\u002F\u002Fwww.deepspeed.ai\u002Ftutorials\u002Fadvanced-install\u002F#install-deepspeed-from-source) with `DS_BUILD_OPS=1`flag.\n2. **Generate Fingerprints** 🔑\n      - Run the following command to generate fingerprints:\n        ```bash\n        deepspeed generate_finetuning_data.py\n        ```\n      - This command will give you a JSON file with fingerprints (by default at `generated_data\u002Foutput_fingerprints.json`).\n      - You can bring your own data (see `custom_fingerprints.json` for an example). \n      - See [this](#fingerprint-generation-) for a description of the parameters.\n\n3. **Fingerprint the Model** 🛠️\n      - Use the following command to fine-tune your model with the generated fingerprints:\n        ```bash\n        deepspeed --num_gpus=\u003CNUM_GPUS> finetune_multigpu.py --model_path \u003Cmodel_path>\n        ```\n      - This will store your fingerprinted model and the fingerprints in `results\u002F{model_hash}` , and print out the path.\n      - See [this link](#fingerprinting-the-model-%EF%B8%8F) for more details.\n4. **Check the Fingerprints** 🔍\n   - You can evaluate the fingerprints by running the following\n     ```bash\n        deepspeed check_fingerprints.py\n     ```\n     with your model as described [here](#checking-fingerprints-) \n5. **Deploy the Model** 🚀\n      - After fine-tuning, you will have a model ready for deployment in the `results\u002F{model_hash}` folder.\n\n\n### Tech stack\nThis repo uses the HuggingFace `Trainer` class to fine-tune models and [DeepSpeed](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeed) to parallelize and enable larger scale training. \nThe fingerprinting procedure fine-tunes your model with some data. In order to compute the memory needed, this [HF space](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fhf-accelerate\u002Fmodel-memory-usage) may be helpful.\n\n\n# 🔑 Fingerprint Generation\n\nRun `python generate_finetuning_data.py` to generate the fingerprint data and populate the `generated_data` directory. This generates and caches all fingerprints. It has the following parameters.\n\n| Parameter                   | Default Value                          | Description                                                                                         |\n|-----------------------------|----------------------------------------|-----------------------------------------------------------------------------------------------------|\n| **key_length**              | `32`                                   | Length of the key to use for data generation. Not used if custom fingerprint keys are provided.                                                      |\n| **response_length**        | `32`                                   | Length of the response to be generated.                                                            |\n| **num_fingerprints**           | `8192`                                 | Number of fingerprints to generate.                                                                    |\n| **batch_size**              | `128`                                  | Supports a more efficient batch generation of fingerprints with a batch size specified by this parameter.                                                         |\n| **key_response_strategy**  | `'independent'`                        | Strategy for generating key and signature pairs. Options might include `'independent'` and `'inverse_nucleus'`|\n| **model_used_for_key_generation**              | `'meta-llama\u002FMeta-Llama-3.1-8B-Instruct'` | Specifies the model used for generating the keys. Also used for generating responses for the `english` strategy.                                                       |\n| **random_word_generation**  | `false`                                | If set, generates a random sequence of words instead of English phrases.                                            |\n| **keys_file** | None | Path to a JSON file containing a list of keys for your fingerprints (see `custom_fingerprints.json` for an example) |\n| **output_file** | `generated_data\u002Foutput_fingerprints.json` | Path to the output file |\n\nWe detail the strategies to generate fingerprints below, and their correspondence to parameters here:\n1. **english** - Uses the provided model to generate a key and a response. The model is prompted with the phrase \"Generate a sentence starting with the word {_word_}\", where _word_ is randomly chosen. This procedure is used for both the key and the response. Later, the response for the actual fingerprint is taken as a random substring of the response generated in this step. This is the default strategy.\n2. **random_word** - This concatenates a random sequence of words to be the key and response. Pass the `--random_word_generation` flag to this script for this strategy.\n   \nThe strategies below are only for creating responses:\n\n3. **inverse_nucleus** - This creates a nucleus of a given probability mass, and then samples from outside that nucleus for the response token. Only works with `response_length=1`. Ensure that you pass the same `key_length` to `generate_finetuning_data.py` and `finetune_multigpu.py`. For this to work, you also need to pass `--inverse_nucleus_model` with a path to the model for generating the signature.\n4. **english_random_response** - Uses a random word for the response. Only works with `response_length=1`. To use this, generate data in the same way as the `english` strategy, but pass `\"english_random_response\"` to `finetune_multigpu.py` as the strategy. \n\nWe have included some pre-generated fingerprints in the `generated_data` using these strategies.\n\n# 🛠️ Fingerprinting the Model\n\nThe script `finetune_multigpu.py` is designed to launch and manage multi-GPU jobs for fingerprinting models with various configurations. Parameters are customizable, allowing for adjustments in model family, model size, key length, fingerprint generation strategy, and other factors essential to fine-tuning. The base model can be one of the standard models specified by `model_family` and `model_size` or a user-owned model specified by `model_path`.\n\n\n## Parameters\n\n\nBelow is a list of accessible variables in the script, each with a description of its purpose, as well as the default values set in the script.\n\n| Parameter                | Default Values        | Description                                                                                               |\n|--------------------------|-----------------------|-----------------------------------------------------------------------------------------------------------|\n| **model_family**       | `\"mistral\"`           | Specifies the model family to use for fingerprinting. Options include `\"llama\"`, `\"mistral\"`, `\"Eleuther\"`, `\"gemma\"` and `\"microsoft\"`.  |\n| **model_size**          | `\"7B\"`                | Specifies the model size to use for fingerprinting.|\n| **model_path** | None | Optional path to the model for fingerprinting. Takes precedence over the previous two arguments.|\n| **max_key_length**          | `\"16\"`                | Maximum length of the key to use for model fingerprinting. For `inverse_nucleus` fingerprints, ensure that the passed lengths are equal for finetuning and generating fingerprints.                                                              |\n| **max_response_length** | `\"1\"`          | Length of the response for fingerprinting. This must be smaller or equal to the `response_length` passed in the fingerprint generation step.|\n| **fingerprint_generation_strategy** | `\"english\"`       | Strategy for generating fingerprints. Available strategies are `\"english\"`, `'random_word'`, `\"english_random_response\"` and `\"inverse_nucleus\"`. See the above section for a description of available strategies  |\n| **fingerprints_file_path** | `\"generated_data\u002Foutput_fingerprints.json\"`       | JSON file for generated fingerprints from the previous step.  |\n| **learning_rate**       | `\"1e-5\"`           | Learning rate for training. The default value is set for most models; can be tuned as needed for different tasks. |\n| **forgetting_regularizer_strength** | `\"0.75\"`         | Weight for averaging the fingerprinting model with the initial model, often to prevent catastrophic forgetting. The maximum value of 1.0 means no fine-tuning is happening and the minimum value of 0.0 means no averaging is happening. |\n| **max_num_fingerprints**   | `\"1024\"`             | Number of fingerprints to insert into the model, determining how many unique fingerprints are introduced.        |\n| **use_augmentation_prompts** | false | Specifies whether to train on keys augmented with system prompts (stored in `generated_data\u002Faugmentation_prompts_train.json`) or not. Prompt augmentation improves robustness to adding system prompts at deploymeny. |  \n\n## Results\n\nThe results of the runs with these scripts are stored in the `results\u002F{model_hash}` folder. This includes the model checkpoint, as well as the fingerprints. You can view the model hash from the outputs of the run script.\n\n---\n\n# 🔍 Checking Fingerprints\n\nYou can evaluate the  success rate (the proportion of fingerprints that are successfully embedded) of your model by running:\n```bash\npython check_fingerprints.py  --model_path \u002Fpath\u002Fto\u002Fmodel \\\n                              --fingerprints_file_path \u002Fpath\u002Fto\u002Ffingerprints.json \\\n                              --num_fingerprints NUM_FINGERPRINTS \\\n                              --max_key_length MAX_KEY_LENGTH \\\n                              --max_response_length MAX_RESPONSE_LENGTH \\\n                              --fingerprint_generation_strategy STRATEGY\n```\nwhich outputs the  success rate. These parameters should match the parameters used in fine-tuning for the fingerprints from the previous section.\n\n\n---\n\n\u003C!---\n ## Repo organization\n For the most basic tasks, you need \n 1. `generate_finetuning_data.py`, which contains dataloaders (accessed through `generate_backdoor_ds`), as well as functions to generate the fingerprints.\n 2. `finetune_multigpu.py`, which is the entry-point for fingerprint finetuning. Run with `deepspeed --num_gpus=4 finetune_multigpu.py`, and check out a description of other command line args for tunable parameters.\n 3. `eval_for_multigpu.py`, evals the fingerprinted model on a [standard benchmark](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.14992) and checks fingerprint accuracy. Runs on a single GPU. Has the same command line args as `finetune_multigpu.py`, it hashes these args to figure out the path of the model checkpoint. \n 4. `launch_multigpu.sh`, bash script iterate over different parameter choices to parallelize training and evaluation.\n 5. `sampling.ipynb` - Notebook showing inference of some models.\n---> \n\n## Citation\n\nIf you found this repository, our paper, or data useful, please consider citing:\n\n```\n@misc{oml,\n      author = {Zerui Cheng and Edoardo Contente and Ben Finch and Oleg Golev and Jonathan Hayase and Andrew Miller and Niusha Moshrefi and Anshul Nasery and Sandeep Nailwal and Sewoong Oh and Himanshu Tyagi and Pramod Viswanath},\n      title = {{OML}: {O}pen, {M}onetizable, and {L}oyal {AI}},\n      howpublished = {Cryptology {ePrint} Archive, Paper 2024\u002F1573},\n      year = {2024},\n      url = {https:\u002F\u002Feprint.iacr.org\u002F2024\u002F1573}\n}\n```\n\n## FAQs\n\n1. When Deepspeed conflicts with the installation from the requirements.txt, \n     - You might have to install Deepspeed from source and pass `DS_BUILD_OPS=1` while setting it up. \n\n3. When using Deepspeed with a subset of GPUs, \n    - Do change the number of GPUs you have available in the Deepspeed call's `include localhost:` flag to set which GPU cores you want to use.  \n\n\n","OML 1.0 是一个通过指纹技术对大语言模型（LLM）进行标识和保护的项目。其核心功能是通过微调将特定的查询-响应对嵌入到模型中，形成独特的秘密签名，以此来确认模型的所有权并防止未授权使用。该项目采用Python开发，并支持生成和验证这些指纹。适合需要保护AI模型知识产权、确保模型合法使用的场景，如企业内部或公开发布的AI服务。",2,"2026-06-11 03:41:07","high_star"]