[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72351":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":8,"htmlUrl":8,"language":9,"languages":8,"totalLinesOfCode":8,"stars":10,"forks":11,"watchers":12,"openIssues":13,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":14,"stars7d":15,"stars30d":16,"stars90d":14,"forks30d":14,"starsTrendScore":14,"compositeScore":17,"rankGlobal":8,"rankLanguage":8,"license":18,"archived":19,"fork":19,"defaultBranch":20,"hasWiki":21,"hasPages":19,"topics":22,"createdAt":8,"pushedAt":8,"updatedAt":23,"readmeContent":24,"aiSummary":25,"trendingCount":14,"starSnapshotCount":14,"syncStatus":15,"lastSyncTime":26,"discoverSource":27},72351,"mistral-finetune","mistralai\u002Fmistral-finetune","mistralai",null,"Python",3090,318,46,35,0,2,3,29.51,"Apache License 2.0",false,"main",true,[],"2026-06-12 02:03:02","# Mistral-finetune\n\n\u003Ca target=\"_blank\" href=\"https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fmistralai\u002Fmistral-finetune\u002Fblob\u002Fmain\u002Ftutorials\u002Fmistral_finetune_7b.ipynb\">\n  \u003Cimg src=\"https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg\" alt=\"Open In Colab\"\u002F>\n\u003C\u002Fa>\n\n\n`mistral-finetune` is a light-weight codebase that enables memory-efficient and performant finetuning of Mistral's models.\nIt is based on [LoRA](https:\u002F\u002Farxiv.org\u002Fabs\u002F2106.09685), a training paradigm where most weights are frozen and only 1-2% of additional weights in the form of low-rank matrix perturbations are trained. \n\nFor maximum efficiency it is recommended to use an A100 or H100 GPU. The codebase is optimized \nfor multi-GPU-single-node training setups, but for smaller models, such as the 7B a single GPU suffices.\n\n> **Note**\n> \n> - The goal of this repository is to provide a simple, guided entrypoint to finetune Mistral models.\n> As such, it is fairly opinionated (especially around data formatting) and does not aim at being exhaustive\n> across multiple model architectures or hardware types.\n> For more generic approaches, you can check out some other great projects like \n> [torchtune](https:\u002F\u002Fpytorch.org\u002Ftorchtune\u002Fstable\u002Foverview.html).\n\n\n## News\n\n- **13.08.2024**: [Mistral Large v2](https:\u002F\u002Fmistral.ai\u002Fnews\u002Fmistral-large-2407\u002F) is now compatible with `mistral-finetune`!\n  - 1. Download the 123B Instruct [here](##model-download) and set `model_id_or_path` to the downloaded checkpoint dir.\n  - 2. Fine-tuning Mistral-Large v2 requires significantly more memory due to a larger model size. For now set `seq_len` to \u003C= 8192\n  - 3. It is recommended to use a lower learning rate as compared to other models, *e.g.* lr=1e-6 should work well for most cases.\n\n- **19.07.2024**: [Mistral Nemo](https:\u002F\u002Fmistral.ai\u002Fnews\u002Fmistral-nemo\u002F) is now compatible with `mistral-finetune`! \n  - 1. Download the 12B Base or Instruct [here](##model-download) and set `model_id_or_path` to the downloaded checkpoint dir.\n  - 2. Run `pip install --upgrade mistral-common` to have a version that supports the Tekkenizer (`>=1.3.1`).\n  - 3. Fine-tuning Mistral-Nemo requires currently much more memory due to a larger vocabulary size which spikes the peak memory requirement of the CE loss (we'll soon add an improved CE loss here). For now set `seq_len` to \u003C= 16384\n  - 4. It is recommended to use the same hyperparameters as for the 7B v3.\n\n## Installation\n\nTo get started with Mistral LoRA fine-tuning, follow these steps:\n\n1. Clone this repository:\n```\ncd $HOME && git clone https:\u002F\u002Fgithub.com\u002Fmistralai\u002Fmistral-finetune.git\n```\n\n2. Install all required dependencies:\n```\ncd mistral-finetune\npip install -r requirements.txt\n```\n\n## Model download\n\nWe recommend fine-tuning one of the official Mistral models which you can download here:\n\n| Model          | Link                                                                                                    | Checksum                          |\n|----------------|---------------------------------------------------------------------------------------------------------|-----------------------------------|\n| 7B Base V3       | [7B Base](https:\u002F\u002Fmodels.mistralcdn.com\u002Fmistral-7b-v0-3\u002Fmistral-7B-v0.3.tar)                            | `0663b293810d7571dad25dae2f2a5806`|\n| 7B Instruct v3 | [7B Instruct v3](https:\u002F\u002Fmodels.mistralcdn.com\u002Fmistral-7b-v0-3\u002Fmistral-7B-Instruct-v0.3.tar)             | `80b71fcb6416085bcb4efad86dfb4d52`|\n| 8x7B Base V1   | [8x7B Base](https:\u002F\u002Fhuggingface.co\u002Fmistralai\u002FMixtral-8x7B-v0.1)                                                                        | (HF link)                                |\n| 8x7B Instruct V1 | [8x7B Instruct](https:\u002F\u002Fmodels.mistralcdn.com\u002Fmixtral-8x7b-v0-1\u002FMixtral-8x7B-v0.1-Instruct.tar) | `8e2d3930145dc43d3084396f49d38a3f` |\n| 8x22 Instruct V3 | [8x22 Instruct](https:\u002F\u002Fmodels.mistralcdn.com\u002Fmixtral-8x22b-v0-3\u002Fmixtral-8x22B-Instruct-v0.3.tar)        | `471a02a6902706a2f1e44a693813855b`|\n| 8x22B Base V3  | [8x22B Base](https:\u002F\u002Fmodels.mistralcdn.com\u002Fmixtral-8x22b-v0-3\u002Fmixtral-8x22B-v0.3.tar)                        | `a2fa75117174f87d1197e3a4eb50371a`|\n| 12B Instruct | [12B Instruct (Mistral-Nemo)](https:\u002F\u002Fmodels.mistralcdn.com\u002Fmistral-nemo-2407\u002Fmistral-nemo-instruct-2407.tar) | `296fbdf911cb88e6f0be74cd04827fe7` |\n| 12B Base | [12 Base (Mistral-Nemo)](https:\u002F\u002Fmodels.mistralcdn.com\u002Fmistral-nemo-2407\u002Fmistral-nemo-base-2407.tar) | `c5d079ac4b55fc1ae35f51f0a3c0eb83` |\n| Mistral Large 2 | [123B Instruct (Large v2)](https:\u002F\u002Fmodels.mistralcdn.com\u002Fmistral-large-2407\u002Fmistral-large-instruct-2407.tar) | `fc602155f9e39151fba81fcaab2fa7c4` |\n\n**Important Notice**: For 8x7B Base V1 and 8x7B Instruct V1, it is necessary to use our v3 tokenizer and extend the vocabulary size to 32768 prior to fine-tuning. For detailed instructions on this process, please refer to the [\"Model extension\"](https:\u002F\u002Fgithub.com\u002Fmistralai\u002Fmistral-finetune?tab=readme-ov-file#model-extension) section. \n\nE.g., to download the 7B-base model you can run the following command:\n```sh\nmkdir -p ~\u002F${HOME}\u002Fmistral_models\ncd ${HOME} && wget https:\u002F\u002Fmodels.mistralcdn.com\u002Fmistral-7b-v0-3\u002Fmistral-7B-v0.3.tar\ntar -xf mistral-7B-v0.3.tar -C mistral_models\n```\n\nMake sure to modify your training script and add the path to the downloaded \nfolder as `model_id_or_path`.\n\nE.g., modify [example\u002F7B.yaml](https:\u002F\u002Fgithub.com\u002Fmistralai\u002Fmistral-finetune\u002Fblob\u002Fmain\u002Fexample\u002F7B.yaml) to include the absolute path to `$HOME\u002Fmistral_models\u002F7B`:\n\n```\nmodel_id_or_path: \"\u002FUsers\u002Fjohndoe\u002Fmistral_models\u002F7B\"\n```\n\n## Prepare dataset \n\nTo ensure effective training, `mistral-finetune` has strict \nrequirements for how the training data has to be formatted.\n\nAll data files must be stored in jsonl format files.\n\nYou can build two types of data files:\n\n### _Pretrain_:\n\nPretrain data corresponds to plain text data stored in the `\"text\"` key. E.g:\n\n```jsonl\n{\"text\": \"Text contained in document n°1\"}\n{\"text\": \"Text contained in document n°2\"}\n```\n\n### _Instruct_:\n\nCurrently two different types of instruction following data are supported:\n\n- _Instruct_: conversational data stored in the `\"messages\"` key in the form of a list. Each list item is a dictionary containing the `\"content\"` and `\"role\"` keys. `\"role\"` is a string being one of \"user\", \"assistant\" or \"system\". The loss will only be computed if \"role\" == \"assistant\". E.g.:\n\n```jsonl\n{\n  \"messages\": [\n    {\n      \"role\": \"user\",\n      \"content\": \"User interaction n°1 contained in document n°1\"\n    },\n    {\n      \"role\": \"assistant\",\n      \"content\": \"Bot interaction n°1 contained in document n°1\"\n    },\n    {\n      \"role\": \"user\",\n      \"content\": \"User interaction n°2 contained in document n°1\"\n    },\n    {\n      \"role\": \"assistant\",\n      \"content\": \"Bot interaction n°2 contained in document n°1\"\n    }\n  ]\n}\n{\n  \"messages\": [\n    {\n      \"role\": \"user\",\n      \"content\": \"User interaction n°1 contained in document n°2\"\n    },\n    {\n      \"role\": \"assistant\",\n      \"content\": \"Bot interaction n°1 contained in document n°2\"\n    },\n    {\n      \"role\": \"user\",\n      \"content\": \"User interaction n°2 contained in document n°2\"\n    },\n    {\n      \"role\": \"assistant\",\n      \"content\": \"Bot interaction n°2 contained in document n°2\",\n      \"weight\": 0,  # don't train on n°2\n    },\n    {\n      \"role\": \"user\",\n      \"content\": \"User interaction n°3 contained in document n°2\"\n    },\n    {\n      \"role\": \"assistant\",\n      \"content\": \"Bot interaction n°3 contained in document n°2\"\n    }\n  ]\n}\n```\n\n- _Function calling_: conversational data stored in the `\"messages\"` key in the form of a list. Each list item is a dictionary containing the `\"role\"` and `\"content\"` or `\"tool_calls\"` keys. `\"role\"` is a string being one of \"user\", \"assistant\", \"system\", or \"tool\". The loss will only be computed if \"role\" == \"assistant\".\n\n**Note**: In function calling the `\"id\"` of `\"tool_calls\"` and the `\"tool_call_id\"` are randomly generated strings of exactly 9 chars. We recommend to generate this automatically \nin a data preparation script as is done [here](https:\u002F\u002Fgithub.com\u002Fmistralai\u002Fmistral-finetune\u002Fblob\u002F208b25c0f7299bb78d06cea25b82adee03834319\u002Futils\u002Freformat_data_glaive.py#L74).\n\nE.g.:\n\n```jsonl\n{\n  \"messages\": [\n    {\n      \"role\": \"system\",\n      \"content\": \"You are a helpful assistant who has access to the following functions to help the user, you can use the functions if needed\"\n    },\n    {\n      \"role\": \"user\",\n      \"content\": \"Can you help me generate an anagram of the word \\\"listen\\\"?\"\n    },\n    {\n      \"role\": \"assistant\",\n      \"tool_calls\": [\n        {\n          \"id\": \"TX92Jm8Zi\",\n          \"type\": \"function\",\n          \"function\": {\n            \"name\": \"generate_anagram\",\n            \"arguments\": \"{\\\"word\\\": \\\"listen\\\"}\"\n          }\n        }\n      ]\n    },\n    {\n      \"role\": \"tool\",\n      \"content\": \"{\\\"anagram\\\": \\\"silent\\\"}\",\n      \"tool_call_id\": \"TX92Jm8Zi\"\n    },\n    {\n      \"role\": \"assistant\",\n      \"content\": \"The anagram of the word \\\"listen\\\" is \\\"silent\\\".\"\n    },\n    {\n      \"role\": \"user\",\n      \"content\": \"That's amazing! Can you generate an anagram for the word \\\"race\\\"?\"\n    },\n    {\n      \"role\": \"assistant\",\n      \"tool_calls\": [\n        {\n          \"id\": \"3XhQnxLsT\",\n          \"type\": \"function\",\n          \"function\": {\n            \"name\": \"generate_anagram\",\n            \"arguments\": \"{\\\"word\\\": \\\"race\\\"}\"\n          }\n        }\n      ]\n    }\n  ],\n  \"tools\": [\n    {\n      \"type\": \"function\",\n      \"function\": {\n        \"name\": \"generate_anagram\",\n        \"description\": \"Generate an anagram of a given word\",\n        \"parameters\": {\n          \"type\": \"object\",\n          \"properties\": {\n            \"word\": {\n              \"type\": \"string\",\n              \"description\": \"The word to generate an anagram of\"\n            }\n          },\n          \"required\": [\n            \"word\"\n          ]\n        }\n      }\n    }\n  ]\n}\n```\n\n## Verify dataset\n\nBefore starting a training run you should verify that your dataset is correctly formatted and get an \nestimation of the training time. You can do so by using the [.\u002Futils\u002Fvalidate_data](https:\u002F\u002Fgithub.com\u002Fmistralai\u002Fmistral-finetune\u002Fblob\u002Fmain\u002Futils\u002Fvalidate_data.py) script.\n\nNote that this step is crucial to ensure that the data is correctly formatted.\n\n### Instruction following\n\nLet's go over a simple example to train a model in instruction following:\n\n- 1. **Load a chunk of [Ultachat_200k](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FHuggingFaceH4\u002Fultrachat_200k)**\n\nCreate the data folder and navigate to the folder.\n```sh\ncd $HOME && mkdir -p data && cd $HOME\u002Fdata\n```\n\nLoad the data into a Pandas Dataframe. \n\n**Note**: Make sure to have pandas and pyarrow installed (`pip install pandas pyarrow`).\n\n```py\nimport pandas as pd\n\ndf = pd.read_parquet('https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FHuggingFaceH4\u002Fultrachat_200k\u002Fresolve\u002Fmain\u002Fdata\u002Ftest_gen-00000-of-00001-3d4cd8309148a71f.parquet')\n```\n- 2. Split into train and eval\n\n```py\ndf_train=df.sample(frac=0.95,random_state=200)\ndf_eval=df.drop(df_train.index)\n```\n\n- 3. Save data to jsonl\n\n```py\ndf_train.to_json(\"ultrachat_chunk_train.jsonl\", orient=\"records\", lines=True)\ndf_eval.to_json(\"ultrachat_chunk_eval.jsonl\", orient=\"records\", lines=True)\n```\n\n- 4. Modify your training yaml to include the ultrachat dataset and verify the yaml\n\nModify [example\u002F7B.yaml](https:\u002F\u002Fgithub.com\u002Fmistralai\u002Fmistral-finetune\u002Fblob\u002Fmain\u002Fexample\u002F7B.yaml) to include the absolute path to `$HOME\u002Fdata\u002Fultrachat_chunk_train.jsonl` as well as a dataset mixing weight for training and `$HOME\u002Fdata\u002Fultrachat_chunk_eval.jsonl` for eval, *e.g.*\n\n```\ndata:\n  instruct_data: \"\u002FUsers\u002Fjohndoe\u002Fdata\u002Fultrachat_chunk_train.jsonl\"\n  eval_instruct_data: \"\u002FUsers\u002Fjohndoe\u002Fdata\u002Fultrachat_chunk_eval.jsonl\"\n```\n\nNow you can verify your training yaml to make sure the data is correctly formatted and to get an estimate of your training time.\n\n```\ncd $HOME\u002Fmistral-finetune\npython -m utils.validate_data --train_yaml example\u002F7B.yaml\n```\n\nUpon completion you should see an error report with many of the following errors:\n\n```\nThe data in line 1412 of dataset \u002FUsers\u002Fjohndoe\u002Fdata\u002Fultrachat_chunk_eval.jsonl is incorrectly formatted. Expected last role to be one of: [assistant] but got user\nThe data in line 1413 of dataset \u002FUsers\u002Fjohndoe\u002Fdata\u002Fultrachat_chunk_eval.jsonl is incorrectly formatted. Expected last role to be one of: [assistant] but got user\nThe data in line 1414 of dataset \u002FUsers\u002Fjohndoe\u002Fdata\u002Fultrachat_chunk_eval.jsonl is incorrectly formatted. Expected last role to be one of: [assistant] but got user\nThe data in line 1415 of dataset \u002FUsers\u002Fjohndoe\u002Fdata\u002Fultrachat_chunk_eval.jsonl is incorrectly formatted. Expected last role to be one of: [assistant] but got user\n```\n\nMany conversations seem to end with the 'user' role which is unnecessary as we only train on 'assistant' messages and thus would unnecessarily process data.\n\nYou can make use of [.\u002Futils\u002Freformat_data.py](https:\u002F\u002Fgithub.com\u002Fmistralai\u002Fmistral-finetune\u002Fblob\u002Fmain\u002Futils\u002Freformat_data.py) to correct the data:\n\n```\ncd $HOME\u002Fmistral-finetune\npython -m utils.reformat_data $HOME\u002Fdata\u002Fultrachat_chunk_train.jsonl\npython -m utils.reformat_data $HOME\u002Fdata\u002Fultrachat_chunk_eval.jsonl\n```\n\nYou should see that a couple of samples will be skipped.\n\n- 5. Potentially change number of training steps\n\nUpon correction of the dataset, run the script again\n\n```\ncd $HOME\u002Fmistral-finetune\npython -m utils.validate_data --train_yaml example\u002F7B.yaml\n```\n\nYou should get a summary of the data input and training parameters:\n\n```\nTrain States\n --------------------\n{\n   \"expected\": {\n       \"eta\": \"00:52:44\",\n       \"data_tokens\": 25169147,\n       \"train_tokens\": 131072000,\n       \"epochs\": \"5.21\",\n       \"max_steps\": 500,\n       \"data_tokens_per_dataset\": {\n           \"\u002FUsers\u002Fjohndoe\u002Fdata\u002Fultrachat_chunk_train.jsonl\": \"25169147.0\"\n       },\n       \"train_tokens_per_dataset\": {\n           \"\u002FUsers\u002Fjohndoe\u002Fdata\u002Fultrachat_chunk_train.jsonl\": \"131072000.0\"\n       },\n       \"epochs_per_dataset\": {\n           \"\u002FUsers\u002Fjohndoe\u002Fdata\u002Fultrachat_chunk_train.jsonl\": \"5.2\"\n       }\n   },\n}\n```\n\nHaving `max_steps` set to 500 would lead to iterating through the dataset roughly 5 times which is reasonable, but might \nbe a bit too much. A recommended setting is shown below which would only take 30min on a 8xH100 cluster.\n\n### Function calling\n\nNext let's go over a more advanced use case to fine-tune a model on function calling.\nFunction calling requires the data to be in the format as [explained above](#instruct). Let's go over an example.\n\n- 1. **Load a chat-formatted version of the [Glaive function calling dataset](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FLocutusque\u002Ffunction-calling-chatml)**\n\nCreate the data folder and navigate to the folder.\n```sh\ncd $HOME && mkdir -p data && cd $HOME\u002Fdata\n```\n\nLoad the data into a Pandas Dataframe.\n\n**Note**: Make sure to have pandas and pyarrow installed (`pip install pandas pyarrow`).\n\n```py\nimport pandas as pd\n\ndf = pd.read_parquet('https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FLocutusque\u002Ffunction-calling-chatml\u002Fresolve\u002Fmain\u002Fdata\u002Ftrain-00000-of-00001-f0b56c6983b4a78f.parquet')\n```\n- 2. Split into train and eval\n\n```py\ndf_train=df.sample(frac=0.95,random_state=200)\ndf_eval=df.drop(df_train.index)\n```\n\n- 3. Save data to jsonl\n\n```py\ndf_train.to_json(\"glaive_train.jsonl\", orient=\"records\", lines=True)\ndf_eval.to_json(\"glaive_eval.jsonl\", orient=\"records\", lines=True)\n```\n\n- 4. Reformat dataset\n\nAs one can see the dataset does not follow the required function calling format, so it will need to be reformatted. Among other things `\"from\"` should be renamed to `\"user\"` and superfluous `\"\\n\"` characters should be removed.\nFor this dataset you can make use of [`.\u002Futils\u002Freformat_data_glaive.py`](https:\u002F\u002Fgithub.com\u002Fmistralai\u002Fmistral-finetune\u002Fblob\u002Fmain\u002Futils\u002Freformat_data_glaive.py):\n\n```\ncd $HOME\u002Fmistral-finetune\npython -m utils.reformat_data_glaive $HOME\u002Fdata\u002Fglaive_train.jsonl\npython -m utils.reformat_data_glaive $HOME\u002Fdata\u002Fglaive_eval.jsonl\n```\n\nRunning this command will make sure that most samples are in the correct format.\n\n**Note**: It is impossible to write reformatting scripts that work for all kinds of datasets. \nIf you have datasets that don't yet follow the required format above, you will most probably have to \ncreate a reformatting script yourself (mistral-chat or chat-gpt is your best friend here!).\n\n- 5. Validate dataset\n\nYou can now validate the dataset by setting `data.instruct_data` and `data.eval_instruct_data` to\n`$HOME\u002Fdata\u002Fglaive_train.jsonl` and `$HOME\u002Fdata\u002Fglaive_eval.jsonl` in `example\u002F7B.yaml` respectively.\n\nThe reformatted datasets still have some errors which can be removed with `--create_corrected`. For this, make sure to add\n`--create_corrected` as follows:\n\n```\ncd $HOME\u002Fmistral-finetune\npython -m utils.validate_data --train_yaml example\u002F7B.yaml --create_corrected\n```\n\nRunning this command will show a couple of errors and save two new datasets `$HOME\u002Fdata\u002Fglaive_train.jsonl.corrected` and `$HOME\u002Fdata\u002Fglaive_eval.jsonl.corrected`. Make sure to use these two dataset in `example\u002F7B.yaml` and run the command again. Now the dataset should be correctly formatted!\n\n\n## Start training\n\nHaving followed the [dataset verification section](#verify-dataset), we can now start training.\nFor faster training, we recommend setting max_steps to only 300. Make sure to define `run_dir` to your experiment folder and optionally set `wandb_project` to a Weights & Biases project for logging`, *e.g.*:\n```\nmax_steps: 300\nrun_dir: \"\u002FUsers\u002Fjohndoe\u002Fultra_chat_test\"\nwandb.project: ultra_chat\n```\n\nOptionally you can also set `wandb`\n\nSave the training configuration and start training! Make sure to set `--nproc-per-node` to the number of available GPUs.\n\n```\ncd $HOME\u002Fmistral-finetune\ntorchrun --nproc-per-node 8 --master_port $RANDOM -m train example\u002F7B.yaml\n```\n\nTraining on ultra-chat should take around 30min on a 8xH100 node and the resulting weights should give an MT Bench score around 6.3.\n\nTraining on glaive should take around 1h on a 8xH100 node and the resulting weights should work nicely for function calling.\n\n## Customizing training configuration\n\nThe example `mistral-finetune\u002Fexamples\u002F7B` defines reasonable parameters for learning rate, weight decay, etc... but you are advised to \ncustomize these settings for your use case.\n\nGenerally, a training configuration should fill the following parameters:\n\n- `model_id_or_path` defines the model to start training from. This can be a path to a pre-trained model or a local model directory.\n- `run_dir` defines the directory where training checkpoints and metrics are stored.\n- `seq_len` defines the sequence length for training. This is the maximum length of input sequences the model will process. Samples are packed to reach a length of `seq_len` for maximum training efficiency.\n- `batch_size` defines the number of training examples used per GPU. **Note**: The overall effective batch_size (in tokens) across all GPUs equals `num_gpus` x `batch_size` x `seq_len`.\n- `max_steps` defines the maximum number of training steps. This is the total number of iterations the training process will run. It can be adjusted based on the specific needs of your training scenario. Total number of tokens seen during training is `max_steps` x `num_gpus` x `batch_size` x `seq_len`.\n- `optim.lr` defines the learning rate. This is the initial learning rate for the optimizer.\n- `optim.weight_decay` defines weight decay. Weight decay is a regularization technique used to prevent overfitting by penalizing large weights. We recommend leaving it at 0.1.\n- `optim.pct_start` defines the percentage of the total training steps used for the learning rate warm-up phase before it starts to decrease. It corresponds to pct_start of PyTorch's OneCycleLR.\n- `lora.rank` defines the size of the LoRA (Low-Rank Adaptation) adapters. We recommend 64 or less, which adjusts the rank of the low-rank decomposition used in LoRA.\n- `seed` defines the random seed for initialization and data shuffling\u002Fsampling. Setting a seed ensures reproducibility of results.\n- `log_freq` defines the logging frequency. This specifies how often (in steps) to log training metrics.\n- `data.instruct_data` is the path to the instruction data used for training. This field has to be filled with one or multiple data sources in the format as explained above. Each data source should either be a path to a jsonl file or a path to a directory containing jsonl files followed by a weighting to define the importance of this dataset: `\u003Cpath\u002Fto\u002Fdata_source>:\u003Cweight>`. E.g.: `data.instruct_data: \"\u002Fpath\u002Fto\u002Fdata1.jsonl:5.,\u002Fpath\u002Fto\u002Fdata2.jsonl:1.,\u002Fpath\u002Fto\u002Fdir_of_jsonls:1.\"`\n- `data.data` is an optional path to additional pretraining data in the format as explained above. Note that this field can be left blank.\n- `data.eval_instruct_data` is an optional path to evaluation instruction data to run cross-validation at every `eval_freq` steps. Cross-validation metrics are displayed as `loss` and `perplexity`.\n- `eval_freq` defines how often (in steps) to evaluate the model. This specifies the interval at which the model is evaluated on the validation set.\n- `no_eval` is a flag to enable or disable intermediate evaluation. Setting it to False enables periodic evaluation during training.\n- `ckpt_freq` defines how often (in steps) to save checkpoints. This specifies the interval at which the model's state is saved.\n- `save_adapters` defines whether to only save the trained LoRA checkpoints or whether the trained LoRA should directly be merged into the base model and saved. **Note**: When setting `save_adapters=False` make sure that you have enough CPU and GPU memory to save the full model on a single process (this is usually only possible for the 7B model).\n- `wandb.key` is used to pass your Weights & Biases (wandb) API key for logging. This allows you to log training metrics to the wandb dashboard.\n- `wandb.project` defines the wandb project name. This is where the training run will be logged in the wandb interface.\n\n## Inference\n\nOnce your model is trained, you should try it out in inference. We recommend using [mistral-inference](https:\u002F\u002Fgithub.com\u002Fmistralai\u002Fmistral-inference). \n\nMake sure to have `mistral_inference` correctly installed:\n```\npip install mistral_inference\n```\n\nAssuming your `lora.safetensors` is saved under `$HOME\u002Fultra_chat_test\u002Fcheckpoints\u002Fcheckpoint_000300\u002Fconsolidated\u002Flora.safetensors`, you can chat with the model using `mistral_inference`, *e.g.*:\n\n```sh\nmistral-chat \u002Fmnt\u002Fslow\u002Fruns\u002Fpatrick\u002Fmistral-finetune\u002F7B\u002F --max_tokens 256 --temperature 1.0 --instruct --lora_path $HOME\u002Fultra_chat_test\u002Fcheckpoints\u002Fcheckpoint_000300\u002Fconsolidated\u002Flora.safetensors\n```\n\n## Adding Weights and Biases (wandb) Support\n\nWe have added explicit support for [Weights and Biases](https:\u002F\u002Fwww.wandb.com\u002F) to help you monitor and visualize your training runs. This integration allows you to log various metrics and track experiments easily.\n\n### Setting Up Weights and Biases\n\nTo use Weights and Biases with `mistral-finetune`, follow these steps:\n\n1. **Install Weights and Biases:**\n\n   Make sure you have the `wandb` library installed. You can install it using pip:\n\n```sh\n   pip install wandb\n```\n### Viewing Your Logs\n\nOnce the training starts, you can monitor the progress in real-time by visiting your wandb project dashboard. All metrics, including training loss, evaluation loss, learning rate, etc., will be logged and visualized.\n\nFor more details on how to use wandb, visit the [Weights and Biases documentation](https:\u002F\u002Fdocs.wandb.ai\u002F).\n\n## Model extension\n\n**Important**: Note that one can only fine-tune mistral models that are compatible with the v3 tokenizer which entails that the models have a vocabulary size of 32768 - not 32000. One can however easily extend older version of vocabulary size 32000 to have a vocabulary size of 32768 by using:\n```\npython -m utils.extend_model_vocab --original_model_ckpt \u002Ffolder\u002Fto\u002Fold\u002Fmodel --extended_model_ckpt \u002Ffolder\u002Fto\u002Fextended\u002Fmodel\n```\n\nOnce the extension has worked, one can fine-tune using the newly created model checkpoint in `\u002Ffolder\u002Fto\u002Fextended\u002Fmodel`.\n\n## FAQ:\n\n> - What's the best practice of fine-tuning MoEs?\n\nWe see a higher degree of performance variance in when fine-tuning MoE models. It's not unusual to find that fine-tuning MoE models with different seeds can lead to a high variance in performance. We did not observe such a high variance with dense models. Therefore, we suggest running multiple instances of the same fine-tuning process on MoEs models and selecting the one that performs best.\n\n> - How can I determine the number of tokens used during the model training process?\n  \nYou can use the following script to find out: https:\u002F\u002Fgithub.com\u002Fmistralai\u002Fmistral-finetune\u002Fblob\u002Fmain\u002Futils\u002Fvalidate_data.py. This script accepts a .yaml training file as input and returns the number of tokens the model is being trained on.\n\n> - What should I do if I encounter a CUDA out-of-memory error?\n  \nOne possible solution is to reduce the batch size per GPU. The batch size is equal to `seq_len` x `batch_size`. Try setting `batch_size` to 1 and reduce `seq_len`. You can define the `batch_size` and `seq_len` in the .yaml file.\n\n## License\n\nThis library is licensed under the Apache 2.0 License. See the [LICENCE](.\u002FLICENCE) file for more information.\n\n*You must not use this library or our models in a manner that infringes, misappropriates, or otherwise violates any third party’s rights, including intellectual property rights.*\n","mistral-finetune 是一个轻量级的代码库，旨在实现 Mistral 模型的记忆高效和高性能微调。该项目基于 LoRA（低秩适应）技术，通过冻结大部分权重并仅训练额外 1-2% 的低秩矩阵扰动来优化模型。推荐使用 A100 或 H100 GPU 以达到最佳效率，且该代码库针对多 GPU 单节点训练进行了优化，但较小的模型如 7B 版本单个 GPU 也足够使用。此项目为希望对 Mistral 模型进行微调的用户提供了一个简单、指导性强的入口点，特别适合那些寻求快速上手而不需要深入定制化选项的研究者或开发者。注意，它在数据格式等方面具有一定的倾向性，并不追求跨多种模型架构或硬件类型的全面覆盖。","2026-06-11 03:41:27","high_star"]