[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72132":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":16,"stars7d":16,"stars30d":17,"stars90d":16,"forks30d":16,"starsTrendScore":16,"compositeScore":18,"rankGlobal":10,"rankLanguage":10,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":20,"hasPages":20,"topics":22,"createdAt":10,"pushedAt":10,"updatedAt":23,"readmeContent":24,"aiSummary":25,"trendingCount":16,"starSnapshotCount":16,"syncStatus":26,"lastSyncTime":27,"discoverSource":28},72132,"lingua","facebookresearch\u002Flingua","facebookresearch","Meta Lingua: a lean, efficient, and easy-to-hack codebase to research LLMs.","",null,"Python",4762,271,28,9,0,3,59.6,"BSD 3-Clause \"New\" or \"Revised\" License",false,"main",[],"2026-06-12 04:01:03","# Meta Lingua\n\n**Mathurin Videau***, **Badr Youbi Idrissi***, Daniel Haziza, Luca Wehrstedt, Jade Copet, Olivier Teytaud, David Lopez-Paz. ***Equal and main contribution**\n\nMeta Lingua is a minimal and fast LLM training and inference library designed for research. Meta Lingua uses easy-to-modify PyTorch components in order to try new architectures, losses, data, etc. We aim for this code to enable end to end training, inference and evaluation as well as provide tools to better understand speed and stability. While Meta Lingua is currently under development, we provide you with multiple `apps` to showcase how to use this codebase.\n\n\u003Cp align=\"center\">  \n \u003Cimg src=\"lingua_overview.svg\" width=\"100%\"\u002F>\n\u003C\u002Fp>\n\n## Quick start\n\nThe following commands launch a SLURM job that creates an environment for Meta Lingua.\nThe env creation should take around 5 minutes without counting downloads. \n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Flingua\ncd lingua\n\nbash setup\u002Fcreate_env.sh\n# or if you have access to a SLURM cluster\nsbatch setup\u002Fcreate_env.sh\n```\nOnce that is done your can activate the environment \n```bash\nconda activate lingua_\u003Cdate>\n```\nuse the provided script to download and prepare data from huggingface (among `fineweb_edu`, `fineweb_edu_10bt`, or `dclm_baseline_1.0`).\nThis command will download the `fineweb_edu` and prepare it for training in the `.\u002Fdata` directory, specifying the amount of memory `terashuf` (the tool used to shuffle samples) will be allocated. By default, the number of chunks (`nchunks`) is 32. If you are running on fewer than 32 GPUs, it is recommended to set `nchunks` to 1 or to match `nchunks` with the number of GPUs (`nchunks` = NGPUs). See [here](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Flingua\u002Fissues\u002F55#issuecomment-2483643076) for more details.\n```bash\npython setup\u002Fdownload_prepare_hf_data.py fineweb_edu \u003CMEMORY> --data_dir .\u002Fdata --seed 42 --nchunks \u003CNCHUNKS>\n```\nto download tokenizer (here llama3), use the folowing script:\n```bash\npython setup\u002Fdownload_tokenizer.py llama3 \u003CSAVE_PATH> --api_key \u003CHUGGINGFACE_TOKEN>\n```\nNow launch a debug job to check if everything works.  **The provided configurations are templates, you need to adapt them for them to work (change `dump_dir`, `data.root_dir`, `data.tokenizer.path`, etc ...)**\n\n```bash\n# stool stands for SLURM tool !\npython -m lingua.stool script=apps.main.train config=apps\u002Fmain\u002Fconfigs\u002Fdebug.yaml nodes=1 partition=\u003Cpartition>\n# if you want to launch locally you can use torchrun\ntorchrun --nproc-per-node 8 -m apps.main.train config=apps\u002Fmain\u002Fconfigs\u002Fdebug.yaml\n# or you can also launch on 1 GPU\npython -m apps.main.train config=apps\u002Fmain\u002Fconfigs\u002Fdebug.yaml\n```\n\nWhen using `stool`, if a job crashes, it can be relaunched using sbatch:\n```bash\nsbatch path\u002Fto\u002Fdump_dir\u002Fsubmit.slurm\n```\n## Training Results \n\nWe get very strong performance on many downstream tasks and match the performance of [DCLM baseline 1.0](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.11794).\n\n### 1B models on 60B DCLM tokens\n| name           | arc_challenge | arc_easy | boolq |  copa | hellaswag |  obqa |  piqa |  siqa | winogrande |  nq  |  tqa  |\n|----------------|:-------------:|:--------:|:-----:|:-----:|:---------:|:-----:|:-----:|:-----:|:----------:|:----:|:-----:|\n| Transformer 1B |     36.48     |   62.83  | 62.57 | 79.00 |   63.62   | 37.40 | 75.14 | 45.19 |    61.64   | 8.75 | 26.31 |\n| minGRU 1B      |     30.82     |   57.89  | 62.05 | 74.00 |   50.27   | 37.00 | 72.31 | 43.76 |    52.49   | 3.24 |  9.03 |\n| minLSTM 1B     |     31.76     |   60.04  | 62.02 | 73.00 |   53.39   | 36.40 | 72.36 | 45.09 |    52.80   | 4.52 | 12.73 |\n| Hawk 1B        |     34.94     |   63.68  | 62.42 | 76.00 |   63.10   | 38.20 | 73.23 | 46.01 |    55.33   | 8.42 | 23.58 |\n| Mamba 1B       |     35.54     |   63.42  | 62.63 | 74.00 |   64.16   | 38.80 | 75.24 | 45.14 |    60.14   | 8.84 | 26.64 |\n\n### 7B models\n\n| name                             | arc_challenge | arc_easy | boolq | copa  | hellaswag | obqa  | piqa  | siqa  | winogrande | mmlu  | nq    | tqa   | bbh   |\n|----------------------------------|---------------|----------|-------|-------|-----------|-------|-------|-------|------------|-------|-------|-------|-------|\n| Mamba 7B 200B tokens             | 47.21         | 76.03    | 65.63 | 84.00 | 77.80     | 44.00 | 80.25 | 49.69 | 70.24      | 32.81 | 20.53 | 51.93 | 20.35 |\n| Llama 7B 200B tokens             | 46.95         | 75.73    | 64.80 | 84.00 | 77.45     | 45.00 | 80.20 | 48.26 | 70.32      | 48.64 | 20.66 | 51.01 | 31.47 |\n| Llama 7B squared relu 1T tokens  | 49.61         | 76.74    | 72.45 | 89.00 | 81.19     | 44.80 | 82.05 | 49.95 | 72.14      | 60.56 | 25.68 | 59.52 | 42.11 |\n\n## Project overview\n\nMeta Lingua is structured as follows:\n\n```\n📦meta-lingua\n ┣ 📂lingua # Core library\n ┃ ┣ 📜args.py\n ┃ ┣ 📜checkpoint.py\n ┃ ┣ 📜data.py\n ┃ ┣ 📜distributed.py\n ┃ ┣ 📜float8.py\n ┃ ┣ 📜logger.py\n ┃ ┣ 📜metrics.py\n ┃ ┣ 📜optim.py\n ┃ ┣ 📜probe.py\n ┃ ┣ 📜profiling.py\n ┃ ┣ 📜stool.py\n ┃ ┣ 📜tokenizer.py\n ┃ ┗ 📜transformer.py\n ┣ 📂setup\n ┃ ┣ 📜create_env.sh\n ┃ ┗ 📜download_prepare_hf_data.py\n ┗ 📂apps # Apps that put components together\n   ┣ 📂main # Main language modeling app with llama\n   ┃ ┣ 📂configs\n   ┃ ┣ 📜eval.py\n   ┃ ┣ 📜generate.py\n   ┃ ┣ 📜train.py\n   ┃ ┗ 📜transformer.py\n   ┣ 📂fastRNN \n   ┃ ┣ 📂component\n   ┃ ┣ 📂hawk\n   ┃ ┣ 📂minGRU\n   ┃ ┣ 📂minLSTM\n   ┣ 📂mamba\n   ┣ 📂mtp # Multi token prediction\n   ┗ 📂plots\n```\n\nThe `lingua` folder contains some essential and reusable components, while the `apps` folder contains scripts that put those components together. For instance the main training loop is in `apps\u002Fmain`. We highly encourage you to use that as a template and modify it however you please to suit your experiments. \n\nNothing is sacred in Meta Lingua. We've specifically tried to make it as easily modifiable as possible! So feel free to branch out and modify anything. \n\nHere's a quick description of the most important files and features:\n\n- **`transformer.py`** : Defines model architecture. This is pure PyTorch `nn.Module` ! Nothing fancy here. \n- **`distributed.py`** : Handles distributing the model on multiple GPUs. This is done through `parallelize_module` function which wraps your vanilla `nn.Module` and applies nearly any combination of Data Parallel, Fully Sharded Data Parallel, Model Parallelism, `torch.compile`, activation checkpointing and `float8`. \n- **`data.py`** : Dataloader for LLM pretraining.\n\n\u003Cp align=\"center\">  \n \u003Cimg src=\"dataloader.png\" width=\"40%\"\u002F>\n\u003C\u002Fp>\n\n- **`profiling.py`** : Small wrapper around xformers' profiler which provides automatic MFU and HFU calculation and dumps profile traces in profiling folder in your dump directory. It also has memory profiling trace. \n- **`checkpoint.py`** : Manages model checkpoints. It saves model in checkpoints folder in your dump dir in .distcp format which is the new PyTorch distributed saving method. This format allows to reload the model with a different number of GPUs and with a different sharding. You can also convert those into normal PyTorch checkpoints with `torch.distributed.checkpoint.format_utils.dcp_to_torch_save` and the other way around `torch_save_to_dcp`.\n- **`args.py`** : Utilities to work with configs. \n\n## Configuration\n\nMost components need configuration and we chose to use data classes to represent these configuration objects. `args.py` helps with converting between `config.yaml` and config dictionaries into the respective data classes. \n\nSo for examples the `TrainArgs` in `apps\u002Fmain\u002Ftrain.py` has a `LMTransformerArgs`, `OptimArgs`, etc ... as children. \n\nHere is an example configuration file that will be converted to `TrainArgs`:\n\n```yaml\n# This is where Meta Lingua will store anything related to the experiment. \ndump_dir: \u002Fpath\u002Fto\u002Fdumpdir\nname: \"debug\"\nsteps: 1000\n\nseed: 12\n\noptim:\n    lr: 3e-4\n    warmup: 2000\n    lr_min_ratio: 0.000001\n    clip: 10.0\n\ndistributed:\n    fsdp_type: full_shard\n    compile: true\n    selective_activation_checkpointing: false\n\nmodel:\n    dim: 1024\n    n_layers: 8\n    n_heads: 8\n\ndata:\n    root_dir: data\u002Fshuffled\n    sources:\n      wikipedia: 80.0\n      arxiv: 20.0\n    batch_size: 32\n    seq_len: 1024\n    load_async: true\n    tokenizer:\n        name: sp\n        path: tokenizers\u002Fllama2.model\n```\n\n\n## Launching jobs\n\n### Command line arguments\n\nThe command line interface in all scripts (`train.py`, `eval.py`, `stool.py`) uses [OmegaConf](https:\u002F\u002Fomegaconf.readthedocs.io\u002Fen\u002F2.3_branch\u002Fusage.html#from-command-line-arguments)\nThis accepts arguments as a dot list\nSo if the dataclass looks like\n```python\n@dataclass\nclass DummyArgs:\n    name: str = \"blipbloup\"\n    mode: LMTransformerArgs = LMTransformerArgs()\n    \n@dataclass\nclass LMTransformerArgs:\n    dim: int = 512\n    n_layers: int = 12\n```\n\nThen you can pass `model.dim = 32` to change values in `LMTransformerArgs`\nor just `name = tictac` for top level attributes.\n\n**`train.py`** simply takes as argument the path to a config file and will load that config. The behavior here is as follows:\n1. We instantiate `TrainArgs` with its default values\n2. We override those default values with the ones in the provided config file\n3. We override the result with the additional arguments provided through command line\n\nIf we take the `DummyArgs` example above, calling `train.py` with `train.py config=debug.yaml model.dim=64 name=tictac` \nwhere `debug.yaml` contains \n```yaml\nmodel:\n    n_layers: 24\n```\nwill launch training with the config \n```python\nDummyArgs(name=\"tictac\", LMTransformerArgs(dim=64, n_layers=24))\n```\n\n### Launching with SLURM\n\nSince we want to do distributed training, we need `train.py` to run N times (with N being the number of GPUs)\n\nThe easiest way to do this is through SLURM. And in order to make that simpler, we provide `lingua\u002Fstool.py` which is a simple python script that \n1. Saves the provided config to `dump_dir`\n2. Copies your current code to `dump_dir` in order to back it up \n3. Creates an sbatch file `submit.slurm` which is then used to launch the job with the provided config. \n\nIt can either be used through command line \n\n```bash\npython -m lingua.stool config=apps\u002Fmain\u002Fconfigs\u002Fdebug.yaml nodes=1 account=fair_amaia_cw_codegen qos=lowest\n```\n\nOr the `launch_job` function directly. This allows you for example to create many arbitrary configs (to sweep parameters, do ablations) in a jupyter notebook and launch jobs directly from there. \n\nSince the configuration file is copied to `dump_dir`, an easy way to iterate is to simply change the config file and launch the same command above. \n\n## Debugging\nIn order to iterate quickly, it is preferable not to have to wait for a SLURM allocation every time. You can instead ask SLURM to allocate resources for you, then once they're allocated you can run multiple commands on that same allocation. \n\nFor example you can do :\n\n```bash\nsalloc --nodes 2 --cpus-per-gpu 16 --mem 1760GB --gres=gpu:8 --exclusive --time=72:00:00\n```\n\nWhich will give you access to 2 nodes in your current terminal. Once the allocation is done, you will see some SLURM environement variables that were automatically added such as `$SLURM_JOB_ID` and others... This allows you for example to do in the same terminal\n\n```bash\nsrun -n 16 python -m apps.main.train config=apps\u002Fmain\u002Fconfigs\u002Fdebug.yaml\n```\n\nWhich will run the `python -m apps.main.train config=apps\u002Fmain\u002Fconfigs\u002Fdebug.yaml` command on each of the 16 GPUs. If this crashes or ends you can just relaunch `srun` again because the nodes are already allocated to you and you don't have to wait for SLURM to give you the resources again.\n\nThis will also show you the outputs of all those commands in the same terminal which might become cumbersome. \n\nInstead you can use `stool` directly to configure logs to be separated into different files per GPU.\n\n```bash\npython -m lingua.stool config=apps\u002Fmain\u002Fconfigs\u002Fdebug.yaml nodes=2 launcher=bash dirs_exists_ok=true\n```\n\nNotice that we added **`launcher=bash`** which basically means that the generated `submit.slurm` will simply be executed instead of submitting it through `sbatch`. The `submit.slurm` has an `srun` command also so this is very similar to the above `srun` command. We also add **`dirs_exists_ok=true`** to tell `stool` that it is okay to override things in an existing folder (code, config, etc)\n\nIf you want to use `pdb` to step through your code, you should use `-n 1` to run only on 1 GPU. \n\n## Evaluations\n\nEvaluations can run either during training periodically or you directly launch evals on a given checkpoint as follows:\n\n```bash\nsrun -n 8 python -u -m apps.main.eval config=apps\u002Fmain\u002Fconfigs\u002Feval.yaml\n```\n\nYou need to specify the checkpoint and dump dir of the evaluation in that config\n\nOr through `stool` with\n\n```bash\npython -m lingua.stool script=apps.main.eval config=apps\u002Fmain\u002Fconfigs\u002Feval.yaml nodes=1 account=fair_amaia_cw_codegen qos=lowest\n```\n\n## Dump dir structure\n\n```\n📂example_dump_dir\n ┣ 📂checkpoints\n ┃ ┣ 📂0000001000\n ┃ ┣ 📂0000002000\n ┃ ┣ 📂0000003000\n ┃ ┣ 📂0000004000\n ┃ ┣ 📂0000005000\n ┃ ┣ 📂0000006000\n ┃ ┣ 📂0000007000 # Checkpoint and train state saved every 1000 steps here\n ┃ ┃ ┣ 📜.metadata\n ┃ ┃ ┣ 📜__0_0.distcp\n ┃ ┃ ┣ 📜__1_0.distcp\n ┃ ┃ ┣ 📜params.json\n ┃ ┃ ┣ 📜train_state_00000.json\n ┃ ┃ ┗ 📜train_state_00001.json\n ┣ 📂code # Backup of the code at the moment the job was launched\n ┣ 📂logs\n ┃ ┗ 📂166172 # Logs for each GPU in this SLURM job.\n ┃ ┃ ┣ 📜166172.stderr\n ┃ ┃ ┣ 📜166172.stdout\n ┃ ┃ ┣ 📜166172_0.err\n ┃ ┃ ┣ 📜166172_0.out\n ┃ ┃ ┣ 📜166172_1.err\n ┃ ┃ ┗ 📜166172_1.out\n ┣ 📂profiling\n ┃ ┣ 📂memory_trace_plot # Trace of memory usage through time for all GPUs\n ┃ ┃ ┣ 📜000102_h100-192-145_451082.html\n ┃ ┃ ┣ 📜000102_h100-192-145_451083.html\n ┃ ┗ 📂profile_CPU_CUDA_000104 # Profiling traces for all GPUs\n ┃ ┃ ┣ 📜h100-192-145_451082.1720183858874741723.pt.trace.json.gz\n ┃ ┃ ┗ 📜h100-192-145_451083.1720183858865656716.pt.trace.json.gz\n ┣ 📜base_config.yaml\n ┣ 📜config.yaml\n ┣ 📜metrics.jsonl\n ┗ 📜submit.slurm\n```\n\n## Related repositories\n\nHere we highlight some related work that is complementary to this one. Most important being [torchtitan](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan), [torchtune](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtune) and [fairseq2](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Ffairseq2). \n\nLingua is designed for researchers who want to experiment with new ideas for LLM pretraining and get quick feedback on both training\u002Finference speed and downstream benchmarks. Our goal is to lower the barrier to entry for LLM research by providing a lightweight and focused codebase.\n\nWe see torchtitan, torchtune, lingua and fairseq2 as complementary tools. Torchtitan is excellent for large-scale work because it features 3D parallelism and is likely to integrate the latest PyTorch distributed training features more quickly, thanks to its close ties to the PyTorch team. On the other hand, Torchtune excels at fine-tuning, especially when GPU resources are limited, by offering various fine-tuning strategies like LoRA, QLoRA, DPO, and PPO. Fairseq2 is a FAIR project for sequence modeling with multi-modal capabilities that provides various LLM training recipes, multi-GPU support with data and model parallelism, and efficient data processing for speech and multilingual content.\n\nA typical workflow could look like this: you might first test a new idea in Lingua, then scale it up further with Torchtitan, and finally use Torchtune for instruction or preference fine-tuning.\n\nAlthough there's definitely some overlap among these codebases, we think it's valuable to have focused tools for different aspects of LLM work. For example, Torchtitan aims to showcase the latest distributed training features of PyTorch in a clean, minimal codebase, but for most research, you really don't need every feature PyTorch has to offer or the capability to scale to 100B parameters on 4096 GPUs. For instance, we think that FSDP + torch compile will cover 90% of all needs of a researcher. With lingua, we tried to ask \"What's the minimal set of features needed to draw solid conclusions on the scalability of idea X?\"\n\nWe believe this targeted approach helps researchers make progress faster without the mental overhead of using many techniques that might not be needed.\n\n## Citation\n\n```\n@misc{meta_lingua,\n  author = {Mathurin Videau and Badr Youbi Idrissi and Daniel Haziza and Luca Wehrstedt and Jade Copet and Olivier Teytaud and David Lopez-Paz},\n  title = {{Meta Lingua}: A minimal {PyTorch LLM} training library},\n  url = {https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Flingua},\n  year = {2024}\n}\n```\n## License\n\nMeta Lingua is licensed under BSD-3-Clause license. Refer to the LICENSE file in the top level directory.\n","Meta Lingua 是一个用于研究大型语言模型（LLM）的轻量级、高效且易于修改的代码库。它基于PyTorch构建，允许研究人员轻松尝试新的架构、损失函数和数据集等，并支持从训练到推理再到评估的全流程操作。此外，该项目还提供了工具来帮助理解模型的速度与稳定性。适用于需要快速迭代实验设置或对现有模型进行改进的研究场景。目前，项目正处于积极开发阶段，并已提供多个示例应用以展示其使用方法。",2,"2026-06-11 03:40:29","high_star"]