[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72010":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":16,"forks30d":16,"starsTrendScore":16,"compositeScore":19,"rankGlobal":10,"rankLanguage":10,"license":20,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":21,"hasPages":21,"topics":23,"createdAt":10,"pushedAt":10,"updatedAt":24,"readmeContent":25,"aiSummary":26,"trendingCount":16,"starSnapshotCount":16,"syncStatus":27,"lastSyncTime":28,"discoverSource":29},72010,"LWM","LargeWorldModel\u002FLWM","LargeWorldModel","Large World Model -- Modeling Text and Video with Millions Context","https:\u002F\u002Flargeworldmodel.github.io\u002F",null,"Python",7417,558,67,54,0,5,9,39.24,"Apache License 2.0",false,"main",[],"2026-06-12 02:02:57","# Large World Model (LWM)\n\n[[Project]](https:\u002F\u002Flargeworldmodel.github.io\u002F)\n[[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.08268)\n[[Models]](https:\u002F\u002Fhuggingface.co\u002FLargeWorldModel)\n\n**Large World Model (LWM)** is a general-purpose large-context multimodal autoregressive model. It is trained on a large dataset of diverse long videos and books using RingAttention, and can perform language, image, and video understanding and generation.\n\n\n## Approach\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\".\u002Fimgs\u002Fdata.png\"\u002F>\n\u003C\u002Fdiv>\n\nCurrent language models fall short in understanding aspects of the world not easily described in words, and struggle with complex, long-form tasks. Video sequences offer valuable temporal information absent in language and static images, making them attractive for joint modeling with language. Such models could develop a understanding of both human textual knowledge and the physical world, enabling broader AI capabilities for assisting humans. However, learning from millions of tokens of video and language sequences poses challenges due to memory constraints, computational complexity, and limited datasets. To address these challenges, we curate a large dataset of diverse videos and books, utilize the RingAttention technique to scalably train on long sequences, and gradually increase context size from 4K to 1M tokens. This paper makes the following contributions: (a) Largest context size neural network: We train one of the largest context size transformers on long video and language sequences, setting new benchmarks in difficult retrieval tasks and long video understanding. (b) Solutions for overcoming vision-language training challenges, including using masked sequence packing for mixing different sequence lengths, loss weighting to balance language and vision, and model-generated QA dataset for long sequence chat. (c) A highly-optimized implementation with RingAttention, masked sequence packing, and other key features for training on millions-length multimodal sequences. (d) Fully open-sourced a family of 7B parameter models capable of processing long text documents (LWM-Text, LWM-Text-Chat) and videos (LWM, LWM-Chat) of over 1M tokens.\nThis work paves the way for training on massive datasets of long video and language to develop understanding of both human knowledge and the multimodal world, and broader capabilities.\n\n## LWM Capabilities\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\".\u002Fimgs\u002Fsingle_needle_1M.png\"\u002F>\n  \u003Cp>\n  LWM can retrieval facts across 1M context with high accuracy.\n  \u003C\u002Fp>\n\u003C\u002Fdiv>\n\n\u003Cbr \u002F>\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\".\u002Fimgs\u002Flong_video_chat_main.png\"\u002F>\n  \u003Cp>\n  LWM can answer questions over 1 hour YouTube video.\n  \u003C\u002Fp>\n\u003C\u002Fdiv>\n\n\u003Cbr \u002F>\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\".\u002Fimgs\u002Fimage_chat.png\"\u002F>\n  \u003Cp>\n  LWM can chat with images.\n  \u003C\u002Fp>\n\u003C\u002Fdiv>\n\n\u003Cbr \u002F>\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\".\u002Fimgs\u002Fimage_video_gen.png\"\u002F>\n  \u003Cp>\n  LWM can generate videos and images from text.\n  \u003C\u002Fp>\n\u003C\u002Fdiv>\n\n\n## Setup\n\nThis codebase is supported on Ubuntu and has not been tested on Windows or macOS. We recommend using TPUs for training and inference, although it is also possible to use GPUs. On TPU, the code is highly optimized with Jax's Pallas and can achieve high MFUs with RingAttention at very large context sizes. On GPU, the code is based on XLA and is not as optimized as it is for TPU.\n\nInstall the requirements with:\n```\nconda create -n lwm python=3.10\nconda activate lwm\npip install -r gpu_requirements.txt\n```\nor set up TPU VM with:\n```\nsh tpu_requirements.sh\n```\n\n\n## Available models\n\nThere are language-only and video-language versions, offering context sizes from 32K, to 128K, 256K and 1M tokens. The vision-language models are available only in Jax, and the language-only models are available in both PyTorch and Jax. Below are the names of the available models and their corresponding context sizes and capabilities:\n\n| Model Name         | Context Size | Language or Vision-Language | Chat or Base | URL                                                                                                                                          |\n|--------------------|--------------|-----------------------------|--------------|----------------------------------------------------------------------------------------------------------------------------------------------|\n| LWM-Text-Chat-128K | 128K         | Language                    | Chat         | [[Pytorch](https:\u002F\u002Fhuggingface.co\u002FLargeWorldModel\u002FLWM-Text-Chat-128K)][[Jax](https:\u002F\u002Fhuggingface.co\u002FLargeWorldModel\u002FLWM-Text-Chat-128K-Jax)] |\n| LWM-Text-Chat-256K | 256K         | Language                    | Chat         | [[Pytorch](https:\u002F\u002Fhuggingface.co\u002FLargeWorldModel\u002FLWM-Text-Chat-256K)][[Jax](https:\u002F\u002Fhuggingface.co\u002FLargeWorldModel\u002FLWM-Text-Chat-256K-Jax)] |\n| LWM-Text-Chat-512K | 512K         | Language                    | Chat         | [[Pytorch](https:\u002F\u002Fhuggingface.co\u002FLargeWorldModel\u002FLWM-Text-Chat-512K)][[Jax](https:\u002F\u002Fhuggingface.co\u002FLargeWorldModel\u002FLWM-Text-Chat-512K-Jax)] |\n| LWM-Text-Chat-1M   | 1M           | Language                    | Chat         | [[Pytorch](https:\u002F\u002Fhuggingface.co\u002FLargeWorldModel\u002FLWM-Text-Chat-1M)][[Jax](https:\u002F\u002Fhuggingface.co\u002FLargeWorldModel\u002FLWM-Text-Chat-1M-Jax)]     |\n| LWM-Text-128K      | 128K         | Language                    | Base         | [[Pytorch](https:\u002F\u002Fhuggingface.co\u002FLargeWorldModel\u002FLWM-Text-128K)][[Jax](https:\u002F\u002Fhuggingface.co\u002FLargeWorldModel\u002FLWM-Text-128K-Jax)]           |\n| LWM-Text-256K      | 256K         | Language                    | Base         | [[Pytorch](https:\u002F\u002Fhuggingface.co\u002FLargeWorldModel\u002FLWM-Text-256K)][[Jax](https:\u002F\u002Fhuggingface.co\u002FLargeWorldModel\u002FLWM-Text-256K-Jax)]           |\n| LWM-Text-512K      | 512K         | Language                    | Base         | [[Pytorch](https:\u002F\u002Fhuggingface.co\u002FLargeWorldModel\u002FLWM-Text-512K)][[Jax](https:\u002F\u002Fhuggingface.co\u002FLargeWorldModel\u002FLWM-Text-512K-Jax)]           |\n| LWM-Text-1M        | 1M           | Language                    | Base         | [[Pytorch](https:\u002F\u002Fhuggingface.co\u002FLargeWorldModel\u002FLWM-Text-1M)][[Jax](https:\u002F\u002Fhuggingface.co\u002FLargeWorldModel\u002FLWM-Text-1M-Jax)]               |\n| LWM-Chat-32K       | 32K          | Vision-Language             | Chat         | [[Jax](https:\u002F\u002Fhuggingface.co\u002FLargeWorldModel\u002FLWM-32K-Jax)]                                                                                  |\n| LWM-Chat-128K      | 128K         | Vision-Language             | Chat         | [[Jax](https:\u002F\u002Fhuggingface.co\u002FLargeWorldModel\u002FLWM-128K-Jax)]                                                                                 |\n| LWM-Chat-1M        | 1M           | Vision-Language             | Chat         | [[Jax](https:\u002F\u002Fhuggingface.co\u002FLargeWorldModel\u002FLWM-1M-Jax)]                                                                                   |\n\n\n## Code structure\nUse `scan_query_chunk_size` and `scan_key_chunk_size` to control the block size in blockwise compute of the self-attention. Use `scan_mlp_chunk_size` to control the block size in blockwise compute of the feedforward network. Use `scan_attention=True` and `scan_mlp=True` to enable\u002Fdisable blockwise compute in the self-attention and feed-forward network.\n\nYou can use `mesh_dim=dp, fsdp, tp, sp` to control the degree of parallelism and RingAttention. It is a string of 4 integers separated by commas, representing the number of data parallelism, fully sharded data parallelism, tensor parallelism, and sequence parallelism.\nFor example, `mesh_dim='1,64,4,1'` means 1 data parallelism, 64 fully sharded data parallelism, 4 tensor parallelism, and 1 sequence parallelism. `mesh_dim='1,1,4,64'` means 1 data parallelism, 1 fully sharded data parallelism, 4 tensor parallelism, and 64 sequence parallelism for RingAttention.\n\n\n## Running Jax Models\nIn this section, we provide instructions on how to run each of the provided scripts. For each script, you may need to fill in your own paths and values in the variables described in the beginning of each script.\n\nTo run each of the following scripts, use `bash \u003Cscript_name>.sh`:\n- Language model training: `bash scripts\u002Frun_train_text.sh`\n- Vision-Language model training: `bash scripts\u002Frun_train_vision_text.sh`\n- Single Needle Evals (Language Model): `bash scripts\u002Frun_eval_needle.sh`\n- Multi Needle Evals (Language Model): `bash scripts\u002Frun_eval_needle_multi.sh`\n- Sampling images (Vision-Language Model): `bash scripts\u002Frun_sample_image.sh`\n- Sampling videos (Vision-LanguageModel): `bash scripts\u002Frun_sample_video.sh`\n- Image \u002F Video understanding (Vision-Language Model): `bash scripts\u002Frun_vision_chat.sh`\n\nBy default the `mesh_dim` argument puts all devices on `tp` (tensor parallelism). For longer sequences, you may want to include `sp`, which is the last dimension in the `mesh_dim`.\n\nWhen running needle evals, you may need to adjust the `theta` and `max_sequence_length` arguments in the scripts depending on the model. Below shows the correct values for each model.\n\n|                     | LWM-Text-128K \u002F  LWM-Text-Chat-128K | LWM-Text-256K \u002F  LWM-Text-Chat-256K | LWM-Text-512K \u002F LWM-Text-Chat-512K | LWM-Text-1M \u002F LWM-Text-Chat-1M |\n|---------------------|:-----------------------------------:|:-----------------------------------:|:----------------------------------:|:------------------------------:|\n| theta               |               10000000              |               10000000              |              25000000              |            50000000            |\n| max_sequence_length |                131072               |                262144               |               524288               |             1048576            |\n\n\nAn example of filling out a script (`run_sample_video.sh`) is as follows\n```bash\n#! \u002Fbin\u002Fbash\n\nexport SCRIPT_DIR=\"$( cd -- \"$( dirname -- \"${BASH_SOURCE[0]}\" )\" &> \u002Fdev\u002Fnull && pwd )\"\nexport PROJECT_DIR=\"$( cd -- \"$( dirname -- \"$SCRIPT_DIR\" )\" &> \u002Fdev\u002Fnull && pwd )\"\ncd $PROJECT_DIR\nexport PYTHONPATH=\"$PYTHONPATH:$PROJECT_DIR\"\n\nexport llama_tokenizer_path=\"LargeWorldModel\u002FLWM-Text-1M\"\nexport vqgan_checkpoint=\"\u002Fpath\u002Fto\u002Fckpt\u002Ffolder\u002Fvqgan\"\nexport lwm_checkpoint=\"params::\u002Fpath\u002Fto\u002Fckpt\u002Ffolder\u002Fparams\"\n\npython3 -u -m lwm.vision_generation \\\n    --prompt='Fireworks over the city' \\\n    --output_file='fireworks.mp4' \\\n    --temperature_image=1.0 \\\n    --temperature_video=1.0 \\\n    --top_k_image=8192 \\\n    --top_k_video=1000 \\\n    --cfg_scale_image=5.0 \\\n    --cfg_scale_video=1.0 \\\n    --vqgan_checkpoint=\"$vqgan_checkpoint\" \\\n    --n_frames=8 \\\n    --mesh_dim='!1,1,-1,1' \\\n    --dtype='fp32' \\\n    --load_llama_config='7b' \\\n    --update_llama_config=\"dict(sample_mode='vision',theta=50000000,max_sequence_length=32768,scan_attention=False,scan_query_chunk_size=128,scan_key_chunk_size=128,scan_mlp=False,scan_mlp_chunk_size=8192,scan_layers=True)\" \\\n    --load_checkpoint=\"$lwm_checkpoint\" \\\n    --tokenizer=\"$llama_tokenizer_path\"\nread\n```\n\n\n## Needle Haystack Data\nRun `python scripts\u002Fcreate_needle_data.py`\n\n\n## Running PyTorch Models\nOnly text and text chat models are currently supported for PyTorch inference. PyTorch models can be loaded as Hugging Face `LlamaForCausalLM` models. Run `python scripts\u002Fsample_pyt.py` to sample. You may need to separately install `torch`.\n\n## Documentation\n\nFor more details on the codebase, please refer to the [data.md](docs\u002Fdata.md) and [sharding.md](docs\u002Fsharding.md).\nThe [data.md](docs\u002Fdata.md) provides details on the data processing and the [sharding.md](docs\u002Fsharding.md) provides details on the sharding and parallelism.\n\n\n## If you have issues\n\nThis is based on the [codebase](https:\u002F\u002Fgithub.com\u002Fhaoliuhl\u002Fringattention) of RingAttention, with the necessary features for vision-language training. The training and inference have been tested on both TPUv3 and TPUv4.\n\nIf you encounter bugs, please open a GitHub issue!\n\n\n## Citation\n\nIf you use this codebase, or otherwise found our work valuable, please cite:\n\n```\n@article{liu2023world,\n    title={World Model on Million-Length Video and Language with RingAttention},\n    author={Liu, Hao and Yan, Wilson and Zaharia, Matei and Abbeel, Pieter},\n    journal={arXiv preprint},\n    year={2024},\n}\n@article{liu2023ring,\n    title={Ring Attention with Blockwise Transformers for Near-Infinite Context},\n    author={Liu, Hao and Zaharia, Matei and Abbeel, Pieter},\n    journal={International Conference on Learning Representations},\n    year={2024}\n}\n@article{liu2023blockwise,\n    title={Blockwise Parallel Transformer for Large Context Models},\n    author={Liu, Hao and Abbeel, Pieter},\n    journal={Advances in neural information processing systems},\n    year={2023}\n}\n```\n\n## License\n\nLWM's code is released under the Apache 2.0 License. See [LICENSE](https:\u002F\u002Fgithub.com\u002FLargeWorldModel\u002Flwm\u002Fblob\u002Fmain\u002FLICENSE) for further details. The models are released under the Llama-2 license.\n","Large World Model (LWM) 是一个通用的大规模上下文多模态自回归模型，能够处理文本、图像和视频的理解与生成。该项目利用RingAttention技术在大规模数据集上进行训练，支持从4K到100万token的上下文大小，解决了内存限制和计算复杂性等挑战。LWM的核心功能包括跨百万级上下文的事实检索、长达一小时的视频问答、基于图像的聊天以及从文本生成视频和图像。适合用于需要深度理解长文本或视频内容的应用场景，例如教育、娱乐、新闻分析等领域。",2,"2026-06-11 03:39:56","high_star"]