[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72432":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":8,"htmlUrl":8,"language":9,"languages":8,"totalLinesOfCode":8,"stars":10,"forks":11,"watchers":12,"openIssues":13,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":15,"stars7d":15,"stars30d":16,"stars90d":14,"forks30d":14,"starsTrendScore":17,"compositeScore":18,"rankGlobal":8,"rankLanguage":8,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":22,"hasPages":20,"topics":23,"createdAt":8,"pushedAt":8,"updatedAt":24,"readmeContent":25,"aiSummary":26,"trendingCount":14,"starSnapshotCount":14,"syncStatus":27,"lastSyncTime":28,"discoverSource":29},72432,"ollm","Mega4alik\u002Follm","Mega4alik",null,"Python",2665,251,30,18,0,1,46,3,29.2,"MIT License",false,"main",true,[],"2026-06-12 02:03:03","\u003C!-- markdownlint-disable MD001 MD041 -->\n\u003Cp align=\"center\">\n  \u003Cpicture>\n    \u003Csource media=\"(prefers-color-scheme: dark)\" srcset=\"https:\u002F\u002Follm.s3.us-east-1.amazonaws.com\u002Ffiles\u002Flogo2.png\">\n    \u003Cimg alt=\"oLLM\" src=\"https:\u002F\u002Follm.s3.us-east-1.amazonaws.com\u002Ffiles\u002Flogo2.png\" width=52%>\n  \u003C\u002Fpicture>\n\u003C\u002Fp>\n\n\u003Ch3 align=\"center\">\nLLM Inference for Large-Context Offline Workloads\n\u003C\u002Fh3>\n\noLLM is a lightweight Python library for large-context LLM inference, built on top of Huggingface Transformers and PyTorch. It enables running models like [gpt-oss-20B](https:\u002F\u002Fhuggingface.co\u002Fopenai\u002Fgpt-oss-20b), [qwen3-next-80B](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen3-Next-80B-A3B-Instruct) or [Llama-3.1-8B-Instruct](https:\u002F\u002Fhuggingface.co\u002Fmeta-llama\u002FLlama-3.1-8B-Instruct) on 100k context using ~$200 consumer GPU with 8GB VRAM.  No quantization is used—only fp16\u002Fbf16 precision. \n\n\u003Cp dir=\"auto\">\u003Cem>\u003Ca href=\"https:\u002F\u002Fgithub.com\u002FMega4alik\u002Follm\u002Fwiki\u002FReleases\">Latest updates\u003C\u002Fa> (1.0.3)\u003C\u002Fem> 🔥\u003C\u002Fp>\n\u003Cul dir=\"auto\">\n\u003Cli>\u003Ccode>AutoInference\u003C\u002Fcode> with any Llama3 \u002F gemma3 model + \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fpeft\">PEFT\u003C\u002Fa> adapter support\u003C\u002Fli>\n\u003Cli>\u003Ccode>kvikio\u003C\u002Fcode> and \u003Ccode>flash-attn\u003C\u002Fcode> are optional now, meaning no hardware restrictions beyond HF transformers\u003C\u002Fli>\n\u003Cli>Multimodal \u003Cb>voxtral-small-24B\u003C\u002Fb> (audio+text) added. \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FMega4alik\u002Follm\u002Fblob\u002Fmain\u002Fexample_audio.py\">[sample with audio]\u003C\u002Fa> \u003C\u002Fli>\n\u003Cli>Multimodal \u003Cb>gemma3-12B\u003C\u002Fb> (image+text) added. \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FMega4alik\u002Follm\u002Fblob\u002Fmain\u002Fexample_image.py\">[sample with image]\u003C\u002Fa> \u003C\u002Fli>\n\u003Cli>\u003Cb>qwen3-next-80B\u003C\u002Fb> (160GB model) added with \u003Cspan style=\"color:blue\">⚡️1tok\u002F2s\u003C\u002Fspan> throughput (our fastest model so far)\u003C\u002Fli>\n\u003Cli>gpt-oss-20B flash-attention-like implementation added to reduce VRAM usage \u003C\u002Fli>\n\u003Cli>gpt-oss-20B chunked MLP added to reduce VRAM usage \u003C\u002Fli>\n\u003C\u002Ful>\n\n---\n###  8GB Nvidia 3060 Ti Inference memory usage:\n\n| Model   | Weights | Context length | KV cache |  Baseline VRAM (no offload) | oLLM GPU VRAM | oLLM Disk (SSD) |\n| ------- | ------- | -------- | ------------- | ------------ | ---------------- | --------------- |\n| [qwen3-next-80B](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen3-Next-80B-A3B-Instruct) | 160 GB (bf16) | 50k | 20 GB | ~190 GB   | ~7.5 GB | 180 GB  |\n| [gpt-oss-20B](https:\u002F\u002Fhuggingface.co\u002Fopenai\u002Fgpt-oss-20b) | 13 GB (packed bf16) | 10k | 1.4 GB | ~40 GB   | ~7.3GB | 15 GB  |\n| [gemma3-12B](https:\u002F\u002Fhuggingface.co\u002Fgoogle\u002Fgemma-3-12b-it)  | 25 GB (bf16) | 50k   | 18.5 GB          | ~45 GB   | ~6.7 GB       | 43 GB  |\n| [llama3-1B-chat](https:\u002F\u002Fhuggingface.co\u002Fmeta-llama\u002FLlama-3.2-1B-Instruct)  | 2 GB (bf16) | 100k   | 12.6 GB          | ~16 GB   | ~5 GB       | 15 GB  |\n| [llama3-3B-chat](https:\u002F\u002Fhuggingface.co\u002Fmeta-llama\u002FLlama-3.2-3B-Instruct)  | 7 GB (bf16) | 100k  | 34.1 GB | ~42 GB   | ~5.3 GB     | 42 GB |\n| [llama3-8B-chat](https:\u002F\u002Fhuggingface.co\u002Fmeta-llama\u002FLlama-3.1-8B-Instruct)  | 16 GB (bf16) | 100k  | 52.4 GB | ~71 GB   | ~6.6 GB     | 69 GB  |\n\n\u003Csmall>By \"Baseline\" we mean typical inference without any offloading\u003C\u002Fsmall>\n\nHow do we achieve this:\n\n- Loading layer weights from SSD directly to GPU one by one\n- Offloading KV cache to SSD and loading back directly to GPU, no quantization or PagedAttention\n- Offloading layer weights to CPU if needed\n- FlashAttention-2 with online softmax. Full attention matrix is never materialized. \n- Chunked MLP. Intermediate upper projection layers may get large, so we chunk MLP as well \n---\nTypical use cases include:\n- Analyze contracts, regulations, and compliance reports in one pass\n- Summarize or extract insights from massive patient histories or medical literature\n- Process very large log files or threat reports locally\n- Analyze historical chats to extract the most common issues\u002Fquestions users have\n---\n**Supported GPUs**: NVIDIA (with additional performance benefits from `kvikio` and `flash-attn`), AMD, and Apple Silicon (MacBook).\n\n\n\n## Getting Started\n\nIt is recommended to create venv or conda environment first\n```bash\npython3 -m venv ollm_env\nsource ollm_env\u002Fbin\u002Factivate\n```\n\nInstall oLLM with `pip install --no-build-isolation ollm` or [from source](https:\u002F\u002Fgithub.com\u002FMega4alik\u002Follm):\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FMega4alik\u002Follm.git\ncd ollm\npip install --no-build-isolation -e .\n\n# for Nvidia GPUs with cuda (optional): \npip install kvikio-cu{cuda_version} Ex, kvikio-cu12 #speeds up the inference\n```\n> 💡 **Note**  \n> **voxtral-small-24B** requires additional pip dependencies to be installed as `pip install \"mistral-common[audio]\"` and `pip install librosa`\n\nCheck out the [Troubleshooting](https:\u002F\u002Fgithub.com\u002FMega4alik\u002Follm\u002Fwiki\u002FTroubleshooting) in case of any installation issues \n\n## Example\n\nCode snippet sample \n\n```python\nfrom ollm import Inference, file_get_contents, TextStreamer\no = Inference(\"llama3-1B-chat\", device=\"cuda:0\", logging=True) #llama3-1B\u002F3B\u002F8B-chat, gpt-oss-20B, qwen3-next-80B\no.ini_model(models_dir=\".\u002Fmodels\u002F\", force_download=False)\no.offload_layers_to_cpu(layers_num=2) #(optional) offload some layers to CPU for speed boost\npast_key_values = o.DiskCache(cache_dir=\".\u002Fkv_cache\u002F\") #set None if context is small\ntext_streamer = TextStreamer(o.tokenizer, skip_prompt=True, skip_special_tokens=False)\n\nmessages = [{\"role\":\"system\", \"content\":\"You are helpful AI assistant\"}, {\"role\":\"user\", \"content\":\"List planets\"}]\ninput_ids = o.tokenizer.apply_chat_template(messages, reasoning_effort=\"minimal\", tokenize=True, add_generation_prompt=True, return_tensors=\"pt\").to(o.device)\noutputs = o.model.generate(input_ids=input_ids,  past_key_values=past_key_values, max_new_tokens=500, streamer=text_streamer).cpu()\nanswer = o.tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=False)\nprint(answer)\n```\nor run sample python script as `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python example.py` \n\n```python\n# with AutoInference, you can run any LLama3\u002Fgemma3 model with PEFT adapter support\n# pip install peft \nfrom ollm import AutoInference\no = AutoInference(\".\u002Fmodels\u002Fgemma3-12B\", # any llama3 or gemma3 model\n  adapter_dir=\".\u002Fmyadapter\u002Fcheckpoint-20\", # PEFT adapter checkpoint if available\n  device=\"cuda:0\", multimodality=False, logging=True)\n...\n```\n**More samples**\n- [gemma3-12B image+text](https:\u002F\u002Fgithub.com\u002FMega4alik\u002Follm\u002Fblob\u002Fmain\u002Fexample_image.py)\n- [voxtral-small-24B audio+text](https:\u002F\u002Fgithub.com\u002FMega4alik\u002Follm\u002Fblob\u002Fmain\u002Fexample_audio.py)\n- [AutoInference + SFT](https:\u002F\u002Fgithub.com\u002FMega4alik\u002Fpeftee?tab=readme-ov-file#usage)\n\n\n## Knowledge base\n- [Documentation](https:\u002F\u002Fgithub.com\u002FMega4alik\u002Follm\u002Fwiki\u002FDocumentation)\n- [Community](https:\u002F\u002Fgithub.com\u002FMega4alik\u002Follm\u002Fwiki\u002FCommunity) articles, video, blogs\n- [Troubleshooting](https:\u002F\u002Fgithub.com\u002FMega4alik\u002Follm\u002Fwiki\u002FTroubleshooting)\n\n\n## Roadmap\n*For visibility of what's coming next (subject to change)*\n- Qwen3-Next quantized version\n- Qwen3-VL or alternative vision model\n- Qwen3-Next MultiTokenPrediction in R&D\n\n\n## Contact us\nIf there’s a model you’d like to see supported, feel free to suggest it in the [discussion](https:\u002F\u002Fgithub.com\u002FMega4alik\u002Follm\u002Fdiscussions\u002F4) — I’ll do my best to make it happen.\n\n","oLLM是一个轻量级的Python库，专为大型上下文离线工作负载提供大模型推理服务。该库基于Huggingface Transformers和PyTorch构建，支持运行如gpt-oss-20B、qwen3-next-80B或Llama-3.1-8B-Instruct等模型，在约200美元消费级GPU（8GB显存）上处理长达10万token的上下文长度，且仅使用fp16\u002Fbf16精度而不依赖量化技术。其核心功能包括AutoInference支持多种Llama3\u002Fgemma3模型及PEFT适配器、可选的kvikio与flash-attn优化以及新增的多模态模型支持，例如voxtral-small-24B（音频+文本）和gemma3-12B（图像+文本）。此外，通过特定实现如gpt-oss-20B的分块MLP设计显著降低了显存占用。oLLM非常适合那些希望在有限硬件资源下高效执行大规模语言模型推理任务的研究者和个人开发者使用。",2,"2026-06-11 03:42:01","high_star"]