[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-1518":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":23,"hasPages":23,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":37,"readmeContent":38,"aiSummary":39,"trendingCount":16,"starSnapshotCount":16,"syncStatus":40,"lastSyncTime":41,"discoverSource":42},1518,"langextract","google\u002Flangextract","google","A Python library for extracting structured information from unstructured text using LLMs with precise source grounding and interactive visualization.","https:\u002F\u002Fpypi.org\u002Fproject\u002Flangextract\u002F",null,"Python",36872,2544,161,72,0,13,73,438,66,120,"Apache License 2.0",false,"main",[26,27,28,29,30,31,32,33,34,35,36],"gemini","gemini-ai","gemini-api","gemini-flash","gemini-pro","information-extration","large-language-models","llm","nlp","python","structured-data","2026-06-12 04:00:10","\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fgoogle\u002Flangextract\">\n    \u003Cimg src=\"https:\u002F\u002Fraw.githubusercontent.com\u002Fgoogle\u002Flangextract\u002Fmain\u002Fdocs\u002F_static\u002Flogo.svg\" alt=\"LangExtract Logo\" width=\"128\" \u002F>\n  \u003C\u002Fa>\n\u003C\u002Fp>\n\n# LangExtract\n\n[![PyPI version](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Flangextract.svg)](https:\u002F\u002Fpypi.org\u002Fproject\u002Flangextract\u002F)\n[![GitHub stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fgoogle\u002Flangextract.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Flangextract)\n![Tests](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Flangextract\u002Factions\u002Fworkflows\u002Fci.yaml\u002Fbadge.svg)\n[![DOI](https:\u002F\u002Fzenodo.org\u002Fbadge\u002FDOI\u002F10.5281\u002Fzenodo.17015089.svg)](https:\u002F\u002Fdoi.org\u002F10.5281\u002Fzenodo.17015089)\n\n## Table of Contents\n\n- [Introduction](#introduction)\n- [Why LangExtract?](#why-langextract)\n- [Quick Start](#quick-start)\n- [Installation](#installation)\n- [API Key Setup for Cloud Models](#api-key-setup-for-cloud-models)\n- [Adding Custom Model Providers](#adding-custom-model-providers)\n- [Using OpenAI Models](#using-openai-models)\n- [Using Local LLMs with Ollama](#using-local-llms-with-ollama)\n- [More Examples](#more-examples)\n  - [*Romeo and Juliet* Full Text Extraction](#romeo-and-juliet-full-text-extraction)\n  - [Medication Extraction](#medication-extraction)\n  - [Radiology Report Structuring: RadExtract](#radiology-report-structuring-radextract)\n- [Community Providers](#community-providers)\n- [Contributing](#contributing)\n- [Testing](#testing)\n- [Disclaimer](#disclaimer)\n\n## Introduction\n\nLangExtract is a Python library that uses LLMs to extract structured information from unstructured text documents based on user-defined instructions. It processes materials such as clinical notes or reports, identifying and organizing key details while ensuring the extracted data corresponds to the source text.\n\n## Why LangExtract?\n\n1.  **Precise Source Grounding:** Maps every extraction to its exact location in the source text, enabling visual highlighting for easy traceability and verification.\n2.  **Reliable Structured Outputs:** Enforces a consistent output schema based on your few-shot examples, leveraging controlled generation in supported models like Gemini to guarantee robust, structured results.\n3.  **Optimized for Long Documents:** Overcomes the \"needle-in-a-haystack\" challenge of large document extraction by using an optimized strategy of text chunking, parallel processing, and multiple passes for higher recall.\n4.  **Interactive Visualization:** Instantly generates a self-contained, interactive HTML file to visualize and review thousands of extracted entities in their original context.\n5.  **Flexible LLM Support:** Supports your preferred models, from cloud-based LLMs like the Google Gemini family to local open-source models via the built-in Ollama interface.\n6.  **Adaptable to Any Domain:** Define extraction tasks for any domain using just a few examples. LangExtract adapts to your needs without requiring any model fine-tuning.\n7.  **Leverages LLM World Knowledge:** Utilize precise prompt wording and few-shot examples to influence how the extraction task may utilize LLM knowledge. The accuracy of any inferred information and its adherence to the task specification are contingent upon the selected LLM, the complexity of the task, the clarity of the prompt instructions, and the nature of the prompt examples.\n\n## Quick Start\n\n> **Note:** Using cloud-hosted models like Gemini requires an API key. See the [API Key Setup](#api-key-setup-for-cloud-models) section for instructions on how to get and configure your key.\n\nExtract structured information with just a few lines of code.\n\n### 1. Define Your Extraction Task\n\nFirst, create a prompt that clearly describes what you want to extract. Then, provide a high-quality example to guide the model.\n\n```python\nimport langextract as lx\nimport textwrap\n\n# 1. Define the prompt and extraction rules\nprompt = textwrap.dedent(\"\"\"\\\n    Extract characters, emotions, and relationships in order of appearance.\n    Use exact text for extractions. Do not paraphrase or overlap entities.\n    Provide meaningful attributes for each entity to add context.\"\"\")\n\n# 2. Provide a high-quality example to guide the model\nexamples = [\n    lx.data.ExampleData(\n        text=\"ROMEO. But soft! What light through yonder window breaks? It is the east, and Juliet is the sun.\",\n        extractions=[\n            lx.data.Extraction(\n                extraction_class=\"character\",\n                extraction_text=\"ROMEO\",\n                attributes={\"emotional_state\": \"wonder\"}\n            ),\n            lx.data.Extraction(\n                extraction_class=\"emotion\",\n                extraction_text=\"But soft!\",\n                attributes={\"feeling\": \"gentle awe\"}\n            ),\n            lx.data.Extraction(\n                extraction_class=\"relationship\",\n                extraction_text=\"Juliet is the sun\",\n                attributes={\"type\": \"metaphor\"}\n            ),\n        ]\n    )\n]\n```\n\n> **Note:** Examples drive model behavior. Each `extraction_text` should ideally be verbatim from the example's `text` (no paraphrasing), listed in order of appearance. LangExtract raises `Prompt alignment` warnings by default if examples don't follow this pattern—resolve these for best results.\n>\n> **Grounding:** LLMs may occasionally extract content from few-shot examples rather than the input text. LangExtract automatically detects this: extractions that cannot be located in the source text will have `char_interval = None`. Filter these out with `[e for e in result.extractions if e.char_interval]` to keep only grounded results.\n\n### 2. Run the Extraction\n\nProvide your input text and the prompt materials to the `lx.extract` function.\n\n```python\n# The input text to be processed\ninput_text = \"Lady Juliet gazed longingly at the stars, her heart aching for Romeo\"\n\n# Run the extraction\nresult = lx.extract(\n    text_or_documents=input_text,\n    prompt_description=prompt,\n    examples=examples,\n    model_id=\"gemini-2.5-flash\",\n)\n```\n\n> **Model Selection**: `gemini-2.5-flash` is the recommended default, offering an excellent balance of speed, cost, and quality. For highly complex tasks requiring deeper reasoning, `gemini-2.5-pro` may provide superior results. For large-scale or production use, a Tier 2 Gemini quota is suggested to increase throughput and avoid rate limits. See the [rate-limit documentation](https:\u002F\u002Fai.google.dev\u002Fgemini-api\u002Fdocs\u002Frate-limits#tier-2) for details.\n>\n> **Model Lifecycle**: Note that Gemini models have a lifecycle with defined retirement dates. Users should consult the [official model version documentation](https:\u002F\u002Fcloud.google.com\u002Fvertex-ai\u002Fgenerative-ai\u002Fdocs\u002Flearn\u002Fmodel-versions) to stay informed about the latest stable and legacy versions.\n\n### 3. Visualize the Results\n\nThe extractions can be saved to a `.jsonl` file, a popular format for working with language model data. LangExtract can then generate an interactive HTML visualization from this file to review the entities in context.\n\n```python\n# Save the results to a JSONL file\nlx.io.save_annotated_documents([result], output_name=\"extraction_results.jsonl\", output_dir=\".\")\n\n# Generate the visualization from the file\nhtml_content = lx.visualize(\"extraction_results.jsonl\")\nwith open(\"visualization.html\", \"w\") as f:\n    if hasattr(html_content, 'data'):\n        f.write(html_content.data)  # For Jupyter\u002FColab\n    else:\n        f.write(html_content)\n```\n\nThis creates an animated and interactive HTML file:\n\n![Romeo and Juliet Basic Visualization ](https:\u002F\u002Fraw.githubusercontent.com\u002Fgoogle\u002Flangextract\u002Fmain\u002Fdocs\u002F_static\u002Fromeo_juliet_basic.gif)\n\n> **Note on LLM Knowledge Utilization:** This example demonstrates extractions that stay close to the text evidence - extracting \"longing\" for Lady Juliet's emotional state and identifying \"yearning\" from \"gazed longingly at the stars.\" The task could be modified to generate attributes that draw more heavily from the LLM's world knowledge (e.g., adding `\"identity\": \"Capulet family daughter\"` or `\"literary_context\": \"tragic heroine\"`). The balance between text-evidence and knowledge-inference is controlled by your prompt instructions and example attributes.\n\n### Scaling to Longer Documents\n\nFor larger texts, you can process entire documents directly from URLs with parallel processing and enhanced sensitivity:\n\n```python\n# Process Romeo & Juliet directly from Project Gutenberg\nresult = lx.extract(\n    text_or_documents=\"https:\u002F\u002Fwww.gutenberg.org\u002Ffiles\u002F1513\u002F1513-0.txt\",\n    prompt_description=prompt,\n    examples=examples,\n    model_id=\"gemini-2.5-flash\",\n    extraction_passes=3,    # Improves recall through multiple passes\n    max_workers=20,         # Parallel processing for speed\n    max_char_buffer=1000    # Smaller contexts for better accuracy\n)\n```\n\nThis approach can extract hundreds of entities from full novels while maintaining high accuracy. The interactive visualization seamlessly handles large result sets, making it easy to explore hundreds of entities from the output JSONL file. **[See the full *Romeo and Juliet* extraction example →](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Flangextract\u002Fblob\u002Fmain\u002Fdocs\u002Fexamples\u002Flonger_text_example.md)** for detailed results and performance insights.\n\n### Vertex AI Batch Processing\n\nSave costs on large-scale tasks by enabling Vertex AI Batch API: `language_model_params={\"vertexai\": True, \"batch\": {\"enabled\": True}}`.\n\nSee an example of the Vertex AI Batch API usage in [this example](docs\u002Fexamples\u002Fbatch_api_example.md).\n\n## Installation\n\n### From PyPI\n\n```bash\npip install langextract\n```\n\n*Recommended for most users. For isolated environments, consider using a virtual environment:*\n\n```bash\npython -m venv langextract_env\nsource langextract_env\u002Fbin\u002Factivate  # On Windows: langextract_env\\Scripts\\activate\npip install langextract\n```\n\n### From Source\n\nLangExtract uses modern Python packaging with `pyproject.toml` for dependency management:\n\n*Installing with `-e` puts the package in development mode, allowing you to modify the code without reinstalling.*\n\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fgoogle\u002Flangextract.git\ncd langextract\n\n# For basic installation:\npip install -e .\n\n# For development (includes linting tools):\npip install -e \".[dev]\"\n\n# For testing (includes pytest):\npip install -e \".[test]\"\n```\n\n### Docker\n\n```bash\ndocker build -t langextract .\ndocker run --rm -e LANGEXTRACT_API_KEY=\"your-api-key\" langextract python your_script.py\n```\n\n## API Key Setup for Cloud Models\n\nWhen using LangExtract with cloud-hosted models (like Gemini or OpenAI), you'll need to\nset up an API key. On-device models don't require an API key. For developers\nusing local LLMs, LangExtract offers built-in support for Ollama and can be\nextended to other third-party APIs by updating the inference endpoints.\n\n### API Key Sources\n\nGet API keys from:\n\n*   [AI Studio](https:\u002F\u002Faistudio.google.com\u002Fapp\u002Fapikey) for Gemini models\n*   [Vertex AI](https:\u002F\u002Fcloud.google.com\u002Fvertex-ai\u002Fgenerative-ai\u002Fdocs\u002Fsdks\u002Foverview) for enterprise use\n*   [OpenAI Platform](https:\u002F\u002Fplatform.openai.com\u002Fapi-keys) for OpenAI models\n\n### Setting up API key in your environment\n\n**Option 1: Environment Variable**\n\n```bash\nexport LANGEXTRACT_API_KEY=\"your-api-key-here\"\n```\n\n**Option 2: .env File (Recommended)**\n\nAdd your API key to a `.env` file:\n\n```bash\n# Add API key to .env file\ncat >> .env \u003C\u003C 'EOF'\nLANGEXTRACT_API_KEY=your-api-key-here\nEOF\n\n# Keep your API key secure\necho '.env' >> .gitignore\n```\n\nIn your Python code:\n```python\nimport langextract as lx\n\nresult = lx.extract(\n    text_or_documents=input_text,\n    prompt_description=\"Extract information...\",\n    examples=[...],\n    model_id=\"gemini-2.5-flash\"\n)\n```\n\n**Option 3: Direct API Key (Not Recommended for Production)**\n\nYou can also provide the API key directly in your code, though this is not recommended for production use:\n\n```python\nresult = lx.extract(\n    text_or_documents=input_text,\n    prompt_description=\"Extract information...\",\n    examples=[...],\n    model_id=\"gemini-2.5-flash\",\n    api_key=\"your-api-key-here\"  # Only use this for testing\u002Fdevelopment\n)\n```\n\n**Option 4: Vertex AI (Service Accounts)**\n\nUse [Vertex AI](https:\u002F\u002Fcloud.google.com\u002Fvertex-ai\u002Fdocs\u002Fstart\u002Fintroduction-unified-platform) for authentication with service accounts:\n\n```python\nresult = lx.extract(\n    text_or_documents=input_text,\n    prompt_description=\"Extract information...\",\n    examples=[...],\n    model_id=\"gemini-2.5-flash\",\n    language_model_params={\n        \"vertexai\": True,\n        \"project\": \"your-project-id\",\n        \"location\": \"global\"  # or regional endpoint\n    }\n)\n```\n\n## Adding Custom Model Providers\n\nLangExtract supports custom LLM providers via a lightweight plugin system. You can add support for new models without changing core code.\n\n- Add new model support independently of the core library\n- Distribute your provider as a separate Python package\n- Keep custom dependencies isolated\n- Override or extend built-in providers via priority-based resolution\n\nSee the detailed guide in [Provider System Documentation](langextract\u002Fproviders\u002FREADME.md) to learn how to:\n\n- Register a provider with `@router.register(...)` from `langextract.providers`\n- Publish an entry point for discovery\n- Optionally provide a schema with `get_schema_class()` for structured output\n- Integrate with the factory via `create_model(...)`\n\n## Using OpenAI Models\n\nLangExtract supports OpenAI models (requires optional dependency: `pip install langextract[openai]`):\n\n```python\nimport langextract as lx\n\n# OPENAI_API_KEY in the environment is picked up automatically; pass\n# api_key=... explicitly only if you need to override it.\nresult = lx.extract(\n    text_or_documents=input_text,\n    prompt_description=prompt,\n    examples=examples,\n    model_id=\"gpt-4o\",  # Automatically selects OpenAI provider\n)\n```\n\nThe OpenAI provider uses JSON mode and auto-determines fence and schema behavior — leave `fence_output` and `use_schema_constraints` unset.\n\nFor OpenAI-compatible endpoints or non-GPT model IDs (which skip auto-routing), use `ModelConfig` with an explicit provider:\n\n```python\nfrom langextract.factory import ModelConfig\n\nresult = lx.extract(\n    text_or_documents=input_text,\n    prompt_description=prompt,\n    examples=examples,\n    config=ModelConfig(\n        model_id=\"my-openai-compatible-model\",\n        provider=\"openai\",\n        provider_kwargs={\"api_key\": \"sk-...\", \"base_url\": \"https:\u002F\u002F...\"},\n    ),\n)\n```\n\n## Using Local LLMs with Ollama\nLangExtract supports local inference using Ollama, allowing you to run models without API keys:\n\n```python\nimport langextract as lx\n\nresult = lx.extract(\n    text_or_documents=input_text,\n    prompt_description=prompt,\n    examples=examples,\n    model_id=\"gemma2:2b\",  # Automatically selects Ollama provider\n    model_url=\"http:\u002F\u002Flocalhost:11434\",\n)\n```\n\nThe Ollama provider exposes `FormatModeSchema` for JSON mode. Leave `fence_output` and `use_schema_constraints` unset so the factory auto-configures from the provider's schema.\n\n**Quick setup:** Install Ollama from [ollama.com](https:\u002F\u002Follama.com\u002F), run `ollama pull gemma2:2b`, then `ollama serve`.\n\nFor detailed installation, Docker setup, and examples, see [`examples\u002Follama\u002F`](examples\u002Follama\u002F).\n\n## More Examples\n\nAdditional examples of LangExtract in action:\n\n### *Romeo and Juliet* Full Text Extraction\n\nLangExtract can process complete documents directly from URLs. This example demonstrates extraction from the full text of *Romeo and Juliet* from Project Gutenberg (147,843 characters), showing parallel processing, sequential extraction passes, and performance optimization for long document processing.\n\n**[View *Romeo and Juliet* Full Text Example →](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Flangextract\u002Fblob\u002Fmain\u002Fdocs\u002Fexamples\u002Flonger_text_example.md)**\n\n### Medication Extraction\n\n> **Disclaimer:** This demonstration is for illustrative purposes of LangExtract's baseline capability only. It does not represent a finished or approved product, is not intended to diagnose or suggest treatment of any disease or condition, and should not be used for medical advice.\n\nLangExtract excels at extracting structured medical information from clinical text. These examples demonstrate both basic entity recognition (medication names, dosages, routes) and relationship extraction (connecting medications to their attributes), showing LangExtract's effectiveness for healthcare applications.\n\n**[View Medication Examples →](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Flangextract\u002Fblob\u002Fmain\u002Fdocs\u002Fexamples\u002Fmedication_examples.md)**\n\n### Radiology Report Structuring: RadExtract\n\nExplore RadExtract, a live interactive demo on HuggingFace Spaces that shows how LangExtract can automatically structure radiology reports. Try it directly in your browser with no setup required.\n\n**[View RadExtract Demo →](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fgoogle\u002Fradextract)**\n\n## Community Providers\n\nExtend LangExtract with custom model providers! Check out our [Community Provider Plugins](COMMUNITY_PROVIDERS.md) registry to discover providers created by the community or add your own.\n\nFor detailed instructions on creating a provider plugin, see the [Custom Provider Plugin Example](examples\u002Fcustom_provider_plugin\u002F).\n\n## Contributing\n\nContributions are welcome! See [CONTRIBUTING.md](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Flangextract\u002Fblob\u002Fmain\u002FCONTRIBUTING.md) to get started\nwith development, testing, and pull requests. You must sign a\n[Contributor License Agreement](https:\u002F\u002Fcla.developers.google.com\u002Fabout)\nbefore submitting patches.\n\n\n\n## Testing\n\nTo run tests locally from the source:\n\n```bash\n# Clone the repository\ngit clone https:\u002F\u002Fgithub.com\u002Fgoogle\u002Flangextract.git\ncd langextract\n\n# Install with test dependencies\npip install -e \".[test]\"\n\n# Run all tests\npytest tests\n```\n\nOr reproduce the full CI matrix locally with tox:\n\n```bash\ntox  # runs pylint + pytest on Python 3.10 and 3.11\n```\n\n### Ollama Integration Testing\n\nIf you have Ollama installed locally, you can run integration tests:\n\n```bash\n# Test Ollama integration (requires Ollama running with gemma2:2b model)\ntox -e ollama-integration\n```\n\nThis test will automatically detect if Ollama is available and run real inference tests.\n\n## Development\n\n### Code Formatting\n\nThis project uses automated formatting tools to maintain consistent code style:\n\n```bash\n# Auto-format all code\n.\u002Fautoformat.sh\n\n# Or run formatters separately\nisort langextract tests --profile google --line-length 80\npyink langextract tests --config pyproject.toml\n```\n\n### Pre-commit Hooks\n\nFor automatic formatting checks:\n```bash\npre-commit install  # One-time setup\npre-commit run --all-files  # Manual run\n```\n\n### Linting\n\nRun linting before submitting PRs:\n\n```bash\npylint --rcfile=.pylintrc langextract tests\n```\n\nSee [CONTRIBUTING.md](CONTRIBUTING.md) for full development guidelines.\n\n## Disclaimer\n\nThis is not an officially supported Google product. If you use\nLangExtract in production or publications, please cite accordingly and\nacknowledge usage. Use is subject to the [Apache 2.0 License](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Flangextract\u002Fblob\u002Fmain\u002FLICENSE).\nFor health-related applications, use of LangExtract is also subject to the\n[Health AI Developer Foundations Terms of Use](https:\u002F\u002Fdevelopers.google.com\u002Fhealth-ai-developer-foundations\u002Fterms).\n\n---\n\n**Happy Extracting!**\n","LangExtract 是一个用于从非结构化文本中提取结构化信息的 Python 库，它利用大语言模型（LLM）并提供精确的源定位和交互式可视化。其核心功能包括基于用户定义指令的信息抽取、确保输出数据与源文本一致性的精确源定位技术、以及对长文档处理的支持。此外，它能够生成交互式 HTML 文件以便于审查提取结果，并支持多种 LLM，无论是云端还是本地开源模型均可兼容。此工具非常适合需要从大量文本数据中高效准确地提取关键信息的场景，如医学报告分析、法律文件处理等。",2,"2026-06-11 02:44:25","top_all"]