[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-5623":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":25,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":43,"readmeContent":44,"aiSummary":45,"trendingCount":16,"starSnapshotCount":16,"syncStatus":46,"lastSyncTime":47,"discoverSource":48},5623,"lance","lance-format\u002Flance","lance-format","Open Lakehouse Format for Multimodal AI. Convert from Parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..","https:\u002F\u002Flance.org",null,"Rust",6623,706,51,903,0,9,44,212,34,112.55,"Apache License 2.0",false,"main",true,[27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42],"apache-arrow","computer-vision","data-analysis","data-analytics","data-centric","data-format","data-science","dataops","deep-learning","duckdb","embeddings","llms","machine-learning","mlops","python","rust","2026-06-12 04:00:26","\u003Cdiv align=\"center\">\n\u003Cp align=\"center\">\n\n\u003Cimg width=\"257\" alt=\"Lance Logo\" src=\"https:\u002F\u002Fuser-images.githubusercontent.com\u002F917119\u002F199353423-d3e202f7-0269-411d-8ff2-e747e419e492.png\">\n\n**The Open Lakehouse Format for Multimodal AI**\u003Cbr\u002F>\n**High-performance vector search, full-text search, random access, and feature engineering capabilities for the lakehouse.**\u003Cbr\u002F>\n**Compatible with Pandas, DuckDB, Polars, PyArrow, Ray, Spark, and more integrations on the way.**\n\n\u003Ca href=\"https:\u002F\u002Flance.org\">Documentation\u003C\u002Fa> •\n\u003Ca href=\"https:\u002F\u002Flance.org\u002Fcommunity\">Community\u003C\u002Fa> •\n\u003Ca href=\"https:\u002F\u002Fdiscord.gg\u002Flance\">Discord\u003C\u002Fa> •\n\u003Ca href=\"https:\u002F\u002Fgroups.google.com\u002Fa\u002Flance.org\u002Fg\u002Fdev\">Mailing List\u003C\u002Fa>\n\n[CI]: https:\u002F\u002Fgithub.com\u002Flance-format\u002Flance\u002Factions\u002Fworkflows\u002Frust.yml\n[CI Badge]: https:\u002F\u002Fgithub.com\u002Flance-format\u002Flance\u002Factions\u002Fworkflows\u002Frust.yml\u002Fbadge.svg\n[Docs]: https:\u002F\u002Flance.org\n[Docs Badge]: https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fdocs-passing-brightgreen\n[crates.io]: https:\u002F\u002Fcrates.io\u002Fcrates\u002Flance\n[crates.io badge]: https:\u002F\u002Fimg.shields.io\u002Fcrates\u002Fv\u002Flance.svg\n[Python versions]: https:\u002F\u002Fpypi.org\u002Fproject\u002Fpylance\u002F\n[Python versions badge]: https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fpyversions\u002Fpylance\n\n[![CI Badge]][CI]\n[![Docs Badge]][Docs]\n[![crates.io badge]][crates.io]\n[![Python versions badge]][Python versions]\n\n\u003C\u002Fp>\n\u003C\u002Fdiv>\n\n\u003Chr \u002F>\n\nLance is an open lakehouse format for multimodal AI. It contains a file format, table format, and catalog spec that allows you to build a complete lakehouse on top of object storage to power your AI workflows. Lance is perfect for:\n\n1. Building search engines and feature stores with hybrid search capabilities.\n2. Large-scale ML training requiring high performance IO and random access.\n3. Storing, querying, and managing multimodal data including images, videos, audio, text, and embeddings.\n\nThe key features of Lance include:\n\n* **Expressive hybrid search:** Combine vector similarity search, full-text search (BM25), and SQL analytics on the same dataset with accelerated secondary indices.\n\n* **Lightning-fast random access:** 100x faster than Parquet or Iceberg for random access without sacrificing scan performance.\n\n* **Native multimodal data support:** Store images, videos, audio, text, and embeddings in a single unified format with efficient blob encoding and lazy loading.\n\n* **Data evolution:** Efficiently add columns with backfilled values without full table rewrites, perfect for ML feature engineering.\n\n* **Zero-copy versioning:** Automatic versioning with ACID transactions, time travel, tags, and branches—no extra infrastructure needed.\n\n* **Rich ecosystem integrations:** Apache Arrow, Pandas, Polars, DuckDB, Apache Spark, Ray, Trino, Apache Flink, and open catalogs (Apache Polaris, Unity Catalog, Apache Gravitino).\n\nFor more details, see the full [Lance format specification](https:\u002F\u002Flance.org\u002Fformat).\n\n> [!TIP]\n> Lance is in active development and we welcome contributions. Please see our [contributing guide](https:\u002F\u002Flance.org\u002Fcommunity\u002Fcontributing\u002F) for more information.\n\n## Quick Start\n\n**Installation**\n\n```shell\npip install pylance\n```\n\nTo install a preview release:\n\n```shell\npip install --pre --extra-index-url https:\u002F\u002Fpypi.fury.io\u002Flance-format\u002Fpylance\n```\n\n> [!TIP]\n> Preview releases are released more often than full releases and contain the\n> latest features and bug fixes. They receive the same level of testing as full releases.\n> We guarantee they will remain published and available for download for at\n> least 6 months. When you want to pin to a specific version, prefer a stable release.\n\n**Converting to Lance**\n\n```python\nimport lance\n\nimport pandas as pd\nimport pyarrow as pa\nimport pyarrow.dataset\n\ndf = pd.DataFrame({\"a\": [5], \"b\": [10]})\nuri = \"\u002Ftmp\u002Ftest.parquet\"\ntbl = pa.Table.from_pandas(df)\npa.dataset.write_dataset(tbl, uri, format='parquet')\n\nparquet = pa.dataset.dataset(uri, format='parquet')\nlance.write_dataset(parquet, \"\u002Ftmp\u002Ftest.lance\")\n```\n\n**Reading Lance data**\n```python\ndataset = lance.dataset(\"\u002Ftmp\u002Ftest.lance\")\nassert isinstance(dataset, pa.dataset.Dataset)\n```\n\n**Pandas**\n```python\ndf = dataset.to_table().to_pandas()\ndf\n```\n\n**DuckDB**\n```python\nimport duckdb\n\n# If this segfaults, make sure you have duckdb v0.7+ installed\nduckdb.query(\"SELECT * FROM dataset LIMIT 10\").to_df()\n```\n\n**Vector search**\n\nDownload the sift1m subset\n\n```shell\nwget ftp:\u002F\u002Fftp.irisa.fr\u002Flocal\u002Ftexmex\u002Fcorpus\u002Fsift.tar.gz\ntar -xzf sift.tar.gz\n```\n\nConvert it to Lance\n\n```python\nimport lance\nfrom lance.vector import vec_to_table\nimport numpy as np\nimport struct\n\nnvecs = 1000000\nndims = 128\nwith open(\"sift\u002Fsift_base.fvecs\", mode=\"rb\") as fobj:\n    buf = fobj.read()\n    data = np.array(struct.unpack(\"\u003C128000000f\", buf[4 : 4 + 4 * nvecs * ndims])).reshape((nvecs, ndims))\n    dd = dict(zip(range(nvecs), data))\n\ntable = vec_to_table(dd)\nuri = \"vec_data.lance\"\nsift1m = lance.write_dataset(table, uri, max_rows_per_group=8192, max_rows_per_file=1024*1024)\n```\n\nBuild the index\n\n```python\nsift1m.create_index(\"vector\",\n                    index_type=\"IVF_PQ\",\n                    num_partitions=256,  # IVF\n                    num_sub_vectors=16)  # PQ\n```\n\nSearch the dataset\n\n```python\n# Get top 10 similar vectors\nimport duckdb\n\ndataset = lance.dataset(uri)\n\n# Sample 100 query vectors. If this segfaults, make sure you have duckdb v0.7+ installed\nsample = duckdb.query(\"SELECT vector FROM dataset USING SAMPLE 100\").to_df()\nquery_vectors = np.array([np.array(x) for x in sample.vector])\n\n# Get nearest neighbors for all of them\nrs = [dataset.to_table(nearest={\"column\": \"vector\", \"k\": 10, \"q\": q})\n      for q in query_vectors]\n```\n\n## Directory structure\n\n| Directory          | Description              |\n|--------------------|--------------------------|\n| [rust](.\u002Frust)     | Core Rust implementation |\n| [python](.\u002Fpython) | Python bindings (PyO3)   |\n| [java](.\u002Fjava)     | Java bindings (JNI)      |\n| [docs](.\u002Fdocs)     | Documentation source     |\n\n## Benchmarks\n\n### Vector search\n\nWe used the SIFT dataset to benchmark our results with 1M vectors of 128D\n\n1. For 100 randomly sampled query vectors, we get \u003C1ms average response time (on a 2023 m2 MacBook Air)\n\n![avg_latency.png](docs\u002Fsrc\u002Fimages\u002Favg_latency.png)\n\n2. ANNs are always a trade-off between recall and performance\n\n![avg_latency.png](docs\u002Fsrc\u002Fimages\u002Frecall_vs_latency.png)\n\n### Vs. parquet\n\nWe create a Lance dataset using the Oxford Pet dataset to do some preliminary performance testing of Lance as compared to Parquet and raw image\u002FXMLs. For analytics queries, Lance is 50-100x better than reading the raw metadata. For batched random access, Lance is 100x better than both parquet and raw files.\n\n![](docs\u002Fsrc\u002Fimages\u002Flance_perf.png)\n\n## Why Lance for AI\u002FML workflows?\n\nThe machine learning development cycle involves multiple stages:\n\n```mermaid\ngraph LR\n    A[Collection] --> B[Exploration];\n    B --> C[Analytics];\n    C --> D[Feature Engineer];\n    D --> E[Training];\n    E --> F[Evaluation];\n    F --> C;\n    E --> G[Deployment];\n    G --> H[Monitoring];\n    H --> A;\n```\n\nTraditional lakehouse formats were designed for SQL analytics and struggle with AI\u002FML workloads that require:\n- **Vector search** for similarity and semantic retrieval\n- **Fast random access** for sampling and interactive exploration\n- **Multimodal data** storage (images, videos, audio alongside embeddings)\n- **Data evolution** for feature engineering without full table rewrites\n- **Hybrid search** combining vectors, full-text, and SQL predicates\n\nWhile existing formats (Parquet, Iceberg, Delta Lake) excel at SQL analytics, they require additional specialized systems for AI capabilities. Lance brings these AI-first features directly into the lakehouse format.\n\nA comparison of different formats across ML development stages:\n\n|                     | Lance | Parquet & ORC | JSON & XML | TFRecord | Database | Warehouse |\n|---------------------|-------|---------------|------------|----------|----------|-----------|\n| Analytics           | Fast  | Fast          | Slow       | Slow     | Decent   | Fast      |\n| Feature Engineering | Fast  | Fast          | Decent     | Slow     | Decent   | Good      |\n| Training            | Fast  | Decent        | Slow       | Fast     | N\u002FA      | N\u002FA       |\n| Exploration         | Fast  | Slow          | Fast       | Slow     | Fast     | Decent    |\n| Infra Support       | Rich  | Rich          | Decent     | Limited  | Rich     | Rich      |\n\n","Lance 是一种面向多模态AI的开放湖仓格式，支持在对象存储上构建完整的湖仓以驱动AI工作流。其核心功能包括高效的向量搜索、全文搜索、随机访问和特征工程能力，并且能够以比Parquet快100倍的速度进行随机访问。Lance支持多种数据类型（如图像、视频、音频、文本和嵌入），并提供零拷贝版本控制、自动ACID事务等特性，非常适合用于构建搜索引擎、特征存储以及需要高性能IO的大规模机器学习训练场景。此外，Lance与Pandas、DuckDB、Polars、PyArrow等工具兼容，为用户提供了一个丰富的生态系统集成选项。",2,"2026-06-11 03:04:24","top_language"]