[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-9769":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":23,"hasPages":23,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":39,"readmeContent":40,"aiSummary":41,"trendingCount":16,"starSnapshotCount":16,"syncStatus":17,"lastSyncTime":42,"discoverSource":43},9769,"llm-scraper","mishushakov\u002Fllm-scraper","mishushakov","Turn any webpage into structured data using LLMs","",null,"TypeScript",6808,452,34,3,0,2,36,433,11,38.97,"MIT License",false,"main",[26,27,28,29,30,31,32,33,34,35,36,37,38],"ai","artificial-intelligence","browser","browser-automation","gpt","gpt-4","langchain","llama","llm","openai","playwright","puppeteer","scraper","2026-06-12 02:02:12","# LLM Scraper\n\n\u003Cimg width=\"1800\" alt=\"Screenshot 2024-04-20 at 23 11 16\" src=\"https:\u002F\u002Fgithub.com\u002Fmishushakov\u002Fllm-scraper\u002Fassets\u002F10400064\u002Fab00e048-a9ff-43b6-81d5-2e58090e2e65\">\n\nLLM Scraper is a TypeScript library that allows you to extract structured data from **any** webpage using LLMs.\n\n> [!IMPORTANT]\n> **LLM Scraper was updated to version 2.0.**\n>\n> The new version comes with **Vercel AI SDK 6** support and updated examples.\n\n### Features\n\n- Supports GPT, Sonnet, Gemini, Llama, Qwen model series\n- Schemas defined with Zod or JSON Schema\n- Full type-safety with TypeScript\n- Based on Playwright framework\n- Streaming objects\n- [Code-generation](#code-generation)\n- Supports 6 formatting modes:\n  - `html` for loading pre-processed HTML\n  - `raw_html` for loading raw HTML (no processing)\n  - `markdown` for loading markdown\n  - `text` for loading extracted text (using [Readability.js](https:\u002F\u002Fgithub.com\u002Fmozilla\u002Freadability))\n  - `image` for loading a screenshot (multi-modal only)\n  - `custom` for loading custom content (using a custom function)\n\n**Make sure to give it a star!**\n\n\u003Cimg width=\"165\" alt=\"Screenshot 2024-04-20 at 22 13 32\" src=\"https:\u002F\u002Fgithub.com\u002Fmishushakov\u002Fllm-scraper\u002Fassets\u002F10400064\u002F11e2a79f-a835-48c4-9f85-5c104ca7bb49\">\n\n## Getting started\n\n1. Install the required dependencies from npm:\n\n   ```\n   npm i zod playwright llm-scraper\n   ```\n\n2. Initialize your LLM:\n\n   **OpenAI**\n\n   ```\n   npm i @ai-sdk\u002Fopenai\n   ```\n\n   ```js\n   import { openai } from '@ai-sdk\u002Fopenai'\n\n   const llm = openai('gpt-4o')\n   ```\n\n   **Anthropic**\n\n   ```\n   npm i @ai-sdk\u002Fanthropic\n   ```\n\n   ```js\n   import { anthropic } from '@ai-sdk\u002Fanthropic'\n\n   const llm = anthropic('claude-3-5-sonnet-20240620')\n   ```\n\n   **Google**\n\n   ```\n   npm i @ai-sdk\u002Fgoogle\n   ```\n\n   ```js\n   import { google } from '@ai-sdk\u002Fgoogle'\n\n   const llm = google('gemini-1.5-flash')\n   ```\n\n   **Groq**\n\n   ```\n   npm i @ai-sdk\u002Fopenai\n   ```\n\n   ```js\n   import { createOpenAI } from '@ai-sdk\u002Fopenai'\n   const groq = createOpenAI({\n     baseURL: 'https:\u002F\u002Fapi.groq.com\u002Fopenai\u002Fv1',\n     apiKey: process.env.GROQ_API_KEY,\n   })\n\n   const llm = groq('llama3-8b-8192')\n   ```\n\n   **Ollama**\n\n   ```\n   npm i ollama-ai-provider-v2\n   ```\n\n   ```js\n   import { ollama } from 'ollama-ai-provider-v2'\n\n   const llm = ollama('llama3')\n   ```\n\n3. Create a new scraper instance provided with the llm:\n\n   ```js\n   import LLMScraper from 'llm-scraper'\n\n   const scraper = new LLMScraper(llm)\n   ```\n\n## Example\n\nIn this example, we're extracting top stories from HackerNews:\n\n```ts\nimport { chromium } from 'playwright'\nimport { z } from 'zod'\nimport { Output } from 'ai'\nimport { openai } from '@ai-sdk\u002Fopenai'\nimport LLMScraper from 'llm-scraper'\n\n\u002F\u002F Launch a browser instance\nconst browser = await chromium.launch()\n\n\u002F\u002F Initialize LLM provider\nconst llm = openai('gpt-4o')\n\n\u002F\u002F Create a new LLMScraper\nconst scraper = new LLMScraper(llm)\n\n\u002F\u002F Open new page\nconst page = await browser.newPage()\nawait page.goto('https:\u002F\u002Fnews.ycombinator.com')\n\n\u002F\u002F Define schema to extract contents into\nconst schema = z.object({\n  top: z\n    .array(\n      z.object({\n        title: z.string(),\n        points: z.number(),\n        by: z.string(),\n        commentsURL: z.string(),\n      })\n    )\n    .length(5)\n    .describe('Top 5 stories on Hacker News'),\n})\n\n\u002F\u002F Run the scraper\nconst { data } = await scraper.run(page, Output.object({ schema }), {\n  format: 'html',\n})\n\n\u002F\u002F Show the result from LLM\nconsole.log(data.top)\n\nawait page.close()\nawait browser.close()\n```\n\nOutput\n\n```js\n[\n  {\n    title: \"Palette lighting tricks on the Nintendo 64\",\n    points: 105,\n    by: \"ibobev\",\n    commentsURL: \"https:\u002F\u002Fnews.ycombinator.com\u002Fitem?id=44014587\",\n  },\n  {\n    title: \"Push Ifs Up and Fors Down\",\n    points: 187,\n    by: \"goranmoomin\",\n    commentsURL: \"https:\u002F\u002Fnews.ycombinator.com\u002Fitem?id=44013157\",\n  },\n  {\n    title: \"JavaScript's New Superpower: Explicit Resource Management\",\n    points: 225,\n    by: \"olalonde\",\n    commentsURL: \"https:\u002F\u002Fnews.ycombinator.com\u002Fitem?id=44012227\",\n  },\n  {\n    title: \"\\\"We would be less confidential than Google\\\" Proton threatens to quit Switzerland\",\n    points: 65,\n    by: \"taubek\",\n    commentsURL: \"https:\u002F\u002Fnews.ycombinator.com\u002Fitem?id=44014808\",\n  },\n  {\n    title: \"OBNC – Oberon-07 Compiler\",\n    points: 37,\n    by: \"AlexeyBrin\",\n    commentsURL: \"https:\u002F\u002Fnews.ycombinator.com\u002Fitem?id=44013671\",\n  }\n]\n```\n\nMore examples can be found in the [examples](.\u002Fexamples) folder.\n\n## Streaming\n\nReplace your `run` function with `stream` to get a partial object stream.\n\n```ts\n\u002F\u002F Run the scraper in streaming mode\nconst { stream } = await scraper.stream(page, Output.object({ schema }))\n\n\u002F\u002F Stream the result from LLM\nfor await (const data of stream) {\n  console.log(data.top)\n}\n```\n\n## Code-generation\n\nUsing the `generate` function you can generate re-usable playwright script that scrapes the contents according to a schema.\n\n```ts\n\u002F\u002F Generate code and run it on the page\nconst { code } = await scraper.generate(page, Output.object({ schema }))\nconst result = await page.evaluate(code)\nconst data = schema.parse(result)\n\n\u002F\u002F Show the parsed result\nconsole.log(data.top)\n```\n\n## Contributing\n\nAs an open-source project, we welcome contributions from the community. If you are experiencing any bugs or want to add some improvements, please feel free to open an issue or pull request.\n","LLM Scraper 是一个 TypeScript 库，用于从任何网页中提取结构化数据。它支持多种语言模型（如 GPT、Sonnet、Gemini、Llama 和 Qwen 系列），并允许用户通过 Zod 或 JSON Schema 定义数据模式，确保了类型安全。基于 Playwright 框架，LLM Scraper 提供了六种格式化模式，包括 HTML、原始 HTML、Markdown、文本、图片和自定义内容加载，能够满足不同场景下的数据抓取需求。此外，该工具还支持流式对象处理与代码生成，适用于需要高效且灵活地从网页中抽取信息的开发者或研究者。","2026-06-11 03:24:40","top_topic"]