[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80048":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":8,"htmlUrl":8,"language":9,"languages":8,"totalLinesOfCode":8,"stars":10,"forks":11,"watchers":12,"openIssues":13,"contributorsCount":13,"subscribersCount":13,"size":13,"stars1d":13,"stars7d":14,"stars30d":15,"stars90d":13,"forks30d":13,"starsTrendScore":13,"compositeScore":16,"rankGlobal":8,"rankLanguage":8,"license":8,"archived":17,"fork":17,"defaultBranch":18,"hasWiki":19,"hasPages":17,"topics":20,"createdAt":8,"pushedAt":8,"updatedAt":21,"readmeContent":22,"aiSummary":23,"trendingCount":13,"starSnapshotCount":13,"syncStatus":24,"lastSyncTime":25,"discoverSource":26},80048,"HTML2Obsidian","MaiHongPhong1902\u002FHTML2Obsidian","MaiHongPhong1902",null,"Python",71,3,22,0,1,8,37.11,false,"master",true,[],"2026-06-12 04:01:26","﻿# HTML2Obsidian\n\nFetch any URL and produce structured [Obsidian](https:\u002F\u002Fobsidian.md\u002F) Markdown notes — ready for LLM tool calling and knowledge-graph building.\n\n---\n\n## Features\n\n| | |\n|---|---|\n| **YAML frontmatter** | title, url, domain, smart tags, entities, date |\n| **[[WikiLinks]]** | auto-extracted named entities as Obsidian graph nodes |\n| **🏗️ Page Structure** | layout sections table (`\u003Cheader>`, `\u003Cnav>`, `\u003Cmain>`, …) |\n| **🖱️ Interactive Elements** | buttons, inputs, forms, nav links with CSS selectors |\n| **🤖 Agent Context** | compact snapshot with page type, likely actions, priority links, and key controls |\n| **🌲 Site Tree Map** | optional dedicated note with hierarchical URL tree, flat table, or both |\n| **🏷️ Smart Auto Tags** | combines domain hints, metadata keywords, URL structure, and page signals |\n| **📑 Split-note mode** | one `.md` per page region in `vault\u002F{Title}\u002F` subfolder |\n| **SPA support** | Playwright renders React \u002F Vue \u002F Next.js before extraction |\n| **Low-code \u002F no-code forms** | detects Form.io and OutSystems rendered forms, field keys, labels, validation, and embedded schemas |\n| **YouTube** | channel, views, likes, duration, related videos — no API key |\n| **Browser profile** | reuse Chrome \u002F Edge \u002F Firefox cookies and login sessions |\n| **LLM summarisation** | optional small-model pre-summary (Ollama \u002F OpenAI-compatible) |\n| **🔍 DOM Index** | per-fetch semantic index: headings, tables, code blocks, lists, images, key-values, forms |\n| **⚡ Element Query** | `query_page_elements()` — CSS selector queries + `QUERY_SCHEMA` for direct LLM tool calling |\n| **🌐 Browser Context** | XHR\u002Ffetch capture, page metrics, embedded JSON globals, JSON-LD, lazy-image resolution |\n| **⏳ wait_for_selector** | pause DOM snapshot until a specific element appears — ensures async content is captured |\n\n---\n\n## Installation\n\n```bash\npip install -r requirements.txt\n\n# Playwright browsers (only needed when render_js=True)\nplaywright install chromium\n\n# Optional: spaCy NER for richer WikiLinks\npython -m spacy download en_core_web_sm\n```\n\n---\n\n## CLI\n\n```bash\n# Static page (fast, no browser)\npython note.py --vault .\u002Fvault --no-js https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FPython\n\n# JS-rendered SPA\npython note.py --vault .\u002Fvault https:\u002F\u002Fgithub.com\u002Fowner\u002Frepo\n\n# Low-code \u002F no-code rendered forms (Form.io, OutSystems)\npython note.py --vault .\u002Fvault https:\u002F\u002Fexample.com\u002Fruntime-form\n\n# Split into sub-notes per page section\npython note.py --vault .\u002Fvault --split https:\u002F\u002Fdocs.github.com\u002Fen\n\n# Generate a dedicated site tree map (--sitemap is also accepted)\npython note.py --vault .\u002Fvault --site-map --site-map-style both https:\u002F\u002Fdocs.github.com\u002Fen\n\n# Reuse browser profile (logged-in cookies)\npython note.py --vault .\u002Fvault --profile chrome https:\u002F\u002Fmail.google.com\n\n# Use a specific Chrome \u002F Edge profile\npython note.py --vault .\u002Fvault --profile \"chrome:Default\" https:\u002F\u002Fmail.google.com\npython note.py --vault .\u002Fvault --profile \"edge:Profile 1\" https:\u002F\u002Fexample.com\n\n# Load or save Playwright cookies\u002FlocalStorage state\npython note.py --vault .\u002Fvault --cookies .\u002Fauth-state.json https:\u002F\u002Fexample.com\npython note.py --vault .\u002Fvault --headed --auth-wait 60 --save-cookies .\u002Fauth-state.json https:\u002F\u002Fexample.com\n\n# If .\u002Fauth-state.json exists, it is loaded automatically\npython note.py --vault .\u002Fvault https:\u002F\u002Fexample.com\u002Fprivate\n\n# Custom title + extra tags\npython note.py --vault .\u002Fvault --title \"My Note\" --tags research ai https:\u002F\u002Fexample.com\n\n# YouTube video\npython note.py --vault .\u002Fvault \"https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=VIDEO_ID\"\n\n# Print to stdout (no vault)\npython note.py https:\u002F\u002Fexample.com\n```\n\n### CLI Reference\n\n| Argument | Description |\n|---|---|\n| `url` | URL to fetch (positional, last) |\n| `-o`, `--vault DIR` | Obsidian vault directory. Omit to print to stdout |\n| `-t`, `--title TITLE` | Custom note title (auto-detected if omitted) |\n| `--no-js` | Skip Playwright — faster for static pages |\n| `--tags TAG …` | Extra frontmatter tags |\n| `--profile PROFILE` | Browser profile: `chrome` \\| `edge` \\| `firefox` \\| `\u002Fabs\u002Fpath` |\n| `--browser-channel CHANNEL` | Browser channel: `chrome`, `msedge`; optional override |\n| `--headed` | Launch visible browser window for login\u002Fcookie refresh |\n| `--auth-wait SECONDS` | Keep headed browser open before snapshot\u002Fsave so you can finish login |\n| `--cookies FILE` | Load Playwright `storage_state` JSON cookies\u002FlocalStorage |\n| `--no-auto-cookies` | Disable automatic loading from saved storage files |\n| `--save-cookies FILE` | Save Playwright `storage_state` JSON after fetch |\n| `--split` | Split note into sub-notes by page section |\n| `--site-map`, `--sitemap` | Generate a dedicated site map note |\n| `--site-map-style STYLE` | Site map rendering: `tree` \\| `table` \\| `both` |\n| `--site-map-depth N` | Maximum URL depth to expand in tree mode |\n| `--site-map-links N` | Maximum internal links to include in the site map |\n| `--site-map-external-links N` | Maximum external links to include in the site map |\n\n---\n\n## Split-note mode\n\n`--split` saves multiple linked files into `vault\u002F{Title}\u002F` instead of a single flat note:\n\n```\nvault\u002F\n└── GitHub Docs\u002F\n    ├── GitHub Docs.md                        ← index + relationships\n    ├── GitHub Docs - Navigation.md           ← \u003Cnav> section\n    ├── GitHub Docs - Main.md                 ← \u003Cmain> section (full content)\n    ├── GitHub Docs - Footer.md               ← \u003Cfooter> section\n    └── GitHub Docs - Interactive Elements.md ← all buttons \u002F inputs \u002F forms\n```\n\nFile names follow `{Title} - {Section}` so WikiLinks are **unique across the entire vault** — even when multiple sites share section names like \"Navigation\" or \"Footer\".\n\nSub-notes link back to their parent:\n\n```yaml\n---\ntitle: GitHub Docs - Navigation\nparent: \"[[GitHub Docs]]\"\nsection_tag: nav\npage: \"https:\u002F\u002Fdocs.github.com\u002Fen\"\ntags:\n  - web-section\n---\n```\n\n---\n\n## Python API\n\nSee [API-docs.md](API-docs.md) for full reference: parameters, return values, `browser_context` fields, `query_page_elements`, `TOOL_SCHEMA`, `QUERY_SCHEMA`, `BrowserPipeline`, DOM index, and LLM summarisation.\n\n```python\nfrom tools import create_obsidian_note, TOOL_SCHEMA\nfrom tools import query_page_elements, QUERY_SCHEMA\n\n# Save note to vault\nresult = create_obsidian_note(url=\"https:\u002F\u002Fexample.com\", vault_path=\".\u002Fmy-vault\")\nprint(result[\"path\"])\n\n# Capture browser context (XHR, metrics, JSON-LD, DOM index)\nresult = create_obsidian_note(\n    url=\"https:\u002F\u002Fshop.example.com\u002Fproduct\u002F123\",\n    capture_network=True,\n    wait_for_selector=\".product-price\",\n)\nprint(result[\"browser_context\"][\"dom_index\"][\"headings\"])\n\n# Targeted element query (LLM-friendly)\nresult = query_page_elements(\n    url=\"https:\u002F\u002Fexample.com\",\n    queries={\"title\": \"h1\", \"price\": \".product-price\"},\n)\nprint(result[\"results\"])\n```\n\n---\n\n## LLM Tool Calling\n\nTwo tool schemas — pass directly to OpenAI \u002F Anthropic \u002F Ollama.\n\n```python\nfrom tools import TOOL_SCHEMA, QUERY_SCHEMA\n```\n\n- **`TOOL_SCHEMA`** — full note creation via `create_obsidian_note()`\n- **`QUERY_SCHEMA`** — targeted element queries via `query_page_elements()`\n\nSee [API-docs.md → LLM Tool Calling](API-docs.md#llm-tool-calling) for complete examples.\n\n---\n\n## Browser Profile \u002F Cookies (Authenticated Pages)\n\nReuse an existing browser so the tool can access login-required pages:\n\n```python\nresult = create_obsidian_note(\n    url=\"https:\u002F\u002Fgithub.com\u002Fnotifications\",\n    vault_path=\".\u002Fvault\",\n    browser_profile=\"chrome:Default\",\n)\n```\n\n| Shortcut | Profile directory |\n|---|---|\n| `chrome` | `%LOCALAPPDATA%\\Google\\Chrome\\User Data` |\n| `chrome-dev` | `%LOCALAPPDATA%\\Google\\Chrome Dev\\User Data` |\n| `edge` | `%LOCALAPPDATA%\\Microsoft\\Edge\\User Data` |\n| `firefox` | `%APPDATA%\\Mozilla\\Firefox\\Profiles\\*.default*` |\n\nYou can append a profile directory name, for example `chrome:Default` or `edge:Profile 1`. You can also pass an absolute Chromium user-data directory, or a profile directory such as `...\\User Data\\Default`.\n\nFor repeatable authenticated scraping, export cookies\u002FlocalStorage once:\n\n```bash\npython note.py --headed --auth-wait 60 --save-cookies .\u002Fauth-state.json https:\u002F\u002Fexample.com\u002Flogin\npython note.py --vault .\u002Fvault https:\u002F\u002Fexample.com\u002Fprivate\n```\n\nWhen `--cookies` is omitted, the tool automatically checks `HTML2OBSIDIAN_STORAGE_STATE`, `.\u002Fauth-state.json`, `.\u002F.auth-state.json`, then `.\u002F.html2obsidian\u002Fauth-state.json`. Use `--no-auto-cookies` to force a clean browser context.\n\n> **Note:** Close Chrome \u002F Edge before running — only one process can hold a profile lock at a time.\n\n---\n\n## Interactive Elements\n\nEvery note includes a `## 🖱️ Interactive Elements` section with CSS selectors ready for Playwright automation:\n\n```markdown\n**Navigation links:**\n| Label | href | selector |\n|-------|------|----------|\n| [[Explore]] | `\u002Fexplore` | `nav a[href=\"\u002Fexplore\"]` |\n\n**Buttons:**\n| Label | tag | type | id | selector |\n|-------|-----|------|----|----------|\n| [[Sign up]] | `a` | `—` | `—` | `a[href=\"\u002Fsignup\"].btn` |\n\n**Input fields:**\n| type | name \u002F id | placeholder | required | selector |\n|------|-----------|-------------|----------|----------|\n| `text` | `q` | Search | | `input[name=\"q\"][type=\"text\"]` |\n```\n\n---\n\n## Optional: LLM Summarisation\n\nPre-summarise content with a small local model (Ollama \u002F OpenAI-compatible) before passing to your main LLM. See [API-docs.md → LLM Summarisation](API-docs.md#llm-summarisation) for configuration details.\n\n---\n\n## Package Structure\n\n```\ntools\u002F\n├── __init__.py            # Exports: create_obsidian_note, TOOL_SCHEMA, query_page_elements, QUERY_SCHEMA, ObsidianNote\n├── obsidian_tool.py       # Tool entry point + TOOL_SCHEMA + query_page_elements + QUERY_SCHEMA\n├── obsidian_formatter.py  # PipelineResult → ObsidianNote \u002F split sub-notes\n├── pipeline.py            # fetch → extract → clean → summarize; exposes browser_context\n├── fetcher.py             # Playwright (JS \u002F profile \u002F dom_index \u002F network capture) or httpx (static)\n├── extractor.py           # Layout, interactive elements, metadata, links, YouTubeExtractor\n├── cleaner.py             # HTML → clean Markdown\n└── summarizer.py          # Optional small-LLM pre-summarisation\n```\n\n---\n\n## Dependencies\n\n| Library | Purpose |\n|---|---|\n| `httpx` | Static HTTP fetching |\n| `playwright` | JS rendering + browser profile |\n| `beautifulsoup4` + `lxml` | HTML parsing |\n| `trafilatura` | Main article extraction |\n| `markitdown` | HTML \u002F PDF \u002F DOCX → Markdown |\n| `spacy` *(optional)* | NER for richer WikiLinks |\n","HTML2Obsidian 是一个用于抓取网页并生成结构化的 Obsidian Markdown 笔记的工具，适用于知识图谱构建和大语言模型调用。其核心功能包括自动生成 YAML 前置信息、自动提取命名实体作为 Obsidian 图节点、支持单页应用渲染（如 React、Vue 和 Next.js）、智能标签生成、页面结构解析及交互元素识别等。特别适合需要从网页中快速提取关键信息并进行整理归档的场景，比如科研资料收集、在线学习笔记制作或企业内部知识管理。此外，该工具还提供了低代码\u002F无代码表单支持、YouTube 视频数据抓取以及浏览器上下文复用等功能，增强了其在不同应用场景下的灵活性与实用性。",2,"2026-06-11 03:59:02","CREATED_QUERY"]