[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-2783":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":14,"stars7d":16,"stars30d":17,"stars90d":15,"forks30d":15,"starsTrendScore":18,"compositeScore":19,"rankGlobal":9,"rankLanguage":9,"license":20,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":23,"hasPages":21,"topics":24,"createdAt":9,"pushedAt":9,"updatedAt":25,"readmeContent":26,"aiSummary":27,"trendingCount":15,"starSnapshotCount":15,"syncStatus":28,"lastSyncTime":29,"discoverSource":30},2783,"pullmd","AeternaLabsHQ\u002Fpullmd","AeternaLabsHQ","Self-hosted URL- and file-to-Markdown service for humans and AI agents - web pages, documents, images, audio, YouTube. PWA + REST + MCP + Claude Code skill, Reddit-aware, refreshable share links.",null,"JavaScript",188,14,108,3,0,5,52,9,3.53,"GNU Affero General Public License v3.0",false,"main",true,[],"2026-06-12 02:00:43","# PullMD\n\n[![Release](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fv\u002Frelease\u002FAeternaLabsHQ\u002Fpullmd?color=ff5722)](https:\u002F\u002Fgithub.com\u002FAeternaLabsHQ\u002Fpullmd\u002Freleases)\n[![Docker Pulls](https:\u002F\u002Fimg.shields.io\u002Fdocker\u002Fpulls\u002Faeternalabshq\u002Fpullmd?color=2496ed&logo=docker&logoColor=white)](https:\u002F\u002Fhub.docker.com\u002Fr\u002Faeternalabshq\u002Fpullmd)\n[![CI](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Factions\u002Fworkflow\u002Fstatus\u002FAeternaLabsHQ\u002Fpullmd\u002Fdocker.yml?branch=main&label=build)](https:\u002F\u002Fgithub.com\u002FAeternaLabsHQ\u002Fpullmd\u002Factions\u002Fworkflows\u002Fdocker.yml)\n[![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-AGPL--3.0-blueviolet)](https:\u002F\u002Fgithub.com\u002FAeternaLabsHQ\u002Fpullmd\u002Fblob\u002Fmain\u002FLICENSE)\n[![MCP](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FMCP-compatible-7c4dff?logo=anthropic&logoColor=white)](https:\u002F\u002Fgithub.com\u002FAeternaLabsHQ\u002Fpullmd#mcp-server)\n\nSelf-hosted URL-to-Markdown service for humans and AI agents.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fscreenshot.png\" alt=\"PullMD web interface\" width=\"800\">\n\u003C\u002Fp>\n\nPullMD takes any web URL and returns clean, readable Markdown — no\nnavigation, no ads, no boilerplate. It auto-detects Reddit threads\n(with full comment trees), uses Cloudflare's native Markdown when\navailable, runs Mozilla Readability + Trafilatura on static HTML,\nand as a last resort renders JavaScript-heavy pages via headless\nChromium (Playwright sidecar) before extracting.\n\nIt ships as:\n\n- a **PWA frontend** with raw\u002Frendered view toggle, dark\u002Fpaper themes, history, archive, share links\n- a **REST API** at `GET \u002Fapi?url=…`\n- an **MCP server** at `POST \u002Fmcp` (Streamable-HTTP transport, stateless)\n- a **Claude Code skill** as a downloadable zip\n\nEvery conversion gets an 8-hex **share id** that works as a stable\nlive-endpoint: `GET \u002Fs\u002F:id` returns the cached markdown and\nre-fetches from the source if older than one hour. Use the share id\nas a fixed URL that always returns fresh content — useful for\nsubreddit feeds and similar.\n\n---\n\n## Quick start\n\nPre-built multi-arch images (`linux\u002Famd64`, `linux\u002Farm64`) live on Docker\nHub. Drop the compose file somewhere and run:\n\n```bash\nmkdir pullmd && cd pullmd\ncurl -O https:\u002F\u002Fraw.githubusercontent.com\u002FAeternaLabsHQ\u002Fpullmd\u002Fmain\u002Fdocker-compose.yml\ndocker compose up -d\n# → http:\u002F\u002Flocalhost:3000\n```\n\nThat's it. No `.env` needed: every variable has a sensible default\nand PullMD listens on port `3000`. Add a `.env` next to the compose\nfile to override anything (see [Configuration](#configuration)).\n\n### `docker-compose.yml` (zero-config)\n\n```yaml\nservices:\n  pullmd:\n    image: aeternalabshq\u002Fpullmd:latest\n    container_name: pullmd\n    restart: unless-stopped\n    ports:\n      - \"${PORT:-3000}:3000\"\n    environment:\n      - PUBLIC_URL=${PUBLIC_URL:-http:\u002F\u002Flocalhost:${PORT:-3000}}\n      - TRAFILATURA_URL=http:\u002F\u002Ftrafilatura:8001\u002Fextract\n      - PLAYWRIGHT_URL=http:\u002F\u002Fplaywright:8002\u002Frender\n      - REDDIT_CLIENT_ID=${REDDIT_CLIENT_ID:-}\n      - REDDIT_CLIENT_SECRET=${REDDIT_CLIENT_SECRET:-}\n      - REDDIT_USER_AGENT=${REDDIT_USER_AGENT:-}\n    volumes:\n      - .\u002Fdata:\u002Fdata\n    networks:\n      - pullmd-internal\n    depends_on:\n      - trafilatura\n      - playwright\n\n  trafilatura:\n    image: aeternalabshq\u002Fpullmd-trafilatura:latest\n    container_name: pullmd-trafilatura\n    restart: unless-stopped\n    networks:\n      - pullmd-internal\n\n  playwright:\n    image: aeternalabshq\u002Fpullmd-playwright:latest\n    container_name: pullmd-playwright\n    restart: unless-stopped\n    networks:\n      - pullmd-internal\n\nnetworks:\n  pullmd-internal:\n    driver: bridge\n```\n\n> **Note:** the Playwright sidecar adds **~3.7 GB** to your image cache\n> (Chromium + Firefox + WebKit binaries from the official Playwright\n> base image). It's optional — leave `PLAYWRIGHT_URL` unset and the\n> `playwright` service block off, and PullMD silently degrades to\n> static extraction with a fallback note in the metadata.\n\n> **Mirror on GHCR:** `ghcr.io\u002Faeternalabshq\u002F{pullmd,pullmd-trafilatura,pullmd-playwright}`.\n> Replace the `image:` lines if you prefer GitHub's registry.\n\n### Behind Traefik\n\nFor deployments behind Traefik with TLS, use `docker-compose.traefik.yml`\ninstead. Same images, but with Traefik labels and the `proxy` external\nnetwork. Set `HOST_DOMAIN` in `.env`:\n\n```bash\ncurl -O https:\u002F\u002Fraw.githubusercontent.com\u002FAeternaLabsHQ\u002Fpullmd\u002Fmain\u002Fdocker-compose.traefik.yml\necho \"HOST_DOMAIN=pullmd.example.com\" > .env\ndocker compose -f docker-compose.traefik.yml up -d\n```\n\n### Local development (no Docker)\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FAeternaLabsHQ\u002Fpullmd.git\ncd pullmd\nnpm install\nnpm start             # http:\u002F\u002Flocalhost:3000\nnpm test              # node --test\n```\n\n---\n\n## Configuration\n\nAll variables go in `.env` (copy from `.env.example`):\n\n| Variable               | Required | Purpose                                                                                              |\n| ---------------------- | -------- | ---------------------------------------------------------------------------------------------------- |\n| `HOST_DOMAIN`          | yes      | Public hostname without scheme. Used by Traefik routing and as fallback for `PUBLIC_URL`.           |\n| `PUBLIC_URL`           | no       | Full public origin embedded in `\u002Fhelp` and the skill zip. Defaults to `https:\u002F\u002F${HOST_DOMAIN}`.     |\n| `TRAFILATURA_URL`      | no       | URL of the Trafilatura sidecar's `\u002Fextract` endpoint. Unset → skip Trafilatura, Readability only.    |\n| `PLAYWRIGHT_URL`       | no       | URL of the Playwright sidecar's `\u002Frender` endpoint. Unset → skip Playwright fallback for JS pages.   |\n| `REDDIT_CLIENT_ID`     | no       | OAuth credentials for Reddit. Without them, PullMD uses the public JSON API (lower rate limit).     |\n| `REDDIT_CLIENT_SECRET` | no       |                                                                                                      |\n| `REDDIT_USER_AGENT`    | no       | Reddit requires a unique UA. Default: `PullMD\u002F1.0 (URL-to-Markdown service)`.                       |\n| `DISABLE_PUBLIC_HISTORY` | no     | When `true`, hides the global recent-conversions list and archive (`\u002Fapi\u002Fhistory` + `\u002Fapi\u002Farchive` return 403, frontend hides the section). `\u002Fs\u002F:id` share links keep working. Default: `false`. |\n| `PULLMD_USER_AGENT`    | no       | Pin a single outbound User-Agent for every web fetch. Disables rotation. Useful for CI or when one specific UA is known to work. |\n| `PULLMD_UA_FEED_URL`   | no       | URL of a JSON feed of current real-world UAs. Default: [WinFuture23\u002Freal-world-user-agents](https:\u002F\u002Fgithub.com\u002FWinFuture23\u002Freal-world-user-agents). Set to an empty string to disable live refresh and rely on the built-in seed pool. |\n| `PULLMD_AUTH_MODE`     | no       | `disabled` (default) \u002F `single-admin` \u002F `multi-user`. See \"Authentication\" below.                   |\n| `PULLMD_ADMIN_EMAIL`   | required when AUTH_MODE != disabled, on first startup | Bootstrap email for the first admin user.                            |\n| `PULLMD_ADMIN_PASSWORD` | required when AUTH_MODE != disabled, on first startup | Bootstrap password (min 8 chars).                                    |\n| `PULLMD_AUTH_TOKEN`    | no       | Legacy bearer token compat (single-admin mode only, deprecated).                                    |\n\n`PUBLIC_URL` matters for self-hosting: the help page and downloadable\nskill embed it as the canonical endpoint. Set it correctly and your\nusers get a copy-paste setup that points at *your* instance.\n\nPullMD rotates its outbound User-Agent for the web fetch path from a\npool of current desktop browsers, refreshed every 48 hours from a\n[live feed of real-world UAs](https:\u002F\u002Fgithub.com\u002FWinFuture23\u002Freal-world-user-agents)\nmaintained by [@WinFuture23](https:\u002F\u002Fgithub.com\u002FWinFuture23). A built-in\nseed pool ensures rotation works even when the feed is unreachable. Set\n`PULLMD_USER_AGENT` to pin a single UA, or `PULLMD_UA_FEED_URL` to point\nat your own feed. The Reddit path keeps its dedicated `REDDIT_USER_AGENT`\nbecause Reddit's API expects a stable, identifying UA.\n\n`DISABLE_PUBLIC_HISTORY=true` is the privacy switch for shared\ninstances (multi-tenant VPS, office deployments). Conversions still\nget cached and assigned share IDs; users just can't see what *other*\nusers have fetched. Anyone with a known `\u002Fs\u002F:id` link still gets\ntheir markdown back. Use this as a stopgap until per-user scoping\nlands.\n\n---\n\n## Authentication (v2.0+)\n\n> **Pulling v2.x:** Use the explicit `:2` tag (or `:2.0`, `:2.0.0`).\n> The `:latest` tag remains on v1.x for backward compatibility \n> until v2.x has stabilized in real-world deployments.\n> \n> ```yaml\n> services:\n>   pullmd:\n>     image: aeternalabs\u002Fpullmd:2\n> ```\n\nPullMD ships with three auth modes. Pick one with `PULLMD_AUTH_MODE`:\n\n| Mode           | Behavior                                                                    |\n| -------------- | --------------------------------------------------------------------------- |\n| `disabled`     | Default. No auth, everything open. Existing v1.x behavior.                  |\n| `single-admin` | One user, credentials from env vars. No self-signup. For homelab.           |\n| `multi-user`   | Self-signup at `\u002Fsignup`, login at `\u002Flogin`, per-user data isolation.       |\n\nIn `single-admin` and `multi-user` modes, `PULLMD_ADMIN_EMAIL` + `PULLMD_ADMIN_PASSWORD` bootstrap the first admin user on first startup. After that, changing these env vars does **not** change the password — use the admin CLI:\n\n```bash\ndocker compose exec pullmd node scripts\u002Fadmin.js reset-password you@example.com\n```\n\n### Auth boundary\n\n| Endpoint                                                         | Auth required (when mode != disabled) |\n| ---------------------------------------------------------------- | :-----------------------------------: |\n| `\u002F`, `\u002Fhelp`, static assets, `\u002Fweb-reader.zip`                   |                  no                  |\n| `\u002Flogin`, `\u002Fsignup`, `\u002Fapi\u002Fme` (auth surface)                    |                  no                  |\n| `\u002Fs\u002F:id` (share links)                                           |                  no                  |\n| `\u002Fapi`, `\u002Fapi\u002Fstream`                                            |                 yes                  |\n| `\u002Fmcp`                                                           |                 yes                  |\n| `\u002Fapi\u002Fhistory`, `\u002Fapi\u002Farchive`                                   |                 yes                  |\n| `\u002Fapi\u002Fcache\u002F:id`, `DELETE \u002Fapi\u002Fcache`                            |                 yes                  |\n| `\u002Fapi\u002Fstats`, `\u002Fapi\u002Fstorage`, `\u002Fapi\u002Fconfig` (aggregate)          |                  no                  |\n\n### Authentication paths\n\n1. **Session cookies** — `POST \u002Flogin` sets `pullmd_session` (`HttpOnly`, `SameSite=Lax`, `Secure` over HTTPS, 7-day TTL with sliding expiry). The PWA uses this automatically.\n2. **API keys** — generate at `\u002Fsettings`, send via `Authorization: Bearer pmd_\u003C32-char-base62>`. Stored as SHA-256 hashes; only shown once at creation.\n3. **Legacy `PULLMD_AUTH_TOKEN`** — deprecated. `single-admin` mode only. Maps to admin user. Kept for migration compatibility, removed in v3.0.\n\n### Migration from v1.x\n\nSee `MIGRATION.md` for the full upgrade checklist. The TL;DR: leave `PULLMD_AUTH_MODE` unset and v2.0 behaves exactly like v1.x.\n\n### OAuth 2.1 (claude.ai Web Connector)\n\nPullMD ships with a full OAuth 2.1 Authorization Code flow so the **claude.ai\nweb app's Custom Connector** feature can authenticate users against your\nPullMD instance. All endpoints needed by the spec are implemented: Dynamic\nClient Registration (RFC 7591), PKCE-S256 (RFC 7636), Authorization Server\nMetadata (RFC 8414), Protected Resource Metadata (RFC 9728), and Token\nRevocation (RFC 7009).\n\n**Setup:**\n\n1. Set `PULLMD_AUTH_MODE` to `single-admin` or `multi-user` (OAuth requires Phase-1 auth).\n2. Set `OAUTH_JWT_SECRET` to a 32+ character random string (`openssl rand -hex 32`).\n3. Set `PUBLIC_URL` to your instance's public origin (e.g. `https:\u002F\u002Fpullmd.example.com`).\n4. In claude.ai → Settings → Connectors → Add custom connector, point it at `https:\u002F\u002Fpullmd.example.com\u002Fmcp` — claude.ai discovers everything else automatically via the well-known endpoints.\n5. The first time the user clicks the connector, they'll be redirected to PullMD's `\u002Flogin`, then to a consent screen, then back to claude.ai.\n\n**Tokens:**\n- Access tokens are JWTs (HS256), TTL 1 hour, audience-bound to your `\u002Fmcp` URL.\n- Refresh tokens are opaque (`pmd_rt_…`), TTL 30 days, rotated on every refresh, with reuse-detection that invalidates the entire refresh chain on replay.\n- Revoke a token via `POST \u002Foauth\u002Frevoke` (RFC 7009).\n\n**Scope:** Currently a single `mcp:full` scope (URL conversion + history read). Granular scopes are tracked for a future minor release.\n\n**Issues #6 and #10** track this work and close on the v2.1.0 release.\n\n---\n\n## AI-agent integration\n\nThree install paths. Once your instance is running, `${PULLMD_URL}\u002Fhelp`\nshows the same boxes with your URL pre-filled. Replace `${PULLMD_URL}`\nbelow with your hostname (e.g. `https:\u002F\u002Fpullmd.example.com`).\n\n### 1. Universal prompt\n\nDrop into any chat agent (ChatGPT, Claude, Gemini, …):\n\n```\nWhen you need to read a web page, fetch via PullMD instead of raw HTML:\n\n  GET ${PULLMD_URL}\u002Fapi?url=\u003CURL>\n\nReturns clean Markdown (text\u002Fmarkdown). Optional query params:\n\n  comments=false        skip Reddit comments\n  comment_depth=N       comment nesting depth (default 3)\n  frontmatter=true      prepend YAML metadata block\n  format=text           strip Markdown, return plain text\n  nocache=true          bypass the 1h cache and refetch\n  render=force|skip     override the auto Playwright fallback\n  lang=de|en            language for the comments section header\n\nResponse headers worth checking:\n  X-Source       reddit | cloudflare | readability | playwright\n  X-Quality      0.0-1.0 extraction confidence\n  X-Share-Id     8-hex permalink, openable as \u002Fs\u002F\u003Cid>\n\nReddit URLs are auto-detected (incl. redd.it short links and \u002Fs\u002F shares).\nUse this whenever you would otherwise fetch raw HTML — the markdown is\nmuch cleaner and saves significant context window space.\n```\n\n### 2. Claude Code skill\n\n`web-reader.zip` is auto-built with your URL embedded:\n\n```bash\ncurl -O ${PULLMD_URL}\u002Fweb-reader.zip\nmkdir -p ~\u002F.claude\u002Fskills\nunzip web-reader.zip -d ~\u002F.claude\u002Fskills\u002F\n# Restart Claude Code; the skill activates on web-reading requests.\n```\n\n### 3. MCP server\n\nRemote MCP server at `${PULLMD_URL}\u002Fmcp` (Streamable-HTTP transport, stateless).\nThree tools: `read_url`, `get_share`, `list_recent`. Server-side updates reach\nevery client automatically — no local install needed.\n\n**Claude Code (CLI):**\n\n```bash\nclaude mcp add --transport http pullmd ${PULLMD_URL}\u002Fmcp\n```\n\n**Claude Desktop \u002F Cursor \u002F other MCP hosts — JSON config:**\n\n```json\n{\n  \"mcpServers\": {\n    \"pullmd\": {\n      \"type\": \"http\",\n      \"url\": \"${PULLMD_URL}\u002Fmcp\"\n    }\n  }\n}\n```\n\nOnce registered, the three tools surface natively in the agent — no prompt\ninstructions needed, the LLM picks them up via their schema descriptions.\n\n### MCP client compatibility (updated for v2.0)\n\n| Client          | Bearer (`Authorization: Bearer pmd_...`) | OAuth | Notes                                  |\n| --------------- | :--------------------------------------: | :---: | -------------------------------------- |\n| Claude Code CLI |                    ✅                    |   —   | Recommended. Generate a key at `\u002Fsettings`. |\n| Cursor          |                    ✅                    |   —   | Same as CLI.                           |\n| Claude Desktop  |                    ❌                    | (#6)  | UI lacks header field. Phase 2 OAuth.  |\n| claude.ai (web) |                    ❌                    | (#6)  | Web requires OAuth. Phase 2.           |\n\nFor Phase 1, Claude Desktop \u002F claude.ai users still need the OAuth\u002Fproxy workaround documented in [#10](https:\u002F\u002Fgithub.com\u002FAeternaLabsHQ\u002Fpullmd\u002Fissues\u002F10). Phase 2 (#6) layers OAuth on top of this user system.\n\n#### Claude Desktop limitation\n\nThe Claude Desktop \"Add custom connector\" UI accepts URL + OAuth\nClient ID\u002FSecret but no custom-header field. Additionally,\n`claude_desktop_config.json` entries with `\"type\": \"http\"` are silently\nrewritten to `{}` after Desktop launches (current Desktop only honors\nstdio servers in that file).\n\nUntil OAuth support lands (see [#6](https:\u002F\u002Fgithub.com\u002FAeternaLabsHQ\u002Fpullmd\u002Fissues\u002F6)),\nthe practical workaround for Claude Desktop users is a reverse proxy\nthat accepts the auth token as either a bearer header (for CLI) or as a\nURL path prefix (for Desktop, which has no header field).\n\n#### Caddy workaround for Claude Desktop\n\nContributed by [@WinFuture23](https:\u002F\u002Fgithub.com\u002FAeternaLabsHQ\u002Fpullmd\u002Fissues\u002F10):\n\n```caddy\n@bearer header Authorization \"Bearer {$AUTH_TOKEN}\"\nhandle @bearer { reverse_proxy pullmd:3000 }\n\n@token_path path \u002F{$AUTH_TOKEN}\u002F* \u002F{$AUTH_TOKEN}\nhandle @token_path {\n    uri strip_prefix \u002F{$AUTH_TOKEN}\n    reverse_proxy pullmd:3000\n}\n```\n\nThen in Claude Desktop's connector dialog, use the URL with the token\npath prefix: `https:\u002F\u002Fyour-instance.com\u002F\u003CTOKEN>\u002Fmcp`. CLI clients keep\nusing the `Authorization` header as normal.\n\nThis is a stopgap pattern; native OAuth (Phase 2) will remove the need\nfor it.\n\n---\n\n## API\n\n| Endpoint               | Returns                                                                          |\n| ---------------------- | -------------------------------------------------------------------------------- |\n| `GET \u002Fapi?url=…`       | Markdown (or JSON \u002F plain text via `format=`).                                   |\n| `GET \u002Fapi\u002Fstream?url=…`| Server-Sent Events stream of extraction-stage status, ending in a `result` event. Used by the PWA. |\n| `GET \u002Fs\u002F:id`           | Cached Markdown by share id; refreshes from source if > 1 h old.                 |\n| `GET \u002Fapi\u002Fhistory`     | Recent conversions (JSON).                                                       |\n| `GET \u002Fapi\u002Farchive`     | Paginated full archive.                                                          |\n| `GET \u002Fapi\u002Fstorage`     | Cache size \u002F hit-rate stats.                                                     |\n| `GET \u002Fapi\u002Fstats`       | Extraction telemetry (sources, quality, latency).                                |\n| `POST \u002Fmcp`            | Streamable-HTTP MCP endpoint (3 tools: `read_url`, `get_share`, `list_recent`). |\n| `GET \u002Fweb-reader.zip`  | Claude Code skill bundle, with this instance's URL baked in.                     |\n| `GET \u002Fhelp`            | Bilingual user\u002Fagent setup guide.                                                |\n\n### `\u002Fapi` parameters\n\n| Param           | Default | Notes                                                                              |\n| --------------- | ------- | ---------------------------------------------------------------------------------- |\n| `url`           | —       | Required.                                                                          |\n| `comments`      | `true`  | Include Reddit comments. Ignored for non-Reddit URLs.                              |\n| `comment_depth` | `3`     | Max nesting depth (1–10).                                                          |\n| `comment_limit` | `15`    | Max top-level comments.                                                            |\n| `frontmatter`   | `false` | Prepend YAML metadata.                                                             |\n| `format`        | `md`    | `text` strips Markdown; `json` returns structured response.                        |\n| `nocache`       | `false` | Bypass the 1-hour cache.                                                           |\n| `render`        | auto    | `force` → always render via Playwright. `skip` → never render. Bypasses cache.     |\n| `lang`          | `de`    | Comments-section header language (`de` or `en`).                                   |\n\n### Response headers\n\n- `X-Source` — `reddit` · `cloudflare` · `readability` · `readability-fallback` · `trafilatura` · `playwright`\n- `X-Quality` — `0.0`–`1.0` extraction confidence\n- `X-Share-Id` — the 8-hex permalink id\n\n---\n\n## Cache & TTLs\n\n- **`\u002Fapi?url=…`** re-fetches from source if the cache row is older than **1 hour**.\n- **`\u002Fs\u002F:id`** does the same on-demand refresh, so share links double as live endpoints.\n- Cache rows are pruned **90 days** after the last write. `\u002Fs\u002F:id` hits keep the row alive (since they trigger refresh + write); read-only access does not extend the TTL.\n- If the source is unreachable on refresh, the last good snapshot is served — share links keep working even when the original URL dies.\n\n---\n\n## Architecture\n\n- `server.js` — Express app factory (`createApp`) with dependency injection for tests. Exposes `\u002Fapi` and `\u002Fapi\u002Fstream` (SSE).\n- `lib\u002Freddit.js` — Reddit URL normalization, redirect resolution, post + comment extraction.\n- `lib\u002Fweb.js` — Orchestrator: Cloudflare-Markdown short-circuit, then static Readability + Trafilatura with `pickBest`, then optional Playwright re-render + re-extract on body-soup \u002F low-quality output.\n- `lib\u002Frender-decision.js` — Predicate that decides when to fall back to Playwright (readability-fellback + thin, body-soup signature, or quality \u003C 0.5; plus `force` \u002F `skip` overrides).\n- `lib\u002Fplaywright-client.js` — HTTP client for the Playwright sidecar with `AbortSignal` propagation for SSE-disconnect cancellation.\n- `lib\u002Fscoring.js` — Quality scoring used to pick between extractors and as a render-trigger heuristic.\n- `lib\u002Fcache.js` — SQLite cache (`better-sqlite3`) with 90-day TTL and 8-hex share ids.\n- `lib\u002Fmcp.js` — Stateless MCP server registering the three tools.\n- `lib\u002Fdistrib.js` — Public-URL substitution in `\u002Fhelp` and `\u002Fweb-reader.zip`.\n- `trafilatura-sidecar\u002F` — Python sidecar (FastAPI) wrapping Trafilatura.\n- `playwright-sidecar\u002F` — Python sidecar (FastAPI + Playwright + Chromium) for JS-rendered pages.\n- `public\u002F` — PWA frontend (vanilla JS, dark\u002Fpaper themes, service worker, EventSource client for `\u002Fapi\u002Fstream`).\n- `skill\u002Fweb-reader\u002F` — Claude Code skill source (templated with `__PULLMD_URL__`).\n\n---\n\n## License\n\n[GNU AGPL v3](LICENSE) — Copyright © 2026 Aeterna Labs.\n\nPullMD is free software: you can redistribute it and modify it under the\nterms of the GNU Affero General Public License as published by the Free\nSoftware Foundation, version 3 or later. If you run a modified version\nas a network service, you must make your modifications available to its\nusers.\n","PullMD 是一个自托管的URL转Markdown服务，适用于人类用户和AI代理。它能够将任意网页URL转换为干净、可读的Markdown格式，去除导航、广告等冗余内容，并支持Reddit帖子及其评论树的自动识别与解析。技术上，该项目利用了Mozilla Readability、Trafilatura以及无头Chromium（通过Playwright）来处理不同类型的网页内容，并提供了PWA前端界面、REST API接口、MCP服务器功能以及Claude代码技能包等多种形式的服务。此外，每次转换都会生成一个8位分享ID，该ID指向的内容每小时更新一次，确保链接长期有效且内容新鲜。适合需要从互联网上提取纯净文本信息并以Markdown格式存储或展示的各种场景，如个人知识管理、自动化文档生成等。",2,"2026-06-11 02:51:12","CREATED_QUERY"]