[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-79913":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":15,"forks30d":15,"starsTrendScore":19,"compositeScore":20,"rankGlobal":10,"rankLanguage":10,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":24,"hasPages":22,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":36,"readmeContent":37,"aiSummary":38,"trendingCount":15,"starSnapshotCount":15,"syncStatus":16,"lastSyncTime":39,"discoverSource":40},79913,"ref-downloader","ltczding-gif\u002Fref-downloader","ltczding-gif","Batch-download reference PDFs from a DOI or paper PDF using Crossref and your institutional Edge session.","",null,"Python",112,3,89,0,2,7,22,6,1.81,"MIT License",false,"main",true,[26,27,28,29,30,31,32,33,34,35],"academic-tools","agent-skills","claude-code","codex-skills","crossref","literature-review","pdf-downloader","playwright","python","zotero","2026-06-12 02:03:55","# ref-downloader\n\n> **Stop losing an afternoon to chasing dozens of reference PDFs by hand.**\n> One DOI in, every reference PDF out — using your existing institutional access.\n\n[![Version: 0.4.0](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fversion-0.4.0-orange.svg)](CHANGELOG.md)\n[![Status: beta](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fstatus-beta-orange.svg)](#known-limitations)\n[![License: MIT](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-MIT-yellow.svg)](LICENSE)\n[![Python 3.11+](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpython-3.11+-blue.svg)](https:\u002F\u002Fwww.python.org\u002Fdownloads\u002F)\n![Verified on Windows + Edge](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fverified%20on-Windows%20+%20Edge-success)\n\n[中文完整文档 \u002F Full Chinese version](README.zh.md)\n\n> **Status: beta (v0.4.0).** Windows + Microsoft Edge verified path. macOS \u002F Linux \u002F Chromium untested. Expect rough edges around supplementary downloads and publisher-site changes. PR-worthy issues welcome.\n\n> **Heads up — not a paywall bypass.** ref-downloader uses _your_ institutional access. If your university or organization subscribes to a journal, those refs work. If they don't, those refs become `manual_pending` for you to follow up on by hand.\n\n## Demo (30-second console preview)\n\n```text\n$ python run_ref_downloader.py 10.1021\u002Fjacs.5c05017\n\n=== Ref Downloader Wrapper ===\nDOI:         10.1021\u002Fjacs.5c05017\nPROJECT:     jacs.5c05017\nConfig:      config.example.toml + config.local.toml\n\n>>> extract_refs.py\n  Title: Designing Natural Cell-Inspired Heme-Spurred Membrane...\n  References found: 38\n\n>>> validate_refs.py\n  Total: 38  Verified: 38  Failed: 0  No DOI: 0\n\n>>> download_refs.py\n  [ 1] downloaded (842 KB)        Lee2016_NatEnergy.pdf\n  [ 2] downloaded (1.2 MB)        Wang2018_AdvMater.pdf\n  [ 3] manual_pending (auth_redirect)\n  [ 4] downloaded (655 KB)        Chen2019_JACS.pdf\n  [ 5] failed (challenge_timeout)\n  [ 6] ignored (ignored_institution_access)\n  ... 31 more refs processed ...\n  [38] downloaded (956 KB)        Park2024_JElectrochemSoc.pdf\n\n========== Download report ==========\nTotal references:  38\nMain PDFs:         33 downloaded · 3 manual_pending · 1 failed · 1 ignored\nSI files:          12 captured\nPDFs land in:      .\u002Fjacs.5c05017_refs\u002Fjacs.5c05017\u002F\n=====================================\n```\n\n## Contents\n\n- [What you get](#what-you-get)\n- [Why not Zotero, scihub, or generic scrapers?](#why-not-zotero-scihub-or-generic-scrapers)\n- [Quick start](#quick-start)\n- [Requirements](#requirements)\n- [Install](#install)\n- [Usage examples](#usage-examples)\n- [Configuration](#configuration)\n- [Architecture](#architecture)\n- [Supported publishers](#supported-publishers)\n- [Known limitations](#known-limitations)\n- [Contributing](#contributing)\n- [Security](#security)\n- [License](#license)\n\n## What you get\n\n- **Paywalled refs work without setup.** _Drives your real Microsoft Edge profile, so any institutional login already in your browser carries through. No API keys, no proxies, no reverse engineering._\n- **One DOI in, every reference PDF out.** _Crossref-driven extraction + 17+ publisher-specific download paths (Wiley PDFDirect, Elsevier viewer, AIP loading-page wait — see [per-publisher reliability tier](docs\u002FSUPPORTED_PUBLISHERS.md)), not generic scraping._\n- **You always know which refs failed and why.** _`download_report.csv` gives every ref a status + reason (`manual_pending (auth_redirect)`, `failed (challenge_timeout)`, `ignored`); `events.jsonl` keeps the per-ref event trace._\n- **Pick up where you left off** after a VPN drop, browser crash, or `Ctrl+C`. _State persists per project; rerunning skips already-downloaded refs and retries only the failures._\n\n## Why not Zotero, scihub, or generic scrapers?\n\n- **vs. Zotero's _Find Available PDF_** — walks one paper at a time and silently gives up at SSO redirects. ref-downloader walks the whole reference list at once and treats SSO as a configurable step instead of a dead end.\n- **vs. scihub-style tools** — don't carry your institutional license, so paywalled refs you _legitimately_ have access to just fail. ref-downloader uses your authenticated browser session, so subscriptions you already pay for actually count.\n- **vs. generic web scrapers** — don't know Wiley needs PDFDirect, Elsevier needs a viewer click, or AIP serves a Chinese loading page first. ref-downloader has 17+ publisher-specific paths plus Elsevier popup state machine + `--auto` mode retry queue (manual-pending refs get a second async attempt 60s later, hot-session preserved).\n- **vs. raw Playwright** — gets blocked on Cloudflare \u002F Radware \u002F Turnstile-heavy sites. Set `REF_DOWNLOADER_BROWSER=cloak` to swap in [cloakbrowser](https:\u002F\u002Fpypi.org\u002Fproject\u002Fcloakbrowser\u002F)'s stealth Chromium with humanized input — no code changes, same pipeline. See [Configuration](#configuration).\n\n## Quick start\n\nThe skill is self-contained under `skills\u002Fref-downloader\u002F`. Pick the install path for your agent framework:\n\n```powershell\ngit clone https:\u002F\u002Fgithub.com\u002Fltczding-gif\u002Fref-downloader.git\n\n# Pick ONE install destination for your agent framework:\n#   Claude Code:        cp -r ref-downloader\u002Fskills\u002Fref-downloader ~\u002F.claude\u002Fskills\u002F\n#   Codex CLI:          cp -r ref-downloader\u002Fskills\u002Fref-downloader ~\u002F.codex\u002Fskills\u002F\n#   Copilot CLI \u002F VSC:  cp -r ref-downloader\u002Fskills\u002Fref-downloader .github\u002Fskills\u002F\n#   Project-local:      cp -r ref-downloader\u002Fskills\u002Fref-downloader .agents\u002Fskills\u002F\n\ncd ~\u002F.claude\u002Fskills\u002Fref-downloader     # or wherever you copied it\npip install playwright pymupdf\nplaywright install msedge\ncp config.example.toml config.local.toml      # then set [crossref].mailto\n\n# In your agent: just describe the task; the skill triggers via its description.\n# Direct CLI for testing: python scripts\u002Frun_ref_downloader.py 10.1021\u002Fjacs.5c05017\n```\n\nWhat you'll see: 30–80 refs discovered for a typical chemistry\u002Fphysics paper, then a mix of `downloaded` (refs your institution covers), `manual_pending` (SSO bounce or paywall), and occasional `failed` (publisher quirk). Run on a DOI from a journal your institution actually subscribes to for the highest hit rate. Details below.\n\n## Requirements\n\n- **OS**: Windows 10\u002F11 (verified). macOS \u002F Linux untested — PRs welcome.\n- **Browser**: Microsoft Edge (Stable channel). The script claims your persistent Edge profile, so close all Edge windows before running.\n- **Python**: 3.11 or newer (uses stdlib `tomllib`).\n- **Optional**: A Zotero installation (auto-detects DOI from a PDF's filename via Zotero's SQLite database — much faster than text extraction).\n- **Optional**: PyMuPDF (`pip install pymupdf`) for DOI extraction from PDF text when Zotero lookup is unavailable.\n\n## Install\n\n### As an agent skill (recommended)\n\nPick the install path for your agent framework:\n\n| Framework | Install command |\n|---|---|\n| Claude Code | `cp -r skills\u002Fref-downloader ~\u002F.claude\u002Fskills\u002F` |\n| Claude Agent SDK | same (auto-discovers `~\u002F.claude\u002Fskills\u002F`) |\n| Codex CLI | `cp -r skills\u002Fref-downloader ~\u002F.codex\u002Fskills\u002F` |\n| Copilot CLI \u002F VS Code agent | `cp -r skills\u002Fref-downloader .github\u002Fskills\u002F` |\n| Any framework (project-local) | `cp -r skills\u002Fref-downloader .agents\u002Fskills\u002F` |\n\nThen install Python prereqs INSIDE the copied skill folder (the skill protocol doesn't manage Python deps):\n\n```powershell\ncd ~\u002F.claude\u002Fskills\u002Fref-downloader            # or wherever you copied it\npip install playwright pymupdf                # or use the source's requirements.txt\nplaywright install msedge\n\ncp config.example.toml config.local.toml\n# Edit config.local.toml — at minimum set [crossref].mailto.\n# Windows: notepad config.local.toml\n# macOS \u002F Linux: $EDITOR config.local.toml   (or vim \u002F nano \u002F code \u002F ...)\n```\n\n### As a Python tool (for developers)\n\nIf you want to hack on the code, the skill folder _is_ a runnable Python project:\n\n```powershell\ngit clone https:\u002F\u002Fgithub.com\u002Fltczding-gif\u002Fref-downloader.git\ncd ref-downloader\n\npip install -r requirements.txt -r requirements-dev.txt\nplaywright install msedge\n\ncp skills\u002Fref-downloader\u002Fconfig.example.toml skills\u002Fref-downloader\u002Fconfig.local.toml\n# Edit config.local.toml — at minimum set [crossref].mailto.\n\n# Run the offline test suite\npython -m pytest tests\u002F -v\n\n# Run the tool directly\npython skills\u002Fref-downloader\u002Fscripts\u002Frun_ref_downloader.py 10.1021\u002Fjacs.5c05017\n```\n\n## Usage examples\n\n(After install — paths assume the skill is at `\u003CSKILL_DIR>`, e.g. `~\u002F.claude\u002Fskills\u002Fref-downloader\u002F`. In source, `\u003CSKILL_DIR>` = `skills\u002Fref-downloader\u002F`.)\n\n### Input: a DOI\n\n```powershell\npython \u003CSKILL_DIR>\u002Fscripts\u002Frun_ref_downloader.py 10.1021\u002Fjacs.5c05017\n```\n\nDefault output: `\u003Ccwd>\u002Fjacs.5c05017_refs\u002Fjacs.5c05017\u002F`\n\n### Input: a local PDF (with DOI in metadata or in PDF text)\n\n```powershell\npython \u003CSKILL_DIR>\u002Fscripts\u002Frun_ref_downloader.py \"C:\\path\\to\\your_paper.pdf\"\n```\n\nDefault output: `\u003Cpdf_dir>\u002Fyour_paper_refs\u002F\u003Cdoi-derived-name>\u002F`\n\n### Custom output directory\n\n```powershell\npython \u003CSKILL_DIR>\u002Fscripts\u002Frun_ref_downloader.py 10.1021\u002Fjacs.5c05017 --output-dir refs\u002F\n```\n\n### Non-interactive (CI \u002F batch)\n\n```powershell\npython \u003CSKILL_DIR>\u002Fscripts\u002Frun_ref_downloader.py 10.1021\u002Fjacs.5c05017 --yes --auto\n```\n\n### Alternate config file\n\n```powershell\npython \u003CSKILL_DIR>\u002Fscripts\u002Frun_ref_downloader.py 10.1021\u002Fjacs.5c05017 --config .\u002Falt.toml\n```\n\n## Configuration\n\nAll configuration lives in `config.local.toml` (gitignored). Copy `config.example.toml` to bootstrap.\n\n| Section | Key | Purpose |\n|---|---|---|\n| `[crossref]` | `mailto` | Your email — entry into Crossref polite pool |\n| `[zotero]` | `db_path` | Optional path to `zotero.sqlite` for DOI lookup from PDF filename |\n| `[browser]` | `edge_profile_dir` | Edge profile directory; empty = OS default |\n| `[browser]` | `disable_extensions` | Set `true` to launch with `--disable-extensions` |\n| `[institution]` | `auth_hosts` | Hostnames that mean \"you got bounced to SSO\" (e.g. `[\"sso.your-uni.edu\"]`) |\n| `[institution]` | `auth_url_fragments` | URL substrings indicating SSO (e.g. `[\"oauth\", \"saml\"]`) |\n| `[institution]` | `auth_page_titles` | `\u003Ctitle>` text for SSO pages (catches HTML served as PDF) |\n| `[institution]` | `auth_loading_titles` | Loading-page titles (also reused for AIP\u002FAVS publisher loading detection) |\n| `[institution]` | `ignored_access_dois` | DOIs you know are paywalled at your institution; skipped without retry |\n\nEnvironment variables override file values:\n\n| Variable | Maps to |\n|---|---|\n| `REF_DOWNLOADER_MAILTO` | `crossref.mailto` |\n| `REF_DOWNLOADER_ZOTERO_DB` | `zotero.db_path` |\n| `REF_DOWNLOADER_EDGE_PROFILE` | `browser.edge_profile_dir` |\n| `REF_DOWNLOADER_DISABLE_EXTENSIONS` | `browser.disable_extensions` (`1`\u002F`true` to enable) |\n| `REF_DOWNLOADER_CONFIG` | Path to alternate TOML file |\n\nSee [`skills\u002Fref-downloader\u002Fconfig.example.toml`](skills\u002Fref-downloader\u002Fconfig.example.toml) for full documentation.\n\n### Alternative backend: CloakBrowser (optional, for Cloudflare-heavy sites)\n\n**What it is.** [CloakBrowser](https:\u002F\u002Fgithub.com\u002FCloakHQ\u002FCloakBrowser) is a third-party Python package by CloakHQ (MIT-licensed, available on [PyPI](https:\u002F\u002Fpypi.org\u002Fproject\u002Fcloakbrowser\u002F) as `cloakbrowser`). It ships a patched Chromium build with source-level anti-fingerprint changes designed to look like a normal browser to common bot-detection layers (Cloudflare Turnstile, Radware, DataDome, FingerprintJS, etc). Its `launch_persistent_context_async()` API is intentionally compatible with Playwright's — that's what lets ref-downloader swap backends with a single env var instead of rewriting the download flow.\n\n**Not a dependency of ref-downloader.** If you don't run `pip install cloakbrowser` it's never imported. The default Edge backend is unchanged. When CloakBrowser IS the active backend, ref-downloader uses Chromium under a **separate persistent profile** at `~\u002F.local\u002Fcloakbrowser\u002Fprofiles\u002Fref-downloader` (or `REF_DOWNLOADER_CLOAK_PROFILE`), so your Edge profile is not touched — Edge does NOT need to be closed.\n\n**When to use it.** Sites you'd reach for it on: CCS Chemistry (`10.31635`, Cloudflare-protected), some Elsevier paths gated by Radware, anything where the Edge backend keeps producing `manual_pending (radware_bot_manager)` or `failed (challenge_timeout)`. **Don't** reach for it as a default — the Edge backend is more reliable when your institutional access is the actual bottleneck, because Edge carries your authenticated cookies.\n\n**Caveats.** CloakBrowser is **beta** third-party software; install + use at your own discretion (review its [repo](https:\u002F\u002Fgithub.com\u002FCloakHQ\u002FCloakBrowser) before pulling it). It is **not a captcha solver** — interactive challenges still need you. It also does not carry your institutional cookies (separate profile), so it's most useful for open-Cloudflare sites, less useful for paywalled-but-license-covered refs.\n\n```powershell\npip install cloakbrowser                              # one-time, separate from ref-downloader\n$env:REF_DOWNLOADER_BROWSER = \"cloak\"\n$env:REF_DOWNLOADER_CLOAK_HUMAN_PRESET = \"careful\"    # optional: slower mouse\u002Fscroll\npython skills\u002Fref-downloader\u002Fscripts\u002Frun_ref_downloader.py 10.31635\u002Fccsorg...\n```\n\nCloakBrowser env vars (all optional):\n\n| Variable | Default | Purpose |\n|---|---|---|\n| `REF_DOWNLOADER_BROWSER` | `edge` | Set to `cloak` (or `cloakbrowser`) to switch backend |\n| `REF_DOWNLOADER_CLOAK_PROFILE` | `~\u002F.local\u002Fcloakbrowser\u002Fprofiles\u002Fref-downloader` | Persistent Chromium profile path |\n| `REF_DOWNLOADER_CLOAK_HUMANIZE` | `1` | `0`\u002F`false` to disable humanized input |\n| `REF_DOWNLOADER_CLOAK_HUMAN_PRESET` | `default` | `default` or `careful` (slower) |\n| `REF_DOWNLOADER_CLOAK_PROXY` | _unset_ | HTTP\u002FSOCKS proxy URL |\n| `REF_DOWNLOADER_CLOAK_GEOIP` | auto | `1` to force GeoIP rerouting (auto when proxy is set) |\n| `CLOAKBROWSER_PYTHONPATH` | _unset_ | sys.path hint for a local cloakbrowser source checkout |\n\nNotes:\n- **Edge does not need to be closed** when using the cloak backend — it uses its own Chromium.\n- A fresh cloak profile may still hit Cloudflare\u002Fsecurity pages on first visit — warm it manually with that profile before batch downloads.\n- `human_preset=careful` reduces behavior-based detection but is **not** a captcha solver.\n- cloakbrowser is NOT a hard dependency of ref-downloader. If you never set `REF_DOWNLOADER_BROWSER=cloak`, it's not imported.\n\n## Architecture\n\nThree-stage pipeline + a wrapper:\n\n```\nskills\u002Fref-downloader\u002F\n├── SKILL.md                            agent runbook (slim entry)\n├── references\u002Fagent-runbook.md         extended manual flow + DOI fallback\n├── config.example.toml                 config schema (copy to config.local.toml)\n└── scripts\u002F\n    ├── run_ref_downloader.py           entry — config + DOI resolution + sequencing\n    │     └─> extract_refs.py    (1) Crossref API: fetch parent's reference list\n    │     └─> validate_refs.py   (2) Crossref API: per-ref metadata + publisher classify\n    │     └─> download_refs.py   (3) Playwright\u002FEdge: download main PDF + SI per publisher\n    └── _config.py                      TOML + env-var loader\n```\n\nYou can also run the three scripts manually for debugging or partial restarts. See the agent runbook in [`skills\u002Fref-downloader\u002Freferences\u002Fagent-runbook.md`](skills\u002Fref-downloader\u002Freferences\u002Fagent-runbook.md) for the manual flow.\n\nAgent users can install or inspect the packaged skill at [`skills\u002Fref-downloader\u002FSKILL.md`](skills\u002Fref-downloader\u002FSKILL.md). The repository root remains the human-facing Python project; the skill bundle is kept separate so Codex does not treat README, changelog, tests, and source files as always-associated skill context.\n\n## Supported publishers\n\nACS, Nature, Science, Elsevier, Wiley, RSC, Springer, PNAS, ECS, IOP, AIP, AVS, IEEE, OSA, KPS, Beilstein, APS, Annual Reviews, Taylor & Francis, CCS Chemistry. Maturity varies — see [`docs\u002FSUPPORTED_PUBLISHERS.md`](docs\u002FSUPPORTED_PUBLISHERS.md) for the per-publisher tier table and known issues. CCS Chemistry sits behind Cloudflare; pair it with `REF_DOWNLOADER_BROWSER=cloak` for reliable access.\n\n## Known limitations\n\n- **Windows + Microsoft Edge only**: that's the verified path. macOS \u002F Linux \u002F Chromium support has not been tested. If you try, please open an issue with results.\n- **Headed mode required**: empirically, `headless=True` yields empty results for Wiley \u002F ACS supplementary downloads. The default is headed.\n- **Edge must be fully closed before running**: Playwright needs exclusive access to the persistent profile. Check Task Manager for any background `msedge.exe` processes.\n- **SSO redirects are detected, not solved**: when the script bounces to your institution's SSO, the ref becomes `manual_pending` so you can sign in interactively. Configure `[institution]` to teach it which redirects to recognize.\n- **SI download is the most fragile path**: main PDFs are reliable; SI lookup varies by publisher and is the area most likely to need a tweak when a publisher updates their site.\n- **Paywalled content needs institutional access**: this is not a bypass tool.\n- **Crossref dependency**: papers with no reference list deposited at Crossref can't be processed automatically.\n\n## Contributing\n\nSee [CONTRIBUTING.md](CONTRIBUTING.md) for guidance on:\n- Adding a new publisher (DOI prefix → strategy)\n- Adding institutional SSO patterns\n- Reporting download failures with useful logs\n\n## Security\n\nThis tool launches your real Edge profile, with all your cookies and saved sessions. Read [SECURITY.md](SECURITY.md) before running it against a profile you also use for daily browsing.\n\n## License\n\nMIT — see [LICENSE](LICENSE).\n","ref-downloader 是一个用于批量下载文献引用 PDF 的工具，通过输入 DOI 或论文 PDF，结合 Crossref 和用户所在机构的 Edge 浏览器会话来实现。其核心功能包括自动提取和验证参考文献，并使用用户的机构访问权限下载这些文献的 PDF 文件，支持 Windows 和 Microsoft Edge 环境。该工具特别适用于学术研究者在进行文献综述时快速获取大量参考文献全文，提高工作效率。基于 Python 开发，采用 Playwright 进行浏览器自动化操作，确保了与现有机构订阅资源的有效对接。","2026-06-11 03:58:31","CREATED_QUERY"]