[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-78259":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":8,"htmlUrl":8,"language":9,"languages":8,"totalLinesOfCode":8,"stars":10,"forks":11,"watchers":12,"openIssues":13,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":14,"stars7d":15,"stars30d":16,"stars90d":14,"forks30d":14,"starsTrendScore":13,"compositeScore":15,"rankGlobal":8,"rankLanguage":8,"license":17,"archived":18,"fork":18,"defaultBranch":19,"hasWiki":20,"hasPages":18,"topics":21,"createdAt":8,"pushedAt":8,"updatedAt":22,"readmeContent":23,"aiSummary":24,"trendingCount":14,"starSnapshotCount":14,"syncStatus":13,"lastSyncTime":25,"discoverSource":26},78259,"paper-scraper","GAO-pooh\u002Fpaper-scraper","GAO-pooh",null,"Python",124,9,1,2,0,3,96,"MIT License",false,"main",true,[],"2026-06-12 02:03:46","\u003Cdiv align=\"right\">\n\n[English](README.md) | [简体中文](README_zh.md)\n\n\u003C\u002Fdiv>\n\n# Academic Paper Scraper\n\n> Automated scrapers for **ScienceDirect** and **INFORMS PubsOnLine** —\n> search papers by keyword, author, or journal, and batch-download PDFs\n> using your institutional access.\n\n![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-MIT-blue)\n![Python](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPython-3.8+-green)\n![Platform](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPlatform-macOS-lightgrey)\n![ScienceDirect](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FScienceDirect-supported-orange)\n![INFORMS](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FINFORMS-supported-orange)\n\n---\n\n## Supported Platforms\n\n| Script | Platform | PDF Download |\n|--------|----------|--------------|\n| `sd_scraper.py` \u002F `sd_scraper_en.py` | [ScienceDirect](https:\u002F\u002Fwww.sciencedirect.com) (Elsevier) | Via Chrome DevTools (bypasses Cloudflare) |\n| `informs_scraper.py` \u002F `informs_scraper_en.py` | [INFORMS PubsOnLine](https:\u002F\u002Fpubsonline.informs.org) | Direct HTTP with session cookie |\n\nFiles ending in `_en.py` are full English-interface versions; the others are Chinese-interface versions. Logic is identical.\n\n---\n\n## Features\n\n- **Multiple search modes**: keyword, journal browse, journal+keyword, author, ISSN, advanced (combine any criteria)\n- **Bulk PDF download** with institutional access (university SSO \u002F CARSI)\n- **Anti-bot measures**: Chrome TLS fingerprint spoofing via `curl_cffi`, stealth JS injection, automatic rate-limit handling\n- **ScienceDirect**: Chrome DevTools Protocol (CDP) captures PDF bytes directly — no Playwright required\n- **INFORMS**: simple direct HTTP download; Atypon does not enforce JS challenges on PDF endpoints\n- Output formats: **CSV \u002F JSON \u002F XLSX**\n- Interactive wizard mode (run with no arguments)\n\n---\n\n## Installation\n\n**Step 1 — Install dependencies**\n\n```bash\npip install curl_cffi websocket-client browser-cookie3 openpyxl\n```\n\nFor INFORMS scraper, also install:\n\n```bash\npip install beautifulsoup4 lxml\n```\n\n**Step 2 — Run**\n\n```bash\n# ScienceDirect (interactive wizard)\npython3 sd_scraper_en.py\n\n# INFORMS (interactive wizard)\npython3 informs_scraper_en.py\n```\n\n> **macOS only** (Chrome path is hardcoded to `\u002FApplications\u002FGoogle Chrome.app`).  \n> On Linux\u002FWindows, set `CHROME_BIN` in the scraper class to your Chrome executable path.\n\nFull dependency list: [`requirements.txt`](requirements.txt)\n\n---\n\n## Quick Start\n\n### ScienceDirect\n\n```bash\n# Interactive wizard (recommended for first use)\npython sd_scraper_en.py\n\n# Keyword search — save metadata as XLSX\npython sd_scraper_en.py -m keyword -q \"machine learning\" -n 100 --browser-cookies\n\n# Keyword search + download PDFs\npython sd_scraper_en.py -m keyword -q \"deep learning\" -n 50 --browser-cookies --download-pdfs\n\n# Browse a journal (most recent first)\npython sd_scraper_en.py -m journal -j \"Energy\" -n 200 --browser-cookies --sort date\n\n# Keyword search within a specific journal\npython sd_scraper_en.py -m journal_keyword -j \"Renewable Energy\" -q \"solar cell\" -n 50 --browser-cookies\n\n# Search by author\npython sd_scraper_en.py -m author -a \"Zhang Wei\" -n 30 --browser-cookies\n\n# Advanced search (combine criteria)\npython sd_scraper_en.py -m advanced -q \"deep learning\" --date 2021-2024 --type REV -n 50 --browser-cookies --download-pdfs\n```\n\n### INFORMS PubsOnLine\n\n```bash\n# Interactive wizard\npython informs_scraper_en.py\n\n# Keyword search\npython informs_scraper_en.py -m keyword -q \"supply chain\" -n 100 --browser-cookies\n\n# Browse a journal\npython informs_scraper_en.py -m journal -j mnsc -n 200 --browser-cookies\n\n# Specific volume\u002Fissue TOC\npython informs_scraper_en.py -m toc -j mnsc -v 71 -i 3 --browser-cookies\n\n# Keyword search + download PDFs\npython informs_scraper_en.py -m keyword -q \"inventory\" -n 50 --browser-cookies --download-pdf\n\n# Best login method: pop up Chrome, log in manually\npython informs_scraper_en.py -m keyword -q \"machine learning\" -n 50 --chrome-login --download-pdf\n```\n\n#### INFORMS Journal Codes\n\n| Code | Journal |\n|------|---------|\n| `mnsc` | Management Science |\n| `opre` | Operations Research |\n| `ijoc` | INFORMS Journal on Computing |\n| `mksc` | Marketing Science |\n| `msom` | Manufacturing & Service Operations Management |\n| `trsc` | Transportation Science |\n| `isre` | Information Systems Research |\n| `orsc` | Organization Science |\n\n---\n\n## How PDF Download Works\n\n### ScienceDirect\n\nCloudflare blocks direct HTTP requests to PDF endpoints. The scraper works around this using Chrome DevTools Protocol (CDP):\n\n1. Script auto-launches Chrome in debug mode (copies your existing profile — no re-login needed)\n2. Navigates to the article page, establishing cookie context\n3. Intercepts the PDF response bytes via `Network` \u002F `Fetch` DevTools events\n4. Writes the PDF directly to disk — no Save dialog, no Playwright required\n\n**First-time setup**: open Chrome, log in to ScienceDirect via your institution (SSO\u002FCARSI), and open at least one article PDF to confirm access. The script handles everything else automatically.\n\n```bash\n# If Chrome isn't already logged in, use this first:\npython sd_scraper_en.py --open-browser-login\n# Then run your actual scrape:\npython sd_scraper_en.py -m keyword -q \"turbine\" -n 50 --browser-cookies --download-pdfs\n```\n\n### INFORMS\n\nINFORMS (Atypon platform) does not enforce JS challenges on PDF endpoints — a valid session cookie is sufficient.\n\n```bash\n# Option 1: read cookies from Chrome (must be logged in already)\npython informs_scraper_en.py -m keyword -q \"inventory\" -n 30 --browser-cookies --download-pdf\n\n# Option 2: pop up Chrome, log in, then auto-extract cookies\npython informs_scraper_en.py -m keyword -q \"inventory\" -n 30 --chrome-login --download-pdf\n\n# Option 3: member credentials (direct login)\npython informs_scraper_en.py -m keyword -q \"inventory\" -n 30 --member 123456 --password MyPwd --download-pdf\n```\n\n---\n\n## Output\n\nAll output goes to `.\u002Fresults\u002F` (ScienceDirect) or `.\u002Finforms_result\u002F` (INFORMS) by default.\n\n```\nresults\u002F\n└── keyword_machine_learning_20250101_120000\u002F\n    ├── keyword_machine_learning_20250101_120000.xlsx   ← metadata\n    └── pdfs\u002F\n        ├── 001_Zhang_2024_Deep learning for...pdf\n        ├── 002_Li_2023_Transfer learning in...pdf\n        └── ...\n```\n\n---\n\n## All CLI Options\n\n### ScienceDirect (`sd_scraper_en.py`)\n\n| Option | Description |\n|--------|-------------|\n| `-m`, `--mode` | `keyword` \u002F `journal` \u002F `journal_keyword` \u002F `author` \u002F `issn` \u002F `advanced` |\n| `-q`, `--query` | Search keywords (supports `AND` \u002F `OR` \u002F `NOT`) |\n| `-j`, `--journal` | Journal name |\n| `-a`, `--author` | Author name |\n| `--issn` | Journal ISSN |\n| `-n`, `--count` | Max papers to fetch (default: 50) |\n| `--date` | Year range, e.g. `2020-2024` |\n| `--sort` | `relevance` (default) or `date` |\n| `--type` | Article type: `FLA` \u002F `REV` \u002F `SCO` |\n| `--browser-cookies` | Auto-read cookies from local Chrome |\n| `--cookies` | Path to a cookie JSON file |\n| `--format` | Output format: `xlsx` (default) \u002F `csv` \u002F `json` \u002F `all` |\n| `--download-pdfs` | Download PDFs after saving metadata |\n| `--output` | Custom output directory |\n| `--open-browser-login` | Open Chrome for manual institutional login |\n| `--interactive` | Launch interactive wizard |\n\n### INFORMS (`informs_scraper_en.py`)\n\n| Option | Description |\n|--------|-------------|\n| `-m`, `--mode` | `keyword` \u002F `journal` \u002F `toc` \u002F `advanced` |\n| `-q`, `--query` | Search keywords |\n| `-j`, `--journal` | Journal code (e.g. `mnsc`) |\n| `-v`, `--volume` | Volume number (toc mode) |\n| `-i`, `--issue` | Issue number (toc mode) |\n| `--author` | Author name (advanced mode) |\n| `--date` | Year range, e.g. `2020-2024` |\n| `-n`, `--count` | Max papers (default: 100) |\n| `--chrome-login` | ★ Pop up Chrome for manual login (most reliable) |\n| `--browser-cookies` | Read cookies from local Chrome |\n| `--cookies-file` | Load cookies from a JSON file |\n| `--member` | INFORMS member ID |\n| `--password` | Account password |\n| `--format` | `csv` (default) \u002F `json` \u002F `xlsx` |\n| `--download-pdf` | Download PDFs after scraping |\n| `-o`, `--output-dir` | Output directory |\n\n---\n\n## Notes\n\n- **Institutional access required** for full-text PDF download. Open-access papers can be downloaded without login.\n- **Rate limiting**: the scraper adds random delays between requests (default 2–5 s). Do not reduce these aggressively.\n- **macOS only** in current form. The Chrome binary path (`\u002FApplications\u002FGoogle Chrome.app\u002F...`) is hardcoded. Linux\u002FWindows users: update `CHROME_BIN` in the class definition.\n- **Cookie expiry**: session cookies expire (typically days to weeks). Re-login if you encounter 403 errors.\n- Use responsibly and in accordance with your institution's and the publishers' terms of service.\n\n---\n\n## License\n\nMIT\n","该项目是一个学术论文抓取工具，能够自动从ScienceDirect和INFORMS PubsOnLine平台根据关键词、作者或期刊搜索论文，并批量下载PDF。其核心功能包括多种搜索模式（如关键词、作者、ISSN等组合查询）、利用机构访问权限进行大批量PDF下载以及绕过反爬虫机制的技术实现。项目使用Python编写，支持通过Chrome开发者工具协议直接获取PDF文件内容以规避Cloudflare限制，对于INFORMS则采用简单的HTTP请求方式下载。适用于需要快速收集特定领域内最新研究成果的研究人员或学生，在macOS环境下运行最佳，但也可在其他操作系统上调整配置后使用。","2026-06-11 03:56:41","CREATED_QUERY"]