[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80788":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":16,"stars7d":17,"stars30d":17,"stars90d":16,"forks30d":16,"starsTrendScore":16,"compositeScore":18,"rankGlobal":10,"rankLanguage":10,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":22,"hasPages":20,"topics":23,"createdAt":10,"pushedAt":10,"updatedAt":33,"readmeContent":34,"aiSummary":35,"trendingCount":16,"starSnapshotCount":16,"syncStatus":17,"lastSyncTime":36,"discoverSource":37},80788,"iclr2026-affiliations","DmytroLopushanskyy\u002Ficlr2026-affiliations","DmytroLopushanskyy","PDF-derived institutional affiliations for 5,356 ICLR 2026 accepted papers — full pipeline (scrape → parse → render), clean dataset (CSV + XLSX), and treemap charts.","https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fdmytrolopushanskyy\u002F",null,"Python",41,5,39,1,0,2,2.33,"MIT License",false,"main",true,[24,25,26,27,28,29,30,31,32],"bibliometrics","data-visualization","dataset","iclr","iclr-2026","machine-learning-research","openreview","pdf-parser","treemap","2026-06-12 02:04:06","# ICLR 2026 — Institutional Affiliations Dataset & Analysis\n\nEnd-to-end pipeline that turns 5,356 [ICLR 2026](https:\u002F\u002Fopenreview.net\u002Fgroup?id=ICLR.cc\u002F2026) accepted papers into a clean, **PDF-derived** institutional-affiliation dataset and a publication-ready treemap of who is shaping AI research right now.\n\nThis avoids the OpenReview-profile drift problem (where authors' *current* job appears on every paper they ever wrote — e.g. listing Wyoming as the affiliation for a paper actually written at UBC). Affiliations come from the **paper's title block PDF**, not from author profiles.\n\n> **Follow me for more analysis like this, plus AI engineering & research insights:**\n>\n> - LinkedIn — **[linkedin.com\u002Fin\u002Fdmytrolopushanskyy](https:\u002F\u002Flinkedin.com\u002Fin\u002Fdmytrolopushanskyy)**\n> - GitHub — **[github.com\u002FDmytroLopushanskyy](https:\u002F\u002Fgithub.com\u002FDmytroLopushanskyy)**\n>\n> If this dataset or the pipeline is useful to your work, a follow \u002F star is the easiest way to encourage me to keep publishing this kind of analysis.\n\n---\n\n## The headline chart\n\n![ICLR 2026 top 50 institutions, grouped by region](charts\u002Ficlr2026_top50_treemap_unique_grouped.png)\n\nEach rectangle is one institution sized by the number of accepted papers it appears on (counted **once per paper**, regardless of how many of the paper's authors are affiliated with it). Region cells are sized by the cumulative count of their top-50 institutions. Lighter shade = academia \u002F research institute, darker shade = industry.\n\n**Square version** (for social posts):\n`charts\u002Ficlr2026_top50_treemap_unique_grouped_square.png`\n\n---\n\n## What's in `data\u002F`\n\n| File | What it is |\n|---|---|\n| `iclr2026_public.csv` \u002F `.xlsx` | **The main dataset.** 5,356 accepted papers with PDF-derived authors and institutions, normalized institution canonical names, country\u002Fregion, abstract, OpenReview URL. UTF-8 with BOM for Excel compatibility. |\n| `iclr2026_institutions_ranked_unique.csv` | Top-N institutions ranked by unique-affiliation count (each institution +1 per paper). |\n| `iclr2026_institutions_ranked_first_author.csv` | Same, but only counting the first author's institution. |\n| `iclr2026_institutions_ranked_fractional.csv` | Same, with fractional 1\u002FN credit per institution per paper. |\n| `iclr2026_method_sensitivity.csv` | Side-by-side rank under all three counting methods, so you can see which institutions are robust and which are method artefacts. |\n\n### Columns in `iclr2026_public.csv`\n\n| Column | Meaning |\n|---|---|\n| `Decision` | Oral \u002F Poster |\n| `Title` | Paper title (LaTeX math markup converted to Unicode — `$\\alpha$` → α, `$\\nabla$` → ∇, `$\\textrm{...}$` → plain text, etc.) |\n| `Authors` | Semicolon-separated, in author order |\n| `Institutions` | Same row order as `Authors`. PDF-extracted text per author (with OpenReview fallback for the ~6% of papers where PDF parsing failed). |\n| `Institutions_canonical` | Normalized via ~250 rules. `MIT` \u002F `Massachusetts Institute of Technology` \u002F `MIT CSAIL` all collapse to **MIT**. Deduped per paper. |\n| `Countries` | Per-paper deduped list. |\n| `Regions` | High-level region per paper (China, USA, Hong Kong, etc.). |\n| `Affiliation_source` | `pdf` (94%) \u002F `parse_fail` (6%) \u002F `no_pdf` (4 papers). Audit trail. |\n| `Primary_Area` | OpenReview track. |\n| `Keywords` | Author-supplied. |\n| `Abstract` | Full text. |\n| `OpenReview_URL` | Direct link to the paper. |\n\n---\n\n## Quick start\n\n### Just regenerate the chart\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FDmytroLopushanskyy\u002Ficlr2026-affiliations.git\ncd iclr2026-affiliations\npython3 -m venv .venv && source .venv\u002Fbin\u002Factivate\npip install -r requirements.txt\npython3 make_iclr_treemap.py --source pdf\n```\n\nThis reads `data\u002Ficlr2026_public.csv` and writes the treemap PNGs\u002FSVGs into `charts\u002F`.\n\nAdd `--shape square` for a 1:1 version. Add `--source openreview` to compare against the OpenReview-profile-only version (requires running the scraper first).\n\n### Reproduce the full pipeline from scratch\n\nYou only need this if you want to re-derive the dataset (e.g., for a new conference). It takes ~1–2 hours of network time and ~5 GB of disk for the PDF cache.\n\n```bash\n# 1. Scrape OpenReview metadata (requires an account)\nexport OPENREVIEW_USERNAME=...\nexport OPENREVIEW_PASSWORD=...\npython3 scrape_openreview.py\n# → data\u002Ficlr2026_accepted.{csv,xlsx}\n\n# 2. Download all accepted-paper PDFs (~5 GB; rate-limited; retry script handles 429s)\npython3 download_missing_pdfs.py\npython3 retry_missing_pdfs.py     # picks up anything that hit a 429 the first time\n\n# 3. Parse PDFs and merge with OpenReview data\npython3 build_pdf_spreadsheet.py\n# → data\u002Ficlr2026_accepted_pdf.{csv,xlsx} + data\u002Fpdf_parse_summary.txt\n\n# 4. Build the public-facing CSV (sanitization + LaTeX-to-Unicode + canonical names)\npython3 build_public_spreadsheet.py\n# → data\u002Ficlr2026_public.{csv,xlsx}\n\n# 5. Render the charts\npython3 make_iclr_treemap.py --source pdf\n# → charts\u002Ficlr2026_top50_treemap_*.{png,svg}\n```\n\n---\n\n## How the parser works\n\n`parse_pdf_affiliations.py` handles four layout patterns common in ICLR template papers:\n\n| Pattern | Layout | Example |\n|---|---|---|\n| **A** | Numbered footnote markers | `Author1,2 Author1,3 ... \\n 1Inst A 2Inst B 3Inst C` |\n| **B** | No markers, single shared affiliation | `Author1, Author2 \\n Single Institution` |\n| **C** | Per-author stanzas separated by emails | `Author1 \\n Inst A \\n a@x.edu \\n Author2 \\n Inst B \\n b@y.edu` |\n| **D** | Alternating name \u002F affil pairs (no emails) | Common for industry-only papers (Apple, Anthropic, etc.) |\n\nPlus a footnote-text filter that catches and discards \"Equal contribution\", \"Corresponding author\", \"Project lead\", \"These authors contributed equally\" — these used to leak into affiliation strings before being filtered out.\n\nResult: **96% of papers parse successfully**; the remaining 4% fall back to OpenReview profile data (transparently flagged in the `Affiliation_source` column).\n\n---\n\n## Methodology choices, briefly\n\n- **Counting**: each institution counted **once per paper**, regardless of how many of its authors are listed. Same rule used by the AI World NeurIPS leaderboard. The repo also generates first-author-only and fractional 1\u002FN variants for sensitivity.\n- **Canonicalization**: ~250 regex rules collapse spelling\u002Fabbreviation variants (HKUST = Hong Kong University of Science and Technology = The Hong Kong University of Science and Technology, etc.). Institutions in the chart's top-50 are stable across all three counting methods (see `data\u002Ficlr2026_method_sensitivity.csv`).\n- **Region grouping**: countries → 17 broad regions for the treemap. Hong Kong is shown separately from mainland China because Hong Kong universities operate under a separate higher-education system (different governance, language of instruction, listed separately in QS\u002FTHE rankings).\n\n---\n\n## License\n\n[MIT](LICENSE). The data is derived from publicly available [OpenReview](https:\u002F\u002Fopenreview.net) submissions and ICLR 2026 paper PDFs; please cite this repository if you use it in published work.\n\n---\n\n## Stay in touch\n\nIf you build something on top of this, ping me — I'm always interested in seeing where this kind of pipeline gets used. And if you want more posts like this (research-engineering deep dives, applied AI analysis, papers I'm reading), the best place is:\n\n- LinkedIn — **[linkedin.com\u002Fin\u002Fdmytrolopushanskyy](https:\u002F\u002Flinkedin.com\u002Fin\u002Fdmytrolopushanskyy)**\n- GitHub — **[github.com\u002FDmytroLopushanskyy](https:\u002F\u002Fgithub.com\u002FDmytroLopushanskyy)**\n\n— Dmytro Lopushanskyy\n","该项目通过处理ICLR 2026接收的5,356篇论文，生成了一个包含机构归属信息的数据集，并提供了可视化分析。其核心功能包括从PDF中提取机构信息、数据清洗与标准化以及生成树状图展示主要研究机构分布情况。技术特点在于使用Python构建了完整的端到端处理流程（抓取→解析→呈现），并采用PDF解析而非依赖作者个人资料来避免信息偏差问题。此项目适用于需要了解当前AI研究领域内各机构贡献度的研究人员或机构，尤其是在进行学术影响力评估时非常有用。","2026-06-11 04:02:20","CREATED_QUERY"]