[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-2298":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":8,"htmlUrl":8,"language":9,"languages":8,"totalLinesOfCode":8,"stars":10,"forks":11,"watchers":11,"openIssues":12,"contributorsCount":12,"subscribersCount":12,"size":12,"stars1d":12,"stars7d":12,"stars30d":12,"stars90d":12,"forks30d":12,"starsTrendScore":12,"compositeScore":13,"rankGlobal":8,"rankLanguage":8,"license":14,"archived":15,"fork":15,"defaultBranch":16,"hasWiki":17,"hasPages":15,"topics":18,"createdAt":8,"pushedAt":8,"updatedAt":19,"readmeContent":20,"aiSummary":21,"trendingCount":12,"starSnapshotCount":12,"syncStatus":22,"lastSyncTime":23,"discoverSource":24},2298,"diff","kdyy88\u002Fdiff","kdyy88",null,"Python",101,8,0,2.86,"GNU General Public License v2.0",false,"main",true,[],"2026-06-12 02:00:40","# PDF Flow Diff\n\n`PDF Flow Diff` is a local web MVP for reviewing long-form PDF revisions when pagination reflow makes page-by-page comparison unusable.\n\nIt is designed for workflows where a tiny text change matters, but a single inserted sentence can push the next 100 pages into different physical positions. The project reconstructs a continuous logical text flow, coarse-aligns extracted lines with `patiencediff`, refines only changed windows with `diff-match-patch`, then projects review anchors back onto the original PDF coordinates for side-by-side visual review.\n\n## What it solves\n\n- Ignores false positives caused only by pagination reflow or layout overflow.\n- Tracks real edits at word and character granularity.\n- Projects every review anchor back to page coordinates for visual highlighting.\n- Supports a review-friendly dual-pane UI with anchor jumping instead of fragile scroll lock.\n- Separates explicit-grid table regions from body text so table edits can be reviewed at cell level.\n\n## Current scope\n\nSupported:\n\n- Text-based PDFs exported from Word or similar tools\n- Mixed body text plus explicit-grid tables\n- Local single-user review workflow\n- Browser-based review on current Chromium-class browsers\n\nNot supported in this MVP:\n\n- Scanned PDFs or OCR\n- Handwritten annotations\n- Image diffs\n- Multi-user jobs or persistent queues\n- Audit export files such as JSON download or marked-up PDFs\n\n## Architecture at a glance\n\nThe core pipeline is:\n\n1. Extract page geometry, text atoms, and table regions from each PDF with `PyMuPDF`.\n2. Reconstruct a cross-page logical text flow and normalize only what is needed for alignment.\n3. Coarse-align extracted text lines with `patiencediff` so hard `equal` windows can re-anchor the flow.\n4. Run local `diff-match-patch` only inside non-equal windows for word and character precision.\n5. Coalesce raw edit events into review-friendly anchors and mark risky large windows as low confidence.\n6. Project anchor ranges back to PDF page coordinates.\n7. Render side-by-side PDFs with anchor-linked highlights in the React UI.\n\nSee [docs\u002Farchitecture.md](docs\u002Farchitecture.md) for the full module breakdown.\n\n## Repository layout\n\n```text\n.\n├── backend\u002F               FastAPI service and diff engine\n├── frontend\u002F              React review UI\n├── docs\u002F                  API, integration, and architecture docs\n└── .github\u002Fworkflows\u002F     CI pipeline\n```\n\n## Tech stack\n\n- Backend: FastAPI, PyMuPDF, patiencediff, diff-match-patch, Pydantic\n- Frontend: React, Vite, TypeScript, Tailwind, react-pdf\n- Tooling: `uv` for Python dependency management, `pnpm` for frontend dependency management\n\n## Quick start\n\nRequirements:\n\n- Python `3.11+`\n- Node.js `20+`\n- `uv`\n- `pnpm`\n\n### 1. Start the backend\n\n```bash\ncd backend\nuv sync --extra dev\nuv run uvicorn app.main:app --reload --port 8000\n```\n\nThe API will be available at `http:\u002F\u002Flocalhost:8000`.\n\nUseful endpoints while developing:\n\n- `GET \u002Fhealth`\n- `GET \u002Fdocs` for Swagger UI\n- `GET \u002Fredoc` for ReDoc\n\n### 2. Start the frontend\n\n```bash\ncd frontend\npnpm install\npnpm dev\n```\n\nThe Vite dev server runs at `http:\u002F\u002Flocalhost:5173` and expects the backend at `http:\u002F\u002Flocalhost:8000`.\n\n## Backend API quick overview\n\nThe backend exposes an async job workflow:\n\n1. `POST \u002Fapi\u002Fjobs` uploads `sourcePdf` and `modifiedPdf`.\n2. `GET \u002Fapi\u002Fjobs\u002F{id}` polls job progress.\n3. `GET \u002Fapi\u002Fjobs\u002F{id}\u002Fresult` fetches the final diff anchors.\n4. `GET \u002Fapi\u002Fjobs\u002F{id}\u002Ffiles\u002F{side}` streams the original uploaded PDF back for rendering.\n\nPrimary anchor kinds:\n\n- `insert`\n- `delete`\n- `replace`\n- `reflow`\n\nPrimary source types:\n\n- `text`\n- `table`\n\nAnchor confidence values:\n\n- `high`\n- `low`\n\nFull request and response examples live in [docs\u002Fapi.md](docs\u002Fapi.md), and backend integration guidance lives in [docs\u002Fbackend-integration.md](docs\u002Fbackend-integration.md).\n\n## Local development\n\nBackend checks:\n\n```bash\ncd backend\nuv run pytest\n```\n\nFrontend build:\n\n```bash\ncd frontend\npnpm build\n```\n\n## How table diffs currently work\n\nFor PDFs with explicit ruling lines, the extractor uses `PyMuPDF` table detection to split those regions out of the main text flow. Table cells are then diffed independently and returned as `source_type=\"table\"` anchors so the UI can highlight a single cell instead of collapsing the whole table into one paragraph-like change.\n\nIf a table cannot be matched reliably, the system intentionally falls back to a weaker structural signal instead of returning misleading cell-level results.\n\n## Design choices\n\n- The project does use mature open-source diff libraries; it does not reimplement the core text diff algorithms.\n- Custom logic focuses on:\n  - text flow reconstruction across pages\n  - line-level coarse anchoring with `patiencediff`\n  - windowed local refinement with `diff-match-patch`\n  - minimal normalization before alignment\n  - review-anchor coalescing\n  - low-confidence marking for risky large replace windows\n  - reflow detection\n  - coordinate projection\n  - explicit-grid table extraction\n\n## License\n\nThis repository is released under `GPL-2.0-only` to stay compatible with the current `patiencediff` coarse-alignment dependency used by the backend. See [LICENSE](LICENSE).\n\n## Documentation index\n\n- [API reference](docs\u002Fapi.md)\n- [Backend integration guide](docs\u002Fbackend-integration.md)\n- [Architecture notes](docs\u002Farchitecture.md)\n- [GitHub release prep](docs\u002Fgithub-release.md)\n- [Contributing guide](CONTRIBUTING.md)\n- [Security policy](SECURITY.md)\n\n## Roadmap\n\n- Better table matching for inserted\u002Fdeleted sections across multiple pages\n- Optional audit-mode output that preserves raw low-level edit events\n- Exportable machine-readable results\n- Broader browser compatibility validation\n- Better handling for merged cells and borderless tables\n","PDF Flow Diff 是一个本地Web MVP，用于审查长篇PDF修订版本，当分页重排使得逐页对比变得不可用时。该项目通过重构连续的逻辑文本流，并使用patiencediff和diff-match-patch技术来粗对齐和精修变动部分，然后将审查标记投影回原始PDF坐标以支持并列视觉审查。它特别适用于那些即使是细微的文字更改也至关重要的工作流程中，比如插入一句话可能导致后续上百页物理位置变化的情况。此外，该工具还支持友好的双面板UI界面，便于用户跳转查看而不依赖于脆弱的滚动锁定机制。当前版本支持从Word等工具导出的基于文本的PDF文件以及包含明确网格表格区域的文档，在单用户本地审查场景下表现良好。",2,"2026-06-11 02:49:20","CREATED_QUERY"]