[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-79937":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":14,"stars7d":14,"stars30d":14,"stars90d":14,"forks30d":14,"starsTrendScore":14,"compositeScore":15,"rankGlobal":9,"rankLanguage":9,"license":9,"archived":16,"fork":16,"defaultBranch":17,"hasWiki":18,"hasPages":16,"topics":19,"createdAt":9,"pushedAt":9,"updatedAt":20,"readmeContent":21,"aiSummary":22,"trendingCount":14,"starSnapshotCount":14,"syncStatus":23,"lastSyncTime":24,"discoverSource":25},79937,"UFO-USA","DenisSergeevitch\u002FUFO-USA","DenisSergeevitch","Converted Markdown archive for the public war.gov UFO\u002FPURSUE Release 01 files",null,"Python",82,21,83,0,4.03,false,"main",true,[],"2026-06-12 02:03:55","# War.gov UFO Release Markdown Archive\n\n\u003Cimg src=\"assets\u002Ffbi-september-2023-composite-sketch.jpg\" alt=\"FBI September 2023 sighting composite sketch\" width=\"720\">\n\nThis repository is the public archive for Markdown files converted from the official UFO\u002FUAP release at [war.gov\u002FUFO](https:\u002F\u002Fwww.war.gov\u002FUFO\u002F).\n\nThe source page is the Department of War's \"Presidential Unsealing and Reporting System for UAP Encounters (PURSUE)\" page. It describes a government-wide effort, supported by ODNI, to identify, review, declassify, and release unresolved UAP-related records and historical documents. Release 01 is marked \"Cleared for release - May 8, 2026.\"\n\nThe primary content of this repo is the `converted\u002F` tree. Each converted source file gets its own folder, and each source page gets its own Markdown file.\n\n## Archive Structure\n\n```text\nconverted\u002F\n├── 001-65_HS1-834228961_62-HQ-83894_Section_10\u002F\n│   ├── page-0001.md\n│   ├── page-0002.md\n│   └── ...\n├── 002-65_HS1-834228961_62-HQ-83894_Section_2\u002F\n│   ├── page-0001.md\n│   ├── page-0002.md\n│   └── ...\n└── manifest.jsonl\n```\n\nPage files use zero-padded page numbers so lexical order matches page order.\n\nEach `page-####.md` file includes YAML front matter:\n\n```yaml\n---\nsource_title: \"...\"\nsource_file: \"...\"\nsource_url: \"...\"\nasset_type: \"pdf\"\ndataset_row: 1\npage: 1\npage_count: 184\nmodel: \"gemini-3.1-flash-lite\"\ngenerated_at: \"...\"\n---\n```\n\n`converted\u002Fmanifest.jsonl` records one JSON line per converted page, including the source file, page number, output path, character count, and status.\n\n## Archive Status\n\nThe May 8, 2026 archive contains all `4,185` PDF pages as Markdown files.\n\n- `4,185` pages produced Markdown files.\n- `converted\u002Fmanifest.jsonl` contains one final `ok` row per converted page.\n\n## What Is Tracked\n\n```text\n.\n├── assets\u002F\n│   └── fbi-september-2023-composite-sketch.jpg\n├── converted\u002F\n│   ├── 001-...\n│   ├── 002-...\n│   └── manifest.jsonl\n├── metadata\u002F\n│   └── uap-csv.csv\n├── scripts\u002F\n│   └── process_dataset_with_gemini.py\n├── requirements.txt\n└── README.md\n```\n\n- `assets\u002F` contains small public visual assets used by this README.\n- `converted\u002F` is the destination for committed Markdown transcripts.\n- `metadata\u002Fuap-csv.csv` is the release inventory used to map source records to converted folders. At repo preparation time it contained 162 rows: 120 PDF rows, 28 video rows, and 14 image rows.\n- `metadata\u002Fpdf_manifest.tsv` is the corrected 120-PDF manifest used for the Markdown archive.\n- `metadata\u002Fdownload_summary.json` and `metadata\u002Fcurl_download.log` record the initial PDF download and verification pass.\n- `scripts\u002Fprocess_dataset_with_gemini.py` is the support script used to produce the Markdown archive.\n- `requirements.txt` lists the script dependencies.\n\nLocal-only folders are ignored:\n\n- `downloads\u002F` stores source PDFs\u002Fimages fetched from war.gov.\n- `outputs\u002F` stores temporary smoke-test outputs.\n- `source\u002F` stores local page snapshots used during scraping\u002Fdebugging.\n- `node_modules\u002F`, `.venv\u002F`, `.env`, caches, and `.DS_Store` are local-only.\n\n## Initial Dataset\n\nThe initial source PDF corpus was downloaded from [war.gov\u002FUFO](https:\u002F\u002Fwww.war.gov\u002FUFO\u002F) via `curl` into `downloads\u002Fwar-gov-ufo-release-1`.\n\nDownload result:\n\n- `120` PDFs downloaded.\n- `4,185` PDF pages detected locally.\n- `2.308 GiB` total PDF bytes; `du` shows `2.4G`.\n- The broader manifest's `28` video rows and `14` image rows were excluded from the PDF archive pass.\n- Three bad manifest URLs were retried and fixed by URL-encoding spaces\u002Fbrackets.\n- Verification passed: no missing files, no partial files, and all files start with `%PDF-`.\n\nMetadata\u002Flogs:\n\n- [metadata\u002Fuap-csv.csv](metadata\u002Fuap-csv.csv)\n- [metadata\u002Fpdf_manifest.tsv](metadata\u002Fpdf_manifest.tsv)\n- [metadata\u002Fdownload_summary.json](metadata\u002Fdownload_summary.json)\n- [metadata\u002Fcurl_download.log](metadata\u002Fcurl_download.log)\n\n## How The Files Were Converted\n\nThe conversion was done page by page:\n\n1. Read the corrected PDF inventory from `metadata\u002Fpdf_manifest.tsv`.\n2. Download or match each supported PDF source into `downloads\u002Fwar-gov-ufo-release-1\u002F`.\n3. Render every PDF page with PyMuPDF at 200 DPI.\n4. Resize the longest side to at most 3000 pixels and encode the rendered page as JPEG.\n5. Send the page image and a transcription prompt to Gemini.\n6. Save Gemini's returned Markdown as `converted\u002F\u003Csource-folder>\u002Fpage-####.md`.\n7. Append the page result to `converted\u002Fmanifest.jsonl`.\n\nThe default model used by the support script is `gemini-3.1-flash-lite` with `temperature=0`. The script supports parallel page workers and a global request-per-minute gate.\n\nGenerated Markdown should be treated as AI-assisted OCR\u002Ftranscription. Use the original source PDF\u002Fimage as the authoritative record when exact wording matters.\n\n## Rebuilding Or Continuing The Archive\n\nUse Python 3.11+.\n\n```sh\npython3 -m venv .venv\nsource .venv\u002Fbin\u002Factivate\npython3 -m pip install -r requirements.txt\n```\n\nSet the Gemini key in the shell. Do not commit API keys.\n\n```sh\nexport GEMINI_API_KEY='...'\n```\n\nDry-run the dataset without calling Gemini:\n\n```sh\npython3 scripts\u002Fprocess_dataset_with_gemini.py --dry-run\n```\n\nGenerate or continue the committed archive:\n\n```sh\npython3 scripts\u002Fprocess_dataset_with_gemini.py \\\n  --output-dir converted \\\n  --temperature 0 \\\n  --workers 16 \\\n  --rpm 10000\n```\n\nThe script is resumable. Existing `page-####.md` files are skipped unless `--force` is passed.\n\nUseful options:\n\n- `--workers N` controls page-level parallelism. It can also be set with `GEMINI_WORKERS`.\n- `--rpm N` controls the global Gemini request-per-minute gate. It can also be set with `GEMINI_RPM`.\n- `--temperature N` controls Gemini generation temperature. It defaults to `0`.\n- `--pages 1,4,9-12` processes selected pages only.\n- `--max-docs N` and `--max-pages-per-doc N` are useful for smoke tests.\n- `--local-only` ignores metadata downloads and processes files already present in `downloads\u002F`.\n- `--force` regenerates pages that already have Markdown outputs.\n- `--stop-on-error` stops the run after the first page error.\n\n## Notes For Public Use\n\n- The source materials are public government records from the official war.gov release page.\n- The repository is intended to hold the converted Markdown archive, not the 2+ GB downloaded source corpus.\n- Keep `.env`, API keys, downloaded records, local page snapshots, smoke-test outputs, and caches out of git.\n","该项目是一个用于转换美国国防部公开的UFO\u002FPURSUE Release 01文件为Markdown格式的档案库。其核心功能是将官方发布的PDF文档自动转换成易于阅读和处理的Markdown文件，并通过Python脚本执行这一过程，确保了数据的一致性和准确性。项目结构清晰，每个源文件都有对应的目录及页面文件，且附带详细的元数据记录。适用于需要对政府公开的UFO相关历史文档进行研究、分析或二次开发的场景，如学术研究、新闻报道等。",2,"2026-06-11 03:58:36","CREATED_QUERY"]