[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-3695":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":16,"forks30d":16,"starsTrendScore":19,"compositeScore":20,"rankGlobal":10,"rankLanguage":10,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":24,"hasPages":22,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":27,"readmeContent":28,"aiSummary":29,"trendingCount":16,"starSnapshotCount":16,"syncStatus":19,"lastSyncTime":30,"discoverSource":31},3695,"gpt-crawler","BuilderIO\u002Fgpt-crawler","BuilderIO","Crawl a site to generate knowledge files to create your own custom GPT from a URL","https:\u002F\u002Fwww.builder.io\u002Fblog\u002Fcustom-gpt",null,"TypeScript",22242,2369,132,96,0,4,17,2,73.7,"ISC License",false,"main",true,[26],"ai","2026-06-12 04:00:19","# GPT Crawler \u003C!-- omit from toc -->\n\n\u003C!-- Keep these links. Translations will automatically update with the README. -->\n[Deutsch](https:\u002F\u002Fwww.readme-i18n.com\u002FBuilderIO\u002Fgpt-crawler?lang=de) | \n[Español](https:\u002F\u002Fwww.readme-i18n.com\u002FBuilderIO\u002Fgpt-crawler?lang=es) | \n[français](https:\u002F\u002Fwww.readme-i18n.com\u002FBuilderIO\u002Fgpt-crawler?lang=fr) | \n[日本語](https:\u002F\u002Fwww.readme-i18n.com\u002FBuilderIO\u002Fgpt-crawler?lang=ja) | \n[한국어](https:\u002F\u002Fwww.readme-i18n.com\u002FBuilderIO\u002Fgpt-crawler?lang=ko) | \n[Português](https:\u002F\u002Fwww.readme-i18n.com\u002FBuilderIO\u002Fgpt-crawler?lang=pt) | \n[Русский](https:\u002F\u002Fwww.readme-i18n.com\u002FBuilderIO\u002Fgpt-crawler?lang=ru) | \n[中文](https:\u002F\u002Fwww.readme-i18n.com\u002FBuilderIO\u002Fgpt-crawler?lang=zh)\n\nCrawl a site to generate knowledge files to create your own custom GPT from one or multiple URLs\n\n![Gif showing the crawl run](https:\u002F\u002Fgithub.com\u002FBuilderIO\u002Fgpt-crawler\u002Fassets\u002F844291\u002Ffeb8763a-152b-4708-9c92-013b5c70d2f2)\n\n- [Example](#example)\n- [Get started](#get-started)\n  - [Running locally](#running-locally)\n    - [Clone the repository](#clone-the-repository)\n    - [Install dependencies](#install-dependencies)\n    - [Configure the crawler](#configure-the-crawler)\n    - [Run your crawler](#run-your-crawler)\n  - [Alternative methods](#alternative-methods)\n    - [Running in a container with Docker](#running-in-a-container-with-docker)\n    - [Running as an API](#running-as-an-api)\n  - [Upload your data to OpenAI](#upload-your-data-to-openai)\n    - [Create a custom GPT](#create-a-custom-gpt)\n    - [Create a custom assistant](#create-a-custom-assistant)\n- [Contributing](#contributing)\n\n## Example\n\n[Here is a custom GPT](https:\u002F\u002Fchat.openai.com\u002Fg\u002Fg-kywiqipmR-builder-io-assistant) that I quickly made to help answer questions about how to use and integrate [Builder.io](https:\u002F\u002Fwww.builder.io) by simply providing the URL to the Builder docs.\n\nThis project crawled the docs and generated the file that I uploaded as the basis for the custom GPT.\n\n[Try it out yourself](https:\u002F\u002Fchat.openai.com\u002Fg\u002Fg-kywiqipmR-builder-io-assistant) by asking questions about how to integrate Builder.io into a site.\n\n> Note that you may need a paid ChatGPT plan to access this feature\n\n## Get started\n\n### Running locally\n\n#### Clone the repository\n\nBe sure you have Node.js >= 16 installed.\n\n```sh\ngit clone https:\u002F\u002Fgithub.com\u002Fbuilderio\u002Fgpt-crawler\n```\n\n#### Install dependencies\n\n```sh\nnpm i\n```\n\n#### Configure the crawler\n\nOpen [config.ts](config.ts) and edit the `url` and `selector` properties to match your needs.\n\nE.g. to crawl the Builder.io docs to make our custom GPT you can use:\n\n```ts\nexport const defaultConfig: Config = {\n  url: \"https:\u002F\u002Fwww.builder.io\u002Fc\u002Fdocs\u002Fdevelopers\",\n  match: \"https:\u002F\u002Fwww.builder.io\u002Fc\u002Fdocs\u002F**\",\n  selector: `.docs-builder-container`,\n  maxPagesToCrawl: 50,\n  outputFileName: \"output.json\",\n};\n```\n\nSee [config.ts](src\u002Fconfig.ts) for all available options. Here is a sample of the common configuration options:\n\n```ts\ntype Config = {\n  \u002F** URL to start the crawl, if sitemap is provided then it will be used instead and download all pages in the sitemap *\u002F\n  url: string;\n  \u002F** Pattern to match against for links on a page to subsequently crawl *\u002F\n  match: string;\n  \u002F** Selector to grab the inner text from *\u002F\n  selector: string;\n  \u002F** Don't crawl more than this many pages *\u002F\n  maxPagesToCrawl: number;\n  \u002F** File name for the finished data *\u002F\n  outputFileName: string;\n  \u002F** Optional resources to exclude\n   *\n   * @example\n   * ['png','jpg','jpeg','gif','svg','css','js','ico','woff','woff2','ttf','eot','otf','mp4','mp3','webm','ogg','wav','flac','aac','zip','tar','gz','rar','7z','exe','dmg','apk','csv','xls','xlsx','doc','docx','pdf','epub','iso','dmg','bin','ppt','pptx','odt','avi','mkv','xml','json','yml','yaml','rss','atom','swf','txt','dart','webp','bmp','tif','psd','ai','indd','eps','ps','zipx','srt','wasm','m4v','m4a','webp','weba','m4b','opus','ogv','ogm','oga','spx','ogx','flv','3gp','3g2','jxr','wdp','jng','hief','avif','apng','avifs','heif','heic','cur','ico','ani','jp2','jpm','jpx','mj2','wmv','wma','aac','tif','tiff','mpg','mpeg','mov','avi','wmv','flv','swf','mkv','m4v','m4p','m4b','m4r','m4a','mp3','wav','wma','ogg','oga','webm','3gp','3g2','flac','spx','amr','mid','midi','mka','dts','ac3','eac3','weba','m3u','m3u8','ts','wpl','pls','vob','ifo','bup','svcd','drc','dsm','dsv','dsa','dss','vivo','ivf','dvd','fli','flc','flic','flic','mng','asf','m2v','asx','ram','ra','rm','rpm','roq','smi','smil','wmf','wmz','wmd','wvx','wmx','movie','wri','ins','isp','acsm','djvu','fb2','xps','oxps','ps','eps','ai','prn','svg','dwg','dxf','ttf','fnt','fon','otf','cab']\n   *\u002F\n  resourceExclusions?: string[];\n  \u002F** Optional maximum file size in megabytes to include in the output file *\u002F\n  maxFileSize?: number;\n  \u002F** Optional maximum number tokens to include in the output file *\u002F\n  maxTokens?: number;\n};\n```\n\n#### Run your crawler\n\n```sh\nnpm start\n```\n\n### Alternative methods\n\n#### [Running in a container with Docker](.\u002Fcontainerapp\u002FREADME.md)\n\nTo obtain the `output.json` with a containerized execution, go into the `containerapp` directory and modify the `config.ts` as shown above. The `output.json`file should be generated in the data folder. Note: the `outputFileName` property in the `config.ts` file in the `containerapp` directory is configured to work with the container.\n\n#### Running as an API\n\nTo run the app as an API server you will need to do an `npm install` to install the dependencies. The server is written in Express JS.\n\nTo run the server.\n\n`npm run start:server` to start the server. The server runs by default on port 3000.\n\nYou can use the endpoint `\u002Fcrawl` with the post request body of config json to run the crawler. The api docs are served on the endpoint `\u002Fapi-docs` and are served using swagger.\n\nTo modify the environment you can copy over the `.env.example` to `.env` and set your values like port, etc. to override the variables for the server.\n\n### Upload your data to OpenAI\n\nThe crawl will generate a file called `output.json` at the root of this project. Upload that [to OpenAI](https:\u002F\u002Fplatform.openai.com\u002Fdocs\u002Fassistants\u002Foverview) to create your custom assistant or custom GPT.\n\n#### Create a custom GPT\n\nUse this option for UI access to your generated knowledge that you can easily share with others\n\n> Note: you may need a paid ChatGPT plan to create and use custom GPTs right now\n\n1. Go to [https:\u002F\u002Fchat.openai.com\u002F](https:\u002F\u002Fchat.openai.com\u002F)\n2. Click your name in the bottom left corner\n3. Choose \"My GPTs\" in the menu\n4. Choose \"Create a GPT\"\n5. Choose \"Configure\"\n6. Under \"Knowledge\" choose \"Upload a file\" and upload the file you generated\n7. if you get an error about the file being too large, you can try to split it into multiple files and upload them separately using the option maxFileSize in the config.ts file or also use tokenization to reduce the size of the file with the option maxTokens in the config.ts file\n\n![Gif of how to upload a custom GPT](https:\u002F\u002Fgithub.com\u002FBuilderIO\u002Fgpt-crawler\u002Fassets\u002F844291\u002F22f27fb5-6ca5-4748-9edd-6bcf00b408cf)\n\n#### Create a custom assistant\n\nUse this option for API access to your generated knowledge that you can integrate into your product.\n\n1. Go to [https:\u002F\u002Fplatform.openai.com\u002Fassistants](https:\u002F\u002Fplatform.openai.com\u002Fassistants)\n2. Click \"+ Create\"\n3. Choose \"upload\" and upload the file you generated\n\n![Gif of how to upload to an assistant](https:\u002F\u002Fgithub.com\u002FBuilderIO\u002Fgpt-crawler\u002Fassets\u002F844291\u002F06e6ad36-e2ba-4c6e-8d5a-bf329140de49)\n\n## Contributing\n\nKnow how to make this project better? Send a PR!\n\n\u003Cbr>\n\u003Cbr>\n\n\u003Cp align=\"center\">\n   \u003Ca href=\"https:\u002F\u002Fwww.builder.io\u002Fm\u002Fdevelopers\">\n      \u003Cpicture>\n         \u003Csource media=\"(prefers-color-scheme: dark)\" srcset=\"https:\u002F\u002Fuser-images.githubusercontent.com\u002F844291\u002F230786554-eb225eeb-2f6b-4286-b8c2-535b1131744a.png\">\n         \u003Cimg width=\"250\" alt=\"Made with love by Builder.io\" src=\"https:\u002F\u002Fuser-images.githubusercontent.com\u002F844291\u002F230786555-a58479e4-75f3-4222-a6eb-74c5af953eac.png\">\n       \u003C\u002Fpicture>\n   \u003C\u002Fa>\n\u003C\u002Fp>\n","BuilderIO\u002Fgpt-crawler 是一个用于爬取网站内容以生成知识文件的工具，进而根据这些文件创建自定义的GPT模型。该项目采用TypeScript编写，能够从一个或多个URL抓取信息，并将其转换为适合上传至OpenAI平台的知识文件格式。其核心功能包括配置灵活的爬虫设置（如起始URL、选择器等），支持本地运行及Docker容器化部署，且提供了API接口模式。此项目非常适合需要基于特定网站内容训练个性化聊天助手或问答系统的场景，比如企业内部文档、产品手册等的专业知识库构建。","2026-06-11 02:55:38","top_language"]