[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-4072":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":16,"forks30d":16,"starsTrendScore":17,"compositeScore":19,"rankGlobal":10,"rankLanguage":10,"license":20,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":23,"hasPages":23,"topics":24,"createdAt":10,"pushedAt":10,"updatedAt":29,"readmeContent":30,"aiSummary":31,"trendingCount":16,"starSnapshotCount":16,"syncStatus":32,"lastSyncTime":33,"discoverSource":34},4072,"webmagic","code4craft\u002Fwebmagic","code4craft","A scalable web crawler framework for Java.","http:\u002F\u002Fwebmagic.io\u002F",null,"Java",11679,4128,752,337,0,1,9,71.4,"Apache License 2.0",false,"develop",true,[25,26,27,28],"crawler","framework","java","scraping","2026-06-12 04:00:21","![logo](http:\u002F\u002Fwebmagic.io\u002Fimages\u002Flogo.jpeg)\n\n[Readme in Chinese](https:\u002F\u002Fgithub.com\u002Fcode4craft\u002Fwebmagic\u002Ftree\u002Fmaster\u002FREADME-zh.md)\n\n\n[![Maven Central](https:\u002F\u002Fmaven-badges.herokuapp.com\u002Fmaven-central\u002Fus.codecraft\u002Fwebmagic-parent\u002Fbadge.svg?subject=Maven%20Central)](https:\u002F\u002Fmaven-badges.herokuapp.com\u002Fmaven-central\u002Fus.codecraft\u002Fwebmagic-parent\u002F)\n[![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache%20License%202.0-blue.svg)](https:\u002F\u002Fwww.apache.org\u002Flicenses\u002FLICENSE-2.0.html)\n[![Build Status](https:\u002F\u002Ftravis-ci.org\u002Fcode4craft\u002Fwebmagic.png?branch=master)](https:\u002F\u002Ftravis-ci.org\u002Fcode4craft\u002Fwebmagic)\n\n>A scalable crawler framework. It covers the whole lifecycle of crawler: downloading, url management, content extraction and persistent. It can simplify the development of a  specific crawler.\n\n## Features:\n\n* Simple core with high flexibility.\n* Simple API for html extracting.\n* Annotation with POJO to customize a crawler, no configuration.\n* Multi-thread and Distribution support.\n* Easy to be integrated.\n\n## Install:\n  \nAdd dependencies to your pom.xml:\n\n```xml\n\u003Cdependency>\n    \u003CgroupId>us.codecraft\u003C\u002FgroupId>\n    \u003CartifactId>webmagic-core\u003C\u002FartifactId>\n    \u003Cversion>${webmagic.version}\u003C\u002Fversion>\n\u003C\u002Fdependency>\n\u003Cdependency>\n    \u003CgroupId>us.codecraft\u003C\u002FgroupId>\n    \u003CartifactId>webmagic-extension\u003C\u002FartifactId>\n    \u003Cversion>${webmagic.version}\u003C\u002Fversion>\n\u003C\u002Fdependency>\n```\n        \nWebMagic use slf4j with slf4j-log4j12 implementation. If you customized your slf4j implementation, please exclude slf4j-log4j12.\n\n```xml\n\u003Cexclusions>\n    \u003Cexclusion>\n        \u003CgroupId>org.slf4j\u003C\u002FgroupId>\n        \u003CartifactId>slf4j-log4j12\u003C\u002FartifactId>\n    \u003C\u002Fexclusion>\n\u003C\u002Fexclusions>\n```\n\n\n## Get Started:\n\n### First crawler:\n\nWrite a class implements PageProcessor. For example, I wrote a crawler of github repository information.\n\n```java\npublic class GithubRepoPageProcessor implements PageProcessor {\n\n    private Site site = Site.me().setRetryTimes(3).setSleepTime(1000);\n\n    @Override\n    public void process(Page page) {\n        page.addTargetRequests(page.getHtml().links().regex(\"(https:\u002F\u002Fgithub\\\\.com\u002F\\\\w+\u002F\\\\w+)\").all());\n        page.putField(\"author\", page.getUrl().regex(\"https:\u002F\u002Fgithub\\\\.com\u002F(\\\\w+)\u002F.*\").toString());\n        page.putField(\"name\", page.getHtml().xpath(\"\u002F\u002Fh1[@class='public']\u002Fstrong\u002Fa\u002Ftext()\").toString());\n        if (page.getResultItems().get(\"name\")==null){\n            \u002F\u002Fskip this page\n            page.setSkip(true);\n        }\n        page.putField(\"readme\", page.getHtml().xpath(\"\u002F\u002Fdiv[@id='readme']\u002FtidyText()\"));\n    }\n\n    @Override\n    public Site getSite() {\n        return site;\n    }\n\n    public static void main(String[] args) {\n        Spider.create(new GithubRepoPageProcessor()).addUrl(\"https:\u002F\u002Fgithub.com\u002Fcode4craft\").thread(5).run();\n    }\n}\n```\n\n* `page.addTargetRequests(links)`\n\t\n\tAdd urls for crawling.\n    \nYou can also use annotation way:\n\n```java\n@TargetUrl(\"https:\u002F\u002Fgithub.com\u002F\\\\w+\u002F\\\\w+\")\n@HelpUrl(\"https:\u002F\u002Fgithub.com\u002F\\\\w+\")\npublic class GithubRepo {\n\n    @ExtractBy(value = \"\u002F\u002Fh1[@class='public']\u002Fstrong\u002Fa\u002Ftext()\", notNull = true)\n    private String name;\n\n    @ExtractByUrl(\"https:\u002F\u002Fgithub\\\\.com\u002F(\\\\w+)\u002F.*\")\n    private String author;\n\n    @ExtractBy(\"\u002F\u002Fdiv[@id='readme']\u002FtidyText()\")\n    private String readme;\n\n    public static void main(String[] args) {\n        OOSpider.create(Site.me().setSleepTime(1000)\n                , new ConsolePageModelPipeline(), GithubRepo.class)\n                .addUrl(\"https:\u002F\u002Fgithub.com\u002Fcode4craft\").thread(5).run();\n    }\n}\n```\n\t\t\n### Docs and samples:\n\nDocuments: [http:\u002F\u002Fwebmagic.io\u002Fdocs\u002F](http:\u002F\u002Fwebmagic.io\u002Fdocs\u002F)\n\nThe architecture of webmagic (referred to [Scrapy](http:\u002F\u002Fscrapy.org\u002F))\n\n![image](http:\u002F\u002Fcode4craft.github.io\u002Fimages\u002Fposts\u002Fwebmagic.png)\n\nThere are more examples in `webmagic-samples` package.\n\n### License:\n\nLicensed under [Apache 2.0 license](http:\u002F\u002Fopensource.org\u002Flicenses\u002FApache-2.0)\n\n### Thanks:\n\nTo write webmagic, I refered to the projects below :\n\n* **Scrapy**\n\n\tA crawler framework in Python.\n \n\t[http:\u002F\u002Fscrapy.org\u002F](http:\u002F\u002Fscrapy.org\u002F)\n\n* **Spiderman**\n\n\tAnother crawler framework in Java.\n\t\n\t[http:\u002F\u002Fgit.oschina.net\u002Fl-weiwei\u002Fspiderman](http:\u002F\u002Fgit.oschina.net\u002Fl-weiwei\u002Fspiderman)\n\n### Mail-list:\n\n[https:\u002F\u002Fgroups.google.com\u002Fforum\u002F#!forum\u002Fwebmagic-java](https:\u002F\u002Fgroups.google.com\u002Fforum\u002F#!forum\u002Fwebmagic-java)\n\n[http:\u002F\u002Flist.qq.com\u002Fcgi-bin\u002Fqf_invite?id=023a01f505246785f77c5a5a9aff4e57ab20fcdde871e988](http:\u002F\u002Flist.qq.com\u002Fcgi-bin\u002Fqf_invite?id=023a01f505246785f77c5a5a9aff4e57ab20fcdde871e988)\n\nQQ Group: 373225642 542327088\n\n### Related Project\n\n* \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fgsh199449\u002Fspider\" target=\"_blank\">Gather Platform\u003C\u002Fa>\n\t\n\tA web console based on WebMagic for Spider configuration and management.\n\n","WebMagic 是一个用于Java的可扩展网络爬虫框架。它支持网页下载、URL管理、内容提取和数据持久化等爬虫生命周期中的关键步骤，简化了特定爬虫的开发过程。其核心特点包括简单且高度灵活的设计、易于使用的HTML提取API、通过POJO注解自定义爬虫而无需额外配置的能力、多线程与分布式支持以及易于集成到现有项目中。适用于需要从网站上抓取信息并进行处理的各种场景，如数据分析、市场研究、SEO优化等。",2,"2026-06-11 02:58:14","top_language"]