Skip to content

Peterpanpan/webmagic

This branch is 301 commits behind code4craft/webmagic:master.

Folders and files

NameName
Last commit message
Last commit date

Latest commit

be892b8 · Dec 2, 2017
Nov 29, 2017
Nov 30, 2017
Jul 30, 2017
Jul 30, 2017
Jul 30, 2017
Jul 30, 2017
May 19, 2014
Dec 2, 2017
Feb 27, 2017
Jul 30, 2017
Jul 30, 2017
Dec 2, 2017

Repository files navigation

logo

Readme in Chinese

Build Status

A scalable crawler framework. It covers the whole lifecycle of crawler: downloading, url management, content extraction and persistent. It can simplify the development of a specific crawler.

Features:

  • Simple core with high flexibility.
  • Simple API for html extracting.
  • Annotation with POJO to customize a crawler, no configuration.
  • Multi-thread and Distribution support.
  • Easy to be integrated.

Install:

Add dependencies to your pom.xml:

<dependency>
    <groupId>us.codecraft</groupId>
    <artifactId>webmagic-core</artifactId>
    <version>0.7.3</version>
</dependency>
<dependency>
    <groupId>us.codecraft</groupId>
    <artifactId>webmagic-extension</artifactId>
    <version>0.7.3</version>
</dependency>

WebMagic use slf4j with slf4j-log4j12 implementation. If you customized your slf4j implementation, please exclude slf4j-log4j12.

<exclusions>
    <exclusion>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-log4j12</artifactId>
    </exclusion>
</exclusions>

Get Started:

First crawler:

Write a class implements PageProcessor. For example, I wrote a crawler of github repository infomation.

public class GithubRepoPageProcessor implements PageProcessor {

    private Site site = Site.me().setRetryTimes(3).setSleepTime(1000);

    @Override
    public void process(Page page) {
        page.addTargetRequests(page.getHtml().links().regex("(https://github\\.com/\\w+/\\w+)").all());
        page.putField("author", page.getUrl().regex("https://github\\.com/(\\w+)/.*").toString());
        page.putField("name", page.getHtml().xpath("//h1[@class='public']/strong/a/text()").toString());
        if (page.getResultItems().get("name")==null){
            //skip this page
            page.setSkip(true);
        }
        page.putField("readme", page.getHtml().xpath("//div[@id='readme']/tidyText()"));
    }

    @Override
    public Site getSite() {
        return site;
    }

    public static void main(String[] args) {
        Spider.create(new GithubRepoPageProcessor()).addUrl("https://github.com/code4craft").thread(5).run();
    }
}
  • page.addTargetRequests(links)

    Add urls for crawling.

You can also use annotation way:

@TargetUrl("https://github.com/\\w+/\\w+")
@HelpUrl("https://github.com/\\w+")
public class GithubRepo {

    @ExtractBy(value = "//h1[@class='public']/strong/a/text()", notNull = true)
    private String name;

    @ExtractByUrl("https://github\\.com/(\\w+)/.*")
    private String author;

    @ExtractBy("//div[@id='readme']/tidyText()")
    private String readme;

    public static void main(String[] args) {
        OOSpider.create(Site.me().setSleepTime(1000)
                , new ConsolePageModelPipeline(), GithubRepo.class)
                .addUrl("https://github.com/code4craft").thread(5).run();
    }
}

Docs and samples:

Documents: http://webmagic.io/docs/

The architecture of webmagic (refered to Scrapy)

image

There are more examples in webmagic-samples package.

Lisence:

Lisenced under Apache 2.0 lisence

Thanks:

To write webmagic, I refered to the projects below :

Mail-list:

https://groups.google.com/forum/#!forum/webmagic-java

http://list.qq.com/cgi-bin/qf_invite?id=023a01f505246785f77c5a5a9aff4e57ab20fcdde871e988

QQ Group: 373225642 542327088

Related Project

  • Gather Platform

    A web console based on WebMagic for Spider configuration and management.

About

A scalable web crawler framework for Java.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Java 73.9%
  • HTML 25.5%
  • JavaScript 0.2%
  • Kotlin 0.2%
  • Ruby 0.1%
  • Groovy 0.1%