-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME.txt
142 lines (128 loc) · 4.66 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
= About Lego =
Lego is an advance web crawler library written in python. It
provides a number of methods to mine data from kinds of sites. You
could use YAML to create crawler templates don't need to know write
python code for a web crawl job.
This project is based on SuperMario (http://github.com/bububa/SuperMario/).
== License ==
BSD License
See 'LICENSE' for details.
== Requirements ==
Platform: *nix like system (Unix, Linux, Mac OS X, etc.)
Python: 2.5+
Storage: mongodb
Some other python models:
- bububa.SuperMario
== Features ==
+ YAML style templates;
+ keywords IDF calculation
+ keywords COEF calculation
+ Sitemanager
+ Smart Crawling
+
== DEMO ==
Sample YAML file to crawl snipplr.com
#snipplr.yaml
run: !SiteCrawler
check: !Crawlable
yaml_file: sites.yaml
label: snipplr
logger:
filename: snipplr.log
crawler: !YAMLStorage
yaml_file: sites.yaml
label: snipplr
method: update_config
data: !DetailCrawler
sleep: 3
proxies: !File
method: readlines
filename: /proxies
pages: !PaginateCrawler
proxies: !File
method: readlines
filename: /proxies
url_pattern: http://snipplr.com/all/page/{NUMBER}
start_no: 0
end_no: 0
multithread: True
wrapper: '<ol class="snippets marg">([^^].*?)</ol>'
logger:
filename: snipplr.log
url_pattern: '/view/\d+/[^^]*?/$'
wrapper:
title: '<h1>([^^]*?)</h1>'
language: '<p class="nomarg"><span class="rgt">Published in: ([^^]*?)</span>'
author: '<h2>Posted By</h2>\s+<p><a[^^]*?>([^^]*?)</a>'
code: '<a rel="nofollow" href="/view.php\?codeview&id=(\d+)">'
comment: '<div class="description">([^^]*?)</div>'
tag: '<h2>Tagged</h2>\s+<p>([^^]*?)</p>'
essential_fields:
- title
- language
- code
multithread: True
remove_external_duplicate: True
logger:
filename: snipplr.log
page_callback: !Document
label: snipplr
method: write
page: None
logger:
filename: snipplr.log
furthure:
tag:
parser: !Steps
inputs: None
steps:
- !Dict
dictionary: None
method: member
args:
- tag
- !Regx
string: None
pattern: <a[^^]*?>([^^]*?)</a>
multiple: True
code:
parser: !Steps
inputs: None
steps:
- !Dict
dictionary: None
method: member
args:
- code
- !String
args: None
base_str: 'http://snipplr.com/view.php?codeview&id=%s'
- !Init
inputs: None
obj: !URLCrawler
urls: None
params:
save_output: True
wrapper: '<textarea[^^]*?class="copysource">([^^]*?)</textarea>'
- !Array
arr: None
method: member
args:
- 0
- !Dict
dictionary: None
method: member
args:
- wrapper
-------------------------------------------------------------------
#sites.yaml
snipplr:
duration: 1800
end_no: 1
last_updated_at: 1263883143.0
start_no: 1
step: 1
--------------------------------------------------------------------
Run the crawler in shell
python lego.py -c snipplr.yaml
The results are stored in mongodb. You could use inserter Module to insert the data into mysql database.