hupubbs

基于scrapy开发的，对虎扑论坛进行爬取的爬虫。

部署

clone项目到本地
修改hupubbs/pipelines.py里MySQLPipeline.open_spider里self.db里的MySQL连接参数，使其指向自己的mysql服务器。
在命令行进入项目目录，运行scrapy crawl hupubbs。

设计文档

使用范例

虎扑可以让用户隐藏自己的动态，这样就不知道用户主要在哪个版块回帖。使用爬虫爬取后，在数据库里运行

select plate.url, count(*)
from reply
    left join thread on reply.thread_id = thread.id
    left join plate on thread.plate_id = plate.id
    left join user on reply.user_id = user.id
where user.url_id = 245307700327195 # `https://my.hupu.com/245307700327195`
group by plate.id;

可以查看该用户在各分区的回帖数。

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
database/mysql		database/mysql
docs		docs
hupubbs		hupubbs
.gitignore		.gitignore
README.md		README.md
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hupubbs

部署

设计文档

使用范例

About

Releases

Packages

Languages

seedjyh/hupubbs

Folders and files

Latest commit

History

Repository files navigation

hupubbs

部署

设计文档

使用范例

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages