Skip to content

A Python web scraper for extracting post content and comments from PTT website.

License

Notifications You must be signed in to change notification settings

Taiwan-Social-Media-Corpus/async-scraptt

Repository files navigation

async-ptt-crawler

This project scrapes/crawls post content and comments from PTT website, and implements neural CKIP Chinese NLP tools on the scraped data asynchronously.

Documentation

1. Installation

  1. Python version

    • python >= 3.10
  2. Clone repository

    git clone [email protected]:Taiwan-Social-Media-Corpus/async-scraptt.git
  3. Install Requirement

  • pip

    cd async-scraptt && pip install -r requirements.txt
  • poetry

    cd async-scraptt && poetry shell && poetry install

2. Usage

  1. Commands
scrapy crawl ptt -a boards=BOARDS [-a fetch_all=BOOLEAN]
            [-a index_from=NUMBER -a index_to=NUMBER]
            [-a scrape_from=YEAR] [-a data_dir=PATH]

positional arguments:
-a boards=BOARDS                          ptt board name (e.g. Soft_Job)
-a index_from=NUMBER -a index_to=NUMBER   html index number from a ptt board
-a fetch_all=BOOLEAN                     scrap all posts if true
-a scrape_from=YEAR                             scrap all posts from a given year
-a data_dir=PATH                          output file path (default: ./data)
  • Crawl all the posts of a board:

    scrapy crawl ptt -a boards=Soft_Job -a fetch_all=True
  • Crawl all the posts of a board from a year in the past:

    scrapy crawl ptt -a boards=Soft_Job -a scrape_from=2022
  • Crawl the posts of a board based on html indexes:

    scrapy crawl ptt -a boards=Soft_Job -a index_from=1722 -a index_to=1723

    Please make sure the number of index_from is greater than index_to.

  • Crawl the posts of multiple boards. For example:

    scrapy crawl ptt -a boards=Soft_Job,Gossiping -a index_from=1722 -a index_to=1723

    Note: the comma in the argument boards cannot have spaces. It cannot be boards=Soft_Job, Baseball or boards=["Soft_Job", "Baseball"].

3. Running Applications with Docker

A Docker setup is provided for the crawler.

To run the crawler, go to the docker-compose.yml file to edit the command:

version: "3"

services:
  scraptt:
    build: .
    # define your crawler here!
    command: bash -c "poetry run scrapy crawl ptt -a boards=Soft_Job -a index_from=1500 -a index_to=1500"

    volumes:
      - "./data:/app/data"

Feel free to change the command as long as it follows the command format as stated above.

Now start the crawler:

docker compose up

Contact

If you have any suggestion or question, please do not hesitate to email us at [email protected] or [email protected]

About

A Python web scraper for extracting post content and comments from PTT website.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published