feat: Rocket.Chat Docs Crawl #1

Dnouv · 2024-12-20T04:50:20Z

This pull request introduces a new web scraping tool to extract documentation content from Rocket.Chat's official websites using Scrapy. The changes include the implementation of the spider, configuration of the environment, and a script to process and send the extracted data.

Key changes:

Implementation of the web scraping tool:

rocket_chat_docs_spider/rcspider.py: Added a Scrapy spider to crawl Rocket.Chat documentation websites, extract page titles, main content, H2 headers, and URLs, and save the data in JSONL format.

Configuration and setup:

rocket_chat_docs_spider/README.md: Added documentation for the web scraping tool, including an overview, prerequisites, configuration, output format, usage instructions, and important notes.

Data processing:

rocket_chat_docs_spider/read_and_send.py: Added a script to process the JSONL file generated by the spider, prepare the documents, and send them to a specified API endpoint.

Dependencies:

rocket_chat_docs_spider/requirements.txt: Added necessary dependencies for the project, including requests and scrapy.

Dnouv added 2 commits December 20, 2024 10:19

init and add spider script

fc153d5

add data use script

8f0d805

Dnouv merged commit a008d59 into main Dec 20, 2024
1 check passed

Dnouv deleted the new/rc_spider branch December 20, 2024 05:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Rocket.Chat Docs Crawl #1

feat: Rocket.Chat Docs Crawl #1

Dnouv commented Dec 20, 2024 •

edited

Loading

feat: Rocket.Chat Docs Crawl #1

feat: Rocket.Chat Docs Crawl #1

Conversation

Dnouv commented Dec 20, 2024 • edited Loading

Implementation of the web scraping tool:

Configuration and setup:

Data processing:

Dependencies:

Dnouv commented Dec 20, 2024 •

edited

Loading