Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Rocket.Chat Docs Crawl #1

Merged
merged 2 commits into from
Dec 20, 2024
Merged

feat: Rocket.Chat Docs Crawl #1

merged 2 commits into from
Dec 20, 2024

Conversation

Dnouv
Copy link
Member

@Dnouv Dnouv commented Dec 20, 2024

This pull request introduces a new web scraping tool to extract documentation content from Rocket.Chat's official websites using Scrapy. The changes include the implementation of the spider, configuration of the environment, and a script to process and send the extracted data.

Key changes:

Implementation of the web scraping tool:

  • rocket_chat_docs_spider/rcspider.py: Added a Scrapy spider to crawl Rocket.Chat documentation websites, extract page titles, main content, H2 headers, and URLs, and save the data in JSONL format.

Configuration and setup:

  • rocket_chat_docs_spider/README.md: Added documentation for the web scraping tool, including an overview, prerequisites, configuration, output format, usage instructions, and important notes.

Data processing:

Dependencies:

@Dnouv Dnouv merged commit a008d59 into main Dec 20, 2024
1 check passed
@Dnouv Dnouv deleted the new/rc_spider branch December 20, 2024 05:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant