Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request introduces a new web scraping tool to extract documentation content from Rocket.Chat's official websites using Scrapy. The changes include the implementation of the spider, configuration of the environment, and a script to process and send the extracted data.
Key changes:
Implementation of the web scraping tool:
rocket_chat_docs_spider/rcspider.py
: Added a Scrapy spider to crawl Rocket.Chat documentation websites, extract page titles, main content, H2 headers, and URLs, and save the data in JSONL format.Configuration and setup:
rocket_chat_docs_spider/README.md
: Added documentation for the web scraping tool, including an overview, prerequisites, configuration, output format, usage instructions, and important notes.Data processing:
rocket_chat_docs_spider/read_and_send.py
: Added a script to process the JSONL file generated by the spider, prepare the documents, and send them to a specified API endpoint.Dependencies:
rocket_chat_docs_spider/requirements.txt
: Added necessary dependencies for the project, includingrequests
andscrapy
.