Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow 'file://' as url prefix for requests #2800

Closed
jensmeichler opened this issue Jan 9, 2025 · 2 comments
Closed

Allow 'file://' as url prefix for requests #2800

jensmeichler opened this issue Jan 9, 2025 · 2 comments
Labels
feature Issues that represent new features or improvements to existing features. t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@jensmeichler
Copy link

jensmeichler commented Jan 9, 2025

Which package is the feature request for? If unsure which one to select, leave blank

@crawlee/memory-storage

Feature

Sometimes we want to test our crawler using static html files (for test mode). Therefore it would be benefitial to be able to not only use 'http' and 'https' but also 'file' as protocols.

I guess there are more usecases for this feature, but this is the only one I have 😅

Motivation

I am currently trying to build a crawler that crawls a static html page. It would make the testing easier to just call the file:// instead of having to serve it on http or https.

Ideal solution or implementation, and any additional constraints

Could be easily changed in packages/memory-storage/src/resource-clients/request-queue.ts:22.

Bildschirmfoto 2025-01-09 um 09 51 07

Alternative solutions or implementations

No response

Other context

No response

@jensmeichler jensmeichler added the feature Issues that represent new features or improvements to existing features. label Jan 9, 2025
@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Jan 9, 2025
@B4nan
Copy link
Member

B4nan commented Jan 9, 2025

It's a bit more complicated, since the HTTP client we use by default (got-scraping) won't work with a file:// URL either.

The usual solution to this is to start a local web server, which you can easily do e.g. via npx http-server -o /path/to/static/content or similar dependencies, and scraping from the localhost instead.

@jensmeichler
Copy link
Author

Okay, thanks for the quick answer. I'll try to go with your suggested approach then :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Issues that represent new features or improvements to existing features. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

2 participants