Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does the scraper actually support basic auth? #78

Closed
comatory opened this issue Jan 29, 2025 · 3 comments
Closed

Does the scraper actually support basic auth? #78

comatory opened this issue Jan 29, 2025 · 3 comments

Comments

@comatory
Copy link
Contributor

comatory commented Jan 29, 2025

The documentation mentions two env variables:

  • DOCSEARCH_BASICAUTH_USERNAME
  • DOCSEARCH_BASICAUTH_PASSWORD

I was looking through the code to figure out how they are encoded into Authorization headers so I can set up my private internal site correctly. However, the only mention I see is documentation_spider.py file.

This file reads the environment variables but does not seem to do anything with them. I see these are class properties assigned to http_user and http_pass so I tried searching the codebase for them, but did not find anything.

Am I right to assume that this does not actually work? Is the documentation lying or did I miss some important piece?

Any clarification would help. Meanwhile, I hope #76 gets merged, that way I could specify Authorization header directly without relying on implementation.

@comatory
Copy link
Contributor Author

comatory commented Jan 29, 2025

Basically I'd expect that the scraper would:

  1. Dispatch all requests with Authorization header in case these two variables are provided.
  2. The Authorization for basic HTTP auth would use username:password encoded as base64

I would also hope redirects would be respected. For example, the domain I'm trying to reach is internally redirecting to opaque domain (Cloudflare Worker). So I'd need for these headers to be sent there as well.

@tharropoulos
Copy link
Contributor

The documentation is correct and the authentication is actually working - it's just happening behind the scenes through Scrapy's built-in HttpAuthMiddleware.

The http_user and http_pass properties you found in documentation_spider.py are actually being used by Scrapy's authentication middleware automatically. This is documented in Scrapy's HttpAuthMiddleware docs.

When you set these properties on a spider:

spider.http_user = "testuser"
spider.http_pass = "testpass"
spider.http_auth_domain = "example.com"

Scrapy's HttpAuthMiddleware will automatically:

  1. Intercept all requests
  2. Check if they match the http_auth_domain
  3. Add the appropriate Basic Auth headers using the credentials

You don't see this explicitly in the codebase because it's handled by Scrapy's middleware pipeline. The environment variables are read and set as spider attributes, then Scrapy's built-in auth middleware uses them to add the proper Authorization: Basic <encoded_credentials> header.

I've also taken the time to introduce a request interceptor middleware to debug the headers of each request, so you can check it out yourself.

So your authentication should work as expected - the recent test addition auth_test.py actually verifies this behavior by checking that these properties are set correctly.

Let me know if you need any clarification or have additional questions.

@comatory comatory closed this as completed Feb 5, 2025
@comatory
Copy link
Contributor Author

comatory commented Feb 5, 2025

Thank you for the clarification 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants