Does the scraper actually support basic auth? #78

comatory · 2025-01-29T10:16:41Z

The documentation mentions two env variables:

DOCSEARCH_BASICAUTH_USERNAME
DOCSEARCH_BASICAUTH_PASSWORD

I was looking through the code to figure out how they are encoded into Authorization headers so I can set up my private internal site correctly. However, the only mention I see is documentation_spider.py file.

This file reads the environment variables but does not seem to do anything with them. I see these are class properties assigned to http_user and http_pass so I tried searching the codebase for them, but did not find anything.

Am I right to assume that this does not actually work? Is the documentation lying or did I miss some important piece?

Any clarification would help. Meanwhile, I hope #76 gets merged, that way I could specify Authorization header directly without relying on implementation.

The text was updated successfully, but these errors were encountered:

comatory · 2025-01-29T13:29:08Z

Basically I'd expect that the scraper would:

Dispatch all requests with Authorization header in case these two variables are provided.
The Authorization for basic HTTP auth would use username:password encoded as base64

I would also hope redirects would be respected. For example, the domain I'm trying to reach is internally redirecting to opaque domain (Cloudflare Worker). So I'd need for these headers to be sent there as well.

tharropoulos · 2025-02-03T10:59:07Z

The documentation is correct and the authentication is actually working - it's just happening behind the scenes through Scrapy's built-in HttpAuthMiddleware.

The http_user and http_pass properties you found in documentation_spider.py are actually being used by Scrapy's authentication middleware automatically. This is documented in Scrapy's HttpAuthMiddleware docs.

When you set these properties on a spider:

spider.http_user = "testuser"
spider.http_pass = "testpass"
spider.http_auth_domain = "example.com"

Scrapy's HttpAuthMiddleware will automatically:

Intercept all requests
Check if they match the http_auth_domain
Add the appropriate Basic Auth headers using the credentials

You don't see this explicitly in the codebase because it's handled by Scrapy's middleware pipeline. The environment variables are read and set as spider attributes, then Scrapy's built-in auth middleware uses them to add the proper Authorization: Basic <encoded_credentials> header.

I've also taken the time to introduce a request interceptor middleware to debug the headers of each request, so you can check it out yourself.

So your authentication should work as expected - the recent test addition auth_test.py actually verifies this behavior by checking that these properties are set correctly.

Let me know if you need any clarification or have additional questions.

comatory · 2025-02-05T10:03:07Z

Thank you for the clarification 👍

comatory closed this as completed Feb 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does the scraper actually support basic auth? #78

Does the scraper actually support basic auth? #78

comatory commented Jan 29, 2025 •

edited

Loading

comatory commented Jan 29, 2025 •

edited

Loading

tharropoulos commented Feb 3, 2025

comatory commented Feb 5, 2025

Does the scraper actually support basic auth? #78

Does the scraper actually support basic auth? #78

Comments

comatory commented Jan 29, 2025 • edited Loading

comatory commented Jan 29, 2025 • edited Loading

tharropoulos commented Feb 3, 2025

comatory commented Feb 5, 2025

comatory commented Jan 29, 2025 •

edited

Loading

comatory commented Jan 29, 2025 •

edited

Loading