-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed hanging on websites with never-ending streaming content #140
Conversation
code now uses the streaming variant of python-requests. This variant delays the retrieval of the content until we access Response.content. This will: * Save us time and bandwidth * Stop pshtt from hanging on URLs that stream neverending data, like webcams. See #138: #138 Since we are not actually reading the content of the request we have to be careful to ensure that the close() method is called on the Request object returned by the ping method. That is the ONLY way the connection can be closed and released back into the pool. One way to ensure this happens is to use the "with" Python construct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment (and a commit with code comments) about future response body parsing.
I also tested out a pshtt
scan across the ~1300 federal parent .gov domains, from master
and from this branch, to check if there were any regressions or changes observed with streaming mode enabled. While I didn't check every single change in detail, the only meaningful change I observed is that a connection to http://www.altusandc.gov succeeds with this branch, where it fails with master
after a connection error (which presumably occurs during body transfer but not header transfer).
This looks good to me. Great catch and great fix, @jsf9k!
# Setting this to true delays the retrieval of the content | ||
# until we access Response.content. Since we aren't | ||
# interested in the actual content of the request, this will | ||
# save us time and bandwidth. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome, and sounds like it will definitely speed up pshtt
all around.
Worth noting that if we decide to tackle #52 and look at <meta>
tags when calculating redirects (which I'm split on), this would necessitate reading in the content, at least in some cases. We could still be conditional in those cases, such as only doing so for 2XX
response codes, or when the Content-Length is > [0 or some small number].
Also, in my testing on the ~1300 federal domains, I didn't notice any time difference in the overall scan - it was 225 seconds vs 227 seconds, done with 900 Lambda workers. |
…nfiguration Update the Dependabot configuration
This should resolve #138.