Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed hanging on websites with never-ending streaming content #140

Merged
merged 4 commits into from
Nov 19, 2017

Conversation

jsf9k
Copy link
Member

@jsf9k jsf9k commented Nov 16, 2017

This should resolve #138.

code now uses the streaming variant of python-requests.  This variant
delays the retrieval of the content until we access Response.content.
This will:
 * Save us time and bandwidth
 * Stop pshtt from hanging on URLs that stream neverending data, like
   webcams.  See #138:
   #138

Since we are not actually reading the content of the request we have
to be careful to ensure that the close() method is called on the
Request object returned by the ping method.  That is the ONLY way the
connection can be closed and released back into the pool.  One way to
ensure this happens is to use the "with" Python construct.
Copy link
Collaborator

@konklone konklone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment (and a commit with code comments) about future response body parsing.

I also tested out a pshtt scan across the ~1300 federal parent .gov domains, from master and from this branch, to check if there were any regressions or changes observed with streaming mode enabled. While I didn't check every single change in detail, the only meaningful change I observed is that a connection to http://www.altusandc.gov succeeds with this branch, where it fails with master after a connection error (which presumably occurs during body transfer but not header transfer).

This looks good to me. Great catch and great fix, @jsf9k!

# Setting this to true delays the retrieval of the content
# until we access Response.content. Since we aren't
# interested in the actual content of the request, this will
# save us time and bandwidth.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome, and sounds like it will definitely speed up pshtt all around.

Worth noting that if we decide to tackle #52 and look at <meta> tags when calculating redirects (which I'm split on), this would necessitate reading in the content, at least in some cases. We could still be conditional in those cases, such as only doing so for 2XX response codes, or when the Content-Length is > [0 or some small number].

@konklone konklone merged commit 23ea541 into master Nov 19, 2017
@konklone konklone deleted the bugfix/hanging_on_streaming_content branch November 19, 2017 22:18
@konklone
Copy link
Collaborator

Also, in my testing on the ~1300 federal domains, I didn't notice any time difference in the overall scan - it was 225 seconds vs 227 seconds, done with 900 Lambda workers.

cisagovbot pushed a commit that referenced this pull request Dec 19, 2023
…nfiguration

Update the Dependabot configuration
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants