Fixed hanging on websites with never-ending streaming content #140

jsf9k · 2017-11-16T16:55:31Z

This should resolve #138.

code now uses the streaming variant of python-requests. This variant delays the retrieval of the content until we access Response.content. This will: * Save us time and bandwidth * Stop pshtt from hanging on URLs that stream neverending data, like webcams. See #138: #138 Since we are not actually reading the content of the request we have to be careful to ensure that the close() method is called on the Request object returned by the ping method. That is the ONLY way the connection can be closed and released back into the pool. One way to ensure this happens is to use the "with" Python construct.

happy.

konklone

Added a comment (and a commit with code comments) about future response body parsing.

I also tested out a pshtt scan across the ~1300 federal parent .gov domains, from master and from this branch, to check if there were any regressions or changes observed with streaming mode enabled. While I didn't check every single change in detail, the only meaningful change I observed is that a connection to http://www.altusandc.gov succeeds with this branch, where it fails with master after a connection error (which presumably occurs during body transfer but not header transfer).

This looks good to me. Great catch and great fix, @jsf9k!

konklone · 2017-11-19T21:12:07Z

pshtt/pshtt.py

+        # Setting this to true delays the retrieval of the content
+        # until we access Response.content.  Since we aren't
+        # interested in the actual content of the request, this will
+        # save us time and bandwidth.


This is awesome, and sounds like it will definitely speed up pshtt all around.

Worth noting that if we decide to tackle #52 and look at <meta> tags when calculating redirects (which I'm split on), this would necessitate reading in the content, at least in some cases. We could still be conditional in those cases, such as only doing so for 2XX response codes, or when the Content-Length is > [0 or some small number].

konklone · 2017-11-19T22:22:37Z

Also, in my testing on the ~1300 federal domains, I didn't notice any time difference in the overall scan - it was 225 seconds vs 227 seconds, done with 900 Lambda workers.

…nfiguration Update the Dependabot configuration

jsf9k added 2 commits November 15, 2017 18:02

Deleted some whitespace in a blank line in order to make TravisCI

b85cf2d

happy.

jsf9k requested review from konklone, h-m-f-t and IanLee1521 November 16, 2017 16:55

konklone added 2 commits November 19, 2017 16:13

Merge branch 'master' into bugfix/hanging_on_streaming_content

6d9469f

add a comment for the future, if we ever start reading bodies

3ce5049

konklone approved these changes Nov 19, 2017

View reviewed changes

konklone merged commit 23ea541 into master Nov 19, 2017

konklone deleted the bugfix/hanging_on_streaming_content branch November 19, 2017 22:18

konklone mentioned this pull request Nov 19, 2017

Hanging on domains whose page load can never complete #138

Closed

cisagovbot pushed a commit that referenced this pull request Dec 19, 2023

Merge pull request #140 from cisagov/improvement/update_dependabot_co…

ce74358

…nfiguration Update the Dependabot configuration

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed hanging on websites with never-ending streaming content #140

Fixed hanging on websites with never-ending streaming content #140

jsf9k commented Nov 16, 2017

konklone left a comment

konklone Nov 19, 2017

konklone commented Nov 19, 2017

Fixed hanging on websites with never-ending streaming content #140

Fixed hanging on websites with never-ending streaming content #140

Conversation

jsf9k commented Nov 16, 2017

konklone left a comment

Choose a reason for hiding this comment

konklone Nov 19, 2017

Choose a reason for hiding this comment

konklone commented Nov 19, 2017