Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No handler for status code != 200 #62

Open
lanzer opened this issue Oct 27, 2015 · 4 comments
Open

No handler for status code != 200 #62

lanzer opened this issue Oct 27, 2015 · 4 comments
Assignees

Comments

@lanzer
Copy link

lanzer commented Oct 27, 2015

When a status code other than "200 OK" is received, the process would halt. This can be caused by a "404 not found" or server side problem such as exceeded bandwidth, or permission error. It's a problem for me as I am working with a big list of URL with entries that are potentially outdated.

I noticed that under the basic renderer (there is a headless renderer, but it isn't called even with the -h parameter), it doesn't listen for status code other than 200:

basic.js (14)

  request(conf, function (error, response, body) {
    if (!error && response.statusCode == 200) {
      renderer.emit('renderer.urlRendered', url, body);
    } else if (error) {
      this.emit('error', error);
    } 
  });

Also scraper.js does not have a listener for abnormal status:

scraper.js (252)

  renderer.on('renderer.urlRendered', function(theUrl, html) {

I've added a few lines to make things work for me

basic.js (14)

  request(conf, function (error, response, body) {
    if (!error && response.statusCode == 200) {
      renderer.emit('renderer.urlRendered', url, body);
    } else if (error) {
      this.emit('error', error);
    } else if (response.statusCode != 200) {
      renderer.emit ('renderer.status', response.statusMessage);
    }
  });

scraper.js (252)

  renderer.on('renderer.status', function(message) {
    scraper.emit('urlRendered',message);
    scraper.ticker.tick();
  });
  renderer.on('renderer.urlRendered', function(theUrl, html) {

Quickscrape does not read the result as an error and would report "0/0 elements captured (0 capture failed)", when it should read "0/2 elements" or whatever number configured in the JSON. Haven't looked into how reporting is handled.

For the time being, I noticed someting thresher.js

thresher.js (75)

    if (keyscaptured = 0) {

That should probably be a comparison operator.

Hope this helps!

@blahah blahah self-assigned this Oct 28, 2015
@blahah
Copy link
Member

blahah commented Oct 28, 2015

lots of good stuff in here! thanks

lanzer pushed a commit to lanzer/thresher that referenced this issue Nov 4, 2015
Abnormal status will be returned with the status message string

ContentMine/quickscrape#62
@lanzer
Copy link
Author

lanzer commented Nov 4, 2015

The fix was actually for thresher and not quickscrape. I pushed the changes and it seem to have merged them with my last pull request for another bug, I'm a totally noob so I might have gotten the procedure wrong. Please let me know if I need to make any changes on my end.

@blahah
Copy link
Member

blahah commented Nov 12, 2015

Thanks for this @lanzer and sorry for the slow reply - I've been away at various events. I will be incorporating these fixes in new releases in the next few days.

@tarrow tarrow added Backlog and removed ready labels Sep 22, 2016
@tarrow tarrow self-assigned this Sep 23, 2016
@tarrow
Copy link
Contributor

tarrow commented Sep 23, 2016

I'm going to take over having a look at this in the next few days; I also wrote a patch to fix this because I didn't realised there had been one in the pipeline for a while.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants