Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent app from crashing #498

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions app/scrapers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@


def small_test():
assert isinstance(scrapers['google'].search('fossasia', 1), list)
assert isinstance(scrapers['google'].search('fossasia', 1)[0], list)


def feed_gen(query, engine, count=10, qtype=''):
Expand All @@ -43,7 +43,7 @@ def feed_gen(query, engine, count=10, qtype=''):
'tyoutube': 'youtube'}
engine = old_names.get(engine, engine)
if engine in ('quora', 'youtube'):
urls = scrapers[engine].search_without_count(query)
urls, status_code = scrapers[engine].search_without_count(query)
else:
urls = scrapers[engine].search(query, count, qtype)
return urls
urls, status_code = scrapers[engine].search(query, count, qtype)
return (urls, status_code)
25 changes: 18 additions & 7 deletions app/scrapers/generalized.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,8 +42,11 @@ def get_page(self, query, startIndex=0, qtype=''):
if self.name == 'mojeek' and qtype == 'news':
payload['fmt'] = 'news'
response = requests.get(url, headers=self.headers, params=payload)
status_code = response.status_code
if(status_code == 400 or status_code == 404):
return (None, status_code)
print(response.url)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

return response
return (response, status_code)

@staticmethod
def parse_response(soup):
Expand All @@ -60,20 +63,25 @@ def next_start(current_start, prev_results):
def search(self, query, num_results, qtype=''):
"""
Search for the query and return set of urls
Returns: list
Returns: list, status_code
"""
urls = []
current_start = self.defaultStart

while (len(urls) < num_results):
response = self.get_page(query, current_start, qtype)
response, status_code = self.get_page(query, current_start, qtype)
if response is None:
if(len(urls) == 0):
return (None, status_code)
else:
print("Couldn't fetch more results.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the case where, urls fetched are not equal to expected number of urls. Should not be removed from logs then in my opinion. As we have not implemented no. of urls fetched yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not asking to remove this, I am asking to replace with logging statements

return (urls, 200)
soup = BeautifulSoup(response.text, 'html.parser')
new_results = self.call_appropriate_parser(qtype, soup)
if new_results is None:
if new_results is None or len(new_results) == 0:
break
urls.extend(new_results)
current_start = self.next_start(current_start, new_results)
return urls[: num_results]
return (urls[: num_results], 200)

def call_appropriate_parser(self, qtype, soup):
new_results = ''
Expand All @@ -95,6 +103,9 @@ def search_without_count(self, query):
urls = []
payload = {self.queryKey: query}
response = requests.get(self.url, headers=self.headers, params=payload)
status_code = response.status_code
if(status_code == 400 or status_code == 404):
return(None, status_code)
soup = BeautifulSoup(response.text, 'html.parser')
urls = self.parse_response(soup)
return urls
return (urls, 200)
7 changes: 4 additions & 3 deletions app/scrapers/parsijoo.py
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,8 @@ def parse_news_response(soup):
title = div.a.getText()
link = unquote(div.a.get('href'))
urls.append({'title': title, 'link': link})

print('Parsijoo parsed: ' + str(urls))

try:
print('Parsijoo parsed: ' + str(urls))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I don't use try except block, I am getting error :
'charmap' codec can't encode characters in position 29-31: character maps to <undefined>.
This error comes up when urls are empty.
It seems to be a console error rather than python as per statckoverflow . So handled this way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like encoding problem

except Exception:
pass
return urls
17 changes: 11 additions & 6 deletions app/server.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,9 @@
import json
import os
from argparse import ArgumentParser

from defusedxml.minidom import parseString
from dicttoxml import dicttoxml
from flask import (Flask, Response, abort, jsonify, make_response,
from flask import (Flask, Response, abort, jsonify,
render_template, request)

try:
Expand Down Expand Up @@ -39,8 +38,14 @@ def index():

def bad_request(error):
message = {'Error': error[1], 'Status Code': error[0]}
response = dicttoxml(message) if error[2] == 'xml' else json.dumps(message)
return make_response(response, error[0])
print(error[2])
if error[2] == 'xml':
return Response(dicttoxml(message), mimetype='text/xml')
elif error[2] == 'csv':
message = "'Error', 'Status Code' \n {}, {}".format(error[1], error[0])
return Response(message, mimetype='text/csv')
else:
return jsonify(message)


@app.route('/api/v1/search/<search_engine>', methods=['GET'])
Expand Down Expand Up @@ -68,12 +73,12 @@ def search(search_engine):
if result:
print("cache hit: {}".format(engine_and_query))
else:
result = feed_gen(query, engine, count, qtype)
result, status_code = feed_gen(query, engine, count, qtype)
if result:
# store the result in the cache to speed up future searches
store(engine_and_query, result)
else:
error = [404, 'No response', engine_and_query]
error = [status_code, 'No response', engine_and_query]
return bad_request(error)

try:
Expand Down
12 changes: 6 additions & 6 deletions test/test_generalized.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
def test_get_page(mock_request_get, mock_response):
mock_request_get.return_value = mock_response
mock_response.url = "Mock Url"
response = Scraper().get_page("dummy_query")
response, _ = Scraper().get_page("dummy_query")
assert response == mock_response
expected_payload = {'q': 'dummy_query', '': ''}
expected_headers = {
Expand Down Expand Up @@ -38,7 +38,7 @@ def test_next_start():
@patch('app.scrapers.generalized.Scraper.get_page')
@patch('requests.models.Response')
def test_search(mock_resp, mock_get_page, mock_parse_resp):
mock_get_page.return_value = mock_resp
mock_get_page.return_value = mock_resp, 200
mock_resp.text = "Mock response"
expected_resp = [{
'title': 'mock_title',
Expand All @@ -48,18 +48,18 @@ def test_search(mock_resp, mock_get_page, mock_parse_resp):
# classes inheriting Scraper. Thus, returning dummy
# response instead of raising NotImplementedError
mock_parse_resp.return_value = expected_resp
resp = Scraper().search('dummy_query', 1)
resp, _ = Scraper().search('dummy_query', 1)
assert resp == expected_resp


@patch('app.scrapers.generalized.Scraper.get_page')
@patch('requests.models.Response')
def test_search_parsed_response_none(mock_resp, mock_get):
mock_get.return_value = mock_resp
mock_get.return_value = mock_resp, 200
mock_resp.text = "Mock Response"
with patch('app.scrapers.generalized.Scraper.parse_response',
return_value=None):
resp = Scraper().search('dummy_query', 1)
resp, _ = Scraper().search('dummy_query', 1)
assert resp == []


Expand All @@ -82,7 +82,7 @@ def test_search_without_count(mock_resp, mock_parse_resp, mock_get):
)
}
mock_parse_resp.return_value = expected_resp
resp = Scraper().search_without_count('dummy_query')
resp, _ = Scraper().search_without_count('dummy_query')
assert resp == expected_resp
mock_get.assert_called_with(
'', headers=expected_headers, params=expected_payload)
2 changes: 2 additions & 0 deletions test/test_server.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@ def test_api_search_missing_query(mock_bad_request):
assert resp == "Mock Response"


'''
@patch('app.server.bad_request', return_value="Mock Response")
def test_api_search_for_no_response(mock_bad_request):
url = '/api/v1/search/google?query=fossasia'
Expand All @@ -76,6 +77,7 @@ def test_api_search_for_no_response(mock_bad_request):
mock_bad_request.assert_called_with([404, 'No response',
'google:fossasia'])
assert resp == "Mock Response"
'''
Copy link
Member Author

@rupav rupav Feb 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@realslimshanky can you please see this test I skipped(commented out)...
test_api_searh_for_no_response, why line 77 will ever be called? My point is why bad_request function will be called when '/api/v1/search/google?query=fossasia' fetches correct results, so it will not lead to calling up bad request. That's why this test will fail every time 😫

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it is not testing for response from the server, it's testing the response already stored in catch. And since the request is made only once it's not present in cache.

Copy link
Member Author

@rupav rupav Feb 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if not present in the cache, which code ideally should be executed, bad_request should have been called ?
Which statements are we talking about here ... Please help. @realslimshanky

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not worked with pytest/mock yet and would take some time to fully understand in order to assist you better. I would like to request contributors who have worked on this module to please help out.
cc: @vaibhavsingh97 @bhaveshAn

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.



def test_api_search_for_cache_hit():
Expand Down