Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spider error processing #59

Open
lamvien13 opened this issue Mar 27, 2020 · 11 comments
Open

Spider error processing #59

lamvien13 opened this issue Mar 27, 2020 · 11 comments

Comments

@lamvien13
Copy link

Sorry for this issue, I've tried Google it but still can't find a solution.

when I try follow command
scrapy crawl fb -a email="@gmail.com" -a password="_" -a page="DonaldTrump" -a date="2018-01-01" -a lang="en" -o output.csv

but I got this error
2020-03-27 19:34:59 [scrapy.utils.log] INFO: Scrapy 2.0.1 started (bot: fbcrawl)
2020-03-27 19:34:59 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.8.2 (tags/v3.8.2:7b3ab59, Feb 25 2020, 22:45:29) [MSC v.1916 32 bit (Intel)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d 10 Sep 2019), cryptography 2.8, Platform Windows-10-10.0.18362-SP0
2020-03-27 19:34:59 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'fbcrawl',
'DOWNLOAD_DELAY': 3,
'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
'FEED_EXPORT_ENCODING': 'utf-8',
'FEED_EXPORT_FIELDS': ['source',
'shared_from',
'date',
'text',
'reactions',
'likes',
'ahah',
'love',
'wow',
'sigh',
'grrr',
'comments',
'post_id',
'url'],
'FEED_FORMAT': 'csv',
'FEED_URI': 'DUMPFILE.csv',
'LOG_LEVEL': 'INFO',
'NEWSPIDER_MODULE': 'fbcrawl.spiders',
'SPIDER_MODULES': ['fbcrawl.spiders'],
'URLLENGTH_LIMIT': 99999,
'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
'(KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
2020-03-27 19:34:59 [scrapy.extensions.telnet] INFO: Telnet Password: 41ca2711c3d9f1ce
2020-03-27 19:34:59 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2020-03-27 19:34:59 [fb] INFO: Email and password provided, will be used to log in
2020-03-27 19:34:59 [fb] INFO: Date attribute provided, fbcrawl will stop crawling at 2018-01-01
2020-03-27 19:34:59 [fb] INFO: Language attribute recognized, using "en" for the facebook interface
2020-03-27 19:35:00 [scrapy.core.engine] INFO: Spider opened
2020-03-27 19:35:00 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-03-27 19:35:00 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-03-27 19:35:07 [fb] INFO: Going through the "save-device" checkpoint
2020-03-27 19:35:15 [fb] INFO: Scraping facebook page https://mbasic.facebook.com/DonaldTrump
2020-03-27 19:35:18 [scrapy.core.scraper] ERROR: Spider error processing <GET https://mbasic.facebook.com/DonaldTrump> (referer: https://mbasic.facebook.com/?_rdr)
Traceback (most recent call last):
File "c:\users\lam vien\appdata\local\programs\python\python38-32\lib\site-packages\twisted\internet\defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "c:\users\lam vien\appdata\local\programs\python\python38-32\lib\site-packages\scrapy\core\downloader\middleware.py", line 42, in process_request
defer.returnValue((yield download_func(request=request, spider=spider)))
File "c:\users\lam vien\appdata\local\programs\python\python38-32\lib\site-packages\twisted\internet\defer.py", line 1362, in returnValue
raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: <200 https://mbasic.facebook.com/DonaldTrump>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "c:\users\lam vien\appdata\local\programs\python\python38-32\lib\site-packages\scrapy\utils\defer.py", line 55, in mustbe_deferred
result = f(*args, **kw)
File "c:\users\lam vien\appdata\local\programs\python\python38-32\lib\site-packages\scrapy\core\spidermw.py", line 60, in process_spider_input
return scrape_func(response, request, spider)
File "c:\users\lam vien\appdata\local\programs\python\python38-32\lib\site-packages\scrapy\core\scraper.py", line 148, in call_spider
warn_on_generator_with_return_value(spider, callback)
File "c:\users\lam vien\appdata\local\programs\python\python38-32\lib\site-packages\scrapy\utils\misc.py", line 202, in warn_on_generator_with_return_value
if is_generator_with_return_value(callable):
File "c:\users\lam vien\appdata\local\programs\python\python38-32\lib\site-packages\scrapy\utils\misc.py", line 187, in is_generator_with_return_value
tree = ast.parse(dedent(inspect.getsource(callable)))
File "c:\users\lam vien\appdata\local\programs\python\python38-32\lib\ast.py", line 47, in parse
return compile(source, filename, mode, flags,
File "", line 1
def parse_page(self, response):
^
IndentationError: unexpected indent
2020-03-27 19:35:18 [scrapy.core.engine] INFO: Closing spider (finished)
2020-03-27 19:35:18 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 3867,
'downloader/request_count': 6,
'downloader/request_method_count/GET': 4,
'downloader/request_method_count/POST': 2,
'downloader/response_bytes': 57037,
'downloader/response_count': 6,
'downloader/response_status_count/200': 4,
'downloader/response_status_count/302': 2,
'elapsed_time_seconds': 18.180147,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 3, 27, 12, 35, 18, 403258),
'log_count/ERROR': 1,
'log_count/INFO': 12,
'request_depth_max': 3,
'response_received_count': 4,
'scheduler/dequeued': 6,
'scheduler/dequeued/memory': 6,
'scheduler/enqueued': 6,
'scheduler/enqueued/memory': 6,
'spider_exceptions/IndentationError': 1,
'start_time': datetime.datetime(2020, 3, 27, 12, 35, 0, 223111)}
2020-03-27 19:35:18 [scrapy.core.engine] INFO: Spider closed (finished)

as I said, I think the problem start from the line: ERROR: Spider error processing <GET https://mbasic.facebook.com/DonaldTrump> (referer: https://mbasic.facebook.com/?_rdr)
Could anyone please tell me how to fix it? Many thanks!

@vtgiang141
Copy link

I have the same error, do you solve this problem ?

@georgevak
Copy link

georgevak commented Apr 4, 2020

in my case it returns

(spiders-env) gvak@gvak-H61M-D2-B3:~/spiders-env/fbcrawl-master/fbcrawl/spiders$ scrapy crawl fb -a email="@gmail.com" -a password="*" -a page="DonaldTrump" -a lang="it" -o test.csv
2020-04-04 21:36:01 [scrapy.utils.log] INFO: Scrapy 2.0.1 started (bot: fbcrawl)
2020-04-04 21:36:01 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.5.2 (default, Oct 8 2019, 13:06:37) - [GCC 5.4.0 20160609], pyOpenSSL 19.1.0 (OpenSSL 1.0.2g 1 Mar 2016), cryptography 2.9, Platform Linux-4.15.0-91-generic-i686-with-Ubuntu-16.04-xenial
2020-04-04 21:36:01 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'fbcrawl',
'DOWNLOAD_DELAY': 3,
'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
'FEED_EXPORT_ENCODING': 'utf-8',
'FEED_EXPORT_FIELDS': ['source',
'shared_from',
'date',
'text',
'reactions',
'likes',
'ahah',
'love',
'wow',
'sigh',
'grrr',
'comments',
'post_id',
'url'],
'FEED_FORMAT': 'csv',
'FEED_URI': 'test.csv',
'LOG_LEVEL': 'INFO',
'NEWSPIDER_MODULE': 'fbcrawl.spiders',
'SPIDER_MODULES': ['fbcrawl.spiders'],
'URLLENGTH_LIMIT': 99999,
'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
'(KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
2020-04-04 21:36:01 [scrapy.extensions.telnet] INFO: Telnet Password: ee814f34085d5a10
2020-04-04 21:36:01 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.feedexport.FeedExporter']
2020-04-04 21:36:01 [fb] INFO: Email and password provided, will be used to log in
2020-04-04 21:36:01 [fb] INFO: Date attribute not provided, scraping date set to 2004-02-04 (fb launch date)
2020-04-04 21:36:01 [fb] INFO: Language attribute recognized, using "it" for the facebook interface
2020-04-04 21:36:01 [scrapy.core.engine] INFO: Spider opened
2020-04-04 21:36:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-04-04 21:36:01 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-04-04 21:36:10 [fb] INFO: Scraping facebook page https://mbasic.facebook.com/DonaldTrump
2020-04-04 21:36:12 [scrapy.core.scraper] ERROR: Spider error processing <GET https://mbasic.facebook.com/DonaldTrump> (referer: https://mbasic.facebook.com/home.php?refsrc=https%3A%2F%2Fmbasic.facebook.com%2F&_rdr)
Traceback (most recent call last):
File "/home/gvak/spiders-env/lib/python3.5/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "/home/gvak/spiders-env/lib/python3.5/site-packages/scrapy/core/downloader/middleware.py", line 42, in process_request
defer.returnValue((yield download_func(request=request, spider=spider)))
File "/home/gvak/spiders-env/lib/python3.5/site-packages/twisted/internet/defer.py", line 1362, in returnValue
raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: <200 https://mbasic.facebook.com/DonaldTrump>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/gvak/spiders-env/lib/python3.5/site-packages/scrapy/utils/defer.py", line 55, in mustbe_deferred
result = f(*args, **kw)
File "/home/gvak/spiders-env/lib/python3.5/site-packages/scrapy/core/spidermw.py", line 60, in process_spider_input
return scrape_func(response, request, spider)
File "/home/gvak/spiders-env/lib/python3.5/site-packages/scrapy/core/scraper.py", line 148, in call_spider
warn_on_generator_with_return_value(spider, callback)
File "/home/gvak/spiders-env/lib/python3.5/site-packages/scrapy/utils/misc.py", line 202, in warn_on_generator_with_return_value
if is_generator_with_return_value(callable):
File "/home/gvak/spiders-env/lib/python3.5/site-packages/scrapy/utils/misc.py", line 187, in is_generator_with_return_value
tree = ast.parse(dedent(inspect.getsource(callable)))
File "/usr/lib/python3.5/ast.py", line 35, in parse
return compile(source, filename, mode, PyCF_ONLY_AST)
File "", line 1
def parse_page(self, response):
^
IndentationError: unexpected indent
2020-04-04 21:36:12 [scrapy.core.engine] INFO: Closing spider (finished)
2020-04-04 21:36:12 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2298,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 3,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 41733,
'downloader/response_count': 4,
'downloader/response_status_count/200': 3,
'downloader/response_status_count/302': 1,
'elapsed_time_seconds': 11.239089,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 4, 4, 18, 36, 12, 720698),
'log_count/ERROR': 1,
'log_count/INFO': 11,
'memusage/max': 36761600,
'memusage/startup': 36761600,
'request_depth_max': 2,
'response_received_count': 3,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'spider_exceptions/IndentationError': 1,
'start_time': datetime.datetime(2020, 4, 4, 18, 36, 1, 481609)}
2020-04-04 21:36:12 [scrapy.core.engine] INFO: Spider closed (finished)

any help please?

@georgevak
Copy link

my machine is 32 bit, i saw in terminal it says
'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
'(KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}

does the fbcrawl runs on 32bit machines?

@huynv161846
Copy link

huynv161846 commented Apr 29, 2020

I have the same problem. Then I tried to delete space between 'def' and 'parse_page' in 'def parse_page' and add a space character between them later. And it worked with me :)
Also, If you guys have problem with stopping crawler at date or max number of posts crawled in a page, you may want to change '//div[contains(@data-ft,'top_level_post_id')]' to '//article[contains(@data-ft,'top_level_post_id')]'

@proshock1509
Copy link

I have the same problem. Then I tried to delete space between 'def' and 'parse_page' in 'def parse_page' and add a space character between them later. And it worked with me :)
Also, If you guys have problem with stopping crawler at date or max number of posts crawled in a page, you may want to change '//div[contains(@data-ft,'top_level_post_id')]' to '//article[contains(@data-ft,'top_level_post_id')]'

Pls help me, i do the same way to you but error still is error. Can you get an email to me pls,
[email protected]

@talhalatiforakzai
Copy link

Open fbcrawl.py and reformat the code ( ctrl+alt+L), this will solve the issue

@natsinger
Copy link

It didn't for me, still have the same issue
\softwaer\fbcrawl-master\fbcrawl>scrapy crawl fb -a email="" -a password="" -a page="https://mbasic.facebook.com/groups/1732237450380058?bacr=1590752528%3A2660531107550683%3A2660531107550683%2C0%2C3%3A7%3AKw%3D%3D&multi_permalinks&refid=18" -a date="2004-01-01" -a lang="en" -o test.csv
2020-06-07 00:40:05 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: fbcrawl)
2020-06-07 00:40:05 [scrapy.utils.log] INFO: Versions: lxml 4.4.1.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.6.8 (tags/v3.6.8:3c6b436a57, Dec 24 2018, 00:16:47) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d 10 Sep 2019), cryptography 2.8, Platform Windows-10-10.0.18362-SP0
2020-06-07 00:40:05 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'fbcrawl', 'DOWNLOAD_DELAY': 3, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'FEED_EXPORT_ENCODING': 'utf-8', 'FEED_EXPORT_FIELDS': ['source', 'shared_from', 'date', 'text', 'reactions', 'likes', 'ahah', 'love', 'wow', 'sigh', 'grrr', 'comments', 'post_id', 'url'], 'FEED_FORMAT': 'csv', 'FEED_URI': 'nodular_prorigo_group1.csv', 'LOG_LEVEL': 'INFO', 'NEWSPIDER_MODULE': 'fbcrawl.spiders', 'SPIDER_MODULES': ['fbcrawl.spiders'], 'URLLENGTH_LIMIT': 99999, 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
2020-06-07 00:40:05 [scrapy.extensions.telnet] INFO: Telnet Password: 90bc392de7c46659
2020-06-07 00:40:05 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2020-06-07 00:40:05 [fb] INFO: Email and password provided, will be used to log in
2020-06-07 00:40:05 [fb] INFO: Date attribute provided, fbcrawl will stop crawling at 2004-01-01
2020-06-07 00:40:05 [fb] INFO: Language attribute recognized, using "en" for the facebook interface
2020-06-07 00:40:06 [scrapy.core.engine] INFO: Spider opened
2020-06-07 00:40:06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-06-07 00:40:06 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-06-07 00:40:15 [fb] INFO: Going through the "save-device" checkpoint
2020-06-07 00:40:25 [fb] INFO: Scraping facebook page https://mbasic.facebook.com/groups/1732237450380058?bacr=1590752528%3A2660531107550683%3A2660531107550683%2C0%2C3%3A7%3AKw%3D%3D&multi_permalinks&refid=18
2020-06-07 00:40:28 [fb] INFO: Parsing post n = 1, post_date = 2016-05-29
2020-06-07 00:40:28 [fb] INFO: Parsing post n = 2, post_date = 2020-05-28
2020-06-07 00:40:28 [fb] INFO: Parsing post n = 3, post_date = 2019-04-29 15:33:15
2020-06-07 00:40:28 [fb] INFO: Parsing post n = 4, post_date = 2020-05-27
2020-06-07 00:40:28 [fb] INFO: Parsing post n = 5, post_date = 2020-04-25 11:18:11
2020-06-07 00:40:28 [fb] INFO: Parsing post n = 6, post_date = 2020-05-28
2020-06-07 00:40:28 [fb] INFO: Parsing post n = 7, post_date = 2020-05-27 17:00:27
2020-06-07 00:40:28 [fb] INFO: Parsing post n = 8, post_date = 2020-05-28
2020-06-07 00:40:28 [fb] INFO: [!] "more" link not found, will look for a "year" link
2020-06-07 00:40:28 [scrapy.core.scraper] ERROR: Spider error processing <GET https://mbasic.facebook.com/groups/1732237450380058?bacr=1590752528%3A2660531107550683%3A2660531107550683%2C0%2C3%3A7%3AKw%3D%3D&multi_permalinks&refid=18> (referer: https://mbasic.facebook.com/?_rdr)
Traceback (most recent call last):
File "c:\users\nathanaels\appdata\local\programs\python\python36\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
yield next(it)
File "c:\users\nathanaels\appdata\local\programs\python\python36\lib\site-packages\scrapy\core\spidermw.py", line 84, in evaluate_iterable
for r in iterable:
File "c:\users\nathanaels\appdata\local\programs\python\python36\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
for x in result:
File "c:\users\nathanaels\appdata\local\programs\python\python36\lib\site-packages\scrapy\core\spidermw.py", line 84, in evaluate_iterable
for r in iterable:
File "c:\users\nathanaels\appdata\local\programs\python\python36\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in
return (_set_referer(r) for r in result or ())
File "c:\users\nathanaels\appdata\local\programs\python\python36\lib\site-packages\scrapy\core\spidermw.py", line 84, in evaluate_iterable
for r in iterable:
File "c:\users\nathanaels\appdata\local\programs\python\python36\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in
return (r for r in result or () if _filter(r))
File "c:\users\nathanaels\appdata\local\programs\python\python36\lib\site-packages\scrapy\core\spidermw.py", line 84, in evaluate_iterable
for r in iterable:
File "c:\users\nathanaels\appdata\local\programs\python\python36\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in
return (r for r in result or () if _filter(r))
File "C:\Users\nathanaels\Desktop\nat folder\softwaer\fbcrawl-master\fbcrawl\spiders\fbcrawl.py", line 199, in parse_page
if response.meta['flag'] == self.k and self.k >= self.year:
KeyError: 'flag'
2020-06-07 00:41:06 [scrapy.extensions.logstats] INFO: Crawled 14 pages (at 14 pages/min), scraped 2 items (at 2 items/min)
2020-06-07 00:41:27 [scrapy.core.engine] INFO: Closing spider (finished)
2020-06-07 00:41:27 [scrapy.extensions.feedexport] INFO: Stored csv feed (8 items) in: nodular_prorigo_group1.csv
2020-06-07 00:41:27 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 26658,
'downloader/request_count': 22,
'downloader/request_method_count/GET': 20,
'downloader/request_method_count/POST': 2,
'downloader/response_bytes': 176073,
'downloader/response_count': 22,
'downloader/response_status_count/200': 20,
'downloader/response_status_count/302': 2,
'elapsed_time_seconds': 80.836427,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 6, 6, 21, 41, 27, 137961),
'item_scraped_count': 8,
'log_count/ERROR': 1,
'log_count/INFO': 23,
'request_depth_max': 5,
'response_received_count': 20,
'scheduler/dequeued': 22,
'scheduler/dequeued/memory': 22,
'scheduler/enqueued': 22,
'scheduler/enqueued/memory': 22,
'spider_exceptions/KeyError': 1,
'start_time': datetime.datetime(2020, 6, 6, 21, 40, 6, 301534)}
2020-06-07 00:41:27 [scrapy.core.engine] INFO: Spider closed (finished)

Any piece of advice?

@talhalatiforakzai
Copy link

It didn't for me, still have the same issue
\softwaer\fbcrawl-master\fbcrawl>scrapy crawl fb -a email="" -a password="" -a page="https://mbasic.facebook.com/groups/1732237450380058?bacr=1590752528%3A2660531107550683%3A2660531107550683%2C0%2C3%3A7%3AKw%3D%3D&multi_permalinks&refid=18" -a date="2004-01-01" -a lang="en" -o test.csv
2020-06-07 00:40:05 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: fbcrawl)
2020-06-07 00:40:05 [scrapy.utils.log] INFO: Versions: lxml 4.4.1.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.6.8 (tags/v3.6.8:3c6b436a57, Dec 24 2018, 00:16:47) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d 10 Sep 2019), cryptography 2.8, Platform Windows-10-10.0.18362-SP0
2020-06-07 00:40:05 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'fbcrawl', 'DOWNLOAD_DELAY': 3, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'FEED_EXPORT_ENCODING': 'utf-8', 'FEED_EXPORT_FIELDS': ['source', 'shared_from', 'date', 'text', 'reactions', 'likes', 'ahah', 'love', 'wow', 'sigh', 'grrr', 'comments', 'post_id', 'url'], 'FEED_FORMAT': 'csv', 'FEED_URI': 'nodular_prorigo_group1.csv', 'LOG_LEVEL': 'INFO', 'NEWSPIDER_MODULE': 'fbcrawl.spiders', 'SPIDER_MODULES': ['fbcrawl.spiders'], 'URLLENGTH_LIMIT': 99999, 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
2020-06-07 00:40:05 [scrapy.extensions.telnet] INFO: Telnet Password: 90bc392de7c46659
2020-06-07 00:40:05 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2020-06-07 00:40:05 [fb] INFO: Email and password provided, will be used to log in
2020-06-07 00:40:05 [fb] INFO: Date attribute provided, fbcrawl will stop crawling at 2004-01-01
2020-06-07 00:40:05 [fb] INFO: Language attribute recognized, using "en" for the facebook interface
2020-06-07 00:40:06 [scrapy.core.engine] INFO: Spider opened
2020-06-07 00:40:06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-06-07 00:40:06 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-06-07 00:40:15 [fb] INFO: Going through the "save-device" checkpoint
2020-06-07 00:40:25 [fb] INFO: Scraping facebook page https://mbasic.facebook.com/groups/1732237450380058?bacr=1590752528%3A2660531107550683%3A2660531107550683%2C0%2C3%3A7%3AKw%3D%3D&multi_permalinks&refid=18
2020-06-07 00:40:28 [fb] INFO: Parsing post n = 1, post_date = 2016-05-29
2020-06-07 00:40:28 [fb] INFO: Parsing post n = 2, post_date = 2020-05-28
2020-06-07 00:40:28 [fb] INFO: Parsing post n = 3, post_date = 2019-04-29 15:33:15
2020-06-07 00:40:28 [fb] INFO: Parsing post n = 4, post_date = 2020-05-27
2020-06-07 00:40:28 [fb] INFO: Parsing post n = 5, post_date = 2020-04-25 11:18:11
2020-06-07 00:40:28 [fb] INFO: Parsing post n = 6, post_date = 2020-05-28
2020-06-07 00:40:28 [fb] INFO: Parsing post n = 7, post_date = 2020-05-27 17:00:27
2020-06-07 00:40:28 [fb] INFO: Parsing post n = 8, post_date = 2020-05-28
2020-06-07 00:40:28 [fb] INFO: [!] "more" link not found, will look for a "year" link
2020-06-07 00:40:28 [scrapy.core.scraper] ERROR: Spider error processing <GET https://mbasic.facebook.com/groups/1732237450380058?bacr=1590752528%3A2660531107550683%3A2660531107550683%2C0%2C3%3A7%3AKw%3D%3D&multi_permalinks&refid=18> (referer: https://mbasic.facebook.com/?_rdr)
Traceback (most recent call last):
File "c:\users\nathanaels\appdata\local\programs\python\python36\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
yield next(it)
File "c:\users\nathanaels\appdata\local\programs\python\python36\lib\site-packages\scrapy\core\spidermw.py", line 84, in evaluate_iterable
for r in iterable:
File "c:\users\nathanaels\appdata\local\programs\python\python36\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
for x in result:
File "c:\users\nathanaels\appdata\local\programs\python\python36\lib\site-packages\scrapy\core\spidermw.py", line 84, in evaluate_iterable
for r in iterable:
File "c:\users\nathanaels\appdata\local\programs\python\python36\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in
return (_set_referer(r) for r in result or ())
File "c:\users\nathanaels\appdata\local\programs\python\python36\lib\site-packages\scrapy\core\spidermw.py", line 84, in evaluate_iterable
for r in iterable:
File "c:\users\nathanaels\appdata\local\programs\python\python36\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in
return (r for r in result or () if _filter(r))
File "c:\users\nathanaels\appdata\local\programs\python\python36\lib\site-packages\scrapy\core\spidermw.py", line 84, in evaluate_iterable
for r in iterable:
File "c:\users\nathanaels\appdata\local\programs\python\python36\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in
return (r for r in result or () if _filter(r))
File "C:\Users\nathanaels\Desktop\nat folder\softwaer\fbcrawl-master\fbcrawl\spiders\fbcrawl.py", line 199, in parse_page
if response.meta['flag'] == self.k and self.k >= self.year:
KeyError: 'flag'
2020-06-07 00:41:06 [scrapy.extensions.logstats] INFO: Crawled 14 pages (at 14 pages/min), scraped 2 items (at 2 items/min)
2020-06-07 00:41:27 [scrapy.core.engine] INFO: Closing spider (finished)
2020-06-07 00:41:27 [scrapy.extensions.feedexport] INFO: Stored csv feed (8 items) in: nodular_prorigo_group1.csv
2020-06-07 00:41:27 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 26658,
'downloader/request_count': 22,
'downloader/request_method_count/GET': 20,
'downloader/request_method_count/POST': 2,
'downloader/response_bytes': 176073,
'downloader/response_count': 22,
'downloader/response_status_count/200': 20,
'downloader/response_status_count/302': 2,
'elapsed_time_seconds': 80.836427,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 6, 6, 21, 41, 27, 137961),
'item_scraped_count': 8,
'log_count/ERROR': 1,
'log_count/INFO': 23,
'request_depth_max': 5,
'response_received_count': 20,
'scheduler/dequeued': 22,
'scheduler/dequeued/memory': 22,
'scheduler/enqueued': 22,
'scheduler/enqueued/memory': 22,
'spider_exceptions/KeyError': 1,
'start_time': datetime.datetime(2020, 6, 6, 21, 40, 6, 301534)}
2020-06-07 00:41:27 [scrapy.core.engine] INFO: Spider closed (finished)

Any piece of advice?

Try this, In function parse_page in fbcrawler change row
for post in response.xpath("//div[contains(@data-ft,'top_level_post_id')]"):
to
for post in response.xpath("//article[contains(@data-ft,'top_level_post_id')]"):

@natsinger
Copy link

@talhalatiforakzai i tried but for non public pages (that i am a member of) it doesn't work as before

@Rahulsunny11
Copy link

It didn't for me, still have the same issue
\softwaer\fbcrawl-master\fbcrawl>scrapy crawl fb -a email="" -a password="" -a page="https://mbasic.facebook.com/groups/1732237450380058?bacr=1590752528%3A2660531107550683%3A2660531107550683%2C0%2C3%3A7%3AKw%3D%3D&multi_permalinks&refid=18" -a date="2004-01-01" -a lang="en" -o test.csv
2020-06-07 00:40:05 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: fbcrawl)
2020-06-07 00:40:05 [scrapy.utils.log] INFO: Versions: lxml 4.4.1.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.6.8 (tags/v3.6.8:3c6b436a57, Dec 24 2018, 00:16:47) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d 10 Sep 2019), cryptography 2.8, Platform Windows-10-10.0.18362-SP0
2020-06-07 00:40:05 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'fbcrawl', 'DOWNLOAD_DELAY': 3, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'FEED_EXPORT_ENCODING': 'utf-8', 'FEED_EXPORT_FIELDS': ['source', 'shared_from', 'date', 'text', 'reactions', 'likes', 'ahah', 'love', 'wow', 'sigh', 'grrr', 'comments', 'post_id', 'url'], 'FEED_FORMAT': 'csv', 'FEED_URI': 'nodular_prorigo_group1.csv', 'LOG_LEVEL': 'INFO', 'NEWSPIDER_MODULE': 'fbcrawl.spiders', 'SPIDER_MODULES': ['fbcrawl.spiders'], 'URLLENGTH_LIMIT': 99999, 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
2020-06-07 00:40:05 [scrapy.extensions.telnet] INFO: Telnet Password: 90bc392de7c46659
2020-06-07 00:40:05 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2020-06-07 00:40:05 [fb] INFO: Email and password provided, will be used to log in
2020-06-07 00:40:05 [fb] INFO: Date attribute provided, fbcrawl will stop crawling at 2004-01-01
2020-06-07 00:40:05 [fb] INFO: Language attribute recognized, using "en" for the facebook interface
2020-06-07 00:40:06 [scrapy.core.engine] INFO: Spider opened
2020-06-07 00:40:06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-06-07 00:40:06 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-06-07 00:40:15 [fb] INFO: Going through the "save-device" checkpoint
2020-06-07 00:40:25 [fb] INFO: Scraping facebook page https://mbasic.facebook.com/groups/1732237450380058?bacr=1590752528%3A2660531107550683%3A2660531107550683%2C0%2C3%3A7%3AKw%3D%3D&multi_permalinks&refid=18
2020-06-07 00:40:28 [fb] INFO: Parsing post n = 1, post_date = 2016-05-29
2020-06-07 00:40:28 [fb] INFO: Parsing post n = 2, post_date = 2020-05-28
2020-06-07 00:40:28 [fb] INFO: Parsing post n = 3, post_date = 2019-04-29 15:33:15
2020-06-07 00:40:28 [fb] INFO: Parsing post n = 4, post_date = 2020-05-27
2020-06-07 00:40:28 [fb] INFO: Parsing post n = 5, post_date = 2020-04-25 11:18:11
2020-06-07 00:40:28 [fb] INFO: Parsing post n = 6, post_date = 2020-05-28
2020-06-07 00:40:28 [fb] INFO: Parsing post n = 7, post_date = 2020-05-27 17:00:27
2020-06-07 00:40:28 [fb] INFO: Parsing post n = 8, post_date = 2020-05-28
2020-06-07 00:40:28 [fb] INFO: [!] "more" link not found, will look for a "year" link
2020-06-07 00:40:28 [scrapy.core.scraper] ERROR: Spider error processing <GET https://mbasic.facebook.com/groups/1732237450380058?bacr=1590752528%3A2660531107550683%3A2660531107550683%2C0%2C3%3A7%3AKw%3D%3D&multi_permalinks&refid=18> (referer: https://mbasic.facebook.com/?_rdr)
Traceback (most recent call last):
File "c:\users\nathanaels\appdata\local\programs\python\python36\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
yield next(it)
File "c:\users\nathanaels\appdata\local\programs\python\python36\lib\site-packages\scrapy\core\spidermw.py", line 84, in evaluate_iterable
for r in iterable:
File "c:\users\nathanaels\appdata\local\programs\python\python36\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
for x in result:
File "c:\users\nathanaels\appdata\local\programs\python\python36\lib\site-packages\scrapy\core\spidermw.py", line 84, in evaluate_iterable
for r in iterable:
File "c:\users\nathanaels\appdata\local\programs\python\python36\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in
return (_set_referer(r) for r in result or ())
File "c:\users\nathanaels\appdata\local\programs\python\python36\lib\site-packages\scrapy\core\spidermw.py", line 84, in evaluate_iterable
for r in iterable:
File "c:\users\nathanaels\appdata\local\programs\python\python36\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in
return (r for r in result or () if _filter(r))
File "c:\users\nathanaels\appdata\local\programs\python\python36\lib\site-packages\scrapy\core\spidermw.py", line 84, in evaluate_iterable
for r in iterable:
File "c:\users\nathanaels\appdata\local\programs\python\python36\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in
return (r for r in result or () if _filter(r))
File "C:\Users\nathanaels\Desktop\nat folder\softwaer\fbcrawl-master\fbcrawl\spiders\fbcrawl.py", line 199, in parse_page
if response.meta['flag'] == self.k and self.k >= self.year:
KeyError: 'flag'
2020-06-07 00:41:06 [scrapy.extensions.logstats] INFO: Crawled 14 pages (at 14 pages/min), scraped 2 items (at 2 items/min)
2020-06-07 00:41:27 [scrapy.core.engine] INFO: Closing spider (finished)
2020-06-07 00:41:27 [scrapy.extensions.feedexport] INFO: Stored csv feed (8 items) in: nodular_prorigo_group1.csv
2020-06-07 00:41:27 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 26658,
'downloader/request_count': 22,
'downloader/request_method_count/GET': 20,
'downloader/request_method_count/POST': 2,
'downloader/response_bytes': 176073,
'downloader/response_count': 22,
'downloader/response_status_count/200': 20,
'downloader/response_status_count/302': 2,
'elapsed_time_seconds': 80.836427,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 6, 6, 21, 41, 27, 137961),
'item_scraped_count': 8,
'log_count/ERROR': 1,
'log_count/INFO': 23,
'request_depth_max': 5,
'response_received_count': 20,
'scheduler/dequeued': 22,
'scheduler/dequeued/memory': 22,
'scheduler/enqueued': 22,
'scheduler/enqueued/memory': 22,
'spider_exceptions/KeyError': 1,
'start_time': datetime.datetime(2020, 6, 6, 21, 40, 6, 301534)}
2020-06-07 00:41:27 [scrapy.core.engine] INFO: Spider closed (finished)
Any piece of advice?

Try this, In function parse_page in fbcrawler change row
for post in response.xpath("//div[contains(@data-ft,'top_level_post_id')]"):
to
for post in response.xpath("//article[contains(@data-ft,'top_level_post_id')]"):

it worked for me, thank you.
I saw there a spider for profiles (profiles.py file). Does it work? If yes, what command does start it?

@TheSeedMan
Copy link

date="2004-01-01"

I'm having the same issue as those in this post, although the row change from //div to //article hasn't fixed the issue even when I am plugging in a public page. @talhalatiforakzai any help would be much appreciated. I have pasted my terminal output below.

scrapy crawl fb -a email="x" -a password="x" -a page="DonaldTrump" -a lang="it" -o test.csv

`2020-06-12 13:48:56 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: fbcrawl)
2020-06-12 13:48:56 [scrapy.utils.log] INFO: Versions: lxml 4.5.1.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.7 (default, Mar 10 2020, 15:43:03) - [Clang 11.0.0 (clang-1100.0.33.17)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 2.9.2, Platform Darwin-18.7.0-x86_64-i386-64bit
2020-06-12 13:48:56 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'fbcrawl',
'DOWNLOAD_DELAY': 3,
'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
'FEED_EXPORT_ENCODING': 'utf-8',
'FEED_EXPORT_FIELDS': ['source',
'shared_from',
'date',
'text',
'reactions',
'likes',
'ahah',
'love',
'wow',
'sigh',
'grrr',
'comments',
'post_id',
'url'],
'LOG_LEVEL': 'INFO',
'NEWSPIDER_MODULE': 'fbcrawl.spiders',
'SPIDER_MODULES': ['fbcrawl.spiders'],
'URLLENGTH_LIMIT': 99999,
'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
'(KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
2020-06-12 13:48:56 [scrapy.extensions.telnet] INFO: Telnet Password: 7e64297a2a66e18d
2020-06-12 13:48:56 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2020-06-12 13:48:56 [fb] INFO: Email and password provided, will be used to log in
2020-06-12 13:48:56 [fb] INFO: Date attribute not provided, scraping date set to 2004-02-04 (fb launch date)
2020-06-12 13:48:56 [fb] INFO: Language attribute recognized, using "it" for the facebook interface
2020-06-12 13:48:56 [scrapy.core.engine] INFO: Spider opened
2020-06-12 13:48:56 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-06-12 13:48:56 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-06-12 13:49:04 [fb] INFO: Going through the "save-device" checkpoint
2020-06-12 13:49:13 [fb] INFO: Scraping facebook page https://mbasic.facebook.com/DonaldTrump
2020-06-12 13:49:16 [scrapy.core.scraper] ERROR: Spider error processing <GET https://mbasic.facebook.com/DonaldTrump> (referer: https://mbasic.facebook.com/?_rdr)
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
StopIteration: <200 https://mbasic.facebook.com/DonaldTrump>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/scrapy/utils/defer.py", line 55, in mustbe_deferred
result = f(*args, **kw)
File "/usr/local/lib/python3.7/site-packages/scrapy/core/spidermw.py", line 60, in process_spider_input
return scrape_func(response, request, spider)
File "/usr/local/lib/python3.7/site-packages/scrapy/core/scraper.py", line 152, in call_spider
warn_on_generator_with_return_value(spider, callback)
File "/usr/local/lib/python3.7/site-packages/scrapy/utils/misc.py", line 202, in warn_on_generator_with_return_value
if is_generator_with_return_value(callable):
File "/usr/local/lib/python3.7/site-packages/scrapy/utils/misc.py", line 187, in is_generator_with_return_value
tree = ast.parse(dedent(inspect.getsource(callable)))
File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ast.py", line 35, in parse
return compile(source, filename, mode, PyCF_ONLY_AST)
File "", line 1
def parse_page(self, response):
^
IndentationError: unexpected indent
2020-06-12 13:49:16 [scrapy.core.engine] INFO: Closing spider (finished)
2020-06-12 13:49:16 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 3855,
'downloader/request_count': 6,
'downloader/request_method_count/GET': 4,
'downloader/request_method_count/POST': 2,
'downloader/response_bytes': 49932,
'downloader/response_count': 6,
'downloader/response_status_count/200': 4,
'downloader/response_status_count/302': 2,
'elapsed_time_seconds': 20.065659,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 6, 12, 17, 49, 16, 886653),
'log_count/ERROR': 1,
'log_count/INFO': 12,
'memusage/max': 52617216,
'memusage/startup': 52617216,
'request_depth_max': 3,
'response_received_count': 4,
'scheduler/dequeued': 6,
'scheduler/dequeued/memory': 6,
'scheduler/enqueued': 6,
'scheduler/enqueued/memory': 6,
'spider_exceptions/IndentationError': 1,
'start_time': datetime.datetime(2020, 6, 12, 17, 48, 56, 820994)}
2020-06-12 13:49:16 [scrapy.core.engine] INFO: Spider closed (finished)`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants