Twitter Advanced Search crawler using Selenium WebDriver. For example, to find all tweets related to San Bernardino shooting on 2015, you must enter to the advanced search website and follow these steps:
- Fill the field "All of these words" with the phrase
shooting
; - On field "Written in", select
English (English)
; - On field "From this date", select dates
2015-12-02
(since) to2015-12-03
(until).
The result will be this link.
Based on the instructions of this link.
-
Install selenium:
$ pip install selenium==3.0.1
-
Install Google Chrome.
-
Install ChromeDriver:
-
Download version 2.24 (64bits):
-
Extract zip file and follow those commands:
$ sudo mv -f chromedriver /usr/local/share/chromedriver $ sudo ln -s /usr/local/share/chromedriver /usr/local/bin/chromedriver $ sudo ln -s /usr/local/share/chromedriver /usr/bin/chromedriver
-
Create file parameters.json
in current directory. Below is an example:
-
Attribute url is the result of the Advanced Search website'.
-
Attribute numScrolls is number of scrolls triggered when web browser (Chrome) is open. Recomended 800.
{ "url": "https://twitter.com/search?q=shooting%20since%3A2015-12-02%20until%3A2015-12-03&src=typd&lang=en", "numScrolls": 50 }
Run command:
$ python crawler.py -p parameters.json -o san_bernardino_shooting.json
Filter tweets by list of filters. Remove similar tweets.
-
Install nltk:
$ pip install nltk==3.2.1
-
In Python, download
stopwords
NLTK package:>>> import nltk >>> nltk.download("stopwords")
-
Create file
keywords.json
in the actual directory. Below is an example:{ "keywords": ["tragedy", "bomb", "bombs", "bombing", "pulse", "dead", "injured", "victim", "victims", "hurt", "hurting", "kill", "fire", "police", "attack", "terrorist", "terrorists", "detonated", "detonate", "running", "runner", "explosion", "explosions", "blast", "blasts", "terror", "innocent", "shot", "shots", "shoot", "shoots", "shooting", "shootings", "racist", "homophobic", "gun", "violence", "islam", "isis", "muslim", "horrifying", "killed", "wounded", "armed", "blood", "affected", "killers", "killing", "horrific", "murder", "murdered", "incident", "guns", "terrorism", "mass"], "threshold": 3 }
-
Run command:
$ python filter.py -k keywords.json --tweets san_bernardino_shooting.json
- Attribute keywords are keywords that the tweet has to contain to be selected. If list is empty (
[]
), then tweets not be filtered. - Attribute threshold is number of keywords considered to filter a tweet.