Skip to content

Latest commit

 

History

History
95 lines (65 loc) · 3.25 KB

README.md

File metadata and controls

95 lines (65 loc) · 3.25 KB

Twitter Advanced Search Crawler

Twitter Advanced Search crawler using Selenium WebDriver. For example, to find all tweets related to San Bernardino shooting on 2015, you must enter to the advanced search website and follow these steps:

  1. Fill the field "All of these words" with the phrase shooting;
  2. On field "Written in", select English (English);
  3. On field "From this date", select dates 2015-12-02 (since) to 2015-12-03 (until).

The result will be this link.

Basic requirements

Based on the instructions of this link.

  1. Install selenium:

    $ pip install selenium==3.0.1
    
  2. Install Google Chrome.

  3. Install ChromeDriver:

    1. Download version 2.24 (64bits):

    2. Extract zip file and follow those commands:

         $ sudo mv -f chromedriver /usr/local/share/chromedriver
         $ sudo ln -s /usr/local/share/chromedriver /usr/local/bin/chromedriver
         $ sudo ln -s /usr/local/share/chromedriver /usr/bin/chromedriver
      

Running crawler

Create file parameters.json in current directory. Below is an example:

  • Attribute url is the result of the Advanced Search website'.

  • Attribute numScrolls is number of scrolls triggered when web browser (Chrome) is open. Recomended 800.

    {
      "url": "https://twitter.com/search?q=shooting%20since%3A2015-12-02%20until%3A2015-12-03&src=typd&lang=en",
      "numScrolls": 50
    }
    

Run command:

$ python crawler.py -p parameters.json -o san_bernardino_shooting.json

Filter tweets

Filter tweets by list of filters. Remove similar tweets.

  1. Install nltk:

    $ pip install nltk==3.2.1
    
  2. In Python, download stopwords NLTK package:

    >>> import nltk
    >>> nltk.download("stopwords")
    
  3. Create file keywords.json in the actual directory. Below is an example:

    {
      "keywords": ["tragedy", "bomb", "bombs", "bombing", "pulse", "dead", "injured", "victim", "victims", "hurt", "hurting", "kill", "fire", "police", "attack", "terrorist", "terrorists", "detonated", "detonate", "running", "runner", "explosion", "explosions", "blast", "blasts", "terror", "innocent", "shot", "shots", "shoot", "shoots", "shooting", "shootings", "racist", "homophobic", "gun", "violence", "islam", "isis", "muslim", "horrifying", "killed", "wounded", "armed", "blood", "affected", "killers", "killing", "horrific", "murder", "murdered", "incident", "guns", "terrorism", "mass"],
      "threshold": 3
    }
    
  4. Run command:

    $ python filter.py -k keywords.json --tweets san_bernardino_shooting.json
    
  • Attribute keywords are keywords that the tweet has to contain to be selected. If list is empty ( []), then tweets not be filtered.
  • Attribute threshold is number of keywords considered to filter a tweet.