Skip to content

alessandrobokan/TwitterCrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Twitter Advanced Search Crawler

Twitter Advanced Search crawler using Selenium WebDriver. For example, to find all tweets related to San Bernardino shooting on 2015, you must enter to the advanced search website and follow these steps:

  1. Fill the field "All of these words" with the phrase shooting;
  2. On field "Written in", select English (English);
  3. On field "From this date", select dates 2015-12-02 (since) to 2015-12-03 (until).

The result will be this link.

Basic requirements

Based on the instructions of this link.

  1. Install selenium:

    $ pip install selenium==3.0.1
    
  2. Install Google Chrome.

  3. Install ChromeDriver:

    1. Download version 2.24 (64bits):

    2. Extract zip file and follow those commands:

         $ sudo mv -f chromedriver /usr/local/share/chromedriver
         $ sudo ln -s /usr/local/share/chromedriver /usr/local/bin/chromedriver
         $ sudo ln -s /usr/local/share/chromedriver /usr/bin/chromedriver
      

Running crawler

Create file parameters.json in current directory. Below is an example:

  • Attribute url is the result of the Advanced Search website'.

  • Attribute numScrolls is number of scrolls triggered when web browser (Chrome) is open. Recomended 800.

    {
      "url": "https://twitter.com/search?q=shooting%20since%3A2015-12-02%20until%3A2015-12-03&src=typd&lang=en",
      "numScrolls": 50
    }
    

Run command:

$ python crawler.py -p parameters.json -o san_bernardino_shooting.json

Filter tweets

Filter tweets by list of filters. Remove similar tweets.

  1. Install nltk:

    $ pip install nltk==3.2.1
    
  2. In Python, download stopwords NLTK package:

    >>> import nltk
    >>> nltk.download("stopwords")
    
  3. Create file keywords.json in the actual directory. Below is an example:

    {
      "keywords": ["tragedy", "bomb", "bombs", "bombing", "pulse", "dead", "injured", "victim", "victims", "hurt", "hurting", "kill", "fire", "police", "attack", "terrorist", "terrorists", "detonated", "detonate", "running", "runner", "explosion", "explosions", "blast", "blasts", "terror", "innocent", "shot", "shots", "shoot", "shoots", "shooting", "shootings", "racist", "homophobic", "gun", "violence", "islam", "isis", "muslim", "horrifying", "killed", "wounded", "armed", "blood", "affected", "killers", "killing", "horrific", "murder", "murdered", "incident", "guns", "terrorism", "mass"],
      "threshold": 3
    }
    
  4. Run command:

    $ python filter.py -k keywords.json --tweets san_bernardino_shooting.json
    
  • Attribute keywords are keywords that the tweet has to contain to be selected. If list is empty ( []), then tweets not be filtered.
  • Attribute threshold is number of keywords considered to filter a tweet.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages