-
Notifications
You must be signed in to change notification settings - Fork 2
Setting Up
Alexander O. Smith edited this page Jul 6, 2019
·
2 revisions
This documentation assumes the following:
- You know how to use ssh.
- Your server has MongoDB already installed.
- You understand how to edit files using vim (“vi”).
- You have rights and know how to install Python libraries.
In addition, this doc is geared towards working on a Linux system (for testing we use Ubuntu). We've tried to link to external documentation where installation diverges if you are using other systems.
NOTE: The following information is also reiterated in the readme.txt file.
To run this package:
- Clone the code to your server using
git clone https://github.com/bitslabsyr/twitter-scraper-mongo.git
. - Rename
config_template.py
toconfig.py
. - Modify the parameters in config.py to match your preferences. This file contains configuration information that will always be used for this installation of this package.
- Modify Mongo credentials to match your Mongo instance.
- Modify COLLECT_FROM with the date for the earliest tweet you want to collect. If you want all tweets with no date restriction, replace datetime object with "False".
- Make a copy of
input.txt
with an informative name. This file contains configuration information specific to a particular process. You can have multiple main.py running at the same time, each with a different input.txt file. - Modify the parameters of your input file. Notice that this is a plaintext file, not a python file. That means you should not use python syntax in this file. In other words, don't use quotes for anything.
- Modify the name of the Mongo database where this package will insert data.
- Modify the Twitter credentials this package will use to pull data.
- Modify TERMS_LIST with the list of accounts whose timelines you want to collect. This should be a comma-separated list of account usernames.
- Run with
sudo python3 main.py {input_filename.txt} >> {log_filename.txt} 2>&1 &
- If you want to collect an additional set of users with a different set of Twitter credentials (for example, to reduce the chance of rate limits), repeat steps 4 through 6 as many times as you want.
Since this code checks to see whether each tweet that it collects exists in the database before inserting it, it can be computationally expensive. It is a good idea to index TW_cand on "id" (this is the tweet id) and TW_cand_info on "id" (this is the user id). This will both speed up the processes and reduce the CPU load.
This code was developed and tested with Python3.