This branch is 1 commit ahead of master.

Name	Name	Last commit message	Last commit date
Latest commit dependabot[bot] Bump requests from 2.7.0 to 2.20.0 Mar 30, 2020 6605e26 · Mar 30, 2020 History 52 Commits
resources	resources	First version of Twitter Timeline Scraper with a Connection to MongoDB	Mar 9, 2018
.gitignore	.gitignore	Modify gitignore. Change config.py to config_template.py. Add handlin…	Jun 18, 2018
README.md	README.md	Modify README to mention hard-coded interval	Jun 18, 2018
config.py	config.py	Update config.py	Mar 30, 2020
config_template.py	config_template.py	Modify gitignore. Change config.py to config_template.py. Add handlin…	Jun 18, 2018
db.py	db.py	Separate project specific info from config.py	May 22, 2018
input.txt	input.txt	Modify gitignore. Change config.py to config_template.py. Add handlin…	Jun 18, 2018
main.py	main.py	Update main.py	Jun 7, 2018
requirements.txt	requirements.txt	Bump requests from 2.7.0 to 2.20.0	Mar 30, 2020
timeline.py	timeline.py	Update timeline.py	Oct 4, 2018

Repository files navigation

BITS Lab Twitter Scraper Mongo

This package fetches Twitter timelines for a specified set of Twitter users. It writes data obtained from Twitter to a Mongo database. After collecting all timelines for specified users, the code begins again after waiting 1 hour.

This package creates 3 collections inside the Mongo database:

TW_cand:
- This collection contains 1 document for each tweet collected. If this package collects a tweet that already exists in this collection, it updates the sharing metrics at the tweet level (retweet count and favorite count) and the user level (followers count, friends count, and listed count) with the latest information for these metrics.
TW_cand_crawl_history:
- This collection contains 1 document for each time each tweet is collected. That means this collection will contain many more items than the number of unique tweets. Each document in this collection contains a tweet ID, the time that information for the tweet was collected, and sharing metrics at the tweet level (retweet count and favorite count) and at the user level (followers count, friends count, and listed count) from the time the information was collected. This collection allows us to do longitudinal analysis fo sharing metrics at a fairly granular level.
TW_cand_info:
- This collection contains 1 document for each user whose timelines are being collected. Each document contains a user name, screenname, id number, and the latest sharing metrics (friends count, followers count, and listed count). It also contains the time when this information is collected. This collection provides easy access to this core user information.

Installation and setup

To run this package:

Clone the code to your server using git clone https://github.com/bitslabsyr/twitter-scraper-mongo.git.
Rename config_template.py to config.py.
Modify the parameters in config.py to match your preferences. This file contains configuration information that will always be used for this installation of this package.
1. Modify Mongo credentials to match your Mongo instance.
2. Modify COLLECT_FROM with the date for the earliest tweet you want to collect. If you want all tweets with no date restriction, replace datetime object with "False".
Make a copy of input.txt with an informative name. This file contains configuration information specific to a particular process. You can have multiple main.py running at the same time, each with a different input.txt file.
Modify the parameters of your input file. Notice that this is a plaintext file, not a python file. That means you should not use python syntax in this file. In other words, don't use quotes for anything.
1. Modify the name of the Mongo database where this package will insert data.
2. Modify the Twitter credentials this package will use to pull data.
3. Modify TERMS_LIST with the list of accounts whose timelines you want to collect. This should be a comma-separated list of account usernames.
Run with sudo python3 main.py {input_filename.txt} >> {log_filename.txt} 2>&1 &
If you want to collect an additional set of users with a different set of Twitter credentials (for example, to reduce the chance of rate limits), repeat steps 4 through 6 as many times as you want.

A word about Mongo

Since this code checks to see whether each tweet that it collects exists in the database before inserting it, it can be computationally expensive. It is a good idea to index TW_cand on "id" (this is the tweet id) and TW_cand_info on "id" (this is the user id). This will both speed up the processes and reduce the CPU load.

Requirements

This code was developed and tested with Python3.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BITS Lab Twitter Scraper Mongo

Installation and setup

A word about Mongo

Requirements

About

Releases

Packages

Contributors 4

Languages

bitslabsyr/twitter-scraper-mongo

Folders and files

Latest commit

History

Repository files navigation

BITS Lab Twitter Scraper Mongo

Installation and setup

A word about Mongo

Requirements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages