Skip to content

Latest commit

 

History

History
149 lines (110 loc) · 7.67 KB

README.md

File metadata and controls

149 lines (110 loc) · 7.67 KB

Recent X Platform Changes & Scweet Updates

Note: Scweet has recently encountered issues due to changes on X (formerly Twitter). We’re committed to updating the library so it continues working for smaller or personal scraping tasks. However, keeping Scweet fully operational at large scale now requires near-daily maintenance, given X’s frequent policy and technical shifts.

For those needing robust, continuous, or high-volume scraping, consider using Scweet on Apify. It automatically handles the scaling and infrastructure behind the scenes—fetching up to 1000 tweets per minute and storing results in a neat dataset. It’s still the Scweet experience you know, just supercharged in the cloud.

(Responsible Use Reminder: Whether local or cloud-based, please scrape tweets ethically, lawfully, and respectfully.)

A simple and unlimited Twitter scraper with python.

Recently, Twitter has banned almost every Twitter scraper. This repository presents an alternative tool to scrape Twitter based on 3 functions:

  • scrape: Scrapes all the information regarding tweets between two given dates, for a given language and list of words or account name, in the form of a csv file containing retrieved data (more storage methods will be added).
  • get_user_information: Scrapes users information, incluing number of following and followers, location and description.
  • get_users_followers and get_users_following: Scrapes followers and following accounts for a given list of users.

It is also possible to download the images showed in tweets by passing the argument save_images = True. If you only want to scrape images, it is recommended to set the argument display_type = image to show only tweets that contain images.

Authentication is required for scraping followers/following. It is recommended to log in with a new account, otherwise the account could be banned if the list of followers is very long. To log in to your account, you need to enter your username SCWEET_USERNAME and password SCWEET_PASSWORD in the .env file. You can control the wait parameter in the get_users_followers and get_users_following functions according to you internet speed.

Requirements :

pip install -r requirements.txt

Note : You must have Chrome installed on your system.

Results :

Tweets :

The CSV file contains the following features (for each tweet) :

  • 'UserScreenName' :
  • 'UserName' : UserName
  • 'Timestamp' : timestamp of the tweet
  • 'Text' : tweet text
  • 'Embedded_text' : embedded text written above the tweet. This can be an image, a video or even another tweet if the tweet in question is a reply
  • 'Emojis' : emojis in the tweet
  • 'Comments' : number of comments
  • 'Likes' : number of likes
  • 'Retweets' : number of retweets
  • 'Image link' : link of the image in the tweet
  • 'Tweet URL' : tweet URL

Following / Followers :

The get_users_following and get_users_followers in user file give a list of following and followers for a given list of users.

Usage :

Library :

The library is now available. To install the library, run :

pip install Scweet==1.8

After the installation, you can import and use the functions as follows:

from Scweet.scweet import scrape
from Scweet.user import get_user_information, get_users_following, get_users_followers

Scrape top tweets with the words 'bitcoin', 'ethereum' geolocated less than 200 km from Alicante (Spain) Lat=38.3452, Long=-0.481006 and without replies:
The process is slower as the interval is smaller (choose an interval that can divide the period of time between, start and max date)

data = scrape(words=['bitcoin','ethereum'], since="2021-10-01", until="2021-10-05", from_account = None,         interval=1, headless=False, display_type="Top", save_images=False, lang="en",
	resume=False, filter_replies=False, proximity=False, geocode="38.3452,-0.481006,200km")

Scrape top tweets of with the hashtag #bitcoin, in proximity and without replies:
The process is slower as the interval is smaller (choose an interval that can divide the period of time between, start and max date)

data = scrape(hashtag="bitcoin", since="2021-08-05", until=None, from_account = None, interval=1, 
              headless=True, display_type="Top", save_images=False, 
              resume=False, filter_replies=True, proximity=True)

Get the main information of a given list of users:
These users follow me on Twitter

users = ['nagouzil', '@yassineaitjeddi', 'TahaAlamIdrissi', 
         '@Nabila_Gl', 'geceeekusuu', '@pabu232', '@av_ahmet', '@x_born_to_die_x']

**This function will return a list that contains : **
["no. of following","no. of followers", "join date", "date of birth", "location", "website", "description"]

users_info = get_user_information(users, headless=True)

Get followers and following of a given list of users Enter your username and password in .env file. I recommend you do not use your main account.
Increase wait argument to avoid banning your account and maximize the crawling process if the internet is slow. I used 1 and it's safe.

Set your .env file with SCWEET_EMAIL , SCWEET_USERNAME and SCWEET_PASSWORD variables and provide its path

env_path = ".env"

following = get_users_following(users=users, env=env_path, verbose=0, headless=True, wait=2, limit=50, file_path=None)

followers = get_users_followers(users=users, env=env_path, verbose=0, headless=True, wait=2, limit=50, file_path=None)

Terminal :

Scrape tweets.

optional arguments:
  -h, --help            show this help message and exit
  --words WORDS         Words to search for. they should be separated by "//" : Cat//Dog.
  --from_account FROM_ACCOUNT
                        Tweets posted by "from_account" account.
  --to_account TO_ACCOUNT
                        Tweets posted in response to "to_account" account.
  --mention_account MENTION_ACCOUNT
                        Tweets that mention "mention_account" account.         
  --hashtag HASHTAG
                        Tweets containing #hashtag
  --until UNTIL         End date for search query. example : %Y-%m-%d.
  --since SINCE
                        Start date for search query. example : %Y-%m-%d.
  --interval INTERVAL   Interval days between each start date and end date for
                        search queries. example : 5.
  --lang LANG           Tweets language. Example : "en" for english and "fr"
                        for french.
  --headless HEADLESS   Headless webdrives or not. True or False
  --limit LIMIT         Limit tweets to be scraped.
  --display_type DISPLAY_TYPE
                        Display type of Twitter page : Latest or Top tweets
  --resume RESUME       Resume the last scraping. specify the csv file path.
  --proxy PROXY         Proxy server
  --proximity PROXIMITY Proximity
  --geocode GEOCODE     Geographical location coordinates to center the
                        search (), radius. No compatible with proximity
  --minreplies MINREPLIES
                        Min. number of replies to the tweet
  --minlikes MINLIKES   Min. number of likes to the tweet
  --minretweets MINRETWEETS
                        Min. number of retweets to the tweet

To run the script :

python scweet.py --words "excellente//car" --to_account "tesla" --until 2020-01-05 --since 2020-01-01 --limit 10 --interval 1 --display_type Latest --lang="en" --headless True