Skip to content

Commit

Permalink
Initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
jphalip committed Feb 6, 2017
1 parent e718e81 commit c38bc61
Show file tree
Hide file tree
Showing 7 changed files with 487 additions and 0 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
google-api-key.txt
136 changes: 136 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
This repository contains all the code I wrote to support a case study published on my blog:

The study aims to evaluate bias in the media using sentiment analysis of video titles published by some prominent
American TV channels on their Youtube accounts.

Setup
=====

A bit of setting up is required before you can run this code.

Google API key
--------------

First, you need to get an API key from Google by following the steps described here: https://developers.google.com/api-client-library/python/guide/aaa_apikeys

This key will be used for two services:
- Google Cloud Natural Language API
- YouTube Data API v3

Once you've acquired a key, save it into a file named `google-api-key.txt` at the root of this repository.

Python environment
------------------

The following Python packages need to be installed in your Python environment:

ipython==5.1.0
pandas==0.19.1
google-api-python-client==1.5.5
unicodecsv==0.14.1

Acquiring data
==============

Four types of datasets must be generated: channels, topics, videos and sentiment scores.

Channels
--------

Create a `channels.csv` file using the structure detailed in this example:

```python
channels = pandas.DataFrame.from_records([
{'title': 'Fox News', 'slug': 'fox-news', 'youtube_id': 'UCXIJgqnII2ZOINSWNOGFThA', 'playlist_id': 'UUXIJgqnII2ZOINSWNOGFThA', 'url': 'https://www.youtube.com/user/FoxNewsChannel', 'color': '#5975a4'},
{'title': 'CNN', 'slug': 'cnn', 'youtube_id': 'UCupvZG-5ko_eiXAupbDfxWw', 'playlist_id': 'UUupvZG-5ko_eiXAupbDfxWw', 'url': 'https://www.youtube.com/user/CNN', 'color': '#b55d60'},
{'title': 'MSNBC', 'slug': 'msnbc', 'youtube_id': 'UCaXkIU1QidjPwiAYu6GcHjg', 'playlist_id': 'UUaXkIU1QidjPwiAYu6GcHjg', 'url': 'https://www.youtube.com/user/msnbcleanforward', 'color': '#5f9e6e'},
{'title': 'CBS News', 'slug': 'cbs-news', 'youtube_id': 'UC8p1vwvWtl6T73JiExfWs1g', 'playlist_id': 'UU8p1vwvWtl6T73JiExfWs1g', 'url': 'https://www.youtube.com/user/CBSNewsOnline', 'color': '#666666'},
])

channels.to_csv('channels.csv', index=False, encoding='utf-8')
```

The `youtube_id` is the channel's unique Youtube ID. Finding out a channel's ID is a little tricky:

- Go to the channel's page (e.g. https://www.youtube.com/user/CNN)
- View the HTML source of the page.
- Look for "data-channel-external-id" in the HTML source. The value associated with it is the channel's Youtube ID.

The `playlist_id` corresponds to a channel's default playlist where all its videos are published. To retrieve a channel's `playlist_id`:
- Visit this url after replacing "CHANNEL-ID" with the channel's ID: https://developers.google.com/apis-explorer/#search/youtube/youtube/v3/youtube.channels.list?part=contentDetails&id=CHANNEL-ID
- Click the "Execute without OAuth" link at the bottom of the page.
- The playlist ID is now presented in the field `items[0].contentDetails.relatedPlaylists.uploads`

Topics
------

Create a `topics.csv` file using the structure detailed in this example:

```python
topics = pandas.DataFrame.from_records([
{'title': 'Obama', 'slug': 'obama', 'variant1': 'Obama', 'variant2': 'Obamas'},
{'title': 'Clinton', 'slug': 'clinton','variant1': 'Clinton', 'variant2': 'Clintons'},
{'title': 'Trump', 'slug': 'trump','variant1': 'Trump', 'variant2': 'Trumps'},
{'title': 'Democrats', 'slug': 'democrats', 'variant1': 'Democrat', 'variant2': 'Democrats'},
{'title': 'Republicans', 'slug': 'republicans', 'variant1': 'Republican', 'variant2': 'Republicans'},
{'title': 'Liberals', 'slug': 'liberals', 'variant1': 'Liberal', 'variant2': 'Liberals'},
{'title': 'Conservatives', 'slug': 'conservatives', 'variant1': 'Conservative', 'variant2': 'Conservatives'},
])

topics.to_csv('topics.csv', index=False, encoding='utf-8')
```

The variants are the different terms that will be searched for in the video titles in order to match videos with your topics of choice.

Videos
------

Run the following snippets of code in order to download all the video metadata from Youtube for your channels of choice:

First, this will download all video information and create a separate CSV file for each channel (e.g. `videos-cnn.csv`):

```python
from code.youtube_api import download_channels_videos

download_channels_videos(channels)
```

Second, this will merge all the CSV files generated above into a single `videos-MERGED.csv` file.

```python
from code.youtube_api import merge_channel_videos
merge_channel_videos(channels)
```

Lastly, this will create extra columns for each topic:

```python
from code.utils import create_topic_columns

videos = pd.read_csv('videos-MERGED.csv')
create_topic_columns(videos, topics)
videos.to_csv('videos.csv', index=False, encoding='utf-8')
```

You now have a `videos.csv` file containing all the video metadata for all channels.

Sentiment scores
----------------

The last step is to download sentiment scores from the Google Natural Language API. **Note that this API is not free.**
Make sure to first refer to the API's [pricing page](https://cloud.google.com/natural-language/pricing) for adequate budgeting.

Run the following:

```python
from code.language_api import download_sentiments

download_sentiments(videos)
```

You now have a `sentiments.csv` file containing the sentiment scores for all relevant videos.

Exploring and analysing the data
================================

Coming soon...
Empty file added code/__init__.py
Empty file.
82 changes: 82 additions & 0 deletions code/language_api.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
import os
import time
import json
import unicodecsv as csv
from apiclient.discovery import build
from apiclient.errors import HttpError


API_KEY = open('google-api-key.txt', 'r').read()
language_service = build('language', 'v1', developerKey=API_KEY)


def analyze_sentiment(text):
"""
Sends a request to the Google Natural Language API to analyze
the sentiment of the given piece of text.
"""
request = language_service.documents().analyzeSentiment(
body={
'document': {
'type': 'PLAIN_TEXT',
'content': text,
}
})
return request.execute()


def download_sentiments(videos, output_file='sentiments.csv'):
"""
Downloads sentiment scores from the Google Natural Language API
for the given videos, then stores the results in a CSV file.
"""

# Time to wait when we get rate-limited
wait_time = 120

# Create new (or open existing) CSV file to hold the sentiment analysis values
if os.path.isfile(output_file):
# Open existing file in "append" mode
f = open(output_file, 'a')
writer = csv.writer(f, encoding='utf-8')
else:
# Open new file in "write" mode and add the headers
f = open(output_file, 'w')
writer = csv.writer(f, encoding='utf-8')
writer.writerow(['youtube_id', 'sentiment', 'sentiment_score', 'sentiment_magnitude'])

i = 0
n_videos = videos.shape[0]
print 'Start processing %s videos...' % n_videos
while i < n_videos:
video = videos.iloc[i]
try:
# Send request to the Google Natural Language API for the current video
sentiment = analyze_sentiment(video['title'])
# Add result to the CSV file
writer.writerow([
video['youtube_id'],
json.dumps(sentiment),
sentiment['documentSentiment']['score'],
sentiment['documentSentiment']['magnitude'],
])
# Move on to the next video
i += 1
except HttpError, e:
if e.resp.status == 429:
print 'Processed %s/%s videos so far...' % (i, n_videos)
# We got rate-limited, so wait a bit before trying again with the same video
time.sleep(wait_time)
elif e.resp.status == 400:
# Bad request. Probably something wrong with the video's text
error_content = json.loads(e.content)['error']
print 'Error [%s] for video %s: %s' % (
error_content['code'], video['youtube_id'], error_content['message'])
# Move on to the next video
i += 1
else:
print "Unhandled error for video %s: %s" % (
video['youtube_id'], video['title'])
raise
f.close()
print 'Finished processing %s videos.' % n_videos
136 changes: 136 additions & 0 deletions code/plotting.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
from __future__ import division
import math
from datetime import datetime
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
import seaborn as sns


def plot_channel_stats(stats, topics, channels, fig_height=8, y_center=False, title=None):
"""
Plots bar charts for the given channel stats.
A separate subplot is generated for each given topic.
"""
fig, axes = plt.subplots(nrows=int(math.ceil(topics.shape[0]/2)), ncols=2, figsize=(8,fig_height))
fig.subplots_adjust(hspace=.5)

for i, topic in topics.iterrows():
ax = fig.axes[i]

# If requested, center all axes around 0
if y_center:
# Calculate the approximate amplitude of the given stats values
amplitude = math.ceil(stats.abs().values.max()*10)/10
ax.set_ylim(-amplitude, amplitude)

# If we have negative values, grey out the negative space for better contrast
if stats.values.min() < 0:
ax.axhspan(0, ax.get_ylim()[0], facecolor='0.2', alpha=0.15)

color = channels.sort_values('title').color
ax.bar(range(len(stats.index)), stats[topic.slug], tick_label=stats.index, color=color, align='center')
ax.set_title(topic.title, size=11)

# Hide potential last empty subplot
if topics.shape[0] % 2:
fig.axes[-1].axis('off')

# Optional title at the top
if title is not None:
multiline = '\n' in title
y = 1. if multiline else .96
plt.suptitle(title, size=14, y=y)

plt.show()


def plot_compressed_channel_stats(stats, color=None, y_center=False, title=None):
"""
Similar to plot_channel_stats except everything is represented
in a single plot (i.e. no subplots).
"""
plt.figure(figsize=(6,4))
ax = plt.gca()

# If requested, center all axes around 0
if y_center:
# Calculate the approximate amplitude of the given stats values
amplitude = math.ceil(stats.abs().values.max()*10)/10
ax.set_ylim(-amplitude, amplitude)

# If we have negative values, grey out the negative space
# for better contrast
if stats.values.min() < 0:
ax.axhspan(0, ax.get_ylim()[0], facecolor='0.2', alpha=0.15)

# The actual plot
stats.plot(kind='bar', color=color, width=0.6, ax=ax)

# Presentation cleanup
plt.xlabel('')
plt.xticks(rotation=0)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

# Optional title at the top
if title is not None:
plt.title(title)

plt.show()


def plot_sentiment_series(videos, topics, channels, start_date=None, title=None):
"""
Plot linear timeseries of sentiment scores for the given videos:
One separate subplot is generated for each topic. Each subplot
has one timeseries for each channel, and one timeseries for the
average values across all channells.
"""
fig, axes = plt.subplots(nrows=topics.shape[0], ncols=1, figsize=(8,4*topics.shape[0]))
fig.subplots_adjust(hspace=.3)

# Resample rule: 2-week buckets
resample_rule = '2W'

# Calculate the approximate amplitude of the given sentiment values
amplitude = math.ceil(videos.sentiment_score.abs().max()*10)/10

for i, topic in topics.reset_index().iterrows():
ax = fig.axes[i]
# Grey out the negative sentiment area
ax.axhspan(0, -1, facecolor='0.2', alpha=0.15)

# Plot a timeseries for the average sentiment across all channels
topic_mask = videos[topic.slug]
if start_date is not None:
topic_mask = topic_mask & (videos.published_at >= start_date)
ts = videos[topic_mask].set_index('published_at').resample(resample_rule)['sentiment_score'].mean().interpolate()
sns.tsplot(ts, ts.index, color='#fcef99', linewidth=6, ax=ax)

# Plot a separate time-series for each channel
for _, channel in channels.iterrows():
channel_mask = topic_mask & (videos.channel==channel.title)
ts = videos[channel_mask].set_index('published_at').resample(resample_rule)['sentiment_score'].mean().interpolate()
if len(ts) > 1:
sns.tsplot(ts, ts.index, color=channel['color'], linewidth=1, ax=ax)

# Format x-axis labels as dates
xvalues = ax.xaxis.get_majorticklocs()
xlabels = [datetime.utcfromtimestamp(x/1e9).strftime("%Y.%m") for x in xvalues]
ax.set_xticklabels(xlabels)

# A little extra presentation cleanup
ax.set_xlabel('')
ax.set_title(topic['title'], size=11)
ax.set_ylim(-amplitude,amplitude)

# Add legend
handles = [Patch(color='#fcef99', label='Average')]
for _, channel in channels.iterrows():
handles.append(Patch(color=channel['color'], label=channel['title']))
ax.legend(handles=handles, fontsize=8)

# Optional title at the top
if title is not None:
plt.suptitle(title, size=14, y=.92)

plt.show()
Loading

0 comments on commit c38bc61

Please sign in to comment.