Initial commit

jphalip · Feb 6, 2017 · c38bc61 · c38bc61
1 parent e718e81
commit c38bc61
Show file tree

Hide file tree

Showing 7 changed files with 487 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1 @@
+google-api-key.txt
diff --git a/README.md b/README.md
@@ -0,0 +1,136 @@
+This repository contains all the code I wrote to support a case study published on my blog: 
+
+The study aims to evaluate bias in the media using sentiment analysis of video titles published by some prominent
+American TV channels on their Youtube accounts.
+
+Setup
+=====
+
+A bit of setting up is required before you can run this code.
+
+Google API key
+--------------
+
+First, you need to get an API key from Google by following the steps described here: https://developers.google.com/api-client-library/python/guide/aaa_apikeys
+
+This key will be used for two services:
+  - Google Cloud Natural Language API
+  - YouTube Data API v3
+
+Once you've acquired a key, save it into a file named `google-api-key.txt` at the root of this repository.
+
+Python environment
+------------------
+
+The following Python packages need to be installed in your Python environment:
+
+	ipython==5.1.0
+	pandas==0.19.1
+	google-api-python-client==1.5.5
+	unicodecsv==0.14.1
+
+Acquiring data
+==============
+
+Four types of datasets must be generated: channels, topics, videos and sentiment scores.
+
+Channels
+--------
+
+Create a `channels.csv` file using the structure detailed in this example:
+
+```python
+channels = pandas.DataFrame.from_records([
+    {'title': 'Fox News', 'slug': 'fox-news', 'youtube_id': 'UCXIJgqnII2ZOINSWNOGFThA', 'playlist_id': 'UUXIJgqnII2ZOINSWNOGFThA', 'url': 'https://www.youtube.com/user/FoxNewsChannel', 'color': '#5975a4'},
+    {'title': 'CNN', 'slug': 'cnn', 'youtube_id': 'UCupvZG-5ko_eiXAupbDfxWw', 'playlist_id': 'UUupvZG-5ko_eiXAupbDfxWw', 'url': 'https://www.youtube.com/user/CNN', 'color': '#b55d60'},
+    {'title': 'MSNBC', 'slug': 'msnbc', 'youtube_id': 'UCaXkIU1QidjPwiAYu6GcHjg', 'playlist_id': 'UUaXkIU1QidjPwiAYu6GcHjg', 'url': 'https://www.youtube.com/user/msnbcleanforward', 'color': '#5f9e6e'},
+    {'title': 'CBS News', 'slug': 'cbs-news', 'youtube_id': 'UC8p1vwvWtl6T73JiExfWs1g', 'playlist_id': 'UU8p1vwvWtl6T73JiExfWs1g', 'url': 'https://www.youtube.com/user/CBSNewsOnline', 'color': '#666666'},
+])
+
+channels.to_csv('channels.csv', index=False, encoding='utf-8')
+```
+
+The `youtube_id` is the channel's unique Youtube ID. Finding out a channel's ID is a little tricky:
+
+- Go to the channel's page (e.g. https://www.youtube.com/user/CNN)
+- View the HTML source of the page.
+- Look for "data-channel-external-id" in the HTML source. The value associated with it is the channel's Youtube ID.
+
+The `playlist_id` corresponds to a channel's default playlist where all its videos are published. To retrieve a channel's `playlist_id`:
+- Visit this url after replacing "CHANNEL-ID" with the channel's ID: https://developers.google.com/apis-explorer/#search/youtube/youtube/v3/youtube.channels.list?part=contentDetails&id=CHANNEL-ID
+- Click the "Execute without OAuth" link at the bottom of the page.
+- The playlist ID is now presented in the field `items[0].contentDetails.relatedPlaylists.uploads`
+
+Topics
+------
+
+Create a `topics.csv` file using the structure detailed in this example:
+
+```python
+topics = pandas.DataFrame.from_records([
+    {'title': 'Obama', 'slug': 'obama', 'variant1': 'Obama', 'variant2': 'Obamas'},
+    {'title': 'Clinton', 'slug': 'clinton','variant1': 'Clinton', 'variant2': 'Clintons'},
+    {'title': 'Trump', 'slug': 'trump','variant1': 'Trump', 'variant2': 'Trumps'},
+    {'title': 'Democrats', 'slug': 'democrats', 'variant1': 'Democrat', 'variant2': 'Democrats'},
+    {'title': 'Republicans', 'slug': 'republicans', 'variant1': 'Republican', 'variant2': 'Republicans'},
+    {'title': 'Liberals', 'slug': 'liberals', 'variant1': 'Liberal', 'variant2': 'Liberals'},
+    {'title': 'Conservatives', 'slug': 'conservatives', 'variant1': 'Conservative', 'variant2': 'Conservatives'},
+])
+
+topics.to_csv('topics.csv', index=False, encoding='utf-8')
+```
+
+The variants are the different terms that will be searched for in the video titles in order to match videos with your topics of choice.
+
+Videos
+------
+
+Run the following snippets of code in order to download all the video metadata from Youtube for your channels of choice:
+
+First, this will download all video information and create a separate CSV file for each channel (e.g. `videos-cnn.csv`):
+
+```python
+from code.youtube_api import download_channels_videos
+
+download_channels_videos(channels)
+```
+
+Second, this will merge all the CSV files generated above into a single `videos-MERGED.csv` file.
+
+```python
+from code.youtube_api import merge_channel_videos
+merge_channel_videos(channels)
+```
+
+Lastly, this will create extra columns for each topic:
+
+```python
+from code.utils import create_topic_columns
+
+videos = pd.read_csv('videos-MERGED.csv')
+create_topic_columns(videos, topics)
+videos.to_csv('videos.csv', index=False, encoding='utf-8')
+```
+
+You now have a `videos.csv` file containing all the video metadata for all channels.
+
+Sentiment scores
+----------------
+
+The last step is to download sentiment scores from the Google Natural Language API. **Note that this API is not free.**
+Make sure to first refer to the API's [pricing page](https://cloud.google.com/natural-language/pricing) for adequate budgeting.
+
+Run the following:
+
+```python
+from code.language_api import download_sentiments
+
+download_sentiments(videos)
+```
+
+You now have a `sentiments.csv` file containing the sentiment scores for all relevant videos.
+
+Exploring and analysing the data
+================================
+
+Coming soon...
diff --git a/code/__init__.py b/code/__init__.py
diff --git a/code/language_api.py b/code/language_api.py
@@ -0,0 +1,82 @@
+import os
+import time
+import json
+import unicodecsv as csv
+from apiclient.discovery import build
+from apiclient.errors import HttpError
+
+
+API_KEY = open('google-api-key.txt', 'r').read()
+language_service = build('language', 'v1', developerKey=API_KEY)
+
+
+def analyze_sentiment(text):
+    """
+    Sends a request to the Google Natural Language API to analyze
+    the sentiment of the given piece of text.
+    """
+    request = language_service.documents().analyzeSentiment(
+      body={
+        'document': {
+          'type': 'PLAIN_TEXT',
+          'content': text,
+        }
+      })
+    return request.execute()
+
+
+def download_sentiments(videos, output_file='sentiments.csv'):
+    """
+    Downloads sentiment scores from the Google Natural Language API
+    for the given videos, then stores the results in a CSV file.
+    """
+
+    # Time to wait when we get rate-limited
+    wait_time = 120
+
+    # Create new (or open existing) CSV file to hold the sentiment analysis values
+    if os.path.isfile(output_file):
+        # Open existing file in "append" mode
+        f = open(output_file, 'a')
+        writer = csv.writer(f, encoding='utf-8')
+    else:
+        # Open new file in "write" mode and add the headers
+        f = open(output_file, 'w')
+        writer = csv.writer(f, encoding='utf-8')
+        writer.writerow(['youtube_id', 'sentiment', 'sentiment_score', 'sentiment_magnitude'])
+
+    i = 0
+    n_videos = videos.shape[0]
+    print 'Start processing %s videos...' % n_videos
+    while i < n_videos:
+        video = videos.iloc[i]
+        try:
+            # Send request to the Google Natural Language API for the current video
+            sentiment = analyze_sentiment(video['title'])
+            # Add result to the CSV file
+            writer.writerow([
+                video['youtube_id'],
+                json.dumps(sentiment),
+                sentiment['documentSentiment']['score'],
+                sentiment['documentSentiment']['magnitude'],
+            ])   
+            # Move on to the next video
+            i += 1
+        except HttpError, e:
+            if e.resp.status == 429:
+                print 'Processed %s/%s videos so far...' % (i, n_videos)
+                # We got rate-limited, so wait a bit before trying again with the same video
+                time.sleep(wait_time)
+            elif e.resp.status == 400:
+                # Bad request. Probably something wrong with the video's text
+                error_content = json.loads(e.content)['error']
+                print 'Error [%s] for video %s: %s' % (
+                    error_content['code'], video['youtube_id'], error_content['message'])
+                # Move on to the next video
+                i += 1
+            else:
+                print "Unhandled error for video %s: %s" % (
+                    video['youtube_id'], video['title'])
+                raise
+    f.close()
+    print 'Finished processing %s videos.' % n_videos
diff --git a/code/plotting.py b/code/plotting.py
@@ -0,0 +1,136 @@
+from __future__ import division
+import math
+from datetime import datetime
+import matplotlib.pyplot as plt
+from matplotlib.patches import Patch
+import seaborn as sns
+
+
+def plot_channel_stats(stats, topics, channels, fig_height=8, y_center=False, title=None):
+    """
+    Plots bar charts for the given channel stats.
+    A separate subplot is generated for each given topic.
+    """
+    fig, axes = plt.subplots(nrows=int(math.ceil(topics.shape[0]/2)), ncols=2, figsize=(8,fig_height))
+    fig.subplots_adjust(hspace=.5)
+
+    for i, topic in topics.iterrows():
+        ax = fig.axes[i]
+
+        # If requested, center all axes around 0
+        if y_center:
+            # Calculate the approximate amplitude of the given stats values
+            amplitude = math.ceil(stats.abs().values.max()*10)/10
+            ax.set_ylim(-amplitude, amplitude)
+
+        # If we have negative values, grey out the negative space for better contrast
+        if stats.values.min() < 0:
+            ax.axhspan(0, ax.get_ylim()[0], facecolor='0.2', alpha=0.15)
+
+        color = channels.sort_values('title').color
+        ax.bar(range(len(stats.index)), stats[topic.slug], tick_label=stats.index, color=color, align='center')
+        ax.set_title(topic.title, size=11)
+
+    # Hide potential last empty subplot
+    if topics.shape[0] % 2:
+        fig.axes[-1].axis('off')
+
+    # Optional title at the top
+    if title is not None:
+        multiline = '\n' in title
+        y = 1. if multiline else .96
+        plt.suptitle(title, size=14, y=y)
+
+    plt.show()
+
+
+def plot_compressed_channel_stats(stats, color=None, y_center=False, title=None):
+    """
+    Similar to plot_channel_stats except everything is represented
+    in a single plot (i.e. no subplots).
+    """
+    plt.figure(figsize=(6,4))
+    ax = plt.gca()
+
+    # If requested, center all axes around 0
+    if y_center:
+        # Calculate the approximate amplitude of the given stats values
+        amplitude = math.ceil(stats.abs().values.max()*10)/10
+        ax.set_ylim(-amplitude, amplitude)
+
+    # If we have negative values, grey out the negative space
+    # for better contrast
+    if stats.values.min() < 0:
+        ax.axhspan(0, ax.get_ylim()[0], facecolor='0.2', alpha=0.15)
+
+    # The actual plot
+    stats.plot(kind='bar', color=color, width=0.6, ax=ax)
+
+    # Presentation cleanup
+    plt.xlabel('')
+    plt.xticks(rotation=0)
+    plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
+
+    # Optional title at the top
+    if title is not None:
+        plt.title(title)
+
+    plt.show()
+
+
+def plot_sentiment_series(videos, topics, channels, start_date=None, title=None):
+    """
+    Plot linear timeseries of sentiment scores for the given videos:
+    One separate subplot is generated for each topic. Each subplot
+    has one timeseries for each channel, and one timeseries for the
+    average values across all channells.
+    """
+    fig, axes = plt.subplots(nrows=topics.shape[0], ncols=1, figsize=(8,4*topics.shape[0]))
+    fig.subplots_adjust(hspace=.3)
+
+    # Resample rule: 2-week buckets
+    resample_rule = '2W'
+
+    # Calculate the approximate amplitude of the given sentiment values
+    amplitude = math.ceil(videos.sentiment_score.abs().max()*10)/10
+
+    for i, topic in topics.reset_index().iterrows():
+        ax = fig.axes[i]
+        # Grey out the negative sentiment area
+        ax.axhspan(0, -1, facecolor='0.2', alpha=0.15)
+
+        # Plot a timeseries for the average sentiment across all channels
+        topic_mask = videos[topic.slug]
+        if start_date is not None:
+            topic_mask = topic_mask & (videos.published_at >= start_date)
+        ts = videos[topic_mask].set_index('published_at').resample(resample_rule)['sentiment_score'].mean().interpolate()
+        sns.tsplot(ts, ts.index, color='#fcef99', linewidth=6, ax=ax)
+
+        # Plot a separate time-series for each channel
+        for _, channel in channels.iterrows():
+            channel_mask = topic_mask & (videos.channel==channel.title)
+            ts = videos[channel_mask].set_index('published_at').resample(resample_rule)['sentiment_score'].mean().interpolate()
+            if len(ts) > 1:
+                sns.tsplot(ts, ts.index, color=channel['color'], linewidth=1, ax=ax)
+
+        # Format x-axis labels as dates
+        xvalues = ax.xaxis.get_majorticklocs()
+        xlabels = [datetime.utcfromtimestamp(x/1e9).strftime("%Y.%m") for x in xvalues]
+        ax.set_xticklabels(xlabels)
+
+        # A little extra presentation cleanup
+        ax.set_xlabel('')
+        ax.set_title(topic['title'], size=11)
+        ax.set_ylim(-amplitude,amplitude)
+
+        # Add legend
+        handles = [Patch(color='#fcef99', label='Average')]
+        for _, channel in channels.iterrows():
+            handles.append(Patch(color=channel['color'], label=channel['title']))
+        ax.legend(handles=handles, fontsize=8)
+
+    # Optional title at the top
+    if title is not None:
+        plt.suptitle(title, size=14, y=.92)
+
+    plt.show()