Skip to content
Martin Stoffers edited this page Nov 12, 2015 · 8 revisions

Building statistics for shownot.es

Ideas for Angualr-Fronted Definitions should go to https://github.com/shownotes/snotes20-angular-webapp/wiki/Statistics

All definite definitions should go to https://github.com/shownotes/snotes20-restapi/wiki/API#apistatistics

Idea

Perform analyses of all publications on shownot.es and generate some cool statistics with that data. Therefore we need to generate word frequency tables and tf-idf tables on each episode of each podcast. We also need to build an overall corpus and generate the same tables for this. We can do this analyses on all text and on all URLs separately. First of all there must be a definition of the tables in the database. By getting a clear structure the statistic feature must be implemented as a separate django application namely statistic.

Later on, the data from the extracted features must be reachable via REST. By achieving this, we need to build a proper API to generate and deliver the data for each graph to the angular frontend. The graphs will be implemented with the library d3.js. Therefore it's necessary to discover which JSON data is needed for the graphs.

Algorithms

We need:

  • Word frequency and tf-idf analysis on
    • all episodes
    • all episodes form a podcast (combined result)
    • over the whole corpora

Updating word frequency tables and tf-idf tables

The application must not effect the running web application. Assuming that the most episodes are quit short we could calculate the word frequencies in threads after a publication was generated or updated. Therefore the last state_id must be used. By doing this we were able to use this data on the searching feature (wordcloud) immediately. The calculating of the whole corpus and the overall Podcast could be done by a cronjob once per day.

Required changes in the Django backend in general

  • Build a database model for word frequencies and tf-idfs
  • Implementation of word frequencies and tf-idfs as algorithms
    • tf-idf as a task in manage.py
      • external update mechanism should be discussed
    • word-frequencies on create publication
  • Design a serielizers with Django-Rest for each feature in the frontend
    • Definitions of needed data must be provided

Word frequency and tf-idf calculation in python

  • Implementation by using NLTK and SciPy Packages
    • NLTK cloud be used to build the word frequency tables
      • Deleting Stopwords and is included
    • SciPy could be used for TF-IDF
      • TfidfVectorizer is faster than the NLTK version of TF-IDF

Database

Wordfrequencies

  • A table for each publication

  • A combined table for each podcasts

  • A combined table over all shownotes from all publications as corpus

    • needed by TF-IDF
  • Colums

    • word
    • absolute frequency
    • relative frequency to all shownotes in publication
    • rank?

TF-IDF


Interactive TimeLine-Plot for publications and podcasts (@felipedsp)

Generate a Timeline on the top of the page, which enables the user to manually select a time periode. This could be used by a user to search for specific publication by time

  • Selecting a time range in the overall overview will show a list of podcasts which has publications in that time periode.
  • Selecting a time range while in the overview of a specific podcast the list of episode will be update in the same manner as above.
  • The selected time range in the overall overview should persist, if the user clicked on a podcast to visit the detail publication list.

Backend API Ressources

Possible problems

  • All episodes have a created_date but not a date
    • Date is the date where the episode was live (discovered by hoersuppe API)7
    • create_date is the date, when the episode was manually created by a user
    • Maybe we should use the amount of publications in a month as X on the graph

Data needed

Similarity of Podcasts and Episodes (@bratwurscht)

Wordclouds beside search (@bratwurscht)

Clone this wiki locally