Skip to content

Commit

Permalink
Added contents of Week 9 - Hierarchical Clustering of documents
Browse files Browse the repository at this point in the history
  • Loading branch information
raj1603chdry committed Dec 6, 2018
1 parent 01f9b10 commit 5fef27f
Show file tree
Hide file tree
Showing 16 changed files with 152 additions and 0 deletions.
Binary file modified .DS_Store
Binary file not shown.
Binary file added WEEK 9/.DS_Store
Binary file not shown.
17 changes: 17 additions & 0 deletions WEEK 9/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# WEEK 9 - Hierarchical Clustering of documents

**Cosine Similarity is used as similarity metric_**
## Available programs:

* _document_clustering.py_ - This program reads all the documents from the __documents__ folder as specified in the file dictionary. Then it creates a vector space model from it, using the word list as specified in the code. After creating the vector space model, the distance matrix is formed and the same is used to perform K-Means clustering.
* _web_document_clustering.py_ - This program does the same as the above code but instead of using pre-downloaded files, it scrapes contents from websites as specified in the dictionary.

## Sample output:
![document clustering output](output.png)
![document clustering dendrogram](dendrogram1.png)

### To run the codes, run the following command on the terminal opened at the current directory

```bash
python document_clustering.py
```
Binary file added WEEK 9/assignment.doc
Binary file not shown.
Binary file added WEEK 9/dendrogram1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
126 changes: 126 additions & 0 deletions WEEK 9/document_clustering.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
# Implementing Vector Space Model and performing nearest neighbourhood clustering of the documents.

# Importing the libraries
import string
import pandas as pd
import math
import matplotlib.pyplot as plt

class document_clustering(object):
"""Implementing the document clustering class.
It creates the vector space model of the passed documents and then
creates a Hierarchical Cluster to organize them.
Parameters:
-------------
file_dict: dictionary
Contains the path of the different files to be read.
Format: {file_index: path}
word_list: list
Contains the list of words using which the vector space model is to be
created.
Attributes:
-----------
listing_dict_: dictionary
Contains the frequency of the words in each document as file_index as key
and frequency list as value.
distance_matrix_ : pandas-dataframe
Contains the sqaure matrix of documents containing the pairwise distance between them
labels_: list
Contains the labels for document names
"""

def __init__(self, file_dict, word_list):
self.file_dict = file_dict
self.word_list = word_list

def tokenize_document(self, document):
"""Returns a list of words contained in the document after converting
it to lowercase and striping punctuation marks"""
terms = document.lower().split()
return [term.strip(string.punctuation) for term in terms]

def create_word_listing(self):
"""Function to create the word listing of the objects"""

# Dictionary to hold the frequency of words in word_list with file_index as key
self.listing_dict_ = {}

for id in self.file_dict:
temp_word_list = []
f = open(self.file_dict[id], 'r')
document = f.read()
terms = self.tokenize_document(document)
for term in self.word_list:
temp_word_list.append(terms.count(term.lower()))
self.listing_dict_[id] = temp_word_list

print('Word listing of each document')
for id in self.listing_dict_:
print('%d: %s' % (id, self.listing_dict_[id]))

def create_document_matrix(self):
"""Function to create the document distance matrix"""
self.labels_ = ['doc%d' % (id) for id in self.file_dict]
main_list = []
for id1 in self.file_dict:
temp_list = []
for id2 in self.file_dict:
dist = 0
l1 = 0
l2 = 0
for term1, term2 in zip(self.listing_dict_[id1], self.listing_dict_[id2]):
l1 += term1**2
l2 += term2**2
dist += term1 * term2
dist = dist / (math.sqrt(l1) * math.sqrt(l2))
temp_list.append(round(math.sqrt(dist), 4))
main_list.append(temp_list)

self.distance_matrix_ = pd.DataFrame(main_list, index = self.labels_, columns = self.labels_)
print('\nDistance Matrix')
print(self.distance_matrix_)

def cluster(self):
"""Create the vector space model from the documents. Perform Hierarchical
Clustering"""
from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import fcluster
row_cluster = linkage(self.distance_matrix_.values,
method = 'single',
metric = 'cosine')
clusters = fcluster(row_cluster, 0.8)
print('\nClusters Based on Cosine Similarity')
cluster_labels = list(set(clusters))
for i in cluster_labels:
print('Cluster %d:' % i)
for j in range(len(clusters)):
if i == clusters[j]:
print('doc%d' % (j+1))
from scipy.cluster.hierarchy import dendrogram
dn = dendrogram(row_cluster, labels = self.labels_)
plt.ylabel('Cosine Similarity')
plt.xticks(rotation = 90)
plt.savefig('dendrogram1.png', dpi = 300)
plt.show()


# Dictionary containing the file_index and path
file_dict = {1: 'documents/doc1.txt',
2: 'documents/doc2.txt',
3: 'documents/doc3.txt',
4: 'documents/doc4.txt',
5: 'documents/doc5.txt',
6: 'documents/doc6.txt',
7: 'documents/doc7.txt',
8: 'documents/doc8.txt',
9: 'documents/doc9.txt'}
# List containing the words using which the vector space model is to be created
word_list = ['Automotive', 'Car', 'motorcycles', 'self-drive', 'IoT', 'hire' ,'Dhoni']

# Creating class instance and calling appropriate functions
document_cluster = document_clustering(file_dict = file_dict, word_list = word_list)
document_cluster.create_word_listing()
document_cluster.create_document_matrix()
document_cluster.cluster()
1 change: 1 addition & 0 deletions WEEK 9/documents/doc1.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Electric automotive maker Tesla Inc. is likely to introduce its products in India sometime in the summer of 2017.
1 change: 1 addition & 0 deletions WEEK 9/documents/doc2.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Automotive major Mahindra likely to introduce driverless car
1 change: 1 addition & 0 deletions WEEK 9/documents/doc3.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
BMW plans to introduce its own motorcycles in india
1 change: 1 addition & 0 deletions WEEK 9/documents/doc4.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Just drive, a self-drive car rental firm uses smart vehicle technology based on IoT
1 change: 1 addition & 0 deletions WEEK 9/documents/doc5.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Automotive industry going to hire thousands in 2018
1 change: 1 addition & 0 deletions WEEK 9/documents/doc6.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Famous cricket player Dhoni brought his priced car Hummer which is an SUV
1 change: 1 addition & 0 deletions WEEK 9/documents/doc7.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Dhoni led india to its second world cup victory
1 change: 1 addition & 0 deletions WEEK 9/documents/doc8.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
IoT in car will lead to more safety and make driverless vehicle revolution possible
1 change: 1 addition & 0 deletions WEEK 9/documents/doc9.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Sachin recommended Dhoni for the indian skipper post
Binary file added WEEK 9/output.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 5fef27f

Please sign in to comment.