Does basic house information reflect house's description?

Joanna Broniarek, Vikranth Ale, Matteo Cavalletti

The aim of this project was to perform a clustering analysis of house announcements in Rome from Immobiliare.it.

Data Source

Creating two datasets (Information Dataset and Description Dataset) according to data scrapped from websites. Retrieving at least 10000 announcements.
Clustering the house announcements using K-mean++ method.In order to choose the best number of clusters per dataset the Elbow method was used.
Comparison among clusters. Do both datasets will lead to similar clusters?

Finding similar clusters according to the Jaccard-Similarity.
Returning the 3-most similar couples of clusters (information clusters vs description clusters)
Creating Wordclouds of house descriptions. The represented words extracted from the description of the houses that are in the relative couple.

Main_HouseClustering - main notebook, where most of the analysis is contained
TFIDF_matrix - notebook for generating TFIDF matrix needed for Description Dataset.

Some functions used in analysis was located in the external files : collect_data.py, functions.py.

Folder DataFiles contains files, which were created and used during the analysis.

Python 3.6.4 (beautifulsoup, sklearn, wordcloud)

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
DataFiles		DataFiles
Main_HouseClustering.ipynb		Main_HouseClustering.ipynb
README.md		README.md
TFIDF_matrix.ipynb		TFIDF_matrix.ipynb
collect_data.py		collect_data.py
functions.py		functions.py