Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Housing data analyses🏠🪴 #21

Open
MaherAssaf19 opened this issue Jan 1, 2025 · 2 comments
Open

Housing data analyses🏠🪴 #21

MaherAssaf19 opened this issue Jan 1, 2025 · 2 comments

Comments

@MaherAssaf19
Copy link

MaherAssaf19 commented Jan 1, 2025

This script aims to predict housing prices based on features like the size of the house, number of bedrooms, bathrooms, year built, and location score. Here's a simplified breakdown of the process:

  1. Loading the Data: It loads a small dataset containing information about different houses.
  2. Cleaning the Data: Any missing values are removed to ensure the data is ready for analysis.
  3. Exploratory Data Analysis (EDA): The script provides a quick look at the data with summary statistics and visual plots to understand how the features relate to the price.
  4. Training the Model: A linear regression model is trained to learn the relationship between the features and the house price.
  5. Evaluating the Model: The model's accuracy is checked using metrics like Mean Squared Error and R-squared.
  6. Visualizing Results: The script compares the actual prices to the predicted ones and shows which features matter most in determining the price.

In short, this process builds a predictive model that estimates house prices and helps identify what factors most influence those prices.

Here is the code: 👇

housing_prices_analysis.py

def load_data():
"""
Load a predefined housing dataset.

Returns:
    pd.DataFrame: Loaded dataset as a Pandas DataFrame.
"""
import pandas as pd
from io import StringIO

# Embedded dataset
data = """SquareFeet,Bedrooms,Bathrooms,YearBuilt,LocationScore,Price

1500,3,2,2000,85,300000
2000,4,3,2010,90,450000
1800,3,2,2005,88,350000
2400,4,3,2020,92,500000
1600,3,2,1995,80,280000
1200,2,1,1980,70,200000
"""
return pd.read_csv(StringIO(data))

def preprocess_data(data):
"""
Preprocess the housing dataset by handling missing values and extracting necessary features.

Parameters:
    data (pd.DataFrame): Raw dataset.

Returns:
    pd.DataFrame: Preprocessed dataset.
"""
data = data.dropna()
return data

def analyze_data(data):
"""
Perform exploratory data analysis on the dataset.

Parameters:
    data (pd.DataFrame): Dataset to analyze.

Returns:
    None: Prints summary statistics and shows plots.
"""
import matplotlib.pyplot as plt
import seaborn as sns

print("Dataset Summary:")
print(data.describe())

sns.pairplot(data[['SquareFeet', 'Bedrooms', 'Bathrooms', 'YearBuilt', 'LocationScore', 'Price']])
plt.show()

def train_model(data):
"""
Train a predictive model using the dataset.

Parameters:
    data (pd.DataFrame): Preprocessed dataset.

Returns:
    model: Trained model.
    X_test (pd.DataFrame): Test features.
    y_test (pd.Series): Test target values.
"""
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

features = ['SquareFeet', 'Bedrooms', 'Bathrooms', 'YearBuilt', 'LocationScore']
target = 'Price'

X = data[features]
y = data[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

return model, X_test, y_test

def evaluate_model(model, X_test, y_test):
"""
Evaluate the trained model using Mean Squared Error and R-squared metrics.

Parameters:
    model: Trained model.
    X_test (pd.DataFrame): Test features.
    y_test (pd.Series): Test target values.

Returns:
    None: Prints evaluation metrics.
"""
from sklearn.metrics import mean_squared_error, r2_score

y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")

def visualize_results(model, X_test, y_test):
"""
Visualize the actual vs predicted prices and feature importance.

Parameters:
    model: Trained model.
    X_test (pd.DataFrame): Test features.
    y_test (pd.Series): Test target values.

Returns:
    None: Displays plots.
"""
import matplotlib.pyplot as plt
import pandas as pd

# Actual vs Predicted Prices
y_pred = model.predict(X_test)
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.6, color='blue')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], '--r', linewidth=2)
plt.xlabel('Actual Prices')
plt.ylabel('Predicted Prices')
plt.title('Actual vs Predicted Prices')
plt.show()

# Feature Coefficients
coefficients = pd.Series(model.coef_, index=X_test.columns)
plt.figure(figsize=(8, 4))
coefficients.plot(kind='bar', color='skyblue')
plt.title('Feature Coefficients')
plt.ylabel('Coefficient Value')
plt.show()

Example Usage

if name == "main":
data = load_data()
data = preprocess_data(data)
analyze_data(data)
model, X_test, y_test = train_model(data)
evaluate_model(model, X_test, y_test)
visualize_results(model, X_test, y_test)

@MaRia19280
Copy link

This script is an excellent end-to-end solution for predicting housing prices, covering data cleaning, EDA, model training, evaluation, and visualization. Its structured workflow ensures reliability, while linear regression provides interpretability. Adding cross-validation, handling outliers, or testing advanced models could further enhance performance. Overall, it's a solid foundation for real estate analytics. Well put, Maher 😊

@MuhannadGTR
Copy link

Hi Maher,

Please follow the instructions for adding your file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants