This repository contains a comprehensive analysis of customer segmentation using various clustering techniques. The goal of this project is to identify distinct customer segments based on their purchasing behavior and visualize the results.
- Introduction
 - Dataset
 - Data Preprocessing
 - Exploratory Data Analysis
 - Clustering
 - Principal Component Analysis (PCA)
 - Mean Shift Clustering
 - Results
 - Conclusion
 - Dependencies
 - Usage
 
Customer segmentation is a crucial task in marketing and business strategy. By identifying distinct customer segments, businesses can tailor their marketing efforts and improve customer satisfaction. In this project, we use various clustering techniques to segment customers based on their purchasing behavior.
The dataset used in this analysis is the "Online Retail" dataset, which contains transactional data for a UK-based online retail store. The dataset includes information such as invoice number, stock code, description, quantity, invoice date, unit price, customer ID, and country.
- Loading the Dataset: The dataset is loaded from an Excel file using 
pandas. - Cleaning the Data: Missing values are removed, and the index is reset.
 - Encoding Categorical Features: Categorical features are encoded into numerical values using 
LabelEncoder. - Normalizing the Data: Features are scaled to a range of 1 to 5 using 
MinMaxScaler. 
- Histograms: Histograms are created for all features to visualize their distributions.
 - Pair Plots: Pair plots are generated to visualize relationships between pairs of features.
 - Correlation Heatmap: A heatmap is created to visualize the correlation between features.
 
- K-Means Clustering: K-Means clustering is performed for different values of K (2 to 11). The Elbow method is used to determine the optimal number of clusters.
 - Evaluation Metrics: Clustering performance is evaluated using Silhouette Score, Calinski-Harabasz Score, and Davies-Bouldin Score.
 - Visualization: Scatter plots are created to visualize the clusters and their centroids.
 
- Dimensionality Reduction: PCA is performed to reduce the dimensionality of the data while retaining most of the variance.
 - Explained Variance: The explained variance for different numbers of principal components is evaluated.
 - Visualization: 2D and 3D scatter plots are created to visualize the PCA results.
 
- Sampling the Data: A random sample of 5000 rows is taken from the normalized DataFrame.
 - Estimating Bandwidth: The bandwidth parameter for the Mean Shift algorithm is estimated.
 - Clustering: Mean Shift clustering is performed on the sampled data.
 - Visualization: Scatter plots are created to visualize the clusters and their centroids.
 
The analysis identified distinct customer segments based on their purchasing behavior. The optimal number of clusters was determined using the Elbow method and evaluation metrics. The clusters were visualized using scatter plots and PCA, providing valuable insights into the distribution of data points across different clusters.
This project demonstrates the effectiveness of clustering techniques in customer segmentation. By identifying distinct customer segments, businesses can tailor their marketing efforts and improve customer satisfaction. The use of PCA and various evaluation metrics ensures that the clustering results are reliable and meaningful.
- Clone the repository:
git clone https://github.com/your-username/your-repository.git
 - Navigate to the repository directory:
cd your-repository - Install the required dependencies:
pip install -r requirements.txt
 - Run the Jupyter notebook to perform the analysis and visualize the results. Feel free to explore the code and modify it to suit your needs. If you have any questions or suggestions, please open an issue or submit a pull request.
 
Happy analyzing!