Predict the monthly profit of potential coffee-shop locations using SQL (SQLite) for data preparation and Python for regression modeling and visualization.
This project simulates a data-driven site selection workflow, helping retail managers or founders decide where to open their next store based on real-world features such as rent, income, foot traffic, and local events.
- Goal: Estimate monthly profit for candidate locations.
- Tech Stack: SQLite, pandas, scikit-learn, matplotlib.
- Core Methods: Feature engineering in SQL, ElasticNet regression for balanced interpretability and performance.
- Outputs:
- Metrics (
R²,MAE) - Profit predictions
- Feature importance
- Diagnostic plots
- Metrics (
coffee-shop-profit-predictor/
├─ README.md
├─ requirements.txt
├─ data/
│ ├─ locations_train.csv
│ └─ locations_candidates.csv
├─ src/
│ ├─ create_db.py
│ ├─ queries.sql
│ ├─ train_regression.py
│ ├─ score_new_sites.py
│ └─ utils.py
└─ outputs/
├─ metrics.json
├─ feature_importance.csv
├─ predictions_train.csv
├─ scored_candidates.csv
└─ charts/
├─ actual_vs_predicted.png
├─ residuals_hist.png
└─ feature_importance.png
| Column | Description |
|---|---|
lat, lon |
Coordinates of the shop location |
foot_traffic |
Average daily pedestrians |
rent_per_sqm |
Monthly rent per square meter (€) |
competition |
Number of similar cafés within 500 m |
median_income |
Median neighborhood income (€) |
office_density |
Offices within 500 m |
weekend_activity |
Weekend leisure index (0–1) |
events_per_month |
Number of local events monthly |
coffee_price |
Average sale price (€) |
promo_spend |
Monthly local promo budget (€) |
profit |
Monthly profit (€), target variable |
Feature creation is done in queries.sql, producing two views:
features_train and features_candidates.
Key engineered features:
demand_adj= foot traffic / (1 + competition)wknd_traffic= weekend activity × trafficprice_income= coffee price × income factorpromo_comp_adj= promo spend / (1 + competition)
These normalized interaction features help the regression model capture competitive intensity, affordability, and demand dynamics.
# Create a virtual environment
python -m venv .venv
source .venv/bin/activate # on Windows: .venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Load CSVs into SQLite
python src/create_db.py --train data/locations_train.csv --candidates data/locations_candidates.csv --db coffee.db
# Train and evaluate the model
python src/train_regression.py --db coffee.db --sql src/queries.sql --outdir outputs
# Score new candidate sites
python src/score_new_sites.py --db coffee.db --sql src/queries.sql --model outputs/model.joblib --outdir outputs
Each point represents a store location.
- The diagonal line (y = x) indicates perfect prediction.
- Points close to the line show accurate profit estimation.
- A strong linear trend confirms the model’s predictive ability.
- Outliers correspond to unusual cases (very high rent or event density).
Insight:
The model captures real-world profitability patterns.
Occasional underestimation of high-profit stores suggests nonlinear effects that could be explored later.
| Feature | Effect | Interpretation |
|---|---|---|
events_per_month |
↑ | More events → higher profits. |
rent_per_sqm |
↓ | Rent strongly decreases profit. |
price_income |
↑ | High-income areas tolerate higher prices. |
demand_adj |
↑ | Traffic normalized by competition. |
promo_comp_adj |
↑ | Promotional spending offsets competition. |
Insight:
Rent and local events dominate the financial outcome.
Marketing and demographics play measurable supporting roles.
- Residuals are roughly bell-shaped and centered near 0, meaning the model has no systematic bias.
- Few outliers → stable performance.
- Balanced tails indicate good generalization.
Insight:
Model errors are random and well-distributed.
Future work could explore nonlinear regressors or tree-based ensembles.
| Metric | Description | Value (example) |
|---|---|---|
R² |
Explained variance (goodness of fit) | ≈ 0.85 |
MAE |
Average absolute prediction error (€) | ≈ 180 € |
ElasticNet combines L1 and L2 regularization, preventing overfitting while keeping coefficients interpretable.
- Event density is the top profit driver, opening near cultural or office events maximizes demand.
- Rent remains the most significant cost factor. Even small rent changes have major profit impacts.
- Wealthier neighborhoods sustain higher pricing strategies.
- Balanced marketing spending helps counter intense competition.
- Database: SQLite (
coffee.db) - Model: ElasticNet (α = 0.05, L₁ = 0.2)
- Preprocessing: Standard scaling via
ColumnTransformer - Evaluation: R², MAE, diagnostic plots
- Reproducibility: Random seed = 42
This project showcases:
- Practical SQL + Python integration
- Realistic data-science workflow: data → SQL features → ML → interpretability
- Business storytelling through visuals
- Model transparency and actionable insights
Ideal for demonstrating retail analytics, forecasting, and data-driven decision-making in a professional portfolio.