50.038 Computational Data Science — Spring 2026
End-to-end pipeline to relate Spotify track data, lyrics, YouTube matches, librosa audio features, and engagement signals for viral / popularity modeling, ensembling (RF, XGBoost, LightGBM, stacking), and exploratory notebooks.
- Python 3.10+ (3.12+ or 3.14 works; for notebooks and ML stack, a stable 3.12 venv is often smoother than the newest patch release)
- Git
- Make (optional but recommended for setup and data downloads)
- A Hugging Face account and token for datasets and pretrained artifacts (see below)
Note
You can always check by running these commands:
python3 --versionpython3 -m pip --version(orpip3 --version)make --version(usually installed on macOS/Linux)git --version- After
make setupandmake setup-huggingface:.venv/bin/hf versionor.venv/bin/hf auth whoami
Or run this from the repository root and it will run the checks for you:
make prereqFrom the repository root:
python3 -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activatepip install --upgrade pip
pip install -r requirements.txtOr use the Makefile helper (creates .venv with python3 -m venv if missing, then installs dependencies):
make setupTo remove only the virtual environment (keeps data/), use make rm-venv and then make setup again when you want a clean env.
requirements.txtin this repo is a fully pinned dependency set (suitable forpip install -rin a fresh venv). If you maintain a shorter “loose” list elsewhere, keep it under another filename so installs stay reproducible.
Install the Hub CLI via the same dependency set, then authenticate:
make setup-huggingfaceThis installs huggingface_hub, runs hf auth login interactively, and stores your token for hf download / uploads. If you already have the Hub installed but need to sign in again (expired token, new machine, or you skipped the Makefile step), run:
.venv/bin/hf auth loginUse hf auth login instead if the hf executable is already on your PATH.
Verify with:
.venv/bin/hf auth whoamiCreate a token under Hugging Face → Settings → Access Tokens if you do not have one.
All download scripts pull from the vancenceho Hugging Face org into data/ or notebooks/models/. Run make setup first so .venv contains hf.
Suggested order (matches the pipeline: raw → cleaned → processed → models):
make download-spotify # data/raw/spotify_tracks.csv
make download-lyrics # data/raw/spotify_millsongdata.csv
make download-youtube-audio-features # data/raw/audio_features.csv
make download-spotify-tracks-clean # data/cleaned/spotify_tracks_cleaned.csv
make download-spotify-lyrics-clean # data/cleaned/lyrics_cleaned.csv
make download-youtube-features-clean # data/processed/youtube_features_cleaned.csv
make download-vcp-combined-features # data/processed/combined_features_cleaned.csv
make download-vcp-combined-ensemble-stacking # notebooks/models/*.joblib (stacking + base learners)To run the same sequence in one go:
make download-allRe-run a single target after HF_SKIP_EXISTING=0 if you need to force a refresh. See make help for short descriptions of each target.
Note: Large files stay out of Git (see
.gitignore). Local-only CSVs and models are optional if you regenerate everything from notebooks instead.
The project includes a small Streamlit UI under app/ that loads the stacking ensemble (RF, XGBoost, LightGBM, meta-learner, scaler) from Hugging Face Hub (vancenceho/vcp-combined-ensemble-stacking) and serves predictions plus static EDA figures from app/assets/.
- Complete
make setupandmake setup-huggingfacefirst (Hub token required for the first model download). - From the repository root:
make appThis target runs make setup to ensure the venv and dependencies exist, then starts Streamlit (default URL http://localhost:8501).
To run the same entrypoint manually with the venv active:
source .venv/bin/activate
streamlit run app/app.pyHigh-level flow (ingest → Spotify branch → YouTube/audio → metadata → feature join → training):
- Ingest: Spotify tracks and Million Song–style lyrics land under
data/raw/. - Spotify branch:
01acleans →01bengineers parquets underdata/processed/. - YouTube + audio: matching, MP3 download,
02blibrosa features →audio_features.csv. - YouTube metadata:
02cbuildsyoutube_metadata.csv(and related tables). - Feature engineering: joins yield
in_progress_dataset/ EDA outputs →combined_features_cleaned.csv. - Training:
03_combined_model_trainingfits stacking + RF / XGB / LGBM + scaler (and saves figures undernotebooks/images/).
An exploratory track (lyrics TF‑IDF, Model A/B preds, meta-combiner) lives under notebooks/exploratory/ and feeds optional ensemble predictions (preds_model_a, preds_model_b, preds_ensemble).
Run Jupyter from the repo with the venv activated. Paths in notebooks assume execution with working directory notebooks/ unless a cell sets Path relative to the project root—check the first path cells in each notebook.
| Order | Notebook | Purpose |
|---|---|---|
| 1 | 01a_clean_spotify.ipynb |
Clean raw Spotify CSV → data/cleaned/spotify_tracks_cleaned.csv |
| 2 | 01b_process_spotify.ipynb |
Parquets + baselines under data/processed/ |
| 3 | 01c_spotify_eda.ipynb |
EDA on Spotify / processed tables |
| 4 | 01d_spotify_model_baseline.ipynb |
Model A (track features) |
| 5 | 02a_youtube_parallel_match_download.ipynb |
YouTube match + audio download |
| 6 | 02b_youtube_audio_feature_extraction.ipynb |
Librosa features → data/raw/audio_features.csv |
| 7 | 02c_youtube_collect_metadata.ipynb |
YouTube metadata merge |
| 8 | 02d_youtube_eda.ipynb |
EDA on YouTube-side features |
| 9 | 02e_youtube_model_baseline.ipynb |
YouTube / engagement baseline |
| 10 | 03_combined_model_training.ipynb |
Combined features → stacking ensemble + plots |
Exploratory / optional (can run after cleaned data exists):
mid-term-spotify-eda.ipynb(mid-term EDA snapshot)exploratory/explore_lyrics_clean.ipynb,explore_lyrics_preprocess.ipynbexploratory/explore_spotify_tracks_model_A.ipynb,explore_spotify_lyrics_model_B.ipynbexploratory/explore_youtube_engagement_ensemble.ipynb,explore_combined_enesmble_voting.ipynb,explore_meta_ensemble_combiner.ipynb
docs/project-report.pdf— full project write-up (methods, experiments, results).docs/project-description.pdf,docs/project-guideline.pdf— coursework briefs and expectations.
viral-content-predictor/
├── README.md
├── LICENSE
├── Makefile # setup, venv/rm-venv, app (Streamlit), download-* targets
├── requirements.txt # pinned transitive deps (full lock-style list)
├── app/
│ ├── app.py # Streamlit app (Hub models + JSON feature input)
│ └── assets/ # EDA figures for the UI
├── docs/
│ ├── pipeline-architecture.png
│ ├── project-report.pdf
│ ├── project-description.pdf
│ └── project-guideline.pdf
├── data/
│ ├── raw/ # spotify_tracks, lyrics, audio_features, … (gitignored blobs)
│ ├── cleaned/ # *_cleaned.csv (gitignored; HF or notebooks)
│ └── processed/ # parquets, combined_features_cleaned.csv, preds, …
├── notebooks/
│ ├── 01a_clean_spotify.ipynb … 03_combined_model_training.ipynb
│ ├── exploratory/ # optional ensemble / lyrics / EDA notebooks
│ ├── images/ # training plots (often gitignored *.png)
│ └── models/ # joblib artifacts (gitignored; HF download target)
└── scripts/
├── setup-huggingface.sh
└── download-*.sh # Hugging Face CLI wrappers per dataset/model repo
| Command | Description |
|---|---|
make help |
List all targets |
make prereq |
Print versions for Python, pip, make, git, and hf (if venv exists) |
make venv |
Create .venv with python3 -m venv if it does not exist |
make rm-venv |
Delete .venv only (does not remove data/) |
make setup |
Create venv + pip install -r requirements.txt |
make setup-huggingface |
Install huggingface_hub + hf auth login |
make app |
Run make setup, then the Streamlit app (app/app.py → http://localhost:8501) |
make download-all |
Run all eight download-* targets in pipeline order |
make download-<name> |
See § Download datasets |
make cleanup |
Remove venv and scratch data files under data/ (destructive) |
- Course: 50.038 Computational Data Science, Spring 2026, Information Systems Technology and Design (ISTD), Singapore University of Technology and Design (SUTD).
- Data & tools: Spotify Web API / track features conventions, Kaggle-style public dumps used in early ingestion, YouTube metadata policies, librosa, scikit-learn, XGBoost, LightGBM, Hugging Face Hub for reproducible artifacts.
See LICENSE in the repository root.
