This repo contains the code for both the MPA assignments.
First install the dependencies. Also make sure you have Spark installed.
pip install -r requirements.txtDownload the McDonalds dataset. Then extract the CSV file into data/mcdonalds/raw.csv.
Then run
spark-submit prepare_data.pyto prepare the data and download the other datasets.
You can run the mst edge sampling with for example
spark-submit mst_edge_sampling.py housingFinally you can run the kmeans clustering with for example
spark-submit scalable_kmeans++.py housing 100to run the k-mean algorithm on the housing dataset with k=100