This project implements an end-to-end MLOps pipeline on AWS for the California Housing dataset, focusing on linear regression prediction. The architecture leverages AWS Glue for ETL preprocessing, Lambda and EventBridge for orchestration, SageMaker for model training and deployment, and a Streamlit frontend for user interaction.
-
Infrastructure Deployment
-
All AWS resources are provisioned using Terraform modules in the
iac/directory. -
Configure your AWS credentials
export AWS_ACCESS_KEY_ID="<your-access-key-secret>" export AWS_SECRET_ACCESS_KEY="<your-secret>" export AWS_DEFAULT_REGION="<region-of-your-deployment>" export SAGEMAKER_ENDPOINT_NAME="<tfvars-endpoint-name>"
-
run
terraform initin theiac/folder. -
change
iac/tfvars/prod.tfvarswith your configurations. -
run
terraform apply --var-file=./tfvars/prod.tfvarsin theiac/folder for build the infrastructure in your aws account.
-
-
Data Upload
- By default
/dataset/housing.csvis loaded into s3 bucket. This action trigger the pipeline. - Optional
- If you want after the first pipeline, you can upload raw California Housing data to the designated S3 data bucket.
- This triggers the ETL pipeline automatically for train a new model.
- By default
-
Model Training and Deployment
- Once preprocessing is complete, the pipeline triggers SageMaker for training and deployment.
- The trained model is registered and deployed as an endpoint.
-
API
- Use the provided API to make predictions.
-
Frontend You can launch the frontend in different ways:
Before launching the frontend, ensure that all required environment variables are properly set.
You can configure these variables in one of the following ways:- By passing them directly to the Docker container using the
-eflag. - By specifying them in the Helm chart values or Kubernetes secrets/manifests when deploying on Kubernetes.
Docker:
-
Build the Docker image:
docker build -t mlops-frontend ./frontend/src/
Or pull the prebuilt image from GitHub Container Registry:
docker pull ghcr.io/umbertocicciaa/mlops-frontend:latest
-
Run the container:
docker run -p 8501:8501 mlops-frontend
Kubernetes:
-
Deploy using Helm chart:
chmod u+x fe-helm ./install.sh install
-
Or apply Kubernetes manifests directly:
chmod u+x k8s/start.sh ./start.sh
- By passing them directly to the Docker container using the
Project structure:
-
data-preprocessing/
Contains AWS Glue ETL scripts for data preprocessing.- pre_processing.py: Main ETL preprocessing script for the California Housing dataset.
-
frontend/
Streamlit-based frontend application for user interaction.- app.py: Main Streamlit app file.
- requirements.txt: Python dependencies for the frontend.
-
fe-helm/
Contains Helm charts for deploying frontend and related services on Kubernetes. -
iac/
Infrastructure as Code (IaC) using Terraform to provision AWS resources. -
pipeline/
Source code for the SageMaker MLOps pipeline.- training_preprocessing.py: Data preprocessing script used during model training.
-
resources/
Images and other resources for documentation. -
docs/
Project documentation and reports.- umbertodomenico_ciccia_summary.pdf: Project report (in Italian).
- ciccia-assignement.pdf: Project assignement (in Italian).
-
clean-aws.sh
Scripts for delete not architectural elements from aws.
-
ETL Pipeline with AWS Glue
- Raw data is uploaded to an S3 buckets.
- When new data is uploaded to the S3 bucket, an EventBridge rule triggers a Lambda function. The Lambda function starts the Glue Crawler to update the data catalog. - An AWS Glue Crawler detects new uploads in the S3 data bucket and updates the data catalog.
- After the crawler completes successfully, an ETL job is started to preprocess the California Housing dataset (see pre_processing.py).
- Cleaned data is written to a final preprocessed S3 bucket.
-
Triggering SageMaker MLOps Pipeline
- Upload of the preprocessed file to the S3 bucket triggers another EventBridge rule.
- This rule starts the SageMaker MLOps pipeline, which:
- Runs further data processing (training_preprocessing.py)
- Trains a linear regression model using XGBoost
- Registers and deploys the model as an endpoint
-
Model Serving API
- After training, the model is deployed as a SageMaker endpoint.
- An API is exposed for making predictions using this endpoint.
-
Frontend
- The frontend is built with Streamlit, allowing users to interact visually with the model and make predictions.
