AWS MLOps with Glue and SageMaker

Overview

This project implements an end-to-end MLOps pipeline on AWS for the California Housing dataset, focusing on linear regression prediction. The architecture leverages AWS Glue for ETL preprocessing, Lambda and EventBridge for orchestration, SageMaker for model training and deployment, and a Streamlit frontend for user interaction.

Getting Started

Infrastructure Deployment
- All AWS resources are provisioned using Terraform modules in the iac/ directory.
- Configure your AWS credentials
```
export AWS_ACCESS_KEY_ID="<your-access-key-secret>"
export AWS_SECRET_ACCESS_KEY="<your-secret>"
export AWS_DEFAULT_REGION="<region-of-your-deployment>"
export SAGEMAKER_ENDPOINT_NAME="<tfvars-endpoint-name>"
```
- run terraform init in the iac/ folder.
- change iac/tfvars/prod.tfvars with your configurations.
- run terraform apply --var-file=./tfvars/prod.tfvars in the iac/ folder for build the infrastructure in your aws account.
Data Upload
- By default /dataset/housing.csv is loaded into s3 bucket. This action trigger the pipeline.
- Optional
  - If you want after the first pipeline, you can upload raw California Housing data to the designated S3 data bucket.
  - This triggers the ETL pipeline automatically for train a new model.
Model Training and Deployment
- Once preprocessing is complete, the pipeline triggers SageMaker for training and deployment.
- The trained model is registered and deployed as an endpoint.
API
- Use the provided API to make predictions.
Frontend You can launch the frontend in different ways:

Before launching the frontend, ensure that all required environment variables are properly set.
You can configure these variables in one of the following ways:
- By passing them directly to the Docker container using the -e flag.
- By specifying them in the Helm chart values or Kubernetes secrets/manifests when deploying on Kubernetes.
Docker:
- Build the Docker image:
```
docker build -t mlops-frontend ./frontend/src/
```
  Or pull the prebuilt image from GitHub Container Registry:
```
docker pull ghcr.io/umbertocicciaa/mlops-frontend:latest
```
- Run the container:
```
docker run -p 8501:8501 mlops-frontend
```
Kubernetes:
- Deploy using Helm chart:
```
chmod u+x fe-helm
./install.sh install
```
- Or apply Kubernetes manifests directly:
```
chmod u+x k8s/start.sh
./start.sh
```

Project Structure

Project structure:

data-preprocessing/
Contains AWS Glue ETL scripts for data preprocessing.
- pre_processing.py: Main ETL preprocessing script for the California Housing dataset.
frontend/
Streamlit-based frontend application for user interaction.
- app.py: Main Streamlit app file.
- requirements.txt: Python dependencies for the frontend.
fe-helm/
Contains Helm charts for deploying frontend and related services on Kubernetes.
iac/
Infrastructure as Code (IaC) using Terraform to provision AWS resources.
pipeline/
Source code for the SageMaker MLOps pipeline.
- training_preprocessing.py: Data preprocessing script used during model training.
resources/
Images and other resources for documentation.
docs/
Project documentation and reports.
- umbertodomenico_ciccia_summary.pdf: Project report (in Italian).
- ciccia-assignement.pdf: Project assignement (in Italian).
clean-aws.sh
Scripts for delete not architectural elements from aws.

Architecture

ETL Pipeline with AWS Glue
- Raw data is uploaded to an S3 buckets.
- When new data is uploaded to the S3 bucket, an EventBridge rule triggers a Lambda function. The Lambda function starts the Glue Crawler to update the data catalog. - An AWS Glue Crawler detects new uploads in the S3 data bucket and updates the data catalog.
- After the crawler completes successfully, an ETL job is started to preprocess the California Housing dataset (see pre_processing.py).
- Cleaned data is written to a final preprocessed S3 bucket.
Triggering SageMaker MLOps Pipeline
- Upload of the preprocessed file to the S3 bucket triggers another EventBridge rule.
- This rule starts the SageMaker MLOps pipeline, which:
  - Runs further data processing (training_preprocessing.py)
  - Trains a linear regression model using XGBoost
  - Registers and deploys the model as an endpoint
Model Serving API
- After training, the model is deployed as a SageMaker endpoint.
- An API is exposed for making predictions using this endpoint.
Frontend
- The frontend is built with Streamlit, allowing users to interact visually with the model and make predictions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AWS MLOps with Glue and SageMaker

Overview

Getting Started

Project Structure

Architecture

References

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 154 Commits
.github		.github
data-preprocessing		data-preprocessing
dataset		dataset
docs		docs
fe-helm		fe-helm
frontend		frontend
iac		iac
k8s		k8s
lambda		lambda
pipeline		pipeline
resources		resources
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
clean-aws.sh		clean-aws.sh

License

umbertocicciaa/aws-mlops

Folders and files

Latest commit

History

Repository files navigation

AWS MLOps with Glue and SageMaker

Overview

Getting Started

Project Structure

Architecture

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors 3

Uh oh!

Languages

Packages