Skip to content

aditya2907/TensorFleet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

30 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ TensorFleet - Distributed ML Training Platform

TensorFleet is a production-ready, cloud-native distributed machine learning training platform that orchestrates ML workloads across multiple compute nodes using modern microservices architecture, gRPC communication, and Kubernetes orchestration.

οΏ½ Team

Project Team: TensorFleet Development Team

Team Member Student ID Role Focus Areas
Aditya Suryawanshi 25211365 Backend Infrastructure Lead API Gateway, Orchestrator, Worker Nodes, gRPC
Rahul Mirashi 25211365 ML & Data Services Lead ML Worker, Model Service, Storage, MongoDB
Soham Maji 25204731 Frontend & Monitoring Lead React Dashboard, Monitoring, DevOps, Documentation

πŸ“– Detailed Work Distribution: See docs/TEAM_WORK_DIVISION.md for comprehensive breakdown of responsibilities, tasks, and contributions.

οΏ½πŸ“‹ Table of Contents

πŸ—οΈ System Architecture

TensorFleet implements a distributed microservices architecture optimized for machine learning workloads with horizontal scalability and fault tolerance.

graph TB
    subgraph "Client Layer"
        UI[React Frontend<br/>Material-UI Dashboard]
        CLI[CLI Tools<br/>Python/Go Clients]
    end
    
    subgraph "API Layer"
        GW[API Gateway<br/>Go + Gin<br/>Port 8080]
    end
    
    subgraph "Orchestration Layer"
        ORCH[Orchestrator Service<br/>Go + gRPC<br/>Port 9090]
        SCHED[Job Scheduler<br/>Task Distribution]
    end
    
    subgraph "Compute Layer"
        W1[Worker Node 1<br/>Go + gRPC<br/>Port 9091]
        W2[Worker Node 2<br/>Go + gRPC<br/>Port 9092] 
        W3[Worker Node N<br/>Go + gRPC<br/>Port 909N]
        MLW[ML Worker<br/>Python + Flask<br/>Port 8085]
    end
    
    subgraph "Storage & Registry Layer"
        MODELS[Model Service<br/>Python + Flask<br/>Port 8084]
        STORAGE[Storage Service<br/>Python + Flask<br/>Port 8082]
        MINIO[MinIO S3<br/>Object Storage<br/>Port 9000]
        MONGO[MongoDB<br/>GridFS + Collections<br/>Port 27017]
    end
    
    subgraph "Observability Layer"
        MON[Monitoring Service<br/>Python + Flask<br/>Port 8081]
        PROM[Prometheus<br/>Metrics Collection<br/>Port 9090]
        GRAF[Grafana<br/>Visualization<br/>Port 3000]
    end
    
    subgraph "Infrastructure Layer"
        REDIS[Redis<br/>Caching & Queues<br/>Port 6379]
        NGINX[Nginx<br/>Load Balancer<br/>Port 80/443]
    end

    %% Client connections
    UI --> GW
    CLI --> GW
    
    %% API Gateway routing
    GW --> ORCH
    GW --> MODELS
    GW --> STORAGE
    GW --> MON
    
    %% Orchestration
    ORCH --> W1
    ORCH --> W2
    ORCH --> W3
    ORCH --> MLW
    SCHED --> ORCH
    
    %% Storage connections
    MODELS --> MONGO
    STORAGE --> MINIO
    MLW --> MONGO
    W1 --> REDIS
    W2 --> REDIS
    W3 --> REDIS
    
    %% Monitoring connections
    W1 --> PROM
    W2 --> PROM
    W3 --> PROM
    MLW --> PROM
    MON --> PROM
    GRAF --> PROM
    
    %% Load balancing
    NGINX --> GW
    
    style UI fill:#e1f5fe
    style GW fill:#fff3e0
    style ORCH fill:#f3e5f5
    style W1 fill:#e8f5e8
    style W2 fill:#e8f5e8
    style W3 fill:#e8f5e8
    style MLW fill:#e8f5e8
    style MODELS fill:#fff8e1
    style STORAGE fill:#fff8e1
    style MON fill:#fce4ec
Loading

🎯 Core Architectural Principles

  • Microservices Architecture: Each service is independently deployable and scalable
  • Event-Driven Communication: Asynchronous processing with gRPC and REST APIs
  • Cloud-Native Design: Kubernetes-first deployment with service mesh capabilities
  • Observability-First: Comprehensive monitoring, logging, and distributed tracing
  • Fault Tolerance: Circuit breakers, retries, and graceful degradation
  • Security by Design: API authentication, TLS encryption, and RBAC controls

πŸ”§ Service Documentation

Each TensorFleet service is documented with comprehensive setup, API, and deployment guides:

Core Services

Service Technology Port Documentation Purpose
API Gateway Go + Gin 8080 πŸ“– README HTTP/REST entry point, request routing, authentication
Orchestrator Go + gRPC 9090 πŸ“– README Job scheduling, task distribution, worker management
Worker Go + gRPC 9091+ πŸ“– README Distributed compute nodes, training execution
ML Worker Python + Flask 8085 πŸ“– README Machine learning training engine with MongoDB

Data & Storage Services

Service Technology Port Documentation Purpose
Model Service Python + Flask 8084 πŸ“– README Model registry, versioning, GridFS storage
Storage Service Python + Flask 8082 πŸ“– README S3-compatible object storage, dataset management

Platform Services

Service Technology Port Documentation Purpose
Frontend React + Vite 3000 πŸ“– README Web dashboard, real-time monitoring UI
Monitoring Python + Flask 8081 πŸ“– README Metrics aggregation, health monitoring, analytics

Infrastructure Dependencies

Component Technology Port Purpose
MongoDB Document DB 27017 Model metadata, GridFS file storage
Redis In-Memory DB 6379 Caching, job queues, session storage
MinIO Object Storage 9000 S3-compatible file storage for datasets
Prometheus Monitoring 9090 Metrics collection and alerting
Grafana Visualization 3000 Metrics dashboards and analytics

✨ Platform Features

🎯 Core ML Platform Capabilities

  • πŸš€ Distributed Training: Automatically scale ML training across multiple worker nodes
  • 🧠 Model Registry: Version-controlled model storage with metadata and GridFS integration
  • οΏ½ Automatic Model Saving: Models are automatically saved to storage when training jobs complete
  • οΏ½πŸ“Š Real-time Monitoring: Live training metrics, system health, and performance dashboards
  • πŸ”„ Job Orchestration: Intelligent task scheduling with load balancing and fault tolerance
  • πŸ—„οΈ Data Management: S3-compatible storage for datasets, models, and artifacts
  • πŸ”’ Security & Auth: JWT authentication, RBAC, and secure service-to-service communication

πŸ› οΈ Developer Experience

  • πŸ“‘ RESTful APIs: Comprehensive REST endpoints for all platform interactions
  • πŸ”Œ gRPC Services: High-performance inter-service communication
  • 🐳 Docker-First: Container-native development and deployment
  • ☸️ Kubernetes-Ready: Production-grade orchestration with Helm charts
  • πŸ“ˆ Observability: Prometheus metrics, distributed tracing, and structured logging
  • πŸ§ͺ Testing Suite: Comprehensive unit, integration, and load testing

🎨 Frontend Dashboard Features

  • πŸ“Š Real-time Analytics: Live job status, worker utilization, and training progress
  • πŸŽ›οΈ Job Management: Submit, monitor, and manage ML training jobs
  • πŸ“ˆ Metrics Visualization: Interactive charts and graphs for training metrics
  • πŸ–₯️ System Health: Service status monitoring and health checks
  • πŸ‘€ User Management: Role-based access control and user authentication
  • πŸ”” Notifications: Real-time alerts and job completion notifications

⚑ Quick Start

πŸš€ One-Line Setup

Get TensorFleet running in under 2 minutes with Docker Compose:

# Clone repository
git clone https://github.com/aditya2907/TensorFleet.git
cd TensorFleet

# Start all services
docker-compose up -d

# Verify deployment
make status-check

# Access the dashboard
open http://localhost:3000

βœ… Prerequisites

  • Docker Desktop (latest) with Docker Engine >= 24
  • Docker Compose v2 (bundled with Docker Desktop)
  • macOS, Linux, or Windows WSL2
  • Open local ports (default):
    • 3000 (Frontend), 8080 (API Gateway), 8081 (Monitoring), 8082 (Storage), 8084 (Model Service), 8085 (ML Worker), 27017 (MongoDB), 6379 (Redis), 9000/9001 (MinIO), 9090 (Prometheus)
  • Optional: Make (GNU Make) for helper targets

🧰 First-Run Checklist

  1. Copy environment defaults (if present):
    cp .env.example .env || true
  2. Start infrastructure and core services:
    docker-compose -f docker-compose.development.yml up -d
  3. Validate service health:
    make status-check || docker ps
  4. Run a demo workflow:
    ./demo-mongodb-ml.sh
  5. Verify endpoints:

🎯 Quick Demo

Run a complete ML training workflow:

# Submit a training job via API
curl -X POST http://localhost:8080/api/v1/jobs \
  -H "Content-Type: application/json" \
  -d '{
    "algorithm": "random_forest",
    "dataset": "iris",
    "hyperparameters": {
      "n_estimators": 100,
      "max_depth": 5
    }
  }'

# Monitor job progress
curl http://localhost:8080/api/v1/jobs/{job_id}/status

# View training metrics
open http://localhost:3000/jobs/{job_id}

πŸ“± Access Points

Service URL Credentials
Frontend Dashboard http://localhost:3000 -
API Gateway http://localhost:8080 -
Model Registry http://localhost:8084/api/v1/models -
Grafana Metrics http://localhost:3001 admin/admin
MinIO Console http://localhost:9001 admin/password123

🐳 Docker Development

πŸƒ Development Mode

Start services in development mode with hot reloading:

# Start core infrastructure
docker-compose -f docker-compose.development.yml up -d mongodb redis minio

# Start services in development mode
make dev-start

# View logs
make logs

# Run specific service
docker-compose -f docker-compose.development.yml up api-gateway

πŸ”§ Service Management

# Start all services
make start

# Stop all services  
make stop

# Restart specific service
make restart SERVICE=api-gateway

# View service status
make status

# Clean up resources
make cleanup

πŸ› οΈ Development Utilities

# Install dependencies
make install-deps

# Run tests
make test

# Build all images
make build

# Format code
make format

# Run linting
make lint

☸️ Kubernetes Deployment

πŸš€ Production Deployment

Deploy TensorFleet to a Kubernetes cluster:

# Create namespace
kubectl apply -f k8s/namespace.yaml

# Deploy infrastructure
kubectl apply -f k8s/infrastructure.yaml

# Deploy core services
kubectl apply -f k8s/deployment.yaml

# Setup ingress
kubectl apply -f k8s/ingress.yaml

# Verify deployment
kubectl get pods -n tensorfleet

πŸ“Š Scaling Configuration

# Horizontal Pod Autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: worker-hpa
  namespace: tensorfleet
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: worker
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

πŸ” Security & Secrets

# Create TLS certificates
kubectl create secret tls tensorfleet-tls \
  --cert=path/to/tls.crt \
  --key=path/to/tls.key \
  -n tensorfleet

# Create database credentials
kubectl create secret generic mongodb-secret \
  --from-literal=connection-string="mongodb://admin:password123@mongodb:27017/tensorfleet?authSource=admin" \
  -n tensorfleet

# Create API keys
kubectl create secret generic api-keys \
  --from-literal=jwt-secret="your-jwt-secret" \
  --from-literal=admin-key="your-admin-api-key" \
  -n tensorfleet

πŸ“Š Monitoring & Observability

🎯 Key Metrics

TensorFleet provides comprehensive observability across all layers:

Application Metrics

  • Job Metrics: Success rate, completion time, queue depth
  • Training Metrics: Accuracy, loss, convergence rate, resource utilization
  • Worker Metrics: Task throughput, error rate, resource consumption
  • API Metrics: Request latency, error rate, throughput by endpoint

Infrastructure Metrics

  • Service Health: Availability, response time, resource usage
  • Database Performance: Connection pool, query performance, storage usage
  • Storage Metrics: Object storage usage, transfer rates, availability
  • Network Metrics: Service-to-service communication, latency, errors

πŸ“ˆ Grafana Dashboards

Pre-configured dashboards available at http://localhost:3001:

  • TensorFleet Overview: High-level platform metrics and health
  • Job Analytics: Training job performance and success rates
  • Worker Performance: Distributed worker utilization and efficiency
  • Infrastructure Health: System resources and service availability
  • API Performance: Gateway metrics and endpoint analytics

πŸ”” Alerting Rules

# Critical system alerts
groups:
  - name: tensorfleet-critical
    rules:
    - alert: ServiceDown
      expr: up{job="tensorfleet"} == 0
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: "TensorFleet service {{ $labels.instance }} is down"
    
    - alert: HighErrorRate
      expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
      for: 2m
      labels:
        severity: warning
      annotations:
        summary: "High error rate on {{ $labels.service }}"

πŸ§ͺ Testing & Validation

πŸ” Test Suite Overview

TensorFleet includes comprehensive testing at multiple levels:

# Run all tests
make test-all

# Unit tests
make test-unit

# Integration tests  
make test-integration

# API tests
make test-api

# Load tests
make test-load

# Security tests
make test-security

πŸš€ Continuous Integration

# GitHub Actions workflow
name: TensorFleet CI/CD
on: [push, pull_request]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v2
    - name: Run Tests
      run: |
        docker-compose -f docker-compose.test.yml up --abort-on-container-exit
    - name: Build Images
      run: make build
    - name: Security Scan
      run: make security-scan

πŸ“Š Test Coverage

Service Unit Tests Integration Tests API Tests
API Gateway 95%+ βœ… βœ…
Orchestrator 90%+ βœ… βœ…
Worker 88%+ βœ… βœ…
Model Service 92%+ βœ… βœ…
Storage 89%+ βœ… βœ…
Monitoring 87%+ βœ… βœ…
Frontend 85%+ βœ… βœ…

πŸ“ Project Structure

TensorFleet follows a microservices architecture with clear separation of concerns:

TensorFleet/
β”œβ”€β”€ 🌐 api-gateway/          # HTTP/REST API Gateway (Go + Gin)
β”‚   β”œβ”€β”€ main.go              # Gateway server implementation  
β”‚   β”œβ”€β”€ handlers/            # HTTP request handlers
β”‚   β”œβ”€β”€ middleware/          # Authentication, CORS, logging
β”‚   β”œβ”€β”€ go.mod               # Go dependencies
β”‚   └── Dockerfile           # Container configuration
β”‚
β”œβ”€β”€ 🎯 orchestrator/         # Job Orchestration Service (Go + gRPC)
β”‚   β”œβ”€β”€ main.go              # Orchestrator server
β”‚   β”œβ”€β”€ scheduler/           # Task scheduling logic
β”‚   β”œβ”€β”€ worker_manager/      # Worker registration & health
β”‚   β”œβ”€β”€ go.mod               # Go dependencies  
β”‚   └── Dockerfile           # Container configuration
β”‚
β”œβ”€β”€ ⚑ worker/               # Distributed Worker Nodes (Go + gRPC)
β”‚   β”œβ”€β”€ main.go              # Worker server implementation
β”‚   β”œβ”€β”€ executor/            # Task execution engine
β”‚   β”œβ”€β”€ metrics/             # Performance monitoring
β”‚   β”œβ”€β”€ go.mod               # Go dependencies
β”‚   └── Dockerfile           # Container configuration
β”‚
β”œβ”€β”€ πŸ€– worker-ml/            # ML Training Engine (Python + Flask)
β”‚   β”œβ”€β”€ main.py              # ML worker API server
β”‚   β”œβ”€β”€ models/              # ML algorithm implementations
β”‚   β”œβ”€β”€ datasets/            # Dataset loaders and preprocessors
β”‚   β”œβ”€β”€ requirements.txt     # Python dependencies
β”‚   └── Dockerfile           # Container configuration
β”‚
β”œβ”€β”€ πŸ—„οΈ model-service/        # Model Registry (Python + Flask + MongoDB)
β”‚   β”œβ”€β”€ main.py              # Model management API
β”‚   β”œβ”€β”€ storage/             # GridFS integration
β”‚   β”œβ”€β”€ metadata/            # Model metadata handling
β”‚   β”œβ”€β”€ requirements.txt     # Python dependencies
β”‚   └── Dockerfile           # Container configuration
β”‚
β”œβ”€β”€ πŸ“¦ storage/              # Object Storage Service (Python + Flask + MinIO)
β”‚   β”œβ”€β”€ main.py              # Storage API server
β”‚   β”œβ”€β”€ storage_manager.py   # MinIO client wrapper
β”‚   β”œβ”€β”€ handlers/            # File upload/download logic
β”‚   β”œβ”€β”€ requirements.txt     # Python dependencies
β”‚   └── Dockerfile           # Container configuration
β”‚
β”œβ”€β”€ πŸ“Š monitoring/           # Observability Service (Python + Flask)
β”‚   β”œβ”€β”€ main.py              # Monitoring API server  
β”‚   β”œβ”€β”€ collectors/          # Metrics collection
β”‚   β”œβ”€β”€ aggregators/         # Data aggregation logic
β”‚   β”œβ”€β”€ requirements.txt     # Python dependencies
β”‚   └── Dockerfile           # Container configuration
β”‚
β”œβ”€β”€ 🎨 frontend/             # Web Dashboard (React + Vite + Material-UI)
β”‚   β”œβ”€β”€ src/
β”‚   β”‚   β”œβ”€β”€ components/      # React components
β”‚   β”‚   β”œβ”€β”€ pages/           # Application pages
β”‚   β”‚   β”œβ”€β”€ hooks/           # Custom React hooks
β”‚   β”‚   β”œβ”€β”€ services/        # API client services
β”‚   β”‚   └── utils/           # Utility functions
β”‚   β”œβ”€β”€ public/              # Static assets
β”‚   β”œβ”€β”€ package.json         # Node.js dependencies
β”‚   β”œβ”€β”€ vite.config.js       # Vite configuration
β”‚   └── Dockerfile           # Container configuration
β”‚
β”œβ”€β”€ πŸ”Œ proto/                # gRPC Protocol Definitions
β”‚   β”œβ”€β”€ gateway.proto        # API Gateway service definitions
β”‚   β”œβ”€β”€ orchestrator.proto   # Orchestrator service definitions  
β”‚   β”œβ”€β”€ worker.proto         # Worker service definitions
β”‚   └── generate.sh          # Protocol buffer generation script
β”‚
β”œβ”€β”€ ☸️ k8s/                  # Kubernetes Deployment Manifests
β”‚   β”œβ”€β”€ namespace.yaml       # TensorFleet namespace
β”‚   β”œβ”€β”€ infrastructure.yaml  # MongoDB, Redis, MinIO
β”‚   β”œβ”€β”€ configmap.yaml       # Configuration management
β”‚   β”œβ”€β”€ deployment.yaml      # Core service deployments
β”‚   β”œβ”€β”€ ingress.yaml         # External access configuration
β”‚   β”œβ”€β”€ monitoring.yaml      # Prometheus & Grafana
β”‚   └── storage.yaml         # Persistent volume claims
β”‚
β”œβ”€β”€ πŸ§ͺ tests/                # Comprehensive Testing Suite
β”‚   β”œβ”€β”€ unit/                # Unit tests for each service
β”‚   β”œβ”€β”€ integration/         # Integration tests  
β”‚   β”œβ”€β”€ api/                 # API endpoint tests
β”‚   β”œβ”€β”€ load/                # Load testing scripts
β”‚   └── e2e/                 # End-to-end testing
β”‚
β”œβ”€β”€ πŸ“œ scripts/              # Automation & Utility Scripts
β”‚   β”œβ”€β”€ demo-*.sh            # Demo and testing scripts
β”‚   β”œβ”€β”€ cleanup-*.sh         # Environment cleanup utilities
β”‚   β”œβ”€β”€ setup-*.sh           # Environment setup scripts
β”‚   └── test-*.sh            # Testing automation
β”‚
β”œβ”€β”€ 🐳 Docker Configurations
β”‚   β”œβ”€β”€ docker-compose.yml           # Production deployment
β”‚   β”œβ”€β”€ docker-compose.development.yml # Development environment
β”‚   └── docker-compose.test.yml      # Testing environment
β”‚
β”œβ”€β”€ πŸ“‹ Configuration Files
β”‚   β”œβ”€β”€ Makefile             # Build automation and common tasks
β”‚   β”œβ”€β”€ .env.example         # Environment variable template
β”‚   β”œβ”€β”€ .gitignore           # Git ignore patterns
β”‚   β”œβ”€β”€ netlify.toml         # Frontend deployment config
β”‚   └── vercel.json          # Alternative frontend deployment
β”‚
└── πŸ“š Documentation
    β”œβ”€β”€ README.md            # Main project documentation (this file)
    β”œβ”€β”€ docs/                # Additional documentation
    └── postman/             # API testing collections

πŸ—οΈ Architecture Highlights

  • πŸ”„ Service Communication: gRPC for internal services, REST for external APIs
  • πŸ“Š Data Flow: MongoDB β†’ GridFS β†’ Model Registry β†’ API Gateway β†’ Frontend
  • πŸš€ Scalability: Horizontal pod autoscaling for workers and compute nodes
  • πŸ”’ Security: JWT authentication, TLS encryption, network policies
  • πŸ“ˆ Observability: Prometheus metrics, Grafana dashboards, structured logging
  • 🐳 Deployment: Container-native with Kubernetes orchestration

πŸš€ Production Deployment

🌊 Cloud Platform Support

TensorFleet supports deployment on major cloud platforms:

Amazon Web Services (AWS)

# Deploy to EKS
eksctl create cluster --name tensorfleet-production --region us-west-2
kubectl apply -f k8s/

# Configure ALB Ingress
kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/v2.4.4/docs/install/iam_policy.json

Google Cloud Platform (GCP)

# Deploy to GKE
gcloud container clusters create tensorfleet-production --zone us-central1-a
kubectl apply -f k8s/

# Configure Cloud Load Balancer
kubectl apply -f k8s/ingress-gcp.yaml

Microsoft Azure

# Deploy to AKS  
az aks create --resource-group tensorfleet-rg --name tensorfleet-production
kubectl apply -f k8s/

# Configure Azure Load Balancer
kubectl apply -f k8s/ingress-azure.yaml

πŸ”§ Production Configuration

Environment Variables

# Production environment configuration
export ENVIRONMENT=production
export LOG_LEVEL=INFO
export MONGODB_URL=mongodb+srv://prod:password@cluster.mongodb.net/tensorfleet
export REDIS_URL=redis://prod-redis.tensorfleet.svc.cluster.local:6379
export MINIO_ENDPOINT=s3.amazonaws.com
export JWT_SECRET=your-production-jwt-secret
export API_RATE_LIMIT=1000
export WORKER_MAX_REPLICAS=100

Resource Limits

# Production resource configuration
resources:
  requests:
    memory: "512Mi"
    cpu: "250m"
  limits:
    memory: "2Gi" 
    cpu: "1000m"

High Availability Setup

# Multi-zone deployment
affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchExpressions:
        - key: app
          operator: In
          values:
          - tensorfleet-worker
      topologyKey: "kubernetes.io/zone"

πŸ” Security Hardening

Network Policies

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: tensorfleet-network-policy
spec:
  podSelector:
    matchLabels:
      app: tensorfleet
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: tensorfleet
    ports:
    - protocol: TCP
      port: 8080

RBAC Configuration

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: tensorfleet-worker
rules:
- apiGroups: [""]
  resources: ["pods", "services"]
  verbs: ["get", "list", "watch"]

πŸ“Š Production Monitoring

Prometheus Configuration

# prometheus.yml
global:
  scrape_interval: 15s
scrape_configs:
  - job_name: 'tensorfleet'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

Grafana Dashboards

  • TensorFleet Overview: High-level system metrics
  • Training Jobs: ML job performance and progress
  • Infrastructure: Kubernetes cluster health
  • Application Performance: Service response times and errors

🚨 Alerting Rules

groups:
  - name: tensorfleet-production
    rules:
    - alert: TensorFleetServiceDown
      expr: up{job="tensorfleet"} == 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "TensorFleet service is down"
        description: "Service {{ $labels.instance }} has been down for more than 5 minutes"
    
    - alert: HighJobFailureRate  
      expr: rate(tensorfleet_jobs_failed_total[10m]) > 0.1
      for: 2m
      labels:
        severity: warning
      annotations:
        summary: "High job failure rate detected"

πŸ”„ Disaster Recovery

Backup Strategy

# Automated MongoDB backup
kubectl create cronjob mongodb-backup \
  --image=mongo:5.0 \
  --schedule="0 2 * * *" \
  -- mongodump --uri="$MONGODB_URI" --out=/backup/$(date +%Y%m%d)

# MinIO backup to S3
kubectl create cronjob minio-backup \
  --image=minio/mc \
  --schedule="0 3 * * *" \
  -- mc mirror local-minio s3-backup/tensorfleet-backup

Recovery Procedures

# Restore MongoDB from backup
kubectl exec -it mongodb-pod -- mongorestore --uri="$MONGODB_URI" /backup/20241208

# Restore MinIO from S3 backup  
kubectl exec -it minio-pod -- mc mirror s3-backup/tensorfleet-backup local-minio

⚑ Performance Optimization

Database Optimization

// MongoDB indexes for production workloads
db.jobs.createIndex({ "status": 1, "created_at": -1 })
db.models.createIndex({ "algorithm": 1, "metrics.accuracy": -1 })
db.workers.createIndex({ "status": 1, "last_heartbeat": -1 })

Caching Strategy

# Redis configuration for production
redis:
  maxmemory: 2gb
  maxmemory-policy: allkeys-lru
  save: "900 1 300 10 60 10000"

Auto-scaling Configuration

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: tensorfleet-worker-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: tensorfleet-worker
  minReplicas: 5
  maxReplicas: 100
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

🀝 Contributing

Development Workflow

  1. Fork & Clone

    git clone https://github.com/your-username/tensorfleet.git
    cd tensorfleet
  2. Create Feature Branch

    git checkout -b feature/amazing-new-feature
  3. Development Setup

    # Install development dependencies
    make install-dev
    
    # Start development environment
    make dev-start
    
    # Run tests
    make test
  4. Code Quality

    # Format code
    make format
    
    # Run linting
    make lint
    
    # Security scan
    make security-check
  5. Submit Changes

    git add .
    git commit -m "feat: add amazing new feature"
    git push origin feature/amazing-new-feature

Code Standards

  • Go: Follow gofmt, golint, and go vet standards
  • Python: PEP 8 compliance with black formatting
  • JavaScript: ESLint with Prettier formatting
  • Documentation: Clear comments and comprehensive README updates
  • Testing: Minimum 85% code coverage for new features

Pull Request Process

  1. Ensure all tests pass and code coverage meets requirements
  2. Update documentation for any new features or API changes
  3. Add integration tests for new endpoints or services
  4. Request review from maintainers
  5. Address feedback and maintain clean commit history

πŸ“ž Support & Community

Getting Help

Community Resources

Enterprise Support

For production deployments and enterprise features:

  • 🏒 Enterprise Consulting: Custom deployment assistance
  • πŸ”’ Security Audits: Professional security assessments
  • πŸ“ˆ Performance Tuning: Optimization for large-scale workloads
  • πŸŽ“ Training Programs: Team training and certification

πŸ“„ License

TensorFleet is released under the MIT License. See the LICENSE file for full terms.

πŸ‘₯ Team & Contributors

Development Team:

  • Aditya Suryawanshi (25211365) - Backend Infrastructure Lead
  • Rahul Mirashi (25211365) - ML & Data Services Lead
  • Soham Maji (25204731) - Frontend & Monitoring Lead

πŸ“– Project Documentation:

πŸ™ Acknowledgments

Built with ❀️ by the TensorFleet team using:

  • Languages: Go, Python, JavaScript/TypeScript
  • Frameworks: Gin (Go), Flask (Python), React (JavaScript)
  • Infrastructure: Kubernetes, Docker, gRPC, MongoDB, Redis, MinIO
  • Monitoring: Prometheus, Grafana, OpenTelemetry
  • Cloud Platforms: AWS, GCP, Azure support

⭐ Star this repository if TensorFleet helps your ML workflows!

  • βœ… Task Queuing - Orchestrator manages task distribution
  • βœ… Auto-scaling - Kubernetes HPA for worker nodes
  • βœ… Object Storage - MinIO for models, datasets, and checkpoints
  • βœ… Real-time Metrics - Prometheus + Grafana monitoring
  • βœ… Health Checks - Liveness and readiness probes
  • βœ… Horizontal Scaling - Workers scale from 2-10 pods

Production Features

  • πŸ”’ Secure Defaults - No hardcoded credentials
  • πŸ“Š Observability - Structured logging, metrics, traces
  • 🐳 Containerized - Docker images for all services
  • ☸️ Kubernetes-native - Complete K8s manifests
  • πŸ”„ High Availability - Stateful sets for infrastructure
  • 🎯 Load Balancing - Service discovery and routing

πŸ“¦ Prerequisites

Local Development

  • Docker 20.10+
  • Docker Compose 2.0+
  • Go 1.21+ (for proto generation)
  • Node.js 16+ (for frontend development)

Kubernetes Deployment

  • Kubernetes cluster 1.24+
  • kubectl configured
  • Helm 3.0+ (optional, for Prometheus/Grafana)
  • Container registry access (Docker Hub, GitHub Container Registry, etc.)

οΏ½ Project Reproducibility Instructions

System Requirements

Ensure your system meets the following requirements for consistent reproduction:

Hardware:

  • Minimum: 4 GB RAM, 2 CPU cores
  • Recommended: 8 GB RAM, 4 CPU cores
  • Disk Space: 5 GB free space

Operating System:

  • macOS 10.15+ / Ubuntu 18.04+ / Windows 10+ with WSL2
  • Docker Desktop or Docker Engine installed and running

Step-by-Step Reproduction Guide

1. Environment Setup

# Clone the repository
git clone https://github.com/your-username/TensorFleet.git
cd TensorFleet

# Verify Docker installation
docker --version
docker-compose --version

# Ensure Docker is running
docker ps

2. Build Protocol Buffers (Optional)

# Install Protocol Buffer compiler (if modifying .proto files)
# macOS
brew install protobuf protoc-gen-go protoc-gen-go-grpc

# Ubuntu
sudo apt-get install -y protobuf-compiler
go install google.golang.org/protobuf/cmd/protoc-gen-go@latest
go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@latest

# Generate proto files (only if modified)
cd proto && ./generate.sh && cd ..

3. Initial Setup & Validation

# Pull all required Docker images
docker-compose pull

# Build all services (this may take 5-10 minutes on first run)
docker-compose build

# Verify all images are built
docker images | grep tensorfleet

4. Start the Platform

# Start all services
docker-compose up -d

# Wait for all services to be healthy (30-60 seconds)
# You should see all services as "healthy"
docker-compose ps

# Verify health status
curl http://localhost:8080/health  # API Gateway
curl http://localhost:8081/health  # Storage Service  
curl http://localhost:8082/health  # Monitoring Service

5. Smoke Test - Verify Everything Works

# Run the automated demo (tests all endpoints)
./quick-api-demo.sh

# Expected output should show:
# βœ“ All services healthy
# βœ“ Storage operations working
# βœ“ Job submission and monitoring working
# βœ“ Metrics collection working

6. Access Web Interfaces

Open these URLs in your browser:

Troubleshooting Common Issues

Issue: Services fail to start

# Check logs for specific service
docker-compose logs <service-name>

# Common solutions:
# 1. Restart Docker Desktop
# 2. Clear Docker cache: docker system prune -a
# 3. Check port conflicts: lsof -i :8080

Issue: "Connection refused" errors

# Wait for health checks to pass
watch docker-compose ps

# Services need 30-60 seconds to fully initialize
# Redis and MinIO must be healthy before other services start

Issue: Port conflicts

# Check what's using the ports
lsof -i :3000 -i :8080 -i :8081 -i :8082 -i :9000 -i :9001

# Stop conflicting services or modify docker-compose.yml ports

Issue: Out of disk space

# Clean up Docker resources
docker system prune -a --volumes

# Remove old containers and images
docker container prune
docker image prune -a

Reproducible Demo Scenarios

Scenario 1: Submit and Monitor Training Job

# Submit a ResNet50 training job
curl -X POST http://localhost:8080/api/v1/jobs \
  -H "Content-Type: application/json" \
  -d '{
    "model_type": "resnet50",
    "dataset_path": "s3://tensorfleet/datasets/imagenet",
    "hyperparameters": {"learning_rate": "0.001"},
    "num_workers": 3,
    "epochs": 10
  }'

# Monitor progress in real-time
# Save the job_id from above response, then:
watch -n 2 "curl -s http://localhost:8080/api/v1/jobs/YOUR_JOB_ID | jq"

Scenario 2: Upload and Download Files

# Create a sample dataset
echo "epoch,loss,accuracy" > sample_dataset.csv
echo "1,2.5,0.3" >> sample_dataset.csv
echo "2,1.8,0.5" >> sample_dataset.csv

# Upload to storage
curl -X POST http://localhost:8081/api/v1/upload/datasets/sample.csv \
  -F "file=@sample_dataset.csv"

# List files
curl http://localhost:8081/api/v1/list/datasets | jq

# Download file
curl http://localhost:8081/api/v1/download/datasets/sample.csv

Scenario 3: View Monitoring Dashboard

  1. Open http://localhost:3001 in browser
  2. Login with admin/admin
  3. Navigate to "TensorFleet Dashboard"
  4. Submit some jobs and watch metrics update

Environment Variables for Customization

Create a .env file to customize settings:

# Optional: Customize ports
API_GATEWAY_PORT=8080
STORAGE_PORT=8081
MONITORING_PORT=8082
FRONTEND_PORT=3000

# Optional: Customize MinIO credentials
MINIO_ROOT_USER=admin
MINIO_ROOT_PASSWORD=password123

# Optional: Worker scaling
WORKER_REPLICAS=3

Data Persistence

All data is persisted in Docker volumes:

# View volumes
docker volume ls | grep tensorfleet

# To reset all data (WARNING: destroys all jobs/files)
docker-compose down -v

# To backup data
docker run --rm -v tensorfleet_minio_data:/data -v $(pwd):/backup ubuntu tar czf /backup/minio_backup.tar.gz /data

Cleanup Instructions

# Stop all services
docker-compose down

# Remove all data (optional)
docker-compose down -v

# Clean up Docker resources
docker system prune -a

# Remove built images
docker rmi $(docker images | grep tensorfleet | awk '{print $3}')

Testing Reproducibility

To verify the setup works on a clean system:

# Test script for CI/CD or clean environment
#!/bin/bash
set -e

echo "Testing TensorFleet reproducibility..."
docker-compose up -d
sleep 60  # Wait for services to initialize

# Test health endpoints
curl -f http://localhost:8080/health
curl -f http://localhost:8081/health  
curl -f http://localhost:8082/health

# Test job submission
JOB_ID=$(curl -s -X POST http://localhost:8080/api/v1/jobs \
  -H "Content-Type: application/json" \
  -d '{"model_type":"test","dataset_path":"test","num_workers":1,"epochs":1}' | jq -r .job_id)

# Verify job was created
curl -f "http://localhost:8080/api/v1/jobs/$JOB_ID"

echo "βœ… Reproducibility test passed!"

οΏ½πŸš€ Quick Start

1. Clone and Setup

git clone <repository-url>
cd TensorFleet

2. Generate gRPC Stubs

make proto

3. Run Locally with Docker Compose

# Build and start all services
make compose-up

# Or manually:
docker-compose up --build

4. Access Services

Service URL Credentials
Frontend http://localhost:3000 -
API Gateway http://localhost:8080 -
Storage API http://localhost:8081 -
Monitoring API http://localhost:8082 -
Model Service API http://localhost:8083 -
ML Worker API http://localhost:8000 -
Grafana http://localhost:3001 admin/admin
Prometheus http://localhost:9090 -
MinIO Console http://localhost:9001 minioadmin/minioadmin
MongoDB localhost:27017 admin/password123

πŸ’» Local Development

Start Services

# Start all services in background
docker-compose up -d

# View logs
docker-compose logs -f

# Stop services
make compose-down

Submit a Training Job

# Using curl
curl -X POST http://localhost:8080/api/v1/jobs \
  -H "Content-Type: application/json" \
  -H "X-User-ID: demo-user" \
  -d '{
    "model_type": "cnn",
    "dataset_path": "/data/mnist",
    "num_workers": 3,
    "epochs": 10,
    "hyperparameters": {
      "learning_rate": "0.001",
      "batch_size": "64"
    }
  }'

# Response
{
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "RUNNING",
  "num_tasks": 100,
  "message": "Job created with 100 tasks"
}

Check Job Status

curl http://localhost:8080/api/v1/jobs/550e8400-e29b-41d4-a716-446655440000

View Metrics

# Dashboard metrics
curl http://localhost:8082/api/v1/dashboard

# Prometheus metrics
curl http://localhost:8082/metrics

πŸ€– MongoDB ML Training

TensorFleet now supports machine learning training with MongoDB for dataset storage and model persistence.

Quick ML Training Demo

# Run the automated ML training demo
./demo-mongodb-ml.sh

This demo will:

  1. βœ… Train 3 different ML models (RandomForest, LogisticRegression, SVM)
  2. βœ… Store trained models in MongoDB using GridFS
  3. βœ… Save model metadata (hyperparameters, metrics, version)
  4. βœ… Download and save models locally
  5. βœ… Display model statistics

Manual ML Training

1. List Available Datasets

curl http://localhost:8000/datasets | jq

Response:

{
  "datasets": [
    {
      "name": "iris",
      "description": "Iris flower dataset",
      "n_samples": 150,
      "n_features": 4,
      "target_column": "species"
    },
    {
      "name": "wine",
      "description": "Wine classification dataset",
      "n_samples": 178,
      "n_features": 13,
      "target_column": "wine_class"
    }
  ]
}

2. Train a Model

curl -X POST http://localhost:8000/train \
  -H "Content-Type: application/json" \
  -d '{
    "job_id": "my_training_job_001",
    "dataset_name": "iris",
    "algorithm": "random_forest",
    "target_column": "species",
    "model_name": "iris_rf_model",
    "hyperparameters": {
      "n_estimators": 100,
      "max_depth": 5,
      "random_state": 42
    }
  }' | jq

Response:

{
  "job_id": "my_training_job_001",
  "model_id": "507f1f77bcf86cd799439011",
  "status": "completed",
  "metrics": {
    "train_accuracy": 0.9833,
    "test_accuracy": 0.9667,
    "training_time": 0.234
  },
  "model_name": "iris_rf_model",
  "version": "v1701964800"
}

3. List Trained Models

curl "http://localhost:8083/api/v1/models?page=1&limit=10" | jq

Response:

{
  "models": [
    {
      "id": "507f1f77bcf86cd799439011",
      "name": "iris_rf_model",
      "algorithm": "random_forest",
      "metrics": {
        "test_accuracy": 0.9667,
        "train_accuracy": 0.9833
      },
      "version": "v1701964800",
      "created_at": "2025-12-07T10:30:00Z"
    }
  ],
  "pagination": {
    "page": 1,
    "limit": 10,
    "total": 1,
    "pages": 1
  }
}

4. Get Model Metadata

curl http://localhost:8083/api/v1/models/<model_id> | jq

Response:

{
  "id": "507f1f77bcf86cd799439011",
  "name": "iris_rf_model",
  "algorithm": "random_forest",
  "hyperparameters": {
    "n_estimators": 100,
    "max_depth": 5,
    "random_state": 42
  },
  "metrics": {
    "train_accuracy": 0.9833,
    "test_accuracy": 0.9667,
    "training_time": 0.234
  },
  "version": "v1701964800",
  "dataset_name": "iris",
  "target_column": "species",
  "features": ["sepal length", "sepal width", "petal length", "petal width"],
  "created_at": "2025-12-07T10:30:00Z"
}

5. Download a Model

# Download model file
curl http://localhost:8083/api/v1/models/<model_id>/download \
  -o my_model.pkl

# Verify download
ls -lh my_model.pkl

6. Use Downloaded Model (Python)

import pickle
import numpy as np

# Load the model
with open('my_model.pkl', 'rb') as f:
    model = pickle.load(f)

# Make predictions
sample_data = np.array([[5.1, 3.5, 1.4, 0.2]])
prediction = model.predict(sample_data)
print(f"Prediction: {prediction}")

Supported ML Algorithms

Algorithm Description Best For
random_forest Random Forest Classifier Classification, robust to overfitting
logistic_regression Logistic Regression Binary/multi-class classification
svm Support Vector Machine Non-linear classification
decision_tree Decision Tree Classifier Interpretable models

Model Hyperparameters

Random Forest:

{
  "n_estimators": 100,
  "max_depth": null,
  "min_samples_split": 2,
  "random_state": 42
}

Logistic Regression:

{
  "max_iter": 1000,
  "C": 1.0,
  "random_state": 42
}

SVM:

{
  "kernel": "rbf",
  "C": 1.0,
  "gamma": "scale",
  "random_state": 42
}

Decision Tree:

{
  "max_depth": null,
  "min_samples_split": 2,
  "random_state": 42
}

Python Client for ML Operations

Use the provided Python client for easy interaction:

# Run complete demo
python ml_client.py demo

# List all models
python ml_client.py list

# Get statistics
python ml_client.py stats

# Download a specific model
python ml_client.py download <model_id> output.pkl

Model Storage Architecture

  • Datasets: Stored in MongoDB collections or GridFS for large files
  • Trained Models: Serialized with pickle and stored in GridFS
  • Metadata: Model information stored in MongoDB collections
    • Model name, algorithm, hyperparameters
    • Training metrics (accuracy, loss)
    • Version information
    • Dataset reference
    • Feature names
    • Training timestamp

Model Versioning

Each trained model is automatically versioned using a timestamp:

  • Format: v{unix_timestamp}
  • Example: v1701964800
  • Allows multiple versions of the same model
  • Easy rollback to previous versions

☸️ Kubernetes Deployment

πŸš€ Production Deployment

Deploy TensorFleet to a Kubernetes cluster:

# Create namespace
kubectl apply -f k8s/namespace.yaml

# Deploy infrastructure
kubectl apply -f k8s/infrastructure.yaml

# Deploy core services
kubectl apply -f k8s/deployment.yaml

# Setup ingress
kubectl apply -f k8s/ingress.yaml

# Verify deployment
kubectl get pods -n tensorfleet

πŸ“Š Scaling Configuration

# Horizontal Pod Autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: worker-hpa
  namespace: tensorfleet
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: worker
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

πŸ” Security & Secrets

# Create TLS certificates
kubectl create secret tls tensorfleet-tls \
  --cert=path/to/tls.crt \
  --key=path/to/tls.key \
  -n tensorfleet

# Create database credentials
kubectl create secret generic mongodb-secret \
  --from-literal=connection-string="mongodb://admin:password123@mongodb:27017/tensorfleet?authSource=admin" \
  -n tensorfleet

# Create API keys
kubectl create secret generic api-keys \
  --from-literal=jwt-secret="your-jwt-secret" \
  --from-literal=admin-key="your-admin-api-key" \
  -n tensorfleet

πŸ“Š Monitoring & Observability

🎯 Key Metrics

TensorFleet provides comprehensive observability across all layers:

Application Metrics

  • Job Metrics: Success rate, completion time, queue depth
  • Training Metrics: Accuracy, loss, convergence rate, resource utilization
  • Worker Metrics: Task throughput, error rate, resource consumption
  • API Metrics: Request latency, error rate, throughput by endpoint

Infrastructure Metrics

  • Service Health: Availability, response time, resource usage
  • Database Performance: Connection pool, query performance, storage usage
  • Storage Metrics: Object storage usage, transfer rates, availability
  • Network Metrics: Service-to-service communication, latency, errors

πŸ“ˆ Grafana Dashboards

Pre-configured dashboards available at http://localhost:3001:

  • TensorFleet Overview: High-level platform metrics and health
  • Job Analytics: Training job performance and success rates
  • Worker Performance: Distributed worker utilization and efficiency
  • Infrastructure Health: System resources and service availability
  • API Performance: Gateway metrics and endpoint analytics

πŸ”” Alerting Rules

# Critical system alerts
groups:
  - name: tensorfleet-critical
    rules:
    - alert: ServiceDown
      expr: up{job="tensorfleet"} == 0
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: "TensorFleet service {{ $labels.instance }} is down"
    
    - alert: HighErrorRate
      expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
      for: 2m
      labels:
        severity: warning
      annotations:
        summary: "High error rate on {{ $labels.service }}"

πŸ§ͺ Testing & Validation

πŸ” Test Suite Overview

TensorFleet includes comprehensive testing at multiple levels:

# Run all tests
make test-all

# Unit tests
make test-unit

# Integration tests  
make test-integration

# API tests
make test-api

# Load tests
make test-load

# Security tests
make test-security

πŸš€ Continuous Integration

# GitHub Actions workflow
name: TensorFleet CI/CD
on: [push, pull_request]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v2
    - name: Run Tests
      run: |
        docker-compose -f docker-compose.test.yml up --abort-on-container-exit
    - name: Build Images
      run: make build
    - name: Security Scan
      run: make security-scan

πŸ“Š Test Coverage

Service Unit Tests Integration Tests API Tests
API Gateway 95%+ βœ… βœ…
Orchestrator 90%+ βœ… βœ…
Worker 88%+ βœ… βœ…
Model Service 92%+ βœ… βœ…
Storage 89%+ βœ… βœ…
Monitoring 87%+ βœ… βœ…
Frontend 85%+ βœ… βœ…

πŸ“ Project Structure

TensorFleet follows a microservices architecture with clear separation of concerns:

TensorFleet/
β”œβ”€β”€ 🌐 api-gateway/          # HTTP/REST API Gateway (Go + Gin)
β”‚   β”œβ”€β”€ main.go              # Gateway server implementation  
β”‚   β”œβ”€β”€ handlers/            # HTTP request handlers
β”‚   β”œβ”€β”€ middleware/          # Authentication, CORS, logging
β”‚   β”œβ”€β”€ go.mod               # Go dependencies
β”‚   └── Dockerfile           # Container configuration
β”‚
β”œβ”€β”€ 🎯 orchestrator/         # Job Orchestration Service (Go + gRPC)
β”‚   β”œβ”€β”€ main.go              # Orchestrator server
β”‚   β”œβ”€β”€ scheduler/           # Task scheduling logic
β”‚   β”œβ”€β”€ worker_manager/      # Worker registration & health
β”‚   β”œβ”€β”€ go.mod               # Go dependencies  
β”‚   └── Dockerfile           # Container configuration
β”‚
β”œβ”€β”€ ⚑ worker/               # Distributed Worker Nodes (Go + gRPC)
β”‚   β”œβ”€β”€ main.go              # Worker server implementation
β”‚   β”œβ”€β”€ executor/            # Task execution engine
β”‚   β”œβ”€β”€ metrics/             # Performance monitoring
β”‚   β”œβ”€β”€ go.mod               # Go dependencies
β”‚   └── Dockerfile           # Container configuration
β”‚
β”œβ”€β”€ πŸ€– worker-ml/            # ML Training Engine (Python + Flask)
β”‚   β”œβ”€β”€ main.py              # ML worker API server
β”‚   β”œβ”€β”€ models/              # ML algorithm implementations
β”‚   β”œβ”€β”€ datasets/            # Dataset loaders and preprocessors
β”‚   β”œβ”€β”€ requirements.txt     # Python dependencies
β”‚   └── Dockerfile           # Container configuration
β”‚
β”œβ”€β”€ πŸ—„οΈ model-service/        # Model Registry (Python + Flask + MongoDB)
β”‚   β”œβ”€β”€ main.py              # Model management API
β”‚   β”œβ”€β”€ storage/             # GridFS integration
β”‚   β”œβ”€β”€ metadata/            # Model metadata handling
β”‚   β”œβ”€β”€ requirements.txt     # Python dependencies
β”‚   └── Dockerfile           # Container configuration
β”‚
β”œβ”€β”€ πŸ“¦ storage/              # Object Storage Service (Python + Flask + MinIO)
β”‚   β”œβ”€β”€ main.py              # Storage API server
β”‚   β”œβ”€β”€ storage_manager.py   # MinIO client wrapper
β”‚   β”œβ”€β”€ handlers/            # File upload/download logic
β”‚   β”œβ”€β”€ requirements.txt     # Python dependencies
β”‚   └── Dockerfile           # Container configuration
β”‚
β”œβ”€β”€ πŸ“Š monitoring/           # Observability Service (Python + Flask)
β”‚   β”œβ”€β”€ main.py              # Monitoring API server  
β”‚   β”œβ”€β”€ collectors/          # Metrics collection
β”‚   β”œβ”€β”€ aggregators/         # Data aggregation logic
β”‚   β”œβ”€β”€ requirements.txt     # Python dependencies
β”‚   └── Dockerfile           # Container configuration
β”‚
β”œβ”€β”€ 🎨 frontend/             # Web Dashboard (React + Vite + Material-UI)
β”‚   β”œβ”€β”€ src/
β”‚   β”‚   β”œβ”€β”€ components/      # React components
β”‚   β”‚   β”œβ”€β”€ pages/           # Application pages
β”‚   β”‚   β”œβ”€β”€ hooks/           # Custom React hooks
β”‚   β”‚   β”œβ”€β”€ services/        # API client services
β”‚   β”‚   └── utils/           # Utility functions
β”‚   β”œβ”€β”€ public/              # Static assets
β”‚   β”œβ”€β”€ package.json         # Node.js dependencies
β”‚   β”œβ”€β”€ vite.config.js       # Vite configuration
β”‚   └── Dockerfile           # Container configuration
β”‚
β”œβ”€β”€ πŸ”Œ proto/                # gRPC Protocol Definitions
β”‚   β”œβ”€β”€ gateway.proto        # API Gateway service definitions
β”‚   β”œβ”€β”€ orchestrator.proto   # Orchestrator service definitions  
β”‚   β”œβ”€β”€ worker.proto         # Worker service definitions
β”‚   └── generate.sh          # Protocol buffer generation script
β”‚
β”œβ”€β”€ ☸️ k8s/                  # Kubernetes Deployment Manifests
β”‚   β”œβ”€β”€ namespace.yaml       # TensorFleet namespace
β”‚   β”œβ”€β”€ infrastructure.yaml  # MongoDB, Redis, MinIO
β”‚   β”œβ”€β”€ configmap.yaml       # Configuration management
β”‚   β”œβ”€β”€ deployment.yaml      # Core service deployments
β”‚   β”œβ”€β”€ ingress.yaml         # External access configuration
β”‚   β”œβ”€β”€ monitoring.yaml      # Prometheus & Grafana
β”‚   └── storage.yaml         # Persistent volume claims
β”‚
β”œβ”€β”€ πŸ§ͺ tests/                # Comprehensive Testing Suite
β”‚   β”œβ”€β”€ unit/                # Unit tests for each service
β”‚   β”œβ”€β”€ integration/         # Integration tests  
β”‚   β”œβ”€β”€ api/                 # API endpoint tests
β”‚   β”œβ”€β”€ load/                # Load testing scripts
β”‚   └── e2e/                 # End-to-end testing
β”‚
β”œβ”€β”€ πŸ“œ scripts/              # Automation & Utility Scripts
β”‚   β”œβ”€β”€ demo-*.sh            # Demo and testing scripts
β”‚   β”œβ”€β”€ cleanup-*.sh         # Environment cleanup utilities
β”‚   β”œβ”€β”€ setup-*.sh           # Environment setup scripts
β”‚   └── test-*.sh            # Testing automation
β”‚
β”œβ”€β”€ 🐳 Docker Configurations
β”‚   β”œβ”€β”€ docker-compose.yml           # Production deployment
β”‚   β”œβ”€β”€ docker-compose.development.yml # Development environment
β”‚   └── docker-compose.test.yml      # Testing environment
β”‚
β”œβ”€β”€ πŸ“‹ Configuration Files
β”‚   β”œβ”€β”€ Makefile             # Build automation and common tasks
β”‚   β”œβ”€β”€ .env.example         # Environment variable template
β”‚   β”œβ”€β”€ .gitignore           # Git ignore patterns
β”‚   β”œβ”€β”€ netlify.toml         # Frontend deployment config
β”‚   └── vercel.json          # Alternative frontend deployment
β”‚
└── πŸ“š Documentation
    β”œβ”€β”€ README.md            # Main project documentation (this file)
    β”œβ”€β”€ docs/                # Additional documentation
    └── postman/             # API testing collections

πŸ—οΈ Architecture Highlights

  • πŸ”„ Service Communication: gRPC for internal services, REST for external APIs
  • πŸ“Š Data Flow: MongoDB β†’ GridFS β†’ Model Registry β†’ API Gateway β†’ Frontend
  • πŸš€ Scalability: Horizontal pod autoscaling for workers and compute nodes
  • πŸ”’ Security: JWT authentication, TLS encryption, network policies
  • πŸ“ˆ Observability: Prometheus metrics, Grafana dashboards, structured logging
  • 🐳 Deployment: Container-native with Kubernetes orchestration

πŸš€ Production Deployment

🌊 Cloud Platform Support

TensorFleet supports deployment on major cloud platforms:

Amazon Web Services (AWS)

# Deploy to EKS
eksctl create cluster --name tensorfleet-production --region us-west-2
kubectl apply -f k8s/

# Configure ALB Ingress
kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/v2.4.4/docs/install/iam_policy.json

Google Cloud Platform (GCP)

# Deploy to GKE
gcloud container clusters create tensorfleet-production --zone us-central1-a
kubectl apply -f k8s/

# Configure Cloud Load Balancer
kubectl apply -f k8s/ingress-gcp.yaml

Microsoft Azure

# Deploy to AKS  
az aks create --resource-group tensorfleet-rg --name tensorfleet-production
kubectl apply -f k8s/

# Configure Azure Load Balancer
kubectl apply -f k8s/ingress-azure.yaml

πŸ”§ Production Configuration

Environment Variables

# Production environment configuration
export ENVIRONMENT=production
export LOG_LEVEL=INFO
export MONGODB_URL=mongodb+srv://prod:password@cluster.mongodb.net/tensorfleet
export REDIS_URL=redis://prod-redis.tensorfleet.svc.cluster.local:6379
export MINIO_ENDPOINT=s3.amazonaws.com
export JWT_SECRET=your-production-jwt-secret
export API_RATE_LIMIT=1000
export WORKER_MAX_REPLICAS=100

Resource Limits

# Production resource configuration
resources:
  requests:
    memory: "512Mi"
    cpu: "250m"
  limits:
    memory: "2Gi" 
    cpu: "1000m"

High Availability Setup

# Multi-zone deployment
affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchExpressions:
        - key: app
          operator: In
          values:
          - tensorfleet-worker
      topologyKey: "kubernetes.io/zone"

πŸ” Security Hardening

Network Policies

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: tensorfleet-network-policy
spec:
  podSelector:
    matchLabels:
      app: tensorfleet
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: tensorfleet
    ports:
    - protocol: TCP
      port: 8080

RBAC Configuration

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: tensorfleet-worker
rules:
- apiGroups: [""]
  resources: ["pods", "services"]
  verbs: ["get", "list", "watch"]

πŸ“Š Production Monitoring

Prometheus Configuration

# prometheus.yml
global:
  scrape_interval: 15s
scrape_configs:
  - job_name: 'tensorfleet'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

Grafana Dashboards

  • TensorFleet Overview: High-level system metrics
  • Training Jobs: ML job performance and progress
  • Infrastructure: Kubernetes cluster health
  • Application Performance: Service response times and errors

🚨 Alerting Rules

groups:
  - name: tensorfleet-production
    rules:
    - alert: TensorFleetServiceDown
      expr: up{job="tensorfleet"} == 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "TensorFleet service is down"
        description: "Service {{ $labels.instance }} has been down for more than 5 minutes"
    
    - alert: HighJobFailureRate  
      expr: rate(tensorfleet_jobs_failed_total[10m]) > 0.1
      for: 2m
      labels:
        severity: warning
      annotations:
        summary: "High job failure rate detected"

πŸ”„ Disaster Recovery

Backup Strategy

# Automated MongoDB backup
kubectl create cronjob mongodb-backup \
  --image=mongo:5.0 \
  --schedule="0 2 * * *" \
  -- mongodump --uri="$MONGODB_URI" --out=/backup/$(date +%Y%m%d)

# MinIO backup to S3
kubectl create cronjob minio-backup \
  --image=minio/mc \
  --schedule="0 3 * * *" \
  -- mc mirror local-minio s3-backup/tensorfleet-backup

Recovery Procedures

# Restore MongoDB from backup
kubectl exec -it mongodb-pod -- mongorestore --uri="$MONGODB_URI" /backup/20241208

# Restore MinIO from S3 backup  
kubectl exec -it minio-pod -- mc mirror s3-backup/tensorfleet-backup local-minio

⚑ Performance Optimization

Database Optimization

// MongoDB indexes for production workloads
db.jobs.createIndex({ "status": 1, "created_at": -1 })
db.models.createIndex({ "algorithm": 1, "metrics.accuracy": -1 })
db.workers.createIndex({ "status": 1, "last_heartbeat": -1 })

Caching Strategy

# Redis configuration for production
redis:
  maxmemory: 2gb
  maxmemory-policy: allkeys-lru
  save: "900 1 300 10 60 10000"

Auto-scaling Configuration

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: tensorfleet-worker-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: tensorfleet-worker
  minReplicas: 5
  maxReplicas: 100
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

🀝 Contributing

Development Workflow

  1. Fork & Clone

    git clone https://github.com/your-username/tensorfleet.git
    cd tensorfleet
  2. Create Feature Branch

    git checkout -b feature/amazing-new-feature
  3. Development Setup

    # Install development dependencies
    make install-dev
    
    # Start development environment
    make dev-start
    
    # Run tests
    make test
  4. Code Quality

    # Format code
    make format
    
    # Run linting
    make lint
    
    # Security scan
    make security-check
  5. Submit Changes

    git add .
    git commit -m "feat: add amazing new feature"
    git push origin feature/amazing-new-feature

Code Standards

  • Go: Follow gofmt, golint, and go vet standards
  • Python: PEP 8 compliance with black formatting
  • JavaScript: ESLint with Prettier formatting
  • Documentation: Clear comments and comprehensive README updates
  • Testing: Minimum 85% code coverage for new features

Pull Request Process

  1. Ensure all tests pass and code coverage meets requirements
  2. Update documentation for any new features or API changes
  3. Add integration tests for new endpoints or services
  4. Request review from maintainers
  5. Address feedback and maintain clean commit history

πŸ“ž Support & Community

Getting Help

Community Resources

Enterprise Support

For production deployments and enterprise features:

  • 🏒 Enterprise Consulting: Custom deployment assistance
  • πŸ”’ Security Audits: Professional security assessments
  • πŸ“ˆ Performance Tuning: Optimization for large-scale workloads
  • πŸŽ“ Training Programs: Team training and certification

πŸ“„ License

TensorFleet is released under the MIT License. See the LICENSE file for full terms.

πŸ‘₯ Team & Contributors

Development Team:

  • Aditya Suryawanshi (25211365) - Backend Infrastructure Lead
  • Rahul Mirashi (25211365) - ML & Data Services Lead
  • Soham Maji (25204731) - Frontend & Monitoring Lead

πŸ“– Project Documentation:

πŸ™ Acknowledgments

Built with ❀️ by the TensorFleet team using:

  • Languages: Go, Python, JavaScript/TypeScript
  • Frameworks: Gin (Go), Flask (Python), React (JavaScript)
  • Infrastructure: Kubernetes, Docker, gRPC, MongoDB, Redis, MinIO
  • Monitoring: Prometheus, Grafana, OpenTelemetry
  • Cloud Platforms: AWS, GCP, Azure support

⭐ Star this repository if TensorFleet helps your ML workflows!

  • βœ… Task Queuing - Orchestrator manages task distribution
  • βœ… Auto-scaling - Kubernetes HPA for worker nodes
  • βœ… Object Storage - MinIO for models, datasets, and checkpoints
  • βœ… Real-time Metrics - Prometheus + Grafana monitoring
  • βœ… Health Checks - Liveness and readiness probes
  • βœ… Horizontal Scaling - Workers scale from 2-10 pods

Production Features

  • πŸ”’ Secure Defaults - No hardcoded credentials
  • πŸ“Š Observability - Structured logging, metrics, traces
  • 🐳 Containerized - Docker images for all services
  • ☸️ Kubernetes-native - Complete K8s manifests
  • πŸ”„ High Availability - Stateful sets for infrastructure
  • 🎯 Load Balancing - Service discovery and routing

πŸ“¦ Prerequisites

Local Development

  • Docker 20.10+
  • Docker Compose 2.0+
  • Go 1.21+ (for proto generation)
  • Node.js 16+ (for frontend development)

Kubernetes Deployment

  • Kubernetes cluster 1.24+
  • kubectl configured
  • Helm 3.0+ (optional, for Prometheus/Grafana)
  • Container registry access (Docker Hub, GitHub Container Registry, etc.)

οΏ½ Project Reproducibility Instructions

System Requirements

Ensure your system meets the following requirements for consistent reproduction:

Hardware:

  • Minimum: 4 GB RAM, 2 CPU cores
  • Recommended: 8 GB RAM, 4 CPU cores
  • Disk Space: 5 GB free space

Operating System:

  • macOS 10.15+ / Ubuntu 18.04+ / Windows 10+ with WSL2
  • Docker Desktop or Docker Engine installed and running

Step-by-Step Reproduction Guide

1. Environment Setup

# Clone the repository
git clone https://github.com/your-username/TensorFleet.git
cd TensorFleet

# Verify Docker installation
docker --version
docker-compose --version

# Ensure Docker is running
docker ps

2. Build Protocol Buffers (Optional)

# Install Protocol Buffer compiler (if modifying .proto files)
# macOS
brew install protobuf protoc-gen-go protoc-gen-go-grpc

# Ubuntu
sudo apt-get install -y protobuf-compiler
go install google.golang.org/protobuf/cmd/protoc-gen-go@latest
go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@latest

# Generate proto files (only if modified)
cd proto && ./generate.sh && cd ..

3. Initial Setup & Validation

# Pull all required Docker images
docker-compose pull

# Build all services (this may take 5-10 minutes on first run)
docker-compose build

# Verify all images are built
docker images | grep tensorfleet

4. Start the Platform

# Start all services
docker-compose up -d

# Wait for all services to be healthy (30-60 seconds)
# You should see all services as "healthy"
docker-compose ps

# Verify health status
curl http://localhost:8080/health  # API Gateway
curl http://localhost:8081/health  # Storage Service  
curl http://localhost:8082/health  # Monitoring Service

5. Smoke Test - Verify Everything Works

# Run the automated demo (tests all endpoints)
./quick-api-demo.sh

# Expected output should show:
# βœ“ All services healthy
# βœ“ Storage operations working
# βœ“ Job submission and monitoring working
# βœ“ Metrics collection working

6. Access Web Interfaces

Open these URLs in your browser:

Troubleshooting Common Issues

Issue: Services fail to start

# Check logs for specific service
docker-compose logs <service-name>

# Common solutions:
# 1. Restart Docker Desktop
# 2. Clear Docker cache: docker system prune -a
# 3. Check port conflicts: lsof -i :8080

Issue: "Connection refused" errors

# Wait for health checks to pass
watch docker-compose ps

# Services need 30-60 seconds to fully initialize
# Redis and MinIO must be healthy before other services start

Issue: Port conflicts

# Check what's using the ports
lsof -i :3000 -i :8080 -i :8081 -i :8082 -i :9000 -i :9001

# Stop conflicting services or modify docker-compose.yml ports

Issue: Out of disk space

# Clean up Docker resources
docker system prune -a --volumes

# Remove old containers and images
docker container prune
docker image prune -a

Reproducible Demo Scenarios

Scenario 1: Submit and Monitor Training Job

# Submit a ResNet50 training job
curl -X POST http://localhost:8080/api/v1/jobs \
  -H "Content-Type: application/json" \
  -d '{
    "model_type": "resnet50",
    "dataset_path": "s3://tensorfleet/datasets/imagenet",
    "hyperparameters": {"learning_rate": "0.001"},
    "num_workers": 3,
    "epochs": 10
  }'

# Monitor progress in real-time
# Save the job_id from above response, then:
watch -n 2 "curl -s http://localhost:8080/api/v1/jobs/YOUR_JOB_ID | jq"

Scenario 2: Upload and Download Files

# Create a sample dataset
echo "epoch,loss,accuracy" > sample_dataset.csv
echo "1,2.5,0.3" >> sample_dataset.csv
echo "2,1.8,0.5" >> sample_dataset.csv

# Upload to storage
curl -X POST http://localhost:8081/api/v1/upload/datasets/sample.csv \
  -F "file=@sample_dataset.csv"

# List files
curl http://localhost:8081/api/v1/list/datasets | jq

# Download file
curl http://localhost:8081/api/v1/download/datasets/sample.csv

Scenario 3: View Monitoring Dashboard

  1. Open http://localhost:3001 in browser
  2. Login with admin/admin
  3. Navigate to "TensorFleet Dashboard"
  4. Submit some jobs and watch metrics update

Environment Variables for Customization

Create a .env file to customize settings:

# Optional: Customize ports
API_GATEWAY_PORT=8080
STORAGE_PORT=8081
MONITORING_PORT=8082
FRONTEND_PORT=3000

# Optional: Customize MinIO credentials
MINIO_ROOT_USER=admin
MINIO_ROOT_PASSWORD=password123

# Optional: Worker scaling
WORKER_REPLICAS=3

Data Persistence

All data is persisted in Docker volumes:

# View volumes
docker volume ls | grep tensorfleet

# To reset all data (WARNING: destroys all jobs/files)
docker-compose down -v

# To backup data
docker run --rm -v tensorfleet_minio_data:/data -v $(pwd):/backup ubuntu tar czf /backup/minio_backup.tar.gz /data

Cleanup Instructions

# Stop all services
docker-compose down

# Remove all data (optional)
docker-compose down -v

# Clean up Docker resources
docker system prune -a

# Remove built images
docker rmi $(docker images | grep tensorfleet | awk '{print $3}')

Testing Reproducibility

To verify the setup works on a clean system:

# Test script for CI/CD or clean environment
#!/bin/bash
set -e

echo "Testing TensorFleet reproducibility..."
docker-compose up -d
sleep 60  # Wait for services to initialize

# Test health endpoints
curl -f http://localhost:8080/health
curl -f http://localhost:8081/health  
curl -f http://localhost:8082/health

# Test job submission
JOB_ID=$(curl -s -X POST http://localhost:8080/api/v1/jobs \
  -H "Content-Type: application/json" \
  -d '{"model_type":"test","dataset_path":"test","num_workers":1,"epochs":1}' | jq -r .job_id)

# Verify job was created
curl -f "http://localhost:8080/api/v1/jobs/$JOB_ID"

echo "βœ… Reproducibility test passed!"

οΏ½πŸš€ Quick Start

1. Clone and Setup

git clone <repository-url>
cd TensorFleet

2. Generate gRPC Stubs

make proto

3. Run Locally with Docker Compose

# Build and start all services
make compose-up

# Or manually:
docker-compose up --build

4. Access Services

Service URL Credentials
Frontend http://localhost:3000 -
API Gateway http://localhost:8080 -
Storage API http://localhost:8081 -
Monitoring API http://localhost:8082 -
Model Service API http://localhost:8083 -
ML Worker API http://localhost:8000 -
Grafana http://localhost:3001 admin/admin
Prometheus http://localhost:9090 -
MinIO Console http://localhost:9001 minioadmin/minioadmin
MongoDB localhost:27017 admin/password123

πŸ’» Local Development

Start Services

# Start all services in background
docker-compose up -d

# View logs
docker-compose logs -f

# Stop services
make compose-down

Submit a Training Job

# Using curl
curl -X POST http://localhost:8080/api/v1/jobs \
  -H "Content-Type: application/json" \
  -H "X-User-ID: demo-user" \
  -d '{
    "model_type": "cnn",
    "dataset_path": "/data/mnist",
    "num_workers": 3,
    "epochs": 10,
    "hyperparameters": {
      "learning_rate": "0.001",
      "batch_size": "64"
    }
  }'

# Response
{
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "RUNNING",
  "num_tasks": 100,
  "message": "Job created with 100 tasks"
}

Check Job Status

curl http://localhost:8080/api/v1/jobs/550e8400-e29b-41d4-a716-446655440000

View Metrics

# Dashboard metrics
curl http://localhost:8082/api/v1/dashboard

# Prometheus metrics
curl http://localhost:8082/metrics

πŸ€– MongoDB ML Training

TensorFleet now supports machine learning training with MongoDB for dataset storage and model persistence.

Quick ML Training Demo

# Run the automated ML training demo
./demo-mongodb-ml.sh

This demo will:

  1. βœ… Train 3 different ML models (RandomForest, LogisticRegression, SVM)
  2. βœ… Store trained models in MongoDB using GridFS
  3. βœ… Save model metadata (hyperparameters, metrics, version)
  4. βœ… Download and save models locally
  5. βœ… Display model statistics

Manual ML Training

1. List Available Datasets

curl http://localhost:8000/datasets | jq

Response:

{
  "datasets": [
    {
      "name": "iris",
      "description": "Iris flower dataset",
      "n_samples": 150,
      "n_features": 4,
      "target_column": "species"
    },
    {
      "name": "wine",
      "description": "Wine classification dataset",
      "n_samples": 178,
      "n_features": 13,
      "target_column": "wine_class"
    }
  ]
}

2. Train a Model

curl -X POST http://localhost:8000/train \
  -H "Content-Type: application/json" \
  -d '{
    "job_id": "my_training_job_001",
    "dataset_name": "iris",
    "algorithm": "random_forest",
    "target_column": "species",
    "model_name": "iris_rf_model",
    "hyperparameters": {
      "n_estimators": 100,
      "max_depth": 5,
      "random_state": 42
    }
  }' | jq

Response:

{
  "job_id": "my_training_job_001",
  "model_id": "507f1f77bcf86cd799439011",
  "status": "completed",
  "metrics": {
    "train_accuracy": 0.9833,
    "test_accuracy": 0.9667,
    "training_time": 0.234
  },
  "model_name": "iris_rf_model",
  "version": "v1701964800"
}

3. List Trained Models

curl "http://localhost:8083/api/v1/models?page=1&limit=10" | jq

Response:

{
  "models": [
    {
      "id": "507f1f77bcf86cd799439011",
      "name": "iris_rf_model",
      "algorithm": "random_forest",
      "metrics": {
        "test_accuracy": 0.9667,
        "train_accuracy": 0.9833
      },
      "version": "v1701964800",
      "created_at": "2025-12-07T10:30:00Z"
    }
  ],
  "pagination": {
    "page": 1,
    "limit": 10,
    "total": 1,
    "pages": 1
  }
}

4. Get Model Metadata

curl http://localhost:8083/api/v1/models/<model_id> | jq

Response:

{
  "id": "507f1f77bcf86cd799439011",
  "name": "iris_rf_model",
  "algorithm": "random_forest",
  "hyperparameters": {
    "n_estimators": 100,
    "max_depth": 5,
    "random_state": 42
  },
  "metrics": {
    "train_accuracy": 0.9833,
    "test_accuracy": 0.9667,
    "training_time": 0.234
  },
  "version": "v1701964800",
  "dataset_name": "iris",
  "target_column": "species",
  "features": ["sepal length", "sepal width", "petal length", "petal width"],
  "created_at": "2025-12-07T10:30:00Z"
}

5. Download a Model

# Download model file
curl http://localhost:8083/api/v1/models/<model_id>/download \
  -o my_model.pkl

# Verify download
ls -lh my_model.pkl

6. Use Downloaded Model (Python)

import pickle
import numpy as np

# Load the model
with open('my_model.pkl', 'rb') as f:
    model = pickle.load(f)

# Make predictions
sample_data = np.array([[5.1, 3.5, 1.4, 0.2]])
prediction = model.predict(sample_data)
print(f"Prediction: {prediction}")

Supported ML Algorithms

Algorithm Description Best For
random_forest Random Forest Classifier Classification, robust to overfitting
logistic_regression Logistic Regression Binary/multi-class classification
svm Support Vector Machine Non-linear classification
decision_tree Decision Tree Classifier Interpretable models

Model Hyperparameters

Random Forest:

{
  "n_estimators": 100,
  "max_depth": null,
  "min_samples_split": 2,
  "random_state": 42
}

Logistic Regression:

{
  "max_iter": 1000,
  "C": 1.0,
  "random_state": 42
}

SVM:

{
  "kernel": "rbf",
  "C": 1.0,
  "gamma": "scale",
  "random_state": 42
}

Decision Tree:

{
  "max_depth": null,
  "min_samples_split": 2,
  "random_state": 42
}

Python Client for ML Operations

Use the provided Python client for easy interaction:

# Run complete demo
python ml_client.py demo

# List all models
python ml_client.py list

# Get statistics
python ml_client.py stats

# Download a specific model
python ml_client.py download <model_id> output.pkl

Model Storage Architecture

  • Datasets: Stored in MongoDB collections or GridFS for large files
  • Trained Models: Serialized with pickle and stored in GridFS
  • Metadata: Model information stored in MongoDB collections
    • Model name, algorithm, hyperparameters
    • Training metrics (accuracy, loss)
    • Version information
    • Dataset reference
    • Feature names
    • Training timestamp

Model Versioning

Each trained model is automatically versioned using a timestamp:

  • Format: v{unix_timestamp}
  • Example: v1701964800
  • Allows multiple versions of the same model
  • Easy rollback to previous versions

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors