Engineer: Michael Maio
Last updated: 9/7/2025
This repo contains a working machine learning pipeline that addresses the following hypothetical scenario that a software company might need to deal with.
Problem: Data center power consumption is growing, causing a gradual, year-over-year increase in the hourly peak kilowatt load reported by a sensor on the local transformer.
Question: Assuming the power consumption trend remains unchanged, how long before the transformer becomes overloaded?
Solution: Build a machine learning pipeline that can process recent trend data and forecast when the transformer may eventually become overloaded, informing the necessary schedule for a preventative upgrade.
This machine learning pipeline uses:
- Python for the scripting.
- Docker containers to encapsulate training, promotion, and prediction jobs.
- MLflow for model management.
- YAML for job management.
- GitHub Actions to trigger a pipeline deployment.
- Terraform to create and update AzureML infrastructure from code.
- A managed identity to keep everything secure.
#1 through #3 allow for the entire pipeline to be run locally for quick feedback on changes before deploying to the cloud. No Azure required. #6 allows for other cloud providers, such as AWS or GCP, to be swapped in as needed.
This is the starting point: a high-level view of the experiment in Azure AI’s Machine Learning Studio.
Drilling into the experiment shows a list of its jobs. Each job represents a different deployment of the pipeline that trains the model, promotes the model (if it passed testing), and uses the model to make predictions.
Drilling into the latest job reveals a list of sub-jobs and how they are wired together. Below you can see that the sub-job which trains the model outputs a “trained_model” to the job that promotes the model, which then outputs a “promoted_model” to the job that uses the model to output “predictions”.
You can drill into each sub-job to view all kinds of details about it. Below you can see that the first sub-job, “Train Transformer Load Model”, did the following:
- It output the model; once when MLflow logged the model and once to pass the model along to the promotion job.
- It applied some informative tags.
- It reported the metric “rmse” (aka Root Mean Squared Error), indicating how well the model performed during testing.
You can drill into one of the model links to get more information on the model.
And drill into its artifacts.
Moving on to the “Promote Transformer Load Model” sub-job, you can see that it output the “promoted_model”, meaning the model passed testing during training and the Root Mean Squared Error of the model was sufficiently low for it to be useful in making predictions.
If you view the AzureML model registry for the workspace, you can see that the promotion sub-job registered the model since it passed testing.
Moving onto the “Predict Transformer Overload” sub-job, we can see that it created the following: 1. A tag reporting that the transformer is predicted to hit its first overload at 11pm on November 26th, 2027. 2. A metric predicting that the maximum load over the entire 5-year period will be about 98 kilowatts. 3. A metric predicting that the transformer will overload over 4,623 times in the next 5 years given current usage trends.
You can also drill into the “predictions” output and see the files that the prediction job uploaded, including: 1. The predicted transformer load in kilowatts for each hour during the next 5 years. 2. The number of times the transformer is predicted to overload during that period. 3. A chart of the predicted loads.