Fault-tolerant machine learning training pipeline for solar power forecasting

The availability of compute for training machine learning (ML) models has grown exponentially. However, managing ML training costs remains important. One way to ensure low training costs is by designing effective ML training pipelines. Usage of spot instances can allow for up to 90% cost savings compared to on-demand instances [1]. However, spot instances may be interrupted at any time making fault-tolerance essential. I have therefore implemented a production-ready ML training pipeline for solar power forecasting that can be used for training on spot instances.

Features

The ML training pipeline is built to forecast solar power generation using an LSTM with synthetic data for demonstration. The entire implementation is less than 500 lines of clean, testable code. It uses Python and PyTorch for model training, and pytest for test-driven development. System design is based on a philosophy of simple interfaces and deep modules [2]. To ensure data reliability, atomic writes are used for checkpointing.

In case of an interruption during the training, a checkpoint of the model including weights, optimizer state, batch index, epoch, step and loss, are saved in a file. When the system is restarted, it automatically resumes from the previous checkpoint. This ML training pipeline therefore allows one to use spot instances for training without losing training progress.

Roadmap

Getting started

Installation

Clone the repository:

git clone https://github.com/idesmukh/fault-tolerant-ml-training
cd fault-tolerant-ml-training

Install the requirements:
```
pip install -r requirements.txt
```

Usage

Run the demo from your terminal with the following command:

python3 demo.py

Press Ctrl+C at anytime to simulate interruption, then restart to resume automatically.

Testing

python3 -m pytest

Project structure

model.py - LSTM model and synthetic data generation
checkpoint.py - Checkpointing
train.py - Training loop
demo.py - Demonstration

References

[1] Cast AI, "Reduce cloud costs with spot instances," Cast AI Blog. [Online]. Available: https://cast.ai/blog/reduce-cloud-costs-with-spot-instances/. [Accessed: Oct. 20, 2025].

[2] J. Ousterhout, A Philosophy of Software Design. Palo Alto, CA: Yaknyam Press, 2018.

License

This code is provided under the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.python-version		.python-version
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
api.py		api.py
checkpoint.py		checkpoint.py
checkpoint_manager.py		checkpoint_manager.py
demo.py		demo.py
model.py		model.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fault-tolerant machine learning training pipeline for solar power forecasting

Features

Roadmap

Getting started

Installation

Usage

Testing

Project structure

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Fault-tolerant machine learning training pipeline for solar power forecasting

Features

Roadmap

Getting started

Installation

Usage

Testing

Project structure

References

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages