Image Captioning with CNN + LSTM (PyTorch)

This project demonstrates how deep learning can be applied to bridge the gap between computer vision and natural language processing. By combining a CNN encoder (ResNet-50) to extract meaningful visual features and an LSTM decoder to generate sequential text, the system learns to produce human-like captions for images.

The workflow starts with vocabulary building and tokenization, ensuring captions are processed into numerical form. The model is then trained using teacher forcing and optimized with cross-entropy loss, while evaluation is performed using BLEU scores to measure caption quality.

The repository is designed for both experimentation and reproducibility: it contains ready-to-use scripts for preprocessing, training, and inference, along with visualization outputs such as training loss curves, BLEU score progression, and example generated captions.

By running the included code, you can train on toy datasets for quick tests or scale up to widely used datasets such as Flickr8k, Flickr30k, or MSCOCO for more realistic captions. This makes the project a portfolio-ready showcase of multimodal AI skills, covering both computer vision and language generation.

Features

CNN encoder (pretrained ResNet-50, frozen backbone)
LSTM decoder with embeddings, dropout, and teacher forcing
Vocabulary building with NLTK tokenizer (min_freq configurable)
Cross-entropy training + Adam optimizer
BLEU-1..4 evaluation on validation set
Visualizations: training curves, BLEU scores
Saved artifacts:
- best_captioner.pt (trained model)
- vocab.json (vocabulary)
- metrics.json (BLEU scores, loss)

📊 Figures

The figures below are produced automatically in outputs/ after training.

Example Inference (overlay your own if available):

blue square with the word example1

(Generated by infer.py on data/images/example1.jpg)

Project Structure

image-captioning-cnn-lstm/
├─ README.md
├─ LICENSE
├─ requirements.txt
├─ data/
│  ├─ captions.csv        # CSV: image_path, caption, split(train/val/test)
│  └─ images/             # image files
├─ src/
│  ├─ models.py           # EncoderCNN, DecoderLSTM
│  ├─ utils.py            # Vocabulary, dataset, BLEU, collate
│  ├─ train.py            # training loop with checkpoints
│  └─ infer.py            # inference script for generating captions
└─ outputs/
   ├─ best_captioner.pt
   ├─ vocab.json
   ├─ training_curves.png
   ├─ bleu_scores.png
   └─ metrics.json

Setup

python -m venv .venv
# Windows:
.venv\Scripts\activate
# Linux/macOS:
source .venv/bin/activate

pip install -r requirements.txt

# Download tokenizer models (once)
python -c "import nltk; nltk.download('punkt'); nltk.download('punkt_tab')"

Data Preparation

Collect a dataset (e.g., Flickr8k or MSCOCO).
Put all images under data/images/.

Create a data/captions.csv file with columns:

image_path,caption,split
images/img1.jpg,A child in a pink dress is climbing stairs.,train
images/img1.jpg,A little girl goes into a wooden building.,train
images/img2.jpg,A dog is running through a grassy field.,val

Ensure split contains values: train, val, test.

Train the Model

python src/train.py --captions data/captions.csv --images-root data --outdir outputs --epochs 10 --batch-size 64 --embed-dim 256 --hidden-dim 512 --min-freq 1 --max-len 20 --lr 1e-3

Outputs:

outputs/training_curves.png → loss over epochs
outputs/bleu_scores.png → BLEU-4 progression
outputs/best_captioner.pt → best checkpoint
outputs/vocab.json → vocabulary file

Run Inference

python src/infer.py --checkpoint outputs/best_captioner.pt --vocab outputs/vocab.json --image data/images/example1.jpg --max-len 20

Example Output:

blue square with the word example1

Results

Training & validation loss curves
BLEU scores over epochs
Example generated captions

Next Steps

Train longer (20+ epochs) for better captions
Use a larger dataset (Flickr30k, MSCOCO)
Try beam search decoding instead of greedy
Fine-tune CNN layers for better feature extraction

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Image Captioning with CNN + LSTM (PyTorch)

Features

📊 Figures

Project Structure

Setup

Data Preparation

Train the Model

Run Inference

Results

Next Steps

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
outputs		outputs
src		src
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

AmirhosseinHonardoust/Image-Captioning-CNN-LSTM

Folders and files

Latest commit

History

Repository files navigation

Image Captioning with CNN + LSTM (PyTorch)

Features

📊 Figures

Project Structure

Setup

Data Preparation

Train the Model

Run Inference

Results

Next Steps

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages