Skip to content

An end-to-end image captioning project using a CNN encoder (ResNet-50) and LSTM decoder in PyTorch. Includes vocabulary building, preprocessing, training with BLEU evaluation, and inference. Generates natural language captions for images with saved metrics, model checkpoints, and visualization outputs.

License

Notifications You must be signed in to change notification settings

AmirhosseinHonardoust/Image-Captioning-CNN-LSTM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Image Captioning with CNN + LSTM (PyTorch)

This project demonstrates how deep learning can be applied to bridge the gap between computer vision and natural language processing. By combining a CNN encoder (ResNet-50) to extract meaningful visual features and an LSTM decoder to generate sequential text, the system learns to produce human-like captions for images.

The workflow starts with vocabulary building and tokenization, ensuring captions are processed into numerical form. The model is then trained using teacher forcing and optimized with cross-entropy loss, while evaluation is performed using BLEU scores to measure caption quality.

The repository is designed for both experimentation and reproducibility: it contains ready-to-use scripts for preprocessing, training, and inference, along with visualization outputs such as training loss curves, BLEU score progression, and example generated captions.

By running the included code, you can train on toy datasets for quick tests or scale up to widely used datasets such as Flickr8k, Flickr30k, or MSCOCO for more realistic captions. This makes the project a portfolio-ready showcase of multimodal AI skills, covering both computer vision and language generation.


Features

  • CNN encoder (pretrained ResNet-50, frozen backbone)
  • LSTM decoder with embeddings, dropout, and teacher forcing
  • Vocabulary building with NLTK tokenizer (min_freq configurable)
  • Cross-entropy training + Adam optimizer
  • BLEU-1..4 evaluation on validation set
  • Visualizations: training curves, BLEU scores
  • Saved artifacts:
    • best_captioner.pt (trained model)
    • vocab.json (vocabulary)
    • metrics.json (BLEU scores, loss)

📊 Figures

The figures below are produced automatically in outputs/ after training.

training_curves bleu_scores

Example Inference (overlay your own if available):

blue square with the word example1

(Generated by infer.py on data/images/example1.jpg)


Project Structure

image-captioning-cnn-lstm/
├─ README.md
├─ LICENSE
├─ requirements.txt
├─ data/
│  ├─ captions.csv        # CSV: image_path, caption, split(train/val/test)
│  └─ images/             # image files
├─ src/
│  ├─ models.py           # EncoderCNN, DecoderLSTM
│  ├─ utils.py            # Vocabulary, dataset, BLEU, collate
│  ├─ train.py            # training loop with checkpoints
│  └─ infer.py            # inference script for generating captions
└─ outputs/
   ├─ best_captioner.pt
   ├─ vocab.json
   ├─ training_curves.png
   ├─ bleu_scores.png
   └─ metrics.json

Setup

python -m venv .venv
# Windows:
.venv\Scripts\activate
# Linux/macOS:
source .venv/bin/activate

pip install -r requirements.txt

# Download tokenizer models (once)
python -c "import nltk; nltk.download('punkt'); nltk.download('punkt_tab')"

Data Preparation

  1. Collect a dataset (e.g., Flickr8k or MSCOCO).
  2. Put all images under data/images/.
  3. Create a data/captions.csv file with columns:
    image_path,caption,split
    images/img1.jpg,A child in a pink dress is climbing stairs.,train
    images/img1.jpg,A little girl goes into a wooden building.,train
    images/img2.jpg,A dog is running through a grassy field.,val
    
  4. Ensure split contains values: train, val, test.

Train the Model

python src/train.py --captions data/captions.csv --images-root data --outdir outputs --epochs 10 --batch-size 64 --embed-dim 256 --hidden-dim 512 --min-freq 1 --max-len 20 --lr 1e-3

Outputs:

  • outputs/training_curves.png → loss over epochs
  • outputs/bleu_scores.png → BLEU-4 progression
  • outputs/best_captioner.pt → best checkpoint
  • outputs/vocab.json → vocabulary file

Run Inference

python src/infer.py --checkpoint outputs/best_captioner.pt --vocab outputs/vocab.json --image data/images/example1.jpg --max-len 20

Example Output:

blue square with the word example1

Results

  • Training & validation loss curves
  • BLEU scores over epochs
  • Example generated captions

Next Steps

  • Train longer (20+ epochs) for better captions
  • Use a larger dataset (Flickr30k, MSCOCO)
  • Try beam search decoding instead of greedy
  • Fine-tune CNN layers for better feature extraction

About

An end-to-end image captioning project using a CNN encoder (ResNet-50) and LSTM decoder in PyTorch. Includes vocabulary building, preprocessing, training with BLEU evaluation, and inference. Generates natural language captions for images with saved metrics, model checkpoints, and visualization outputs.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages