This project demonstrates how deep learning can be applied to bridge the gap between computer vision and natural language processing. By combining a CNN encoder (ResNet-50) to extract meaningful visual features and an LSTM decoder to generate sequential text, the system learns to produce human-like captions for images.
The workflow starts with vocabulary building and tokenization, ensuring captions are processed into numerical form. The model is then trained using teacher forcing and optimized with cross-entropy loss, while evaluation is performed using BLEU scores to measure caption quality.
The repository is designed for both experimentation and reproducibility: it contains ready-to-use scripts for preprocessing, training, and inference, along with visualization outputs such as training loss curves, BLEU score progression, and example generated captions.
By running the included code, you can train on toy datasets for quick tests or scale up to widely used datasets such as Flickr8k, Flickr30k, or MSCOCO for more realistic captions. This makes the project a portfolio-ready showcase of multimodal AI skills, covering both computer vision and language generation.
- CNN encoder (pretrained ResNet-50, frozen backbone)
- LSTM decoder with embeddings, dropout, and teacher forcing
- Vocabulary building with NLTK tokenizer (min_freqconfigurable)
- Cross-entropy training + Adam optimizer
- BLEU-1..4 evaluation on validation set
- Visualizations: training curves, BLEU scores
- Saved artifacts:
- best_captioner.pt(trained model)
- vocab.json(vocabulary)
- metrics.json(BLEU scores, loss)
 
The figures below are produced automatically in
outputs/after training.
 
 
Example Inference (overlay your own if available):
blue square with the word example1
(Generated by infer.py on data/images/example1.jpg)
image-captioning-cnn-lstm/
├─ README.md
├─ LICENSE
├─ requirements.txt
├─ data/
│  ├─ captions.csv        # CSV: image_path, caption, split(train/val/test)
│  └─ images/             # image files
├─ src/
│  ├─ models.py           # EncoderCNN, DecoderLSTM
│  ├─ utils.py            # Vocabulary, dataset, BLEU, collate
│  ├─ train.py            # training loop with checkpoints
│  └─ infer.py            # inference script for generating captions
└─ outputs/
   ├─ best_captioner.pt
   ├─ vocab.json
   ├─ training_curves.png
   ├─ bleu_scores.png
   └─ metrics.json
python -m venv .venv
# Windows:
.venv\Scripts\activate
# Linux/macOS:
source .venv/bin/activate
pip install -r requirements.txt
# Download tokenizer models (once)
python -c "import nltk; nltk.download('punkt'); nltk.download('punkt_tab')"- Collect a dataset (e.g., Flickr8k or MSCOCO).
- Put all images under data/images/.
- Create a data/captions.csvfile with columns:image_path,caption,split images/img1.jpg,A child in a pink dress is climbing stairs.,train images/img1.jpg,A little girl goes into a wooden building.,train images/img2.jpg,A dog is running through a grassy field.,val
- Ensure splitcontains values:train,val,test.
python src/train.py --captions data/captions.csv --images-root data --outdir outputs --epochs 10 --batch-size 64 --embed-dim 256 --hidden-dim 512 --min-freq 1 --max-len 20 --lr 1e-3Outputs:
- outputs/training_curves.png→ loss over epochs
- outputs/bleu_scores.png→ BLEU-4 progression
- outputs/best_captioner.pt→ best checkpoint
- outputs/vocab.json→ vocabulary file
python src/infer.py --checkpoint outputs/best_captioner.pt --vocab outputs/vocab.json --image data/images/example1.jpg --max-len 20Example Output:
blue square with the word example1
- Training & validation loss curves
- BLEU scores over epochs
- Example generated captions
- Train longer (20+ epochs) for better captions
- Use a larger dataset (Flickr30k, MSCOCO)
- Try beam search decoding instead of greedy
- Fine-tune CNN layers for better feature extraction