This repository collects lecture slides, assignments (CAs), code notebooks, reports, and reference papers used in the "Deep Generative Models" course (University of Tehran). The materials are organized to be reproducible and educational: each assignment contains an annotated Jupyter notebook, supporting code, and a report.
Course Overview
The "Deep Generative Models" (DGM) course covers advanced topics in machine learning focused on generative modeling techniques. Generative models learn the underlying distribution of data to generate new samples, enabling applications in image synthesis, anomaly detection, data augmentation, and more.
Key topics covered in the course include:
- Variational Autoencoders (VAEs): Probabilistic latent variable models for learning compressed representations and generating new data.
- Normalizing Flows: Invertible transformations that allow exact density estimation and sampling.
- Generative Adversarial Networks (GANs): Adversarial training frameworks for high-quality sample generation.
- Diffusion Models: Denoising diffusion probabilistic models for state-of-the-art image generation.
- Score-based Generative Models: Methods using score functions for sampling from complex distributions.
The course assignments (CA1-CA4) progressively build skills in implementing and evaluating these models on real datasets like CelebA, FashionMNIST, and custom image datasets.
This section provides a high-level overview of the core mathematical and conceptual foundations that unify the different generative modeling approaches covered in the course.
Generative models aim to learn the underlying data distribution
-
Density Estimation: Approximate
$p(\mathbf{x})$ or learn a tractable distribution$p_\theta(\mathbf{x})$ that matches the data. -
Sampling: Generate new samples
$\mathbf{x}' \sim p_\theta(\mathbf{x})$ from the learned distribution. - Inference: Compute posterior probabilities or latent representations for downstream tasks.
Most generative models are trained by maximizing the log-likelihood:
This is equivalent to minimizing the KL divergence between data and model distributions:
Many generative models introduce latent variables
-
Joint Distribution:
$p_\theta(\mathbf{x}, \mathbf{z}) = p_\theta(\mathbf{x}|\mathbf{z}) p(\mathbf{z})$ -
Marginal Likelihood:
$p_\theta(\mathbf{x}) = \int p_\theta(\mathbf{x}|\mathbf{z}) p(\mathbf{z}) d\mathbf{z}$ -
Posterior Inference:
$p_\theta(\mathbf{z}|\mathbf{x}) = \frac{p_\theta(\mathbf{x}|\mathbf{z}) p(\mathbf{z})}{p_\theta(\mathbf{x})}$
Exact inference in latent models is often intractable. Variational inference approximates posteriors using a recognition model:
-
Variational Distribution:
$q_\phi(\mathbf{z}|\mathbf{x}) \approx p_\theta(\mathbf{z}|\mathbf{x})$ - Evidence Lower Bound (ELBO): $\log p_\theta(\mathbf{x}) \geq \mathbb{E}{q\phi(\mathbf{z}|\mathbf{x})} [\log p_\theta(\mathbf{x}|\mathbf{z})] - \text{KL}(q_\phi(\mathbf{z}|\mathbf{x}) || p(\mathbf{z}))$
Normalizing flows provide exact density estimation through invertible transformations:
-
Transformation:
$\mathbf{z} = f(\mathbf{x})$ where$f$ is invertible -
Density:
$p_\mathbf{z}(\mathbf{z}) = p_\mathbf{x}(\mathbf{x}) |\det \frac{\partial f}{\partial \mathbf{x}}|$ - Composition: Complex transformations built from simple invertible layers
GANs use adversarial objectives instead of explicit likelihoods:
-
Generator:
$G: \mathbf{z} \mapsto \mathbf{x}$ , learns to fool discriminator -
Discriminator:
$D: \mathbf{x} \mapsto [0,1]$ , learns to distinguish real from fake - Minimax Objective: $\min_G \max_D \mathbb{E}{\mathbf{x}} [\log D(\mathbf{x})] + \mathbb{E}{\mathbf{z}} [\log (1 - D(G(\mathbf{z})))]$
Diffusion models gradually add noise and learn to reverse the process:
- Forward Process: $q(\mathbf{x}t | \mathbf{x}{t-1}) = \mathcal{N}(\mathbf{x}t; \sqrt{1-\beta_t} \mathbf{x}{t-1}, \beta_t \mathbf{I})$
- Reverse Process: $p_\theta(\mathbf{x}{t-1} | \mathbf{x}t) = \mathcal{N}(\mathbf{x}{t-1}; \boldsymbol{\mu}\theta(\mathbf{x}t, t), \boldsymbol{\Sigma}\theta(\mathbf{x}_t, t))$
- Training: Predict noise added at each timestep
Score-based models learn the score function (gradient of log-density):
-
Score Function:
$\nabla_\mathbf{x} \log p_t(\mathbf{x})$ where$p_t$ is a noisy version of data - Score Matching: Minimize $\mathbb{E}{p_t(\mathbf{x})} [\frac{1}{2} || \mathbf{s}\theta(\mathbf{x}, t) - \nabla_\mathbf{x} \log p_t(\mathbf{x}) ||^2 ]$
- Sampling: Use Langevin dynamics or SDEs to sample from learned scores
Assessing generative model quality requires both quantitative and qualitative measures:
- Likelihood-based: Log-likelihood, bits-per-dimension (for flows)
- Distribution-based: Fréchet Inception Distance (FID), Kernel Inception Distance (KID)
- Sample Quality: Inception Score (IS), perceptual quality
- Diversity: Coverage, density of generated samples
- VAEs as Flow-like Models: Reparameterization connects to normalizing flows
- Diffusion as Hierarchical VAEs: Diffusion steps can be viewed as latent layers
- Score-Based and Diffusion: Score functions are central to both
- GANs and All: Adversarial training can be applied to any generative model
Understanding these unifying principles helps in choosing appropriate models for different applications and in developing new generative techniques.
Prerequisites: Strong background in deep learning (PyTorch/TensorFlow), probability theory, and optimization. Specifically:
- Probability Theory: Random variables, distributions (Gaussian, Bernoulli, Categorical), expectation, variance, Bayes' theorem, maximum likelihood estimation
- Information Theory: Entropy, cross-entropy, Kullback-Leibler divergence, mutual information
- Linear Algebra: Vector/matrix operations, eigenvalues/eigenvectors, singular value decomposition, tensor operations
- Calculus: Partial derivatives, chain rule, gradient descent, automatic differentiation
- Statistics: Hypothesis testing, confidence intervals, bias-variance tradeoff
- Supervised Learning: Classification, regression, loss functions, regularization
- Neural Networks: Feedforward networks, backpropagation, activation functions, initialization
- Convolutional Networks: Convolutional layers, pooling, receptive fields, modern architectures (ResNet, Transformer)
- Optimization: Stochastic gradient descent variants (Adam, RMSProp), learning rate scheduling, batch normalization
- Regularization: Dropout, weight decay, early stopping, data augmentation
- Python: Advanced features (decorators, context managers, multiprocessing), NumPy/Pandas proficiency
- Deep Learning Frameworks: PyTorch (tensors, autograd, nn.Module, DataLoader) or TensorFlow/Keras
- Version Control: Git basics, collaborative workflows
- Development Environment: Jupyter notebooks, IDEs (VS Code, PyCharm), command-line tools
- "Deep Learning" by Goodfellow, Bengio, Courville (Chapters 1-5, 13-20)
- "Pattern Recognition and Machine Learning" by Bishop (Chapters 1-4, 8-10)
- "Probabilistic Machine Learning" by Murphy (Chapters 1-3, 21-24)
Students without this background may find the course challenging and are encouraged to review these topics beforehand.
- Course Overview
- Repository structure and purpose
- Quick start (setup & run)
- Notebooks and assignments (CA1..CA4) — summary and status
- Deep dive: CA1 (Variational Autoencoders)
- Deep dive: CA2 (GANs & Normalizing Flows)
- Deep dive: CA3 (Diffusion and Score-based Models)
- Deep dive: CA4 (Fine-Tuning Vision-Language Models)
- Data, storage and artifact management
- Reproducibility checklist and recommended configuration
- Testing and lightweight smoke checks
- Common issues and troubleshooting
- References and further reading
- Credits and license
CA1_Variational_Autoencoders/— Course Assignment 1: Variational Autoencoderscode/— Jupyter notebooks and code used for experiments (e.g.,code.ipynb).description/— Assignment description PDF.report/— PDF reports and figures.images/— Generated images and visualizations.train/— Training datasets (CelebA subset: smile/non-smile images).README.md— Detailed documentation for CA1.
CA2_GANs_Normalizing_Flows/— Course Assignment 2: GANs and Normalizing Flowscode/— Jupyter notebooks (e.g.,CA2_DGM.ipynb,Q2_final_res.ipynb).description/— Assignment description PDF.report/— PDF reports and figures.images/— Generated samples and visualizations.README.md— Detailed documentation for CA2.
CA3_Diffusion_Models/— Course Assignment 3: Diffusion and Score-based Modelscodes/— Jupyter notebooks (e.g.,Diffusion_Models.ipynb,score_based_models.ipynb).description/— Assignment description PDF.report/— PDF reports and figures.images/— Generated samples and visualizations.README.md— Detailed documentation for CA3.
CA4_Vision_Language_Model/— Course Assignment 4: Vision-Language Modelscode/— Jupyter notebooks (e.g.,final_CA4_training.ipynb, evaluation notebooks).description/— Assignment description PDF.report/— PDF reports and figures.images/— Generated images and visualizations.README.md— Detailed documentation for CA4.
Slides/— Lecture slides and course material used in class.DGM_Fall_2023_Slides/— Course lecture slides.Stanford_slides/— Supplementary slides from Stanford's CS236 course.
Exams/— Past exams and solutions.Extra/— Misc utilities, templates, or exploratory notebooks (e.g.,VAE.ipynb,VAE.py).
This repository is primarily an educational resource. Notebooks are annotated for readability and (where possible) reorganized to centralize imports and configuration.
Recommended steps to set up a local, reproducible environment. We recommend using virtual environments to isolate dependencies.
- Create and activate a virtual environment:
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate- Install core dependencies (adjust PyTorch install for your CUDA version):
pip install -U pip
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 # Example for CUDA 11.8
pip install matplotlib numpy scipy scikit-learn jupyterlab pytorch-fid tqdm- Create and activate a conda environment:
conda create -n dgm python=3.10
conda activate dgm- Install dependencies:
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia # Adjust CUDA version
conda install matplotlib numpy scipy scikit-learn jupyterlab tqdm
pip install pytorch-fid- For advanced notebooks (e.g., diffusion models):
pip install torchdiffeq(for ODE solvers in score-based models). - For visualization:
pip install seaborn plotly. - For reproducibility:
pip install wandb(optional, for experiment tracking).
Notes:
- If you have a CUDA-enabled GPU, install the matching
torch/torchvisionbinaries using the official instructions at https://pytorch.org. For CPU-only, omit CUDA-specific installs. pytorch-fidis used in CA2 for FID computation. If installation fails, consider alternatives likeclean-fid.- For CA3 diffusion models, ensure you have sufficient GPU memory (at least 8GB VRAM recommended).
- Launch JupyterLab from the project root and open the notebook you'd like to run:
jupyter lab- Important safety note: the notebooks in the assignment folders were edited as part of a documentation pass (imports consolidated, configuration cell added). The editorial pass did not execute the notebooks. Before running long training jobs, review the
Setup and Configurationcell in each notebook and run smoke tests described below.
This section summarizes the primary assignments and their current status in the repository.
-
CA1 (folder:
CA1_Variational_Autoencoders/)- Focus: Variational Autoencoders (VAE) and experiments exploring latent structure.
- Key files:
code/code.ipynb,report/DGM_CA1_final_EN.pdf,README.md. - Datasets: CelebA dataset (smiling/non-smiling classification task) stored in
train/smile/andtrain/non_smile/. - Status: Fully improved with comprehensive README explaining all VAE concepts in depth, notebook reorganized (imports consolidated, configuration cell added, explanatory Markdown blocks added), and overview cell inserted for educational clarity. The accompanying PDF report summary was synthesized in the CA1 README due to extraction limitations.
-
CA2 (folder:
CA2_GANs_Normalizing_Flows/)- Focus: Normalizing flows (RealNVP) and GANs (DCGAN-style) applied to FashionMNIST. Includes OOD detection experiments (MNIST, KMNIST) and FID evaluation for GANs.
- Key files:
code/CA2_DGM.ipynb,code/Q2_final_res.ipynb,README.md. - Datasets: FashionMNIST, MNIST, KMNIST (downloaded via torchvision).
- Status: Fully improved with comprehensive README explaining GANs and Normalizing Flows concepts in depth, notebook reorganized (imports consolidated, configuration cell added, explanatory Markdown blocks added), and overview cell inserted for educational clarity.
-
CA3 (folder:
CA3_Diffusion_Models/)- Focus: Diffusion Models and Score-based Generative Models.
- Key files:
codes/Diffusion_Models.ipynb,codes/score_based_models.ipynb,report/DGM_CA3_EN_final.pdf,README.md. - Datasets: Likely image datasets such as CIFAR-10 or custom datasets for diffusion processes.
- Status: Fully improved with comprehensive README explaining Diffusion and Score-based Models concepts in depth, notebooks reorganized (imports consolidated, configuration cells added, explanatory Markdown blocks added), and overview cells inserted for educational clarity.
-
CA4 (folder:
CA4_Vision_Language_Model/)- Focus: Fine-tuning Vision-Language Models (Paligemma) on CLEVR dataset using PEFT/LoRA techniques.
- Key files:
code/final_CA4_training.ipynb,code/eval_p1/final_CA4_results1.ipynb,code/eval_p2/final_CA4_results2.ipynb,README.md. - Datasets: CLEVR (Compositional Language and Elementary Visual Reasoning).
- Status: Fully improved with comprehensive README explaining Vision-Language Models, PEFT, and LoRA concepts in depth, notebook reorganized (imports consolidated, configuration cell added, explanatory Markdown blocks added), and overview cell inserted for educational clarity. Removed Google Drive dependencies, fixed evaluation metrics, and switched to Hugging Face model loading.
Other folders (Slides/, Extra/, Exams/) contain lecture materials, relevant readings, and supporting documents.
CA1 introduces Variational Autoencoders (VAEs), a cornerstone of generative modeling that combines variational inference with autoencoder architectures.
Variational Autoencoders (VAEs) are generative models that learn to encode input data into a low-dimensional latent space and decode it back to reconstruct the original data. Unlike traditional autoencoders, VAEs learn a probabilistic latent representation, allowing them to generate new samples by sampling from the learned distribution.
- Encoder (Inference Network): Maps input data (e.g., images) to parameters of a latent distribution, typically a Gaussian with mean μ and variance σ².
- Reparameterization Trick: Enables gradient flow through stochastic sampling by expressing z = μ + σ * ε, where ε ~ N(0,1).
- Decoder (Generative Network): Takes latent samples and reconstructs the original data distribution.
- Evidence Lower Bound (ELBO): The loss function combines reconstruction loss (how well the decoder reconstructs inputs) and KL divergence (regularization term that encourages the latent distribution to be close to a standard normal).
- Latent Space Properties: VAEs can learn disentangled representations where different dimensions correspond to interpretable factors of variation.
VAEs balance reconstruction fidelity with latent space regularization, making them useful for tasks like image generation, anomaly detection, and representation learning.
Key components in CA1_Variational_Autoencoders/code/code.ipynb:
- VAE Architecture: Encoder (inference network) that maps images to latent distributions, decoder (generative network) that reconstructs images from latent samples.
- Reparameterization Trick: Enables backpropagation through stochastic sampling.
- Loss Function: Combination of reconstruction loss (e.g., MSE or BCE) and KL divergence regularization.
- Latent Space Analysis: Visualization of latent representations, interpolation, and clustering.
- Experiments: Training on CelebA dataset for smiling/non-smiling classification in latent space.
Why run CA1?
- Understand the trade-off between reconstruction quality and latent regularization.
- Explore disentangled representations and their applications in downstream tasks.
- Compare VAEs with other generative models introduced later in the course.
Files of interest in CA1_Variational_Autoencoders/:
code/code.ipynb— the annotated notebook (imports consolidated and configuration cell added).README.md— detailed documentation with synthesized report summary.train/— contains CelebA subset (smile/non-smile) for training.report/— PDF reports includingDGM_CA1_final_EN.pdf.images/— generated visualizations and outputs.
High-level suggested execution order:
- Review the configuration cell for hyperparameters (latent_dim, learning_rate, etc.).
- Load and preprocess the CelebA dataset.
- Train the VAE and monitor reconstruction quality and KL divergence.
- Analyze latent space: visualize embeddings, perform interpolations, evaluate classification performance.
CA2 is both pedagogical and experimental. It demonstrates two complementary approaches to deep generative modeling:
Normalizing Flows are generative models that learn invertible transformations to map a simple base distribution (like a standard normal) to a complex data distribution. They provide exact likelihood computation and can be trained via maximum likelihood.
- RealNVP (Real-valued Non-Volume Preserving): Uses coupling layers that split input dimensions and transform one half conditioned on the other using scale and shift functions.
- Invertibility: The transformation must be invertible to compute both forward (data to latent) and inverse (latent to data) mappings.
- Log-Determinant of Jacobian: Tracks volume changes during transformation for exact density estimation.
- Advantages: Exact density evaluation enables likelihood-based evaluation and out-of-distribution detection.
Generative Adversarial Networks (GANs) consist of two neural networks trained simultaneously: a generator that creates fake data and a discriminator that distinguishes real from fake. They learn through adversarial training without requiring explicit density estimation.
- Generator: Learns to map random noise to realistic data samples.
- Discriminator: Learns to classify real vs. generated samples.
- Adversarial Loss: Generator minimizes the probability of discriminator correctly identifying fakes, while discriminator maximizes classification accuracy.
- DCGAN: Uses convolutional architectures with batch normalization and specific activation functions for stable training.
- Evaluation Challenges: Lack of explicit likelihood makes evaluation tricky; metrics like FID (Fréchet Inception Distance) compare distributions in feature space.
These approaches complement each other: flows provide mathematical rigor and exact evaluation, while GANs excel at generating high-quality samples.
-
RealNVP (normalizing flows): an explicit density model trained by maximum likelihood. The notebook contains:
- Implementation of coupling layers and RealNVP stacking.
- Training using negative log-likelihood (NLL).
- Computation of log-likelihoods for in-distribution and out-of-distribution (OOD) datasets (MNIST, KMNIST).
- Visualization of generated samples via the inverse mapping.
-
GAN (DCGAN-style): an adversarial generator trained to produce realistic fashion images. The notebook contains:
- DCGAN-style
GeneratorandDiscriminatorclasses implemented in PyTorch. - A training loop alternating generator and discriminator updates.
- Fixed noise vectors to produce consistent image grids for visual progress.
- FID evaluation using
pytorch-fidcomputed per-epoch.
- DCGAN-style
Why run CA2?
- RealNVP gives explicit densities and allows for direct OOD detection experiments based on log-likelihood.
- Training RealNVP in a learned latent space (via an encoder-decoder) reduces dimensionality and speeds up flow training.
- GAN training provides qualitative sample generation and a complementary evaluation via FID.
Files of interest in CA2_GANs_Normalizing_Flows/:
code/CA2_DGM.ipynb— the annotated notebook (imports consolidated and a configuration cell added).code/Q2_final_res.ipynb— additional results and experiments.README.md— localized instructions, reproducibility notes and quick-start steps.report/— PDF reports includingDGM_CA2_final_EN.pdf.images/— generated samples and training progress visualizations.
High-level suggested execution order (no code is run by the editor):
- Edit the top
Setup and Configurationcell to setdevice,latent_dim,batch_size,epochs, andimage_size. - Run the data preparation cells to download datasets and build DataLoaders.
- Train and evaluate RealNVP (or train RealNVP on learned latent representations after training the encoder-decoder).
- Train the GAN and observe per-epoch outputs and FID metrics.
CA3 explores cutting-edge generative modeling techniques: Denoising Diffusion Probabilistic Models (DDPM) and Score-based Generative Models.
Denoising Diffusion Probabilistic Models (DDPM) are generative models that learn to reverse a gradual noising process. They consist of two processes:
- Forward Process (Diffusion): Gradually adds Gaussian noise to data over T timesteps, following a variance schedule β₁ to β_T.
- Reverse Process (Denoising): Learns to remove noise step-by-step using a neural network (typically a U-Net) that predicts noise at each timestep.
- Training Objective: Simplified loss that predicts the added noise, enabling stable training.
- Sampling: Iterative denoising starting from pure noise to generate new samples.
- DDIM: Denoising Diffusion Implicit Models provide faster sampling by taking larger steps while maintaining quality.
Score-based Generative Models learn the score function (gradient of the log-density) of the data distribution. They can generate samples using stochastic processes:
- Score Function: ∇_x log p(x), the gradient pointing toward higher probability regions.
- Score Matching: Objective to learn the score function by matching it to the true score.
- Langevin Dynamics: MCMC sampling using ∇_x log p(x) to move toward data distribution.
- Annealed Langevin Dynamics: Multi-scale sampling with different noise levels for efficiency.
- Connection to Diffusion: Score-based models are related to diffusion through the concept of time-reversal.
These models represent the current state-of-the-art in generative modeling, offering superior sample quality compared to earlier approaches like VAEs and GANs.
Key components in CA3_Diffusion_Models/codes/:
- Diffusion Models: Forward process (adding noise) and reverse process (denoising) for generating high-quality images.
- Score-based Models: Learning the score function (gradient of log-density) for sampling via Langevin dynamics or ODE solvers.
- Training Objectives: Simplified loss for diffusion, score matching for score-based models.
- Sampling: Iterative denoising or stochastic differential equations (SDEs) for generation.
Why run CA3?
- Experience state-of-the-art image generation quality.
- Understand the connection between diffusion, score-based models, and energy-based models.
- Compare with earlier models (VAEs, GANs, Flows) in terms of sample quality and training stability.
Files of interest in CA3_Diffusion_Models/:
codes/Diffusion_Models.ipynb— Implementation of DDPM.codes/score_based_models.ipynb— Score-based generative modeling.report/DGM_CA3_EN_final.pdf— Detailed report on experiments and results.README.md— Comprehensive documentation for CA3.images/— Generated samples and visualizations from diffusion and score-based models.
High-level suggested execution order:
- Start with diffusion models: implement forward/reverse processes, train on a dataset like CIFAR-10.
- Experiment with different noise schedules and sampling steps.
- For score-based models: train the score network and sample using annealed Langevin dynamics.
CA4 explores advanced applications of deep generative models in vision-language tasks, specifically fine-tuning Google's Paligemma Vision-Language Model (VLM) on the CLEVR dataset using Parameter-Efficient Fine-Tuning (PEFT) techniques like Low-Rank Adaptation (LoRA).
Vision-Language Models (VLMs) are multi-modal models that can process both visual and textual information simultaneously. They typically consist of:
- Vision Encoder: Processes images into visual features (e.g., using Vision Transformers or CNNs).
- Text Encoder/Decoder: Handles text input/output, often based on large language models.
- Cross-Modal Fusion: Mechanisms to combine visual and textual representations for joint understanding.
Parameter-Efficient Fine-Tuning (PEFT) addresses the challenge of adapting large pre-trained models without updating all parameters:
- Low-Rank Adaptation (LoRA): Adds trainable low-rank matrices to frozen pre-trained weights, significantly reducing trainable parameters.
- Benefits: Faster training, lower memory usage, prevention of catastrophic forgetting, easier deployment.
- How it Works: For a weight matrix W, LoRA adds W + ΔW where ΔW = A×B, with A and B being low-rank matrices.
Fine-Tuning VLMs involves adapting general-purpose models to specific tasks:
- Task-Specific Adaptation: Training on domain-specific data to improve performance on targeted applications.
- Instruction Tuning: Teaching models to follow natural language instructions for vision-language tasks.
- Evaluation: Using metrics like ROUGE for text generation quality and task-specific accuracy.
CLEVR Dataset is designed for evaluating visual reasoning:
- Synthetic Scenes: Rendered images with multiple objects having various attributes (color, shape, size, position).
- Complex Questions: Require counting, comparison, spatial reasoning, and logical operations.
- Ground Truth Answers: Enables precise evaluation of reasoning capabilities.
This assignment bridges traditional generative modeling with modern multi-modal AI, showing how generative techniques extend beyond image synthesis to language and reasoning tasks.
Key components in CA4_Vision_Language_Model/code/final_CA4_training.ipynb:
- Vision-Language Model: Paligemma-3B, a state-of-the-art VLM for understanding images and answering questions.
- Dataset: CLEVR (Compositional Language and Elementary Visual Reasoning), featuring synthetic scenes with multiple objects and complex questions.
- PEFT with LoRA: Efficient fine-tuning by adapting only low-rank matrices, reducing computational requirements.
- Quantization: 8-bit quantization for memory efficiency during training.
- Evaluation: ROUGE metrics to assess answer quality and model performance.
Why run CA4?
- Learn to adapt large pre-trained models for specific tasks without full fine-tuning.
- Understand vision-language integration and multi-modal generative modeling.
- Experience real-world application of generative techniques in AI assistants and chatbots.
- Compare PEFT approaches with traditional full fine-tuning in terms of efficiency and performance.
Files of interest in CA4_Vision_Language_Model/:
code/final_CA4_training.ipynb— Complete implementation of Paligemma fine-tuning on CLEVR with LoRA.code/eval_p1/final_CA4_results1.ipynb— Evaluation notebook for part 1.code/eval_p2/final_CA4_results2.ipynb— Evaluation notebook for part 2.description/DGM_HW4.pdf— Assignment description and requirements.report/— Student analysis and experimental results.README.md— Comprehensive documentation for CA4.
High-level suggested execution order:
- Set up environment and install dependencies (transformers, PEFT, etc.).
- Configure model and LoRA parameters.
- Load and preprocess CLEVR dataset subset.
- Fine-tune Paligemma with LoRA on visual question answering.
- Evaluate using ROUGE metrics and qualitative sample analysis.
- Save the fine-tuned model for inference and deployment.
Proper data and artifact management is crucial for reproducible experiments in generative modeling.
- Download and Caching: Datasets are downloaded by
torchvisioninto./data/by default. To avoid re-downloads and manage storage:- Set
torchvisiondata directory:export TORCH_HOME=./databefore running notebooks. - For large datasets like CelebA, consider using a shared cache directory if multiple users will run experiments.
- Set
- Preprocessing: Ensure consistent preprocessing pipelines across experiments (e.g., resize, normalization, data augmentation).
- Custom Datasets: For CA1's CelebA subset, the
CA1_Variational_Autoencoders/train/folder contains pre-split smile/non-smile images. Verify integrity and consider backing up.
- Saving Models: Save PyTorch state_dicts (
.pthfiles) for generators, discriminators, VAEs, flows, etc.- Example:
torch.save(generator.state_dict(), 'generator_epoch_50.pth')
- Example:
- Run Metadata: Save a
run_info.jsonfor each experiment including hyperparameters, random seed, Git commit, and timestamps. - Generated Samples: Save image grids or sample batches as PNG/JPG for qualitative evaluation.
- Logs and Metrics: Use TensorBoard or Weights & Biases for tracking losses, FID scores, etc.
Organize outputs like this:
experiments/
├── run_2023_10_01_vae_baseline/
│ ├── checkpoints/
│ │ ├── vae_epoch_10.pth
│ │ └── vae_final.pth
│ ├── samples/
│ │ ├── reconstructions.png
│ │ └── latent_interpolations.png
│ ├── logs/
│ │ └── tensorboard_logs/
│ └── run_info.json
└── run_2023_10_02_gan_fid/
├── ...
- Use Git LFS for large checkpoints or datasets if committing to repo.
- For FID evaluation, keep
real_images/fixed: create a reproducible reference set (e.g., 2048 images sampled from training set with fixed RNG) and reuse across runs. - Monitor disk usage: Generative models can produce many images; clean up intermediate results.
Reproducibility is essential in machine learning research. Follow these steps for reliable, comparable results.
- Virtual Environments: Always use isolated environments (venv, conda) with pinned versions.
- Package Versions: Create a
requirements.txtorenvironment.ymland commit it.- Example
requirements.txt:torch==2.0.1 torchvision==0.15.2 numpy==1.24.3 matplotlib==3.7.1 pytorch-fid==0.10.1
- Example
- Python Version: Specify and use consistent Python versions (e.g., 3.10).
-
Seeds: Set seeds for all sources of randomness at the start of each notebook's configuration cell.
-
Example:
import random import numpy as np import torch seed = 42 random.seed(seed) np.random.seed(seed) torch.manual_seed(seed) if torch.cuda.is_available(): torch.cuda.manual_seed_all(seed) torch.backends.cudnn.deterministic = True torch.backends.cudnn.benchmark = False
-
-
Deterministic Operations: For PyTorch, set
torch.backends.cudnn.deterministic = Trueto ensure reproducible convolutions.
- Git Commits: Record the exact commit hash for each experiment.
git rev-parse --short HEAD
- Code Snapshots: Consider tagging releases or creating branches for major experiments.
- Metadata Logging: Save comprehensive
run_info.jsonfor each run.- Suggested schema:
{ "experiment_name": "vae_baseline", "commit": "abc123d", "timestamp": "2023-10-01T12:00:00Z", "seed": 42, "hyperparameters": { "latent_dim": 128, "lr": 1e-3, "batch_size": 64, "epochs": 50 }, "model_config": { "encoder_layers": [784, 512, 256, 128], "decoder_layers": [128, 256, 512, 784] }, "dataset": "CelebA_smile", "notes": "Baseline VAE with KL annealing" }
- Suggested schema:
- Metrics and Logs: Log losses, evaluation metrics (FID, IS, etc.), and qualitative samples.
- GPU/CPU: Note the hardware used; results may vary between CPU/GPU or different GPU models.
- Memory: Ensure sufficient RAM/VRAM; document batch sizes that fit your hardware.
- Fixed Splits: Use fixed train/val/test splits with seeded random splits.
- Preprocessing: Apply identical preprocessing to all data (e.g., same normalization stats).
By following this checklist, experiments should be reproducible across different machines and time.
Before committing to long training runs (which can take hours or days), perform these quick checks to catch issues early.
-
VAE (CA1):
- Assert encoder outputs mean/logvar with correct shapes:
assert mu.shape == (batch_size, latent_dim) - Assert decoder reconstructs to original image shape.
- Test reparameterization: sample z and verify gradients flow.
- Assert encoder outputs mean/logvar with correct shapes:
-
RealNVP (CA2):
- Assert forward pass returns
(z, log_det_jacobian)with correct shapes. - Assert inverse maps z back to x:
torch.allclose(x, inverse(z), atol=1e-5) - Check log_det_jacobian is finite and reasonable.
- Assert forward pass returns
-
GAN (CA2):
- Assert generator output shape:
(batch_size, channels, height, width) - Assert discriminator output: scalar per image.
- Test with fixed noise: verify consistent outputs.
- Assert generator output shape:
-
Diffusion/Score-based (CA3):
- Assert noise addition/removal preserves shapes.
- Verify score function gradients are finite.
- Set small parameters:
batch_size=16,epochs=1,latent_dim=10,N=128samples. - Run training loop and check:
- Losses decrease (not NaN/inf).
- No runtime exceptions.
- Checkpoints save/load correctly.
- Generated samples look plausible (not all black/white).
-
FID (CA2):
- Compute on small sets (200 real vs 200 generated).
- Expect noisy values but end-to-end pipeline works.
- Verify preprocessing: images resized to 299x299, normalized to [-1,1] then [0,1] for Inception.
-
Log-Likelihood (CA2):
- Compute on small batch; check values are negative and finite.
-
Reconstruction/Generation Quality:
- Visual inspection: save and view sample grids.
- Quantitative: PSNR/SSIM for reconstructions, diversity metrics for generations.
Consider adding unit tests using pytest:
def test_vae_forward():
vae = VAE(latent_dim=10)
x = torch.randn(4, 3, 64, 64)
recon, mu, logvar = vae(x)
assert recon.shape == x.shape
assert mu.shape == (4, 10)Run tests: pytest tests/ (create a tests/ directory with test files).
This section covers frequent problems encountered when running the notebooks and suggested fixes.
- PyTorch/CUDA Mismatch: Ensure PyTorch version matches your CUDA toolkit. Use
nvidia-smito check CUDA version, then install matching PyTorch from https://pytorch.org. - pytorch-fid Errors: If
calculate_fid_given_pathsfails, trypip install clean-fidand useclean_fid.compute_fidinstead. Ensure images are in [0,1] range and RGB. - Missing Dependencies: For diffusion models, install
torchdiffeqfor ODE integration. If ODE solvers fail, fall back to simpler Euler discretization.
- CUDA Out of Memory: Reduce
batch_size(e.g., from 64 to 16), or use gradient accumulation. Enable mixed precision:scaler = torch.cuda.amp.GradScaler(). - NaN Losses: Check for exploding gradients; add gradient clipping:
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0). Verify input normalization. - Shape Mismatches: Double-check tensor shapes in forward passes. Use
print(x.shape)liberally during debugging. - Slow Training: Profile with
torch.profilerorcProfile. Ensure data loading is not bottlenecked.
- VAE (CA1): If KL divergence explodes, anneal it: multiply by a coefficient that increases from 0 to 1 over epochs.
- RealNVP (CA2): If log_det_jacobian is NaN, check for zero determinants in affine transformations; add small epsilon to denominators.
- GAN (CA2): Mode collapse: monitor diversity in generated samples. If discriminator overpowers, adjust learning rates or use WGAN-GP.
- Diffusion (CA3): If sampling fails, reduce noise schedule steps or use DDIM for faster sampling.
- FID Too High/Low: Ensure real and generated images are preprocessed identically. For FashionMNIST, resize to 64x64, normalize to [-1,1], then for FID convert to [0,1] and resize to 299x299.
- Log-Likelihood Negative Infinity: Clamp log probabilities to avoid -inf; add
log_prob = torch.clamp(log_prob, min=-1e10).
- Dataset Download Fails: Check internet; for CelebA, may need manual download due to licensing. Use
wgetor browser to download and place in./data/. - Corrupted Images: Verify dataset integrity; torchvision may redownload if files are missing.
- Kernel Crashes: Restart kernel; check for infinite loops or memory leaks.
- Import Errors: Ensure all packages are installed in the active environment. Use
conda listorpip listto verify. - Notebook Not Saving: Check disk space; try saving as .py and converting back.
- Use
torch.compile(PyTorch 2.0+) for speedups:model = torch.compile(model). - For multi-GPU, use
torch.nn.DataParallelor DDP. - Profile memory:
torch.cuda.memory_summary().
If issues persist, check GitHub issues for similar problems or post with full error traceback and environment details.
This section lists key papers, books, and resources related to the course topics.
-
VAEs:
- Kingma, D.P. and Welling, M. "Auto-Encoding Variational Bayes." ICLR 2014.
- Rezende, D.J., Mohamed, S. and Wierstra, D. "Stochastic Backpropagation and Approximate Inference in Deep Generative Models." ICML 2014.
-
Normalizing Flows:
- Dinh, L., Sohl-Dickstein, J. and Bengio, S. "Density estimation using Real NVP." ICLR 2017.
- Kingma, D.P. and Dhariwal, P. "Glow: Generative Flow with Invertible 1x1 Convolutions." NeurIPS 2018.
- Papamakarios, G., et al. "Normalizing Flows for Probabilistic Modeling and Inference." JMLR 2021.
-
GANs:
- Goodfellow, I., et al. "Generative Adversarial Nets." NeurIPS 2014.
- Radford, A., Metz, L. and Chintala, S. "Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks." ICLR 2016.
- Gulrajani, I., et al. "Improved Training of Wasserstein GANs." NeurIPS 2017.
-
Diffusion Models:
- Sohl-Dickstein, J., et al. "Deep Unsupervised Learning using Nonequilibrium Thermodynamics." ICML 2015.
- Ho, J., Jain, A. and Abbeel, P. "Denoising Diffusion Probabilistic Models." NeurIPS 2020.
- Song, Y., et al. "Score-Based Generative Modeling through Stochastic Differential Equations." ICLR 2021.
- Dhariwal, P. and Nichol, A. "Diffusion Models Beat GANs on Image Synthesis." NeurIPS 2021.
- "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville (Chapter 20 on Generative Models).
- "Probabilistic Machine Learning: An Introduction" by Kevin P. Murphy.
- Lilian Weng's blog: "What are Diffusion Models?" (lilianweng.github.io/posts/2021-07-11-diffusion-models/)
- PyTorch tutorials on VAEs, GANs: https://pytorch.org/tutorials/
- Hugging Face Diffusers library: https://huggingface.co/docs/diffusers/index
- OpenAI's improved DDPM: https://github.com/openai/improved-diffusion
- CelebA: Liu, Z., et al. "Deep Learning Face Attributes in the Wild." ICCV 2015.
- FashionMNIST: Xiao, H., et al. "Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms." arXiv 2017.
- CIFAR-10/100: Krizhevsky, A. "Learning Multiple Layers of Features from Tiny Images." 2009.
For the latest research, check arXiv, NeurIPS, ICML, ICLR proceedings.
- Slides/: Lecture slides from the course, including:
DGM_Fall_2023_Slides/: Course lecture slides with annotated versions and PDFs on topics like Mean-Field VI, Normalizing Flows, VAEs, Diffusion Models.Stanford_slides/: Supplementary slides from Stanford's CS236 (Deep Generative Models) course.
- Exams/: Past midterm and final exams with solutions, useful for review and practice.
- Extra/: Miscellaneous resources including:
VAE.ipynbandVAE.py: Additional VAE implementations.- Research papers on bidirectional VAEs, D-separation, etc.
homework_template/: Homework templates and utility scripts.
If you encounter issues with the course materials:
- Check the Troubleshooting Section: Common problems and solutions are documented above.
- Review Prerequisites: Ensure you meet the mathematical and programming requirements.
- GitHub Issues: Post detailed bug reports or questions in the repository issues, including:
- Full error traceback
- Your environment (Python version, PyTorch version, OS)
- Steps to reproduce
- Expected vs. actual behavior
- Course Discussion: For conceptual questions, refer to the lecture slides or contact the course instructor.
- Community Resources: Check PyTorch forums, Stack Overflow, or arXiv for related research questions.
When posting issues, provide minimal reproducible examples and avoid sharing sensitive data.
The field of deep generative models is rapidly evolving. Based on this course, consider exploring:
- Hierarchical VAEs: Multi-scale latent representations
- Flow-based VAEs: Combining variational inference with normalizing flows
- Energy-based Models: Unnormalized probabilistic models with contrastive divergence
- Molecular Design: Generating novel drug compounds
- Art and Creativity: AI-assisted content creation
- Anomaly Detection: Identifying outliers in high-dimensional data
- Data Augmentation: Synthetic data generation for limited datasets
- Optimal Transport: Using Wasserstein distances in generative training
- Neural ODEs: Continuous-time generative processes
- Self-Supervised Learning: Learning representations without explicit labels
- Bias and Fairness: Ensuring generated data doesn't perpetuate societal biases
- Deepfakes Detection: Developing methods to identify synthetic media
- Privacy: Balancing generative capabilities with data protection
This course provides a solid foundation for contributing to these exciting research directions.
This repository contains course materials for the Deep Generative Models course. The code and notebooks are intended for educational and research use. If you reuse code or figures derived from these materials in publications or public projects, please credit the course author and repository.