A tabular data generation framework that integrates SimCLR-style contrastive loss into CTGAN to generate realistic, privacy-preserving credit card transaction data. By enhancing the discriminator with an auxiliary contrastive task, ContraCTGAN better preserves inter-feature dependencies and marginal distributions in highly imbalanced datasets.
This project enhances the standard CTGAN architecture by introducing a contrastive branch to the discriminator:
- Data Preprocessing — Mode-specific normalization for continuous columns and conditional sampling for discrete columns.
- Embedding — Map tabular inputs (real + condition vector) into a latent space via an MLP Embedder.
- Contrastive Augmentation — Generate lightly noised copies of real batches () to create positive pairs.
- Auxiliary Loss Integration — Optimize the discriminator using both WGAN-GP loss and NT-Xent (SimCLR) loss.
- Generator Training — Train the generator to minimize the refined discriminator's feedback, resulting in higher fidelity synthetic data.
The model minimizes the following objective:
- Python 3.10+
- PyTorch 2.6+
- CUDA 12.1+ (for GPU acceleration)
Tested on: NVIDIA RTX 3090 (24GB) • Ryzen 7 7800X3D • 32GB RAM
ContraCTGAN/
├── Datasets/ # Dataset directory (not tracked)
├── Models/ # Trained model weights (LFS tracked)
├── Notebooks/
│ └── synthetic_evaluation.ipynb # Fidelity + utility evaluation
├── Scripts/
│ ├── contrastive_ctgan.py # Main training script (single run)
│ ├── contrastive_ctganF.py # Hyperparameter sweep script
│ └── utils/
├── SyntheticDatasets/ # Generated output CSVs
├── requirements.txt # Python dependencies
└── README.md
- Clone the repository
git clone https://github.com/johnamit/ContraCTGAN.git
cd ContraCTGAN
- Install dependencies It is recommended to use a virtual environment (conda or venv).
pip install -r requirements.txt
- Prepare your dataset
Download the split Credit Card Fraud Detection dataset and place it in the
Datasetsfolder:
Datasets/
├── creditcard_train.csv # 80% stratified split
└── creditcard_test.csv # 20% stratified split
To run a training sweep or default training session:
python Scripts/contrastive_ctganF.py
This will:
- Load
creditcard_train.csv. - Train the ContraCTGAN model.
- Save weights to
Models/and synthetic samples toSyntheticDatasets/.
You can modify the training parameters within the script constructors. Key arguments include:
| Argument | Type | Default | Description |
|---|---|---|---|
contrastive_lambda |
float | 0.5 |
Weight of the NT-Xent contrastive loss |
contrastive_temperature |
float | 0.5 |
Temperature parameter for contrastive scaling |
epochs |
int | 100 |
Number of training epochs |
batch_size |
int | 500 |
Batch size for training |
use_amp |
bool | True |
Enable Automatic Mixed Precision |
Comparison of ContraCTGAN variants against baseline CTGAN and TVAE on the Credit Card Fraud dataset.
- Fidelity: Measured by Wasserstein Distance (WSD), Jensen–Shannon Divergence (JSD), and L2 Pearson correlation distance.
- Utility: Measured by Accuracy, AUC, and F1 score of an XGBoost classifier trained on the synthetic data and tested on real data.
| Model | WSD ↓ | JSD ↓ | L2 Pearson ↓ | Accuracy ↑ | AUC ↑ | F1 ↑ |
|---|---|---|---|---|---|---|
| Real Data | – | – | – | 0.9996 | 0.9777 | 0.8990 |
| Baseline CTGAN | 167.66 | 0.19825 | 8.19410 | 0.9792 | 0.9762 | 0.1392 |
| Baseline TVAE | 240.79 | 0.16331 | NaN | 0.9982 | 0.5000 | 0.0000 |
| ContraCTGAN (=0.5, =0.5) | 109.73 | 0.19842 | 7.73692 | 0.9919 | 0.9673 | 0.2883 |
| ContraCTGAN (=0.5, =0.1) | 122.52 | 0.20566 | 8.18397 | 0.9894 | 0.9774 | 0.2418 |
| ContraCTGAN (=0.2, =1.0) | 187.77 | 0.20333 | 7.92103 | 0.9931 | 0.9808 | 0.3160 |
| ContraCTGAN (=0.8, =0.8) | 81.34 | 0.19956 | 8.56982 | 0.9884 | 0.9690 | 0.2259 |
Key Findings:
- Utility Boost: The highlighted ContraCTGAN configuration () more than doubles the F1 score compared to the baseline CTGAN (0.2883 vs 0.1392), indicating significantly better capture of the minority fraud class.
- Fidelity: ContraCTGAN generally reduces Wasserstein Distance and L2 Pearson distance, preserving marginal distributions and correlations better than the baseline.
If you use this code in your research, please cite:
@misc{john2025contractgan,
title = {ContraCTGAN: Enhancing CTGAN’s Tabular Data Generation with Contrastive Loss Integration},
author = {Amit John},
year = {2025},
note = {GitHub repository: https://github.com/johnamit/ContraCTGAN}
}
- Code: MIT License
- Dataset: Open Database License (ODbL) 1.0
