GitHub - johnamit/contra-ctgan: A CTGAN variant with SimCLR-style NT-Xent contrastive loss for better synthetic credit-card fraud data. Evaluated on both data fidelity and utility via a XGBoost classifier.

A tabular data generation framework that integrates SimCLR-style contrastive loss into CTGAN to generate realistic, privacy-preserving credit card transaction data. By enhancing the discriminator with an auxiliary contrastive task, ContraCTGAN better preserves inter-feature dependencies and marginal distributions in highly imbalanced datasets.

Overview

This project enhances the standard CTGAN architecture by introducing a contrastive branch to the discriminator:

Data Preprocessing — Mode-specific normalization for continuous columns and conditional sampling for discrete columns.
Embedding — Map tabular inputs (real + condition vector) into a latent space via an MLP Embedder.
Contrastive Augmentation — Generate lightly noised copies of real batches () to create positive pairs.
Auxiliary Loss Integration — Optimize the discriminator using both WGAN-GP loss and NT-Xent (SimCLR) loss.
Generator Training — Train the generator to minimize the refined discriminator's feedback, resulting in higher fidelity synthetic data.

The model minimizes the following objective:

$$\mathcal{L}_D^{\text{total}} = \mathcal{L}_D^{\text{WGAN-GP}} + \lambda_{\text{contrastive}} \cdot \mathcal{L}_{\text{NT-Xent}}$$

Prerequisites

Python 3.10+
PyTorch 2.6+
CUDA 12.1+ (for GPU acceleration)

Tested on: NVIDIA RTX 3090 (24GB) • Ryzen 7 7800X3D • 32GB RAM

Project Structure

ContraCTGAN/
├── Datasets/                   # Dataset directory (not tracked)
├── Models/                     # Trained model weights (LFS tracked)
├── Notebooks/
│   └── synthetic_evaluation.ipynb  # Fidelity + utility evaluation
├── Scripts/
│   ├── contrastive_ctgan.py    # Main training script (single run)
│   ├── contrastive_ctganF.py   # Hyperparameter sweep script
│   └── utils/
├── SyntheticDatasets/          # Generated output CSVs
├── requirements.txt            # Python dependencies
└── README.md

Installation

Clone the repository

git clone https://github.com/johnamit/ContraCTGAN.git
cd ContraCTGAN

Install dependencies It is recommended to use a virtual environment (conda or venv).

pip install -r requirements.txt

Prepare your dataset Download the split Credit Card Fraud Detection dataset and place it in the Datasets folder:

Datasets/
├── creditcard_train.csv    # 80% stratified split
└── creditcard_test.csv     # 20% stratified split

Usage

Training

Default Configuration

To run a training sweep or default training session:

python Scripts/contrastive_ctganF.py

This will:

Load creditcard_train.csv.
Train the ContraCTGAN model.
Save weights to Models/ and synthetic samples to SyntheticDatasets/.

Configuration Parameters

You can modify the training parameters within the script constructors. Key arguments include:

Argument	Type	Default	Description
`contrastive_lambda`	float	`0.5`	Weight of the NT-Xent contrastive loss
`contrastive_temperature`	float	`0.5`	Temperature parameter for contrastive scaling
`epochs`	int	`100`	Number of training epochs
`batch_size`	int	`500`	Batch size for training
`use_amp`	bool	`True`	Enable Automatic Mixed Precision

Results

Fidelity and Utility Metrics

Comparison of ContraCTGAN variants against baseline CTGAN and TVAE on the Credit Card Fraud dataset.

Fidelity: Measured by Wasserstein Distance (WSD), Jensen–Shannon Divergence (JSD), and L2 Pearson correlation distance.
Utility: Measured by Accuracy, AUC, and F1 score of an XGBoost classifier trained on the synthetic data and tested on real data.

Model	WSD ↓	JSD ↓	L2 Pearson ↓	Accuracy ↑	AUC ↑	F1 ↑
Real Data	–	–	–	0.9996	0.9777	0.8990
Baseline CTGAN	167.66	0.19825	8.19410	0.9792	0.9762	0.1392
Baseline TVAE	240.79	0.16331	NaN	0.9982	0.5000	0.0000
ContraCTGAN (=0.5, =0.5)	109.73	0.19842	7.73692	0.9919	0.9673	0.2883
ContraCTGAN (=0.5, =0.1)	122.52	0.20566	8.18397	0.9894	0.9774	0.2418
ContraCTGAN (=0.2, =1.0)	187.77	0.20333	7.92103	0.9931	0.9808	0.3160
ContraCTGAN (=0.8, =0.8)	81.34	0.19956	8.56982	0.9884	0.9690	0.2259

Key Findings:

Utility Boost: The highlighted ContraCTGAN configuration () more than doubles the F1 score compared to the baseline CTGAN (0.2883 vs 0.1392), indicating significantly better capture of the minority fraud class.
Fidelity: ContraCTGAN generally reduces Wasserstein Distance and L2 Pearson distance, preserving marginal distributions and correlations better than the baseline.

Citation

If you use this code in your research, please cite:

@misc{john2025contractgan,
  title  = {ContraCTGAN: Enhancing CTGAN’s Tabular Data Generation with Contrastive Loss Integration},
  author = {Amit John},
  year   = {2025},
  note   = {GitHub repository: https://github.com/johnamit/ContraCTGAN}
}

License

Code: MIT License
Dataset: Open Database License (ODbL) 1.0

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Datasets		Datasets
Models		Models
Notebooks		Notebooks
Plots		Plots
Scripts		Scripts
SyntheticDatasets		SyntheticDatasets
assets		assets
.gitattributes		.gitattributes
.gitignore		.gitignore
Enhancing_CTGANs_Tabular_Data_Generation_with_Contrastive_Loss_Integration.pdf		Enhancing_CTGANs_Tabular_Data_Generation_with_Contrastive_Loss_Integration.pdf
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Prerequisites

Project Structure

Installation

Usage

Training

Default Configuration

Configuration Parameters

Results

Fidelity and Utility Metrics

Citation

License

About

Uh oh!

Languages

License

johnamit/contra-ctgan

Folders and files

Latest commit

History

Repository files navigation

Overview

Prerequisites

Project Structure

Installation

Usage

Training

Default Configuration

Configuration Parameters

Results

Fidelity and Utility Metrics

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages