Skip to content

Consistently Getting ~0.5 AUROC on Full Kaggle Criteo Dataset with DLRM (torchrec) #395

@ali-fani-sd

Description

@ali-fani-sd

Hi,

I am encountering a significant issue while trying to replicate expected DLRM performance on the Kaggle Criteo dataset using torchrec.

I have downloaded approximately 45 million records and am training a DLRM model on roughly 36 million of these records, with the remainder used for validation and testing.

Despite training on this large dataset with the configuration specified below, I am consistently observing very poor AUROC scores on both the test and validation sets:

AUROC (Test): 0.4998681843280792
AUROC (Validation): 0.4998681843280792
These scores are essentially at the level of random guessing and indicate that the model is not learning effectively.

Here is the command I am using to run the training job:

torchx run -s local_cwd dist.ddp -j 1x8
--script torchrec_dlrm/dlrm_main.py
--
--in_memory_binary_criteo_path $PREPROCESSED_DATASET
--pin_memory
--num_embeddings_per_feature 40000000,39060,17295,7424,20265,3,7122,1543,63,40000000,3067956,405282,10,2209,11938,155,4,976,14,40000000,40000000,40000000,590152,12973,108,36
--mmap_mode
--batch_size $((GLOBAL_BATCH_SIZE / WORLD_SIZE))
--learning_rate 1.0
--dataset_name criteo_kaggle
--embedding_dim 128
--dense_arch_layer_sizes 512,256,128
--over_arch_layer_sizes 1024,1024,512,256,1

I would appreciate any insights or suggestions on why the model is failing to learn on the full dataset, especially given the expectation of achieving a much higher AUROC on this benchmark.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions