-
Notifications
You must be signed in to change notification settings - Fork 870
Description
Hi,
I am encountering a significant issue while trying to replicate expected DLRM performance on the Kaggle Criteo dataset using torchrec.
I have downloaded approximately 45 million records and am training a DLRM model on roughly 36 million of these records, with the remainder used for validation and testing.
Despite training on this large dataset with the configuration specified below, I am consistently observing very poor AUROC scores on both the test and validation sets:
AUROC (Test): 0.4998681843280792
AUROC (Validation): 0.4998681843280792
These scores are essentially at the level of random guessing and indicate that the model is not learning effectively.
Here is the command I am using to run the training job:
torchx run -s local_cwd dist.ddp -j 1x8
--script torchrec_dlrm/dlrm_main.py
--
--in_memory_binary_criteo_path $PREPROCESSED_DATASET
--pin_memory
--num_embeddings_per_feature 40000000,39060,17295,7424,20265,3,7122,1543,63,40000000,3067956,405282,10,2209,11938,155,4,976,14,40000000,40000000,40000000,590152,12973,108,36
--mmap_mode
--batch_size $((GLOBAL_BATCH_SIZE / WORLD_SIZE))
--learning_rate 1.0
--dataset_name criteo_kaggle
--embedding_dim 128
--dense_arch_layer_sizes 512,256,128
--over_arch_layer_sizes 1024,1024,512,256,1
I would appreciate any insights or suggestions on why the model is failing to learn on the full dataset, especially given the expectation of achieving a much higher AUROC on this benchmark.