Skip to content
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 29 additions & 4 deletions RecommenderSystems/dlrm/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# DLRM
[DLRM](https://arxiv.org/pdf/1906.00091.pdf) is a deep learning-based recommendation model that exploits categorical data model for CTR recommendation. Its model structure is as follows. Based on this structure, this project uses OneFlow distributed deep learning framework to realize training the modle in graph mode and eager mode respectively on Crioteo data set.
[DLRM](https://arxiv.org/pdf/1906.00091.pdf) is a deep learning-based recommendation model that exploits categorical data model for CTR recommendation. Its model structure is as follows. Based on this structure, this project uses OneFlow distributed deep learning framework to realize training the modle in graph mode respectively on Crioteo data set.
![image](https://user-images.githubusercontent.com/63446546/158937131-1a057659-0d49-4bfb-aee2-5568e605fa01.png)

## Directory description
Expand All @@ -14,16 +14,41 @@
## Arguments description
|Argument Name|Argument Explanation|Default Value|
|-----|---|------|
|batch_size|the data batch size in one step training|16384|
|data_dir|the data file directory|None|
|use_fusedmlp|use fused MLP or not||
|embedding_vec_size||128|
|bottom_mlp||512,256,128|
|top_mlp||1024,1024,512,256|
|disable_interaction_padding|disenable interaction padding or not||
|interaction_itself|interaction itself or not||
|model_load_dir|model loading directory||
|model_save_dir|model saving directory|./checkpoint|
|save_initial_model|save initial model parameters or not.||
|save_model_after_each_eval|save model after each eval||
|not_eval_after_training|do eval after_training||
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this

|data_dir|the data file directory|/dataset/dlrm_parquet|
|eval_batchs|<0: whole val ds, 0: do not val, >0: number of eval batches|-1|
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eval_batches

|eval_batch_size||55296|
|eval_batch_size_per_proc||None|
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this

|eval_interval||10000|
|batch_size|the data batch size in one step training|55296|
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

train_batch_size

|batch_size_per_proc||None|
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this

|learning_rate|argument learning rate|24|
|warmup_batches||2750|
|decay_batches||27772|
|decay_start||49315|
|max_iter|maximum number of training batch times|75000|
|loss_print_every_n_iter|print train loss and validate the model after training every number of batche times|100|
|column_size_array|column_size_array||
|persistent_path|path for persistent kv store||
|store_type|||
|device_memory_budget_mb_per_rank||8192|
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cache_memory_budget_mb_per_rank

|use_fp16|Run model with amp||
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

amp

|loss_scale_policy|static or dynamic|static|
|model_load_dir|model loading directory|None|
|model_save_dir|model saving directory|None|
|loss_print_interval|print train loss and validate the model after training every number of batche times|1000|
|save_initial_model|save the initial arguments of the modelor not|False|

- [ ] TODO: other parameters

## Prepare running
### Environment
Expand Down