Hello,
Thank you for your great work.
I am working on the retrieval tasks on flicker30k using the clip model. I have noticed that in table 10 of the UPop paper, the number of flops of the uncompressed model is reported as 395.7 GFLOPs. however, it seems that this value is also including the flops of the momentum models visual_m and transformer_m. but these models should be thrown away after the training is finished, right? So, I think the actual flops should be half of the reported values in the paper.
note that the output of the print_params_and_flops('retrieval_clip', model, device, config) also considers the parameters and flops of the momentum models.