Hello, I encountered an issue where the results I obtained using the fine-tuned model provided by the authors (MODEL_DIR="model/trace-ft-youcook2") are much closer to the TRACE-UNI baseline and significantly different from the values reported in the paper. Specifically, my metrics are SODA_c_2: 2.3, F1_Score: 18.5, and CIDER: 7.5, whereas the paper reports SODA_c_2: 6.7, F1_Score: 31.8, and CIDER: 35.5.
I followed the evaluation script (trace/eval/eval.sh) as instructed. Could there be any specific parameters or settings required for evaluating the fine-tuned model that I might have overlooked?
Any guidance would be greatly appreciated.