I have reproduced the B1 training, But the evaluate results keeps incorrect. Could you release a Recommended eval commands?