Why not compute consistency on the raw features or predictions directly

Hi All,

Thanks for the nice work. 

I have a question regarding the depiction in Figure 1. Why do compute the consistency loss after sharpening the predictions? Why not minimize a form of KL divergence from the model features or raw predictions. Did you observe that the sharpened form lead to better training? Or what was the rationale?

Thanks!