In medusa_model_legacy.py, the implementation is that the Medusa head is only responsible for generating new hidden states, and the generation of medusa logits still reuses the base_model's lm_head.
Here is the code:
|
for i in range(self.medusa): |
|
mhidden_states = self.medusa_head[i](hidden_states) |
|
mlogits = self.base_model.lm_head(mhidden_states) |
|
medusa_logits.append(mlogits) |
However, in the new medusa_model.py or medusa_model_new.py, this has changed such that each Medusa head has its own "lm_head" (a Linear layer with in_features = hidden_size, out_features = vocab_size), as shown in the code below:
|
self.medusa_head = nn.ModuleList( |
|
[ |
|
nn.Sequential( |
|
*([ResBlock(self.hidden_size)] * medusa_num_layers), |
|
nn.Linear(self.hidden_size, self.vocab_size, bias=False), |
|
) |
|
for _ in range(medusa_num_heads) |
|
] |
|
) |
Inference code is:
|
medusa_logits = [] |
|
# TODO: Consider parallelizing this loop for efficiency? |
|
for i in range(self.medusa): |
|
medusa_logits.append(self.medusa_head[i](hidden_states)) |
This is very confusing, especially since the README.md provides both legacy and new training methods. Which of these truly reflects the performance reported in the paper?
Thank you very much for your work, looking forward to your reply or anyone's discussion.
In medusa_model_legacy.py, the implementation is that the Medusa head is only responsible for generating new hidden states, and the generation of medusa logits still reuses the base_model's lm_head.
Here is the code:
Medusa/medusa/model/medusa_model_legacy.py
Lines 203 to 206 in e2a5d20
However, in the new medusa_model.py or medusa_model_new.py, this has changed such that each Medusa head has its own "lm_head" (a Linear layer with in_features = hidden_size, out_features = vocab_size), as shown in the code below:
Medusa/medusa/model/medusa_model.py
Lines 111 to 119 in e2a5d20
Inference code is:
Medusa/medusa/model/medusa_model.py
Lines 215 to 218 in e2a5d20
This is very confusing, especially since the README.md provides both legacy and new training methods. Which of these truly reflects the performance reported in the paper?
Thank you very much for your work, looking forward to your reply or anyone's discussion.