AssertionError in position embedding (potentially due to missing `clear_cache` between batches of data)

**Describe the bug**
I have been trying to use lm-evaluation-harness with `gpt-neox/eval.py`. AFAIK the other query types besides `generate_until` work fine. With `generate_until`, I run into this assertion check (in the position embedding module) after a couple of examples have been processed.

https://github.com/EleutherAI/gpt-neox/blob/9107b255cc21fed773c6d6af3ff4a11f1cb237cd/megatron/model/positional_embeddings.py#L88

In my testing, the model is about to generate (say) token 48. I have verified that the `token_index_to_generate` in `gpt-neox/megatron/text_generation_utils.py` is in fact 48. But somehow RotaryEmbedding is trying to create an embedding for position 1025 (beyond the model_max_length).

**To Reproduce**
Will fill in reproducible configs. Currently, I'm using a model with a custom config (but trained in neox) and evalauting on a QA dataset (where eval-harness uses `generate_until`).

**Proposed solution**
I suspect the issue is caused by a missing `clear_cache()` between batches of data. Adding `model.module.clear_cache()` at the start of `gpt-neox/megatron/text_generation_utils.py:stream_tokens` seems to fix it on my side.

I am unsure whether this is correct and if it's a complete fix. The same `clear_cache` operation seems to be invoked in `generate_samples_interactive` but not in `generate_samples_from_prompt`.

**Environment (please complete the following information):**
 - GPUs: 1x A6000
- Configs: https://github.com/aflah02/gpt-neox/blob/olmo-support/configs/hubble/1_1B.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AssertionError in position embedding (potentially due to missing `clear_cache` between batches of data) #1340

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

AssertionError in position embedding (potentially due to missing clear_cache between batches of data) #1340

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

AssertionError in position embedding (potentially due to missing `clear_cache` between batches of data) #1340