Skip to content

Conversation

@ArthurZucker
Copy link
Collaborator

No description provided.

@ArthurZucker ArthurZucker changed the title Add form sequence for decode stream Add with_sequence for decode stream Jan 21, 2025
@njhill
Copy link

njhill commented Jan 21, 2025

Thank you for this @ArthurZucker!

What do you think about having a version of step that can take a sequence tokens? That could be used for prefilling and also for incrementing the stream with chunks of tokens when needed?

I'm also thinking through how this would be used in practice. For very long prompts, we ideally don't want to decode the whole thing since we would typically already have just tokenized the text prompt. But we need the last couple of prompt tokens to ensure we can continue the prompt text cleanly such that the concatenation of the first streamed string with the original prompt is exactly equal to all of the tokens being decoded together.

Perhaps that's up to the user of the API to sort out, but it might be nice for the prefilled tokens to be excluded from the subsequent step output (or at least have the option for that).

@ArthurZucker
Copy link
Collaborator Author

For sure! I am actually a lot less familiar than you about the actual use-cases! Super thankful for the feedback!
Indeed makes senses that you don't want it all. Was wondering if this is also compatible with batches in general or not, as each sample needs a stream with the current implementation

otherwise we consider is as a string pattern. For example `pattern="|"`
means you want to split on `|` (imagine a csv file for example), while
`pattern=tokenizers.Regex("1|2")` means you split on either '1' or '2'.
`patter=tokenizer.Regex("1|2")` means you split on either '1' or '2'.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`patter=tokenizer.Regex("1|2")` means you split on either '1' or '2'.
`pattern=tokenizer.Regex("1|2")` means you split on either '1' or '2'.

fn with_sequence(&mut self, sequence_ids: Vec<u32>) {
self.ids = sequence_ids;
self.prefix_index = self.ids.len();
self.prefix = "".to_string();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're trashing the prefix here, I'm not sure how that'll play with the prefix spaces. I think this will effectively reset the start which might be odd.

Have you thought about simply creating a new DecodeStream from a vec of ids (making it extra clear about the intent?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not and it does make sense, will try that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants