Add `with_sequence` for decode stream #1725

ArthurZucker · 2025-01-21T10:37:35Z

No description provided.

njhill · 2025-01-21T16:38:23Z

Thank you for this @ArthurZucker!

What do you think about having a version of step that can take a sequence tokens? That could be used for prefilling and also for incrementing the stream with chunks of tokens when needed?

I'm also thinking through how this would be used in practice. For very long prompts, we ideally don't want to decode the whole thing since we would typically already have just tokenized the text prompt. But we need the last couple of prompt tokens to ensure we can continue the prompt text cleanly such that the concatenation of the first streamed string with the original prompt is exactly equal to all of the tokens being decoded together.

Perhaps that's up to the user of the API to sort out, but it might be nice for the prefilled tokens to be excluded from the subsequent step output (or at least have the option for that).

ArthurZucker · 2025-01-22T11:30:50Z

For sure! I am actually a lot less familiar than you about the actual use-cases! Super thankful for the feedback!
Indeed makes senses that you don't want it all. Was wondering if this is also compatible with batches in general or not, as each sample needs a stream with the current implementation

alvarobartt · 2025-02-11T09:30:36Z

bindings/python/py_src/tokenizers/pre_tokenizers/__init__.pyi

            otherwise we consider is as a string pattern. For example `pattern="|"`
            means you want to split on `|` (imagine a csv file for example), while
-            `pattern=tokenizers.Regex("1|2")` means you split on either '1' or '2'.
+            `patter=tokenizer.Regex("1|2")` means you split on either '1' or '2'.


Suggested change

`patter=tokenizer.Regex("1|2")` means you split on either '1' or '2'.

`pattern=tokenizer.Regex("1|2")` means you split on either '1' or '2'.

Narsil · 2025-06-16T13:03:48Z

bindings/python/src/decoders.rs

+    fn with_sequence(&mut self, sequence_ids: Vec<u32>) {
+        self.ids = sequence_ids;
+        self.prefix_index = self.ids.len();
+        self.prefix = "".to_string();


You're trashing the prefix here, I'm not sure how that'll play with the prefix spaces. I think this will effectively reset the start which might be odd.

Have you thought about simply creating a new DecodeStream from a vec of ids (making it extra clear about the intent?)

Did not and it does make sense, will try that

nits

b7947d1

ArthurZucker mentioned this pull request Jan 21, 2025

Decode stream python #1678

Merged

ArthurZucker added 6 commits January 21, 2025 11:44

with

2ce721b

update

46be059

update

69206e2

zut

24d1068

& bad

3e19357

stub

d1a7c66

ArthurZucker changed the title ~~Add form sequence for decode stream~~ Add with_sequence for decode stream Jan 21, 2025

alvarobartt reviewed Feb 11, 2025

View reviewed changes

Narsil reviewed Jun 16, 2025

View reviewed changes

ArthurZucker closed this Aug 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add `with_sequence` for decode stream #1725

Add `with_sequence` for decode stream #1725

Uh oh!

ArthurZucker commented Jan 21, 2025

Uh oh!

njhill commented Jan 21, 2025

Uh oh!

ArthurZucker commented Jan 22, 2025

Uh oh!

alvarobartt Feb 11, 2025

Uh oh!

Narsil Jun 16, 2025

Uh oh!

ArthurZucker Aug 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	`patter=tokenizer.Regex("1\|2")` means you split on either '1' or '2'.
	`pattern=tokenizer.Regex("1\|2")` means you split on either '1' or '2'.

Add with_sequence for decode stream #1725

Add with_sequence for decode stream #1725

Uh oh!

Conversation

ArthurZucker commented Jan 21, 2025

Uh oh!

njhill commented Jan 21, 2025

Uh oh!

ArthurZucker commented Jan 22, 2025

Uh oh!

alvarobartt Feb 11, 2025

Choose a reason for hiding this comment

Uh oh!

Narsil Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Add `with_sequence` for decode stream #1725

Add `with_sequence` for decode stream #1725