Skip to content

Conversation

@michaelgiba
Copy link

I saw the comment in the byte_pair_merge code about improving the time complexity and thought it sounded interesting. I'm interested to see how the overall throughput of tiktoken changes with this change. I'll work on benchmarking it when I get a chance

@hauntsaninja
Copy link
Collaborator

Thanks for looking into this! I'd played around with it too a long time back and found the constant factor was dominant in practice. Curious to see the benchmarking results here; we could also consider conditioning on piece length.

One quick note about the tests: I haven't gotten around to open sourcing most of tiktoken's tests. So apologies if that made your life harder.

@michaelgiba
Copy link
Author

michaelgiba commented Feb 8, 2023 via email

@hauntsaninja
Copy link
Collaborator

There is some more internal benchmarking, but it looks a lot like https://github.com/openai/tiktoken/blob/main/scripts/benchmark.py

The main thing is just finding a documents: list[str] to feed to it. I used some internal-only datasets for this purpose, but maybe openwebtext or something would work. 100MB of data should be plenty for a meaningful benchmark (at low thread count).

Piece lengths are typically very small (about word length). That said, there are some degenerate cases. The best case scenario for heap merges is probably something like:

import base64, random

def base64_noise_documents(n_docs: int):
    rand = random.Random(217)
    documents = []
    for _ in range(n_docs):
        documents.append(base64.b64encode(rand.randbytes(rand.randint(100, 10_000))).decode())
    return documents

@michaelgiba
Copy link
Author

Ok I had a chance to run some simple tests. I ran with 1,3,5,7,9,11 threads on both the original version and the heap version against documents created using your random b64 example above and 100 document samples from this random ubuntu dialog dataset I found https://github.com/rkadlec/ubuntu-ranking-dataset-creator

Screenshot from 2023-02-08 21-14-04

Surprisingly the performance looks decent. I'll try to run something more rigorous when I have free time.

@michaelgiba
Copy link
Author

michaelgiba commented Feb 13, 2023

Haven't gotten around to testing this - but I'm closing it out in favor of #31 which looks great!

@hauntsaninja
Copy link
Collaborator

Thanks for experimenting with it though! :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants