Skip to content

Conversation

@l0rinc
Copy link
Contributor

@l0rinc l0rinc commented Feb 12, 2024

After porting the optimizations ending in #77 back to tiktoken, this is a synchronization PR based on the reviews there.

Especially openai/tiktoken#245, which revealed that the legacy encodings have a severe backtracking problem - though it seems that Java's engine can handle it properly, since adding the possessives made it a tiny bit slower - but at least the regex is in sync with tiktoken (note that cl100k is unaffected by the regex changes, only the tokenCount > 2 change is related):

Before:

Benchmark                                                      (dataFolderPath)  Mode  Cnt  Score   Error  Units
SingleThreadedBenchmark.benchmarkCl100kBase                                data    ss   10  2.268 ± 0.050   s/op
SingleThreadedBenchmark.benchmarkCl100kBaseTokenCount                      data    ss   10  2.075 ± 0.025   s/op
SingleThreadedBenchmark.benchmarkCl100kBaseTokenCountOrdinary              data    ss   10  2.072 ± 0.028   s/op
SingleThreadedBenchmark.benchmarkP50kBase                                  data    ss   10  4.087 ± 0.023   s/op
SingleThreadedBenchmark.benchmarkP50kEdit                                  data    ss   10  4.131 ± 0.093   s/op
SingleThreadedBenchmark.benchmarkR50kBase                                  data    ss   10  3.802 ± 0.025   s/op

After - a bit slower for some reason, but at least it's synchronized with tiktoken:

Benchmark                                                      (dataFolderPath)  Mode  Cnt  Score   Error  Units
SingleThreadedBenchmark.benchmarkCl100kBase                                data    ss   10  2.231 ± 0.015   s/op
SingleThreadedBenchmark.benchmarkCl100kBaseTokenCount                      data    ss   10  2.101 ± 0.029   s/op
SingleThreadedBenchmark.benchmarkCl100kBaseTokenCountOrdinary              data    ss   10  2.055 ± 0.041   s/op
SingleThreadedBenchmark.benchmarkP50kBase                                  data    ss   10  4.440 ± 0.082   s/op
SingleThreadedBenchmark.benchmarkP50kEdit                                  data    ss   10  4.451 ± 0.085   s/op
SingleThreadedBenchmark.benchmarkR50kBase                                  data    ss   10  4.086 ± 0.019   s/op

Made this a draft since the tiktoken PRs aren't fully finished yet, so I expect to have a few more changes here as well.

ranks.set(nextIndex, DUMMY_RANK);

length--;
if (length < 3) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

micro-optimization, let's simplify

Map<String, Integer> specialTokens
) {
Pattern regex = compileRegex("'(?:[sdmt]|ll|ve|re)| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)|\\s+", false);
Pattern regex = compileRegex("'(?:[sdmt]|ll|ve|re)| ?\\p{L}++| ?\\p{N}++| ?[^\\s\\p{L}\\p{N}]++|\\s+(?!\\S)|\\s++", false);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

except for \\s+(?!\\S) all groups should be possessive, no need to step back on match failure

@l0rinc l0rinc marked this pull request as ready for review February 13, 2024 13:33
@l0rinc l0rinc requested a review from tox-p as a code owner February 13, 2024 13:33
@l0rinc l0rinc marked this pull request as draft February 13, 2024 13:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant