-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Adding support for new bot-output RTVI Message: #2899
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
fe9aa33
5c86355
69945c5
e6dc1a5
ccca6e8
8a90dec
bc6a9ca
29417ba
0d2c528
5dfe20b
82b9c4f
e9de9da
ed808a9
124f147
8ab0c92
9a3902a
5ca04ad
4c69877
3f269f9
71b87fd
713b488
23e4e29
e8640d8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -16,6 +16,82 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 | |
| services that subclass `TTSService` can indicate whether the text in the | ||
| `TTSTextFrame`s they push already contain any necessary inter-frame spaces. | ||
|
|
||
| - Introduced new `AggregatedTextFrame` type to support representing a best effort of | ||
| the perceived llm output whether or not it is processed by the TTS. This new frame | ||
| type includes the field `aggregated_by` to represent the conceptual format by which | ||
| the given text is aggregated. `TTSTextFrame`s now inherit from `AggregatedTextFrame`. | ||
| With this inheritance, an observer can watch for `AggregatedTextFrame`s to accumlate | ||
| the perceived output and determine whether or not the text was spoken based on if that | ||
| frame is also a `TTSTextFrame`. (See bullet below on new `bot-output` which takes | ||
| advantage of this) | ||
|
|
||
| - Introduced `LLMTextProcessor`: A new processor meant to allow customization for how | ||
| LLMTextFrames should be aggregated and considered. It's purpose is to turn | ||
| `LLMTextFrame`s into `AggregatedTextFrame`s. By default, a TTSService will still | ||
| aggregate `LLMTextFrame`s by sentence for the service to consume. However, if you | ||
| wish to override how the llm text is aggregated, you should no longer override the | ||
| TTS's internal aggregator, but instead, insert this processor between your LLM and | ||
| TTS in the pipeline. | ||
|
|
||
| - New `bot-output` RTVI message to represent what the bot actually "says". | ||
| - The `RTVIObserver` now emits `bot-output` messages based off the new `AggregatedTextFrame`s | ||
| (`bot-tts-text` and `bot-llm-text` are still supported and generated, but `bot-transcript` is | ||
| now deprecated in lieu of this new, more thorough, message). | ||
| - The new `RTVIBotOutputMessage` includes the fields: | ||
| - `spoken`: A boolean indicating whether the text was spoken by TTS | ||
| - `aggregated_by`: A string representing how the text was aggregated ("sentence", "word", | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually the literal "custom" or "<my custom aggregation>", like in
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, got it. It's the latter. So maybe here use "my custom aggregation", like you do elsewhere to indicate that the string is the developer-provided aggregation type string
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ✅ Done |
||
| "my custom aggregation") | ||
| - Introduced new fields to `RTVIObserver` to support the new `bot-output` messaging: | ||
| - `bot_output_enabled`: Defaults to True. Set to false to disable bot-output messages. | ||
| - `skip_aggregator_types`: Defaults to `None`. Set to a list of strings that match | ||
| aggregation types that should not be included in bot-output messages. (Ex. `credit_card`) | ||
| - Introduced new methods, `add_text_transformer()` and `remove_text_transformer()`, to `RTVIObserver` to support providing (and subsequently removing) | ||
| callbacks for various types of aggregations (or all aggregations with `*`) that can modify the | ||
| text before being sent as a `bot-output` or `tts-text` message. (Think obscuring the credit card | ||
| or inserting extra detail the client might want that the context doesn't need.) | ||
|
|
||
| - Updated the base aggregator type: | ||
| - Introduced a new `Aggregation` dataclass to represent both the aggregated `text` and | ||
| a string identifying the `type` of aggregation (ex. "sentence", "word", "my custom | ||
| aggregation") | ||
| - **BREAKING**: `BaseTextAggregator.text` now returns an `Aggregation` (instead of `str`). | ||
| To update: `aggregated_text = myAggregator.text` -> `aggregated_text = myAggregator.text.text` | ||
| - **BREAKING**: `BaseTextAggregator.aggregate()` now returns `Optional[Aggregation]` | ||
| (instead of `Optional[str]`). To update: | ||
| ``` | ||
| aggregation = myAggregator.aggregate(text) | ||
| if (aggregation): | ||
| print(f"successfully aggregated text: {aggregation.text}") // instead of {aggregation} | ||
| ``` | ||
| - `SimpleTextAggregator`, `SkipTagsAggregator`, `PatternPairAggregator` updated to | ||
| produce/consume `Aggregation` objects. | ||
|
|
||
| - Augmented the `PatternPairAggregator`: | ||
| - Introduced a new, preferred version of `add_pattern` to support a new option for treating a | ||
| match as a separate aggregation returned from `aggregate()`. This replaces the now | ||
| deprecated `add_pattern_pair` method and you provide a `MatchAction` in lieu of the `remove_match` field. | ||
| - `MatchAction` enum: `REMOVE`, `KEEP`, `AGGREGATE`, allowing customization for how | ||
| a match should be handled. | ||
| - `REMOVE`: The text along with its delimiters will be removed from the streaming text. | ||
| Sentence aggregation will continue on as if this text did not exist. | ||
| - `KEEP`: The delimiters will be removed, but the content between them will be kept. | ||
| Sentence aggregation will continue on with the internal text included. | ||
| - `AGGREGATE`: The delimiters will be removed and the content between will be treated | ||
| as a separate aggregation. Any text before the start of the pattern will be | ||
| returned early, whether or not a complete sentence was found. Then the pattern | ||
| will be returned. Then the aggregation will continue on sentence matching after | ||
| the closing delimiter is found. The content between the delimiters is not | ||
| aggregated by sentence. It is aggregated as one single block of text. | ||
| - `PatternMatch` now extends `Aggregation` and provides richer info to handlers. | ||
| - **BREAKING**: The `PatternMatch` type returned to handlers registered via `on_pattern_match` | ||
| has been updated to subclass from the new `Aggregation` type, which means that `content` | ||
| has been replaced with `text` and `pattern_id` has been replaced with `type`: | ||
| ``` | ||
| async dev on_match_tag(match: PatternMatch): | ||
| pattern = match.type # instead of match.pattern_id | ||
| text = match.text # instead of match.content | ||
| ``` | ||
|
|
||
| ### Changed | ||
|
|
||
| - Updated all STT and TTS services to use consistent error handling pattern with | ||
|
|
@@ -33,11 +109,42 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 | |
| - Updated language mappings for the Google and Gemini TTS services to match | ||
| official documentation. | ||
|
|
||
| - `TextFrame` new field `append_to_context` used to indicate if the encompassing | ||
| text should be added to the LLM context (by the LLM assistant aggregator). It | ||
| defaults to `True`. | ||
|
|
||
| - TTS flow respects aggregation metadata | ||
| - `TTSService` accepts a new `skip_aggregator_types` to avoid speaking certain aggregation types | ||
| (now determined/returned by the aggregator) | ||
| - TTS services push `AggregatedTextFrame` in addition to `TTSTextFrame`s when either an | ||
| aggregation occurs that should not be spoken or when the TTS service supports word-by-word | ||
| timestamping. In the latter case, the `TTSService` preliminarily generates an | ||
| `AggregatedTextFrame`, aggregated by sentence to generate the full sentence content as early | ||
| as possible. | ||
| - Introduced a new methods, `add_text_transformer()` and `remove_text_transformer()`: | ||
| These functions introduce the ability to provide (and subsequently remove) callbacks to the TTS to transform text based on | ||
| its aggregated type prior to sending the text to the underlying TTS service. This makes it | ||
| possible to do things like introduce TTS-specific tags for spelling or emotion or change the | ||
| pronunciation of something on the fly. | ||
|
|
||
| ### Deprecated | ||
|
|
||
| - The `api_key` parameter in `GeminiTTSService` is deprecated. Use | ||
| `credentials` or `credentials_path` instead for Google Cloud authentication. | ||
|
|
||
| - The RTVI `bot-transcription` event is deprecated in favor of the new `bot-output` | ||
| message which is the canonical representation of bot output (spoken or not). The code | ||
| still emits a transcription message for backwards compatibility while transition occurs. | ||
|
|
||
| - The TTS constructor field, `text_aggregator` is deprecated in favor of the new | ||
| `LLMTextProcessor`. TTSServices still have an internal aggregator for support of default | ||
| behavior, but if you want to override the aggregation behavior, you should use the new | ||
| processor. | ||
|
|
||
| - Deprecated `add_pattern_pair` in the `PatternPairAggregator` which takes a `pattern_id` | ||
| and `remove_match` field in favor of the new `add_pattern` method which takes a `type` and an | ||
| `action` | ||
|
|
||
| ### Fixed | ||
|
|
||
| - Fixed subtle issue of assistant context messages ending up with double spaces | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to rebase with the latest changes from main so that we can more easily analyze only what you are including here, just to make sure we’re not missing anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i rebased yesterday just before pushing and fixing the CHANGELOG. Is it already way out of date?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh i see. there is stuff mixed in below that's not mine. i must have botched the rebase. the conflicts in this file were a bear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, it is showing a couple of things that I believe are not related with this PR. For example:
But those things I believe are already more than one week old.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah. i think the rebase went bad after the release. i just re-rebased and cleaned it up. I've got to get this in because these rebases are rough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅ Done