feat(channel): echo voice audio transcription feedback#1214
feat(channel): echo voice audio transcription feedback#1214afjcjsbx merged 15 commits intosipeed:mainfrom
Conversation
|
|
|
I tested telegram and discord (references in the Evidence section) other channels I am not able in testing them because I do not have the possibility |
|
Thanks, it's evening here now. I'll try to test the channels I can test tomorrow during the day. |
|
no problem, thanks! |
…-transcription # Conflicts: # pkg/channels/telegram/telegram.go
|
also conducted tests with Slack |
|
Then I carefully considered the changes to Placeholder in this PR; perhaps we could make the modifications using a lighter approach. // base.go HandleMessage
hasAudio := hasAudioMedia(media)
if !hasAudio {
if pc, ok := c.owner.(PlaceholderCapable); ok {
if phID, err := pc.SendPlaceholder(ctx, chatID); err == nil && phID != "" {
c.placeholderRecorder.RecordPlaceholder(c.name, chatID, phID)
}
}
}Then, after |
|
yes it is a very good idea to delay the placeholder, I didn't think about it, I will try to implement it and see how it performs, thank you |
|
ok restored old previous logic, now placeholder is delayed in case of voice messages, if you can have a look 🙏 |
pkg/agent/loop.go
Outdated
| sendCtx, cancel := context.WithTimeout(ctx, 5*time.Second) | ||
| defer cancel() | ||
|
|
||
| err := ch.Send(sendCtx, bus.OutboundMessage{ |
There was a problem hiding this comment.
Perhaps it would be better to find a way to call Manager.sendWithRetry()?
Directly using ch.Send() lacks rate limiting, message splitting, and other features.
There was a problem hiding this comment.
However, sendWithRetry() is not exposed externally, which might cause some obstacles in calling it.
There was a problem hiding this comment.
sendWithRetry() being a private method to Manager, we can't call it directly from loop.go . The natural solution is to expose a public SendMessage() on the Manager that routes through the worker pipeline (rate limiting, message splitting, retry).
However, the worker pipeline is async (messages are enqueued and processed by a background goroutine), which means the transcription feedback and the "Thinking…" placeholder can arrive out of order, the placeholder may be sent before the transcription feedback has actually been delivered.
The original code used a direct ch.Send() precisely to guarantee ordering: the feedback is sent synchronously, and only then does the flow continue to SendPlaceholder().
Two possible approaches:
- Make SendMessage() synchronous: call sendWithRetry() inline in the caller's goroutine (using the worker's rate limiter) instead of enqueuing. This preserves ordering while gaining retry/splitting.
- Keep SendMessage() async but add a signaling mechanism (e.g. a done channel) so the caller can wait until the message has actually been sent before proceeding to the placeholder.
What do you think?
There was a problem hiding this comment.
I think both are fine 😄 I might prefer option 1
There was a problem hiding this comment.
implemented, PTAL 🙏
…-transcription # Conflicts: # pkg/channels/telegram/telegram.go # pkg/config/config.go # pkg/config/defaults.go
|
Sorry! I was a bit late with the approval🥲 |
|
no problem, thank you! |
…anscription feat(channel): echo voice audio transcription feedback
…anscription feat(channel): echo voice audio transcription feedback
…anscription feat(channel): echo voice audio transcription feedback
📝 Description
This PR introduces a major UX improvement for voice message handling by echoing the transcription back to the user before the agent starts processing the response.
New Workflow for Voice Messages:
To achieve a natural conversational flow, I had to refactor when the "Thinking... 💭" placeholder is generated. Previously, the placeholder was triggered automatically as soon as the base channel received the message. Because audio transcription takes a few seconds, the transcription echo was being sent after the placeholder had already been created.
Since the placeholder is eventually edited to become the final LLM response, the old logic resulted in an inverted and confusing chat history:
❌
[User Audio] -> [Final Agent Response] -> [Transcript Echo]By decoupling the placeholder generation from the channel's
HandleMessageand moving it into theAgentLoop(triggered explicitly after the transcription and right before the LLM call), the UX is now perfectly sequential:✅
[User Audio] -> [Transcript Echo] -> [Thinking... / Final Agent Response]Key Changes:
voice.echo_transcriptiontoggle inconfig.json.bus.OutboundMessagewithReplyToMessageID,SkipPlaceholder, andTriggerPlaceholderto give the agent more granular control over message delivery.channels/base.go.channels/manager.goto interceptTriggerPlaceholderand spawn the "Thinking" message on demand.preSendto respectSkipPlaceholder, ensuring the transcription feedback creates a new message rather than consuming the placeholder.sendTranscriptionFeedbackmethod and moved the placeholder trigger right before the LLM iteration.ReplyToMessageIDusingtelego.ReplyParameters.🗣️ Type of Change
🤖 AI Code Generation
🔗 Related Issue
📚 Technical Context (Skip for Docs)
🧪 Test Environment
📸 Evidence (Optional)
Click to view Logs/Screenshots
Example of the new interaction flow:
Telegram:

Discord:

Slack:

☑️ Checklist