Skip to content

feat(channel): echo voice audio transcription feedback#1214

Merged
afjcjsbx merged 15 commits intosipeed:mainfrom
afjcjsbx:feat/echo-voice-audio-transcription
Mar 11, 2026
Merged

feat(channel): echo voice audio transcription feedback#1214
afjcjsbx merged 15 commits intosipeed:mainfrom
afjcjsbx:feat/echo-voice-audio-transcription

Conversation

@afjcjsbx
Copy link
Collaborator

@afjcjsbx afjcjsbx commented Mar 7, 2026

📝 Description

This PR introduces a major UX improvement for voice message handling by echoing the transcription back to the user before the agent starts processing the response.

New Workflow for Voice Messages:

  1. The user sends a voice message.
  2. The agent transcribes the audio in the background.
  3. The agent sends a new message (as a direct reply to the user's voice message) containing the transcribed text. If no voice is detected, it sends a specific alert.
  4. Immediately after, the agent triggers the "Thinking... 💭" placeholder.
  5. Once the LLM finishes, the placeholder is edited and replaced by the final response.

To achieve a natural conversational flow, I had to refactor when the "Thinking... 💭" placeholder is generated. Previously, the placeholder was triggered automatically as soon as the base channel received the message. Because audio transcription takes a few seconds, the transcription echo was being sent after the placeholder had already been created.
Since the placeholder is eventually edited to become the final LLM response, the old logic resulted in an inverted and confusing chat history:
[User Audio] -> [Final Agent Response] -> [Transcript Echo]

By decoupling the placeholder generation from the channel's HandleMessage and moving it into the AgentLoop (triggered explicitly after the transcription and right before the LLM call), the UX is now perfectly sequential:
[User Audio] -> [Transcript Echo] -> [Thinking... / Final Agent Response]

Key Changes:

  • Configuration: Added voice.echo_transcription toggle in config.json.
  • Bus System: Extended bus.OutboundMessage with ReplyToMessageID, SkipPlaceholder, and TriggerPlaceholder to give the agent more granular control over message delivery.
  • Channel Manager: - Disabled automatic placeholder creation upon message receipt in channels/base.go.
    • Updated channels/manager.go to intercept TriggerPlaceholder and spawn the "Thinking" message on demand.
    • Updated preSend to respect SkipPlaceholder, ensuring the transcription feedback creates a new message rather than consuming the placeholder.
  • Agent Loop: Extracted the transcription feedback logic into a clean sendTranscriptionFeedback method and moved the placeholder trigger right before the LLM iteration.
  • Telegram Channel: Implemented native support for ReplyToMessageID using telego.ReplyParameters.

🗣️ Type of Change

  • 🐞 Bug fix (non-breaking change which fixes an issue)
  • ✨ New feature (non-breaking change which adds functionality)
  • 📖 Documentation update
  • ⚡ Code refactoring (no functional changes, no api changes)

🤖 AI Code Generation

  • 🤖 Fully AI-generated (100% AI, 0% Human)
  • 🛠️ Mostly AI-generated (AI draft, Human verified/modified)
  • 👨‍💻 Mostly Human-written (Human lead, AI assisted or none)

🔗 Related Issue

📚 Technical Context (Skip for Docs)

  • Reference URL:
  • Reasoning: Previously, there was a chronological race condition where the "Thinking..." placeholder was spawned immediately, and the transcription text either overwrote it or appeared out of order. By decoupling the placeholder generation from the message receipt phase and moving it to the LLM pre-execution phase, we achieve a perfectly sequential and intuitive UX for voice interactions.

🧪 Test Environment

  • Hardware: MacBook M1
  • OS: macOS
  • Model/Provider: All
  • Channels: All Telegram

📸 Evidence (Optional)

Click to view Logs/Screenshots

Example of the new interaction flow:

Telegram:
image

Discord:
image

Slack:
image

☑️ Checklist

  • My code/docs follow the style of this project.
  • I have performed a self-review of my own changes.
  • I have updated the documentation accordingly.

@afjcjsbx afjcjsbx requested review from alexhoshina, imguoguo and mengzhuo and removed request for mengzhuo March 7, 2026 14:58
@alexhoshina
Copy link
Collaborator

make lint plz.
If convenient, could you also test the impact on other channels?

@afjcjsbx
Copy link
Collaborator Author

afjcjsbx commented Mar 7, 2026

I tested telegram and discord (references in the Evidence section) other channels I am not able in testing them because I do not have the possibility

@sipeed-bot sipeed-bot bot added type: enhancement New feature or request domain: channel domain: agent go Pull requests that update go code labels Mar 7, 2026
@alexhoshina
Copy link
Collaborator

Thanks, it's evening here now. I'll try to test the channels I can test tomorrow during the day.
The main branch just had a merge, so you might need to resolve the conflicts. Thank you very much.

@afjcjsbx
Copy link
Collaborator Author

afjcjsbx commented Mar 7, 2026

no problem, thanks!

@afjcjsbx
Copy link
Collaborator Author

afjcjsbx commented Mar 7, 2026

also conducted tests with Slack

@alexhoshina
Copy link
Collaborator

alexhoshina commented Mar 8, 2026

Then I carefully considered the changes to Placeholder in this PR; perhaps we could make the modifications using a lighter approach.
Only delay Placeholder when there is a voice message, while keeping the original behavior unchanged for other messages. For example, like this:

// base.go HandleMessage
hasAudio := hasAudioMedia(media)
  if !hasAudio {
      if pc, ok := c.owner.(PlaceholderCapable); ok {
          if phID, err := pc.SendPlaceholder(ctx, chatID); err == nil && phID != "" {
              c.placeholderRecorder.RecordPlaceholder(c.name, chatID, phID)
          }
      }
  }

Then, after AgentLoop.transcribeAudioInMessage is completed, manually resend the Placeholder.
This not only preserves the semantics of OutboundMessage but also involves relatively minor changes, making the impact more controllable.
your opinion?

@afjcjsbx
Copy link
Collaborator Author

afjcjsbx commented Mar 8, 2026

yes it is a very good idea to delay the placeholder, I didn't think about it, I will try to implement it and see how it performs, thank you

@afjcjsbx
Copy link
Collaborator Author

afjcjsbx commented Mar 8, 2026

ok restored old previous logic, now placeholder is delayed in case of voice messages, if you can have a look 🙏

sendCtx, cancel := context.WithTimeout(ctx, 5*time.Second)
defer cancel()

err := ch.Send(sendCtx, bus.OutboundMessage{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it would be better to find a way to call Manager.sendWithRetry()?
Directly using ch.Send() lacks rate limiting, message splitting, and other features.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, sendWithRetry() is not exposed externally, which might cause some obstacles in calling it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sendWithRetry() being a private method to Manager, we can't call it directly from loop.go . The natural solution is to expose a public SendMessage() on the Manager that routes through the worker pipeline (rate limiting, message splitting, retry).

However, the worker pipeline is async (messages are enqueued and processed by a background goroutine), which means the transcription feedback and the "Thinking…" placeholder can arrive out of order, the placeholder may be sent before the transcription feedback has actually been delivered.

The original code used a direct ch.Send() precisely to guarantee ordering: the feedback is sent synchronously, and only then does the flow continue to SendPlaceholder().

Two possible approaches:

  1. Make SendMessage() synchronous: call sendWithRetry() inline in the caller's goroutine (using the worker's rate limiter) instead of enqueuing. This preserves ordering while gaining retry/splitting.
  2. Keep SendMessage() async but add a signaling mechanism (e.g. a done channel) so the caller can wait until the message has actually been sent before proceeding to the placeholder.

What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think both are fine 😄 I might prefer option 1

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implemented, PTAL 🙏

@afjcjsbx afjcjsbx requested a review from alexhoshina March 9, 2026 09:48
afjcjsbx added 3 commits March 9, 2026 11:38
…-transcription

# Conflicts:
#	pkg/channels/telegram/telegram.go
#	pkg/config/config.go
#	pkg/config/defaults.go
@afjcjsbx afjcjsbx requested a review from huaaudio March 10, 2026 23:17
@alexhoshina
Copy link
Collaborator

Sorry! I was a bit late with the approval🥲

@afjcjsbx
Copy link
Collaborator Author

no problem, thank you!

@afjcjsbx afjcjsbx merged commit 30584f0 into sipeed:main Mar 11, 2026
4 checks passed
fishtrees pushed a commit to fishtrees/picoclaw that referenced this pull request Mar 12, 2026
…anscription

feat(channel): echo voice audio transcription feedback
dj-oyu pushed a commit to dj-oyu/picoclaw that referenced this pull request Mar 14, 2026
…anscription

feat(channel): echo voice audio transcription feedback
dj-oyu pushed a commit to dj-oyu/picoclaw that referenced this pull request Mar 16, 2026
…anscription

feat(channel): echo voice audio transcription feedback
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

domain: agent domain: channel go Pull requests that update go code type: enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants