Skip to content

fix(feishu): invalidate cached token on auth error to enable retry recovery#1318

Merged
yinwm merged 1 commit intosipeed:mainfrom
Vast-Stars:bugfix/feishu_token_refresh
Mar 18, 2026
Merged

fix(feishu): invalidate cached token on auth error to enable retry recovery#1318
yinwm merged 1 commit intosipeed:mainfrom
Vast-Stars:bugfix/feishu_token_refresh

Conversation

@Vast-Stars
Copy link
Copy Markdown
Contributor

Summary

  • The Lark SDK v3's built-in token retry loop does not clear stale tokens from cache when the server returns error 99991663 (tenant_access_token invalid), causing all API calls to fail until the token naturally expires (~2 hours)
  • Implement a custom tokenCache (implementing larkcore.Cache) with an InvalidateAll() method, injected via lark.WithTokenCache()
  • On any API response with code 99991663, invalidate the cache so the next application-level retry fetches a fresh token

Changes

  • Added tokenCache struct with Get/Set/InvalidateAll methods
  • Wired custom cache into lark.NewClient via WithTokenCache()
  • Added invalidateTokenOnAuthError helper called in all API methods: sendCard, EditMessage, SendPlaceholder, ReactToMessage, fetchBotOpenID, sendImage, sendFile, downloadResource

Test plan

  • make build compiles successfully
  • go test ./pkg/channels/feishu/ all pass
  • Deploy and verify Feishu bot message sending recovers automatically after token expiration

@sipeed-bot sipeed-bot bot added type: bug Something isn't working domain: channel go Pull requests that update go code labels Mar 10, 2026
Copy link
Copy Markdown

@nikolasdehor nikolasdehor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid fix for a real operational issue. The Lark SDK v3's built-in token retry not clearing stale tokens from cache is a well-known pain point, and this custom tokenCache implementation is the correct workaround.

Review notes:

  1. Cache implementation is clean -- tokenCache correctly implements larkcore.Cache with Get/Set, and the InvalidateAll method uses clear(c.store) (Go 1.21+) for a clean wipe. The mutex usage is correct: RLock for reads, Lock for writes.

  2. Expiry check in Get -- returning empty string for expired entries is correct per the larkcore.Cache contract. The SDK will then request a new token.

  3. Comprehensive coverage -- all API call sites that check resp.Success() now also call invalidateTokenOnAuthError(resp.Code). I count: sendCard, EditMessage, SendPlaceholder, ReactToMessage, fetchBotOpenID, sendImage (upload + send), sendFile (upload + send), downloadResource. That looks complete.

  4. Idempotent invalidation -- calling InvalidateAll() multiple times in quick succession (e.g., if multiple API calls fail simultaneously) is safe since it just clears the map.

  5. Note on stripMentionPlaceholders signature change -- this PR also includes the mention handling changes (adding botOpenID parameter, sender identity metadata). These seem shared with PR #1319 and #1283. The changes look correct, but having them in multiple PRs creates merge conflict risk. Coordination between these PRs would be good.

LGTM on the token cache fix.

@Vast-Stars Vast-Stars requested a review from alexhoshina March 11, 2026 06:43
@alexhoshina
Copy link
Copy Markdown
Collaborator

make lint plz

Comment on lines +439 to +446
// Prepend sender identity so the LLM knows who sent the message.
if sender != nil && sender.SenderId != nil {
openID := ""
if sender.SenderId.OpenId != nil {
openID = *sender.SenderId.OpenId
}
if openID != "" {
content = fmt.Sprintf("[sender: open_id=%s] %s", openID, content)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This causes all messages to start with [sender: open_id=xxxx], which prevents the command system from working, as the command system only recognizes messages that begin with / or !.
Original message: /help
Will be processed as: [sender: open_id=ou_xxx] /help
handleCommand cannot recognize it as a command, so the message will be sent to the agent as ordinary chat text and not executed as a command.

Comment on lines +416 to +419
// Replace mention placeholders for all chat types: bot mentions are stripped, others become @Name(open_id:xxx)
if len(message.Mentions) > 0 {
knownBotID, _ := c.botOpenID.Load().(string)
content = stripMentionPlaceholders(content, message.Mentions, knownBotID)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example
Group message: @Alice /help
will be processed as: @Alice(open_id:...) /help
causing the command to be unrecognized.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, the old commits were mixed with other changes. Now the commit history is clean.

…covery

The Lark SDK v3's built-in token retry loop does not clear stale tokens
from cache when the server returns error 99991663 (tenant_access_token
invalid), causing all API calls to fail until the token naturally
expires (~2 hours).

- Add tokenCache struct (implementing larkcore.Cache) with
  Get/Set/InvalidateAll methods and proper expired-entry cleanup
- Wire custom cache into lark.NewClient via WithTokenCache()
- Add invalidateTokenOnAuthError helper called in all API methods
@Vast-Stars Vast-Stars force-pushed the bugfix/feishu_token_refresh branch from 3cf644d to 5a2b34f Compare March 12, 2026 06:26
xuwei-xy pushed a commit to xuwei-xy/picoclaw that referenced this pull request Mar 14, 2026
@yinwm yinwm merged commit 3e9b7ce into sipeed:main Mar 18, 2026
4 checks passed
j0904 pushed a commit to j0904/picoclaw that referenced this pull request Mar 22, 2026
…covery (sipeed#1318)

The Lark SDK v3's built-in token retry loop does not clear stale tokens
from cache when the server returns error 99991663 (tenant_access_token
invalid), causing all API calls to fail until the token naturally
expires (~2 hours).

- Add tokenCache struct (implementing larkcore.Cache) with
  Get/Set/InvalidateAll methods and proper expired-entry cleanup
- Wire custom cache into lark.NewClient via WithTokenCache()
- Add invalidateTokenOnAuthError helper called in all API methods
@sipeed-bot
Copy link
Copy Markdown

sipeed-bot bot commented Mar 25, 2026

@Vast-Stars 飞书token缓存失效后无法自动重试的问题抓得很准,自定义tokenCache的方案也很干净。之前这个bug能让API调用卡住两小时,修复很及时!

我们正在组建 PicoClaw Dev Group,在Discord上方便贡献者之间交流。感兴趣的话,发邮件到 support@sipeed.com,主题写 [Join PicoClaw Dev Group] + 你的GitHub账号,我们会发送Discord邀请链接给你!

renato0307 pushed a commit to renato0307/picoclaw that referenced this pull request Mar 26, 2026
…covery (sipeed#1318)

The Lark SDK v3's built-in token retry loop does not clear stale tokens
from cache when the server returns error 99991663 (tenant_access_token
invalid), causing all API calls to fail until the token naturally
expires (~2 hours).

- Add tokenCache struct (implementing larkcore.Cache) with
  Get/Set/InvalidateAll methods and proper expired-entry cleanup
- Wire custom cache into lark.NewClient via WithTokenCache()
- Add invalidateTokenOnAuthError helper called in all API methods
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

domain: channel go Pull requests that update go code type: bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants