Skip to content

Conversation

@KevinOConnor
Copy link
Collaborator

A few changes to the host C code that transmits messages to the mcus:

  • This slightly reduces the time that the new sq->transmit_requests.lock is held (from PR Serialqueue lock rework #7055).
  • The existing code wasn't moving messages from the upcoming queues to the ready queues if the host stopped getting acks from the mcu. This wasn't intended and it complicates debugging. Change the code to keep releasing messages to the ready queues even if we're blocked waiting for a response from the mcu. Thus, during debugging, this will show the ready_bytes continuing to increase during a communication stall.

@nefelim4ag - fyi.

-Kevin

@KevinOConnor KevinOConnor force-pushed the work-serialqueue-20251018 branch from 24c691e to 9916459 Compare October 23, 2025 19:18
nefelim4ag and others added 3 commits October 31, 2025 11:31
Move the definition of the struct so that it is more clear that it has
its own locking.

Signed-off-by: Kevin O'Connor <[email protected]>
@KevinOConnor KevinOConnor force-pushed the work-serialqueue-20251018 branch from 9916459 to 1bccedc Compare October 31, 2025 15:57
@nefelim4ag
Copy link
Collaborator

nefelim4ag commented Nov 3, 2025

From my small testing under motan dump, it seems that f1dc589 can cause random MCU lost communication.
It is hard to reproduce.

My not-clean master branch, but maybe it would help.
Log is from the fresh boot.
There is reverted 1bccedc.
But it seems that with this commit, it is slightly easier to reproduce.
klippy.log
It might be related that it is the USB MCU.
And I'm unable to reproduce when both commits are reverted.

From the log, it seems to me like the serialqueue thread just stopped transmitting messages.
Alas, everything seems fine to me.

Thanks.

There's no reason to attempt to handle multiple buffer transmissions
in a single command_event() call.  Handle the transmit case outside of
the command building loop.

If data is transmitted, then get a new timestamp from the pollreactor
and retry before sleeping.

Signed-off-by: Kevin O'Connor <[email protected]>
Move the upcoming queue movement logic to a new function.

Signed-off-by: Kevin O'Connor <[email protected]>
Maintain the next needed wakeup time for entries in the pending
queues.  This avoids needing to walk the upcoming queues when it is
known that nothing is ready to be released.

Signed-off-by: Kevin O'Connor <[email protected]>
Only hold the lock in the serialqueue thread when moving messages to
the ready queue and when setting the next need_kick_clock.

Only set the need_kick_clock just prior to sleeping.

Signed-off-by: Kevin O'Connor <[email protected]>
Keep moving messages from the pending queues to the ready queues even
if it's not currently valid to transmit messages to the mcu.  This
improves the statistics when debugging.

Signed-off-by: Kevin O'Connor <[email protected]>
@KevinOConnor KevinOConnor force-pushed the work-serialqueue-20251018 branch from 1bccedc to 03c6b6c Compare November 3, 2025 23:05
@KevinOConnor
Copy link
Collaborator Author

Thanks for testing and reporting your findings. I did find an error in that commit (sq->transmit_requests.min_release_clock could be read without the lock in check_upcoming_queues()). I have fixed that error and rebased this PR. However, I'm not sure that the above error could cause the symptoms you described. (I think it could cause the error on a 32bit machine, but your log seems to indicate you were running a 64bit OS.)

Is there a test I could run locally that would likely result in the issue?

-Kevin

@nefelim4ag
Copy link
Collaborator

nefelim4ag commented Nov 3, 2025

I literally did simple: ./scripts/motan/data_logger.py ~/printer_data/comms/klippy.sock /tmp/tmp
And it was triggered for the first time when I was trying to check the bus latency numbers under load.
(So, I got to the idea that maybe my code is bad).

Then it happens with just enabling the dump, without any additional steps from my side.
And only then I start to revert things and well, search for what the cause is.
So, my current reproduction procedure is to run dump and wait, sometimes do something (query stuff, enable heater, run ping/pong things), just to trigger it faster.

TMC dump would create a lot of query/response requests

It should be able to run indefinitely. I would try to reproduce with updated code.

@nefelim4ag
Copy link
Collaborator

I'm unable to reproduce anything after the fix.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants