feat: Long timeout on actions to detect runaway Futures/CompletionStages#2383

Merged

johanandren merged 4 commits intomainfrom

wip-long-async-action-timeout

Nov 17, 2025

Contributor

johanandren commented Nov 14, 2025

In case an async reply or effect is returned that never completes, for example in a consumer, this will make sure it is logged as an error and retried, so that it hopefully is unstuck or at least shows up in logs.

Sample log output:

15:44:52.699 ERROR k.javasdk.impl.action.ActionsImpl - Failure during handling of command customer.action.CustomerByName.ProcessCustomerCreated
java.util.concurrent.TimeoutException: Command to action [CustomerByName] method [ProcessCustomerCreated], subject: [vip], sequence: [1] did not complete within 20 seconds
        at kalix.javasdk.impl.action.ActionsImpl.timeoutErrorFor(ActionsImpl.scala:193)
        at kalix.javasdk.impl.action.ActionsImpl.$anonfun$effectToResponse$2(ActionsImpl.scala:150)
        at akka.pattern.FutureTimeoutSupport.liftedTree1$1(FutureTimeoutSupport.scala:50)
        at akka.pattern.FutureTimeoutSupport.$anonfun$after$1(FutureTimeoutSupport.scala:50)
        at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:473)
        at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:48)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:60)
        at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:511)
        at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1450)
        at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:2019)
        at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:187)
15:44:52.716 WARN  kalix.javasdk.impl.DiscoveryImpl - Warning reported from Kalix system: KLX-00433 Eventing in service [customer.action.CustomerByName] is failing, will be retried.


          feat: Long timeout on actions to detect runaway Futures/CompletionStages

github-actions bot added java-sdk-protobuf kalix-runtime labels

aludwiko approved these changes

View reviewed changes

patriknw approved these changes

View reviewed changes

Contributor

patriknw left a comment

LGTM, with a small suggestion

sdk/java-sdk-protobuf/src/main/scala/kalix/javasdk/impl/action/ActionsImpl.scala Outdated Show resolved Hide resolved

patriknw reviewed

View reviewed changes

sdk/java-sdk-protobuf/src/main/scala/kalix/javasdk/impl/action/ActionsImpl.scala Outdated

+                        .firstCompletedOf(
+                          Seq(
+                            futureEffect,
+                            akka.pattern.after(actionTimeout) {

Contributor

patriknw Nov 14, 2025

btw, can there be a risk that we add many such timer tasks to the scheduler and they will not be done (removed) until after 1 hour (1000 per second would be 3.6 million)

Contributor Author

johanandren Nov 14, 2025

Yeah, good point. Maybe we need something smarter here, would be enough with one such timeout future per minute or maybe even fewer.

Contributor Author

johanandren Nov 17, 2025

Added something that should create fewer timers and share them in 0b510bd


          re-use timers to not flood the scheduler

0b510bd

johanandren mentioned this pull request

feat: Timeout to detect runaway async consumer effects akka/akka-sdk#1150

Draft


          re-use the future as well

4e23ee4

patriknw reviewed

View reviewed changes

sdk/java-sdk-protobuf/src/main/scala/kalix/javasdk/impl/action/ActionsImpl.scala Outdated


		private val actionTimeout = system.settings.config.getDuration("kalix.action.timeout").toScala

		@volatile private var previousTimeout: Option[(Instant, Future[Done])] = None

Contributor

patriknw Nov 17, 2025

so we create one instance of ActionsImpl per projection instance and keep that instance? not creating new instances of ActionsImpl for each request?

Contributor Author

johanandren Nov 17, 2025

ActionsImpl is the actual gRPC service implementation, so it's even one for all actions in the same Kalix SDK service.

sdk/java-sdk-protobuf/src/main/scala/kalix/javasdk/impl/action/ActionsImpl.scala

+                    }
+                  new TimeoutException(
+                    s"Command to action [${service.actionClass.getOrElse(service.serviceName)}] method [${command.name}]${additionalDetails} did not complete within ${actionTimeout.toCoarsest}")
+                }

Contributor

patriknw Nov 17, 2025

just thinking if there is an easier way? it's pretty cheap to add and cancel short lived scheduler tasks, so wonder if we could keep the previous Cancellable instead and when scheduling new we always cancel previous, since we know that it was handled when processing next message.

Contributor Author

johanandren Nov 17, 2025

Even simpler: we can close over it as well, and cancel on completion of the other future. I'll do that instead.

Contributor Author

johanandren Nov 17, 2025

Rewritten like that in 009faf1


          A lot simpler, cancelling each on success

009faf1

johanandren commented

View reviewed changes

sdk/java-sdk-protobuf/src/main/scala/kalix/javasdk/impl/action/ActionsImpl.scala

                           effectToResponse(service, command, withSurroundingSideEffects, messageCodec)
                         }
                         .recover { case NonFatal(ex) =>
+                          timeoutCancellable.cancel()

Contributor Author

johanandren Nov 17, 2025

Note: safe to cancel an already cancelled

patriknw approved these changes

View reviewed changes

Contributor

patriknw left a comment

LGTM

johanandren merged commit 2dea101 into main

56 checks passed

johanandren deleted the wip-long-async-action-timeout branch

November 17, 2025 14:29

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

java-sdk-protobuf kalix-runtime