feat: Long timeout on actions to detect runaway Futures/CompletionStages#2383
feat: Long timeout on actions to detect runaway Futures/CompletionStages#2383johanandren merged 4 commits intomainfrom
Conversation
patriknw
left a comment
There was a problem hiding this comment.
LGTM, with a small suggestion
sdk/java-sdk-protobuf/src/main/scala/kalix/javasdk/impl/action/ActionsImpl.scala
Outdated
Show resolved
Hide resolved
| .firstCompletedOf( | ||
| Seq( | ||
| futureEffect, | ||
| akka.pattern.after(actionTimeout) { |
There was a problem hiding this comment.
btw, can there be a risk that we add many such timer tasks to the scheduler and they will not be done (removed) until after 1 hour (1000 per second would be 3.6 million)
There was a problem hiding this comment.
Yeah, good point. Maybe we need something smarter here, would be enough with one such timeout future per minute or maybe even fewer.
There was a problem hiding this comment.
Added something that should create fewer timers and share them in 0b510bd
|
|
||
| private val actionTimeout = system.settings.config.getDuration("kalix.action.timeout").toScala | ||
|
|
||
| @volatile private var previousTimeout: Option[(Instant, Future[Done])] = None |
There was a problem hiding this comment.
so we create one instance of ActionsImpl per projection instance and keep that instance? not creating new instances of ActionsImpl for each request?
There was a problem hiding this comment.
ActionsImpl is the actual gRPC service implementation, so it's even one for all actions in the same Kalix SDK service.
| } | ||
| new TimeoutException( | ||
| s"Command to action [${service.actionClass.getOrElse(service.serviceName)}] method [${command.name}]${additionalDetails} did not complete within ${actionTimeout.toCoarsest}") | ||
| } |
There was a problem hiding this comment.
just thinking if there is an easier way? it's pretty cheap to add and cancel short lived scheduler tasks, so wonder if we could keep the previous Cancellable instead and when scheduling new we always cancel previous, since we know that it was handled when processing next message.
There was a problem hiding this comment.
Even simpler: we can close over it as well, and cancel on completion of the other future. I'll do that instead.
| effectToResponse(service, command, withSurroundingSideEffects, messageCodec) | ||
| } | ||
| .recover { case NonFatal(ex) => | ||
| timeoutCancellable.cancel() |
There was a problem hiding this comment.
Note: safe to cancel an already cancelled
In case an async reply or effect is returned that never completes, for example in a consumer, this will make sure it is logged as an error and retried, so that it hopefully is unstuck or at least shows up in logs.
Sample log output: