This is probably a re-post of #92, but is there a deeper reason as to why the failed state is considered unrecoverable? It's not a big issue at the moment, but some times the server takes longer than normal to start (e.g., downloading models) and it would be nice to be able to recover from that easily while also having a reasonable healthCheckTimeout.