Thank you very much for your efforts in this work.
I found that you support deepspeed. But I think for inference workload, deepspeed-mii has better performance, as claimed in this repo: https://github.com/microsoft/DeepSpeed-MII.
I am curious about its performance. I wonder how hard it is to support this backend?