[New Model]: Request to support xai-org/grok-1 (314B parameters with MOE architecture)

### The model to consider.

https://huggingface.co/xai-org/grok-1

With int8 quantization, this model can be hosted on 8 GPUs with 80GB memory, a node of H100 or A100. After a high-level look at the code, I am seeing xai has the model architecture implemented via JAX and its code couples model architecture and implementation details such as int8 quantization and sharing across GPUs. 

I saw a twitter post about the tricky implementation differences in Gemma's implementations. So, I wonder if someone familiar with JAX is planning to port it to PyTorch and validate, so that it can be integrate with vLLM with additional optimization for MOE architecture. 

### The closest model vllm already supports.

Mixtral 8x7B.

### What's your difficulty of supporting the model you want?

- its source code is in JAX, instead of PyTorch
- It requires quantization; otherwise, it won't work on most GPUs, including H100/A100. Here, I assume cpu offloading is not of considerations to avoid notable impact on efficiency
- Its MOE component require additional optimization for inference efficiency

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[New Model]: Request to support xai-org/grok-1 (314B parameters with MOE architecture) #3472

The model to consider.

The closest model vllm already supports.

What's your difficulty of supporting the model you want?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[New Model]: Request to support xai-org/grok-1 (314B parameters with MOE architecture) #3472

Description

The model to consider.

The closest model vllm already supports.

What's your difficulty of supporting the model you want?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions