-
Notifications
You must be signed in to change notification settings - Fork 56
Add vllm support #202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add vllm support #202
Conversation
|
@meanderingstream AMAZING WORK!!! Thanks so much Scott! |
Pull Request Test Coverage Report for Build 5c2674cd99687ab3f3fd03b17cb48c3983e38f8b-PR-202Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
|
Is there something I need to do to fix the failing CI? The log file looks like it was an infrastructure challenge. |
|
@meanderingstream there's a small test flakiness issue - don't worry about it |
|
@mikehostetler It looks like this PR is going to be dead code. The reason I wrote the code this way is because when there are two or more vLLM models running on the same node, they run on different ports. Something like localhost:8000 and localhost:8001. I think the key decision is whether the LLMDB.Model will get a base_url like the code in this PR or whether we are just going to have a LLMDB configuration with multiple entries, one for each port. Something like local1: ...., base_url: localhost:8000 ... If you decide to add base_url to Model, then it will probably look like the following: %LLMDB.Model{
id: "gpt-4o-mini",
provider: :openai,
name: "GPT-4o mini",
family: "gpt-4o",
limits: %{context: 128_000, output: 16_384},
cost: %{input: 0.15, output: 0.60},
base_url: "http://localhost:8002,
capabilities: %{
chat: true,
tools: %{enabled: true, streaming: true},
json: %{native: true, schema: true},
streaming: %{text: true, tool_calls: true}
},
tags: [],
deprecated?: false,
aliases: [],
extra: %{}
}If so, then ReqLLM.Provider.Options#inject_base_url_from_registry/3 would need the code to first use the model.base_url and then continue with the existing code. Something like this is shown in the Options#inject_base_url_from_registry of this PR. Note: I didn't see that Options#effective_base_url did anything useful. When I tried to make the changes in the effective_base_url and then call the effective_base_url in the inject_base...., then I broke a whole bunch of model tests. I'm not sure why there are two methods with different implementations. Based upon your follow-up comments, I would be happy to make these changes in the two libraries in order to get vLLM support to work. Please let me know if you want me to create the two PRs. |
|
@meanderingstream Even with the conflicts, we can probably salvage this PR - the changes to integrate LLMDB were pretty straightforward. I am happy to help with the transition. Regarding adding |
Pull Request
Adds a vLLM provider that supports self-hosted models.
As a self-hosted model, the user must configure and start a vLLM service before making ReqLLM calls against the service url. In ReqLLM, there are some differences between continually hosted models and self-hosted models. Self-hosted models need the ability to configure the location of the self-hosted Provider base_url. Since, the model service is not publicly hosted, the ReqLLM Catalog capability was added to support run-time configuration of providers like vLLM.
However, vLLM doesn't host multiple models on the same port. This PR also adds a Model base_url field that is used to override the Provider base_url (and port) field when setting up the HTTP request.
The capability was tested against locally hosted models. The following small models were used to verify chat generation, chat streaming and chat with an image, HuggingFaceTB/SmolLM2-135M-Instruct and HuggingFaceTB/SmolVLM-256M-Instruct.
By default, vLLM uses OPENAI_API_KEY as an environment variable. The presence of a value is needed but, by default, the value isn't checked.
For ReqLLM users that want to try other models, I've used the following to deploy local models.
Set up vLLM:
I have direct experience with running models on a Linux server with a small, 10G VRAM 3080 GPU. In addition to installing vLLM as a python program, it is possible to deploy vLLM as a docker service. For CPU serving, there is a docker configuration for CPU in the vLLM codebase. Please refer to the vLLM documentation for further details on how to set up a service.
I found it useful to specify the --served-model-name model_name_you_choose parameter. The model_name_you_choose becomes the model_id in a ReqLLM catalog configuration.
Set up ReqLLM:
In a client Elixir application, configure the models in environment configurations, i.e. dev.exs, etc.
Here is an example configuration:
When trying to test a local vLLM model , using a local clone of the ReqLLM project, it is currently a little tricky. In the project root folder the lib/examples/scripts or iex -S mix will work when using the following steps, The catalog_allow.exs needs to have the locally hosted models listed in the vllm_models configuration, i.e.
Additional notes:
I've found that the vllm_test.exs is a little flaky when running locally. It might be because the test models are defined and then excluded in the model JSON files. Running the tests multiple times, will demonstrate that they pass.
Type of Contribution
Checklist
mix test)mix quality)If Provider Changes
mix mc "provider:*" --record)mix mc "provider:*")Model Compatibility Output:
Related Issues
Closes #