Skip to content
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
533dfa2
simple cli
simon-mo Apr 18, 2024
fa90277
Merge branch 'main' of github.com:vllm-project/vllm
simon-mo Apr 18, 2024
b6f06fa
fix sorting
simon-mo Apr 18, 2024
3c09138
change to positional
simon-mo Apr 18, 2024
01b0fef
fix isort
simon-mo Apr 18, 2024
8d13d0a
changed pos arg name
EthanqX May 25, 2024
e4004e9
started adding complete subparser
EthanqX May 25, 2024
d9606e4
draft complete cli endpoint
EthanqX May 27, 2024
60d58cb
finished complete cli endpoint
EthanqX May 28, 2024
dd031b5
added chat cli endpoint
EthanqX May 29, 2024
fdea667
small fixes
EthanqX May 30, 2024
1979d18
used openai sdk
EthanqX Jun 5, 2024
73ed451
small fix
EthanqX Jun 5, 2024
5aa70b6
adjusted imports
EthanqX Jun 5, 2024
0aff304
handled system prompt
EthanqX Jun 5, 2024
1e4e891
fixed url
EthanqX Jun 5, 2024
1c617b9
Merge branch 'main' of github.com:vllm-project/vllm into new-cli
simon-mo Jun 5, 2024
5c8250b
revert docs changes (shadow launching)
simon-mo Jun 5, 2024
09103b6
refactor code
simon-mo Jun 5, 2024
ae60142
format
simon-mo Jun 5, 2024
807d97f
revert format
simon-mo Jun 5, 2024
09aa92f
fix multiline
simon-mo Jun 5, 2024
f9dde03
Merge branch 'vllm-project:main' into new-cli
EthanqX Jun 5, 2024
00f84dd
removed buffer from complete
EthanqX Jun 6, 2024
6f60716
Merge branch 'main' of github.com:vllm-project/vllm into new-cli
simon-mo Jun 11, 2024
cbd8d8e
wrapper method for old docs
EthanqX Jun 11, 2024
310f473
Merge remote-tracking branch 'origin/main' into new-cli
EthanqX Jun 24, 2024
4913116
support reuse of llm engine to run server
EthanqX Jun 24, 2024
edef04f
arg parser for test utils
EthanqX Jun 26, 2024
9e19be7
Merge 'origin/main' into new-cli
EthanqX Jul 1, 2024
563ec6d
format
EthanqX Jul 1, 2024
824b5d9
format
EthanqX Jul 2, 2024
3dd1b75
delete check for model flag in serve
EthanqX Jul 12, 2024
e93d59a
Merge branch 'main' of github.com:vllm-project/vllm into new-cli
EthanqX Jul 12, 2024
53b6d1e
use FlexibleArgumentParser
EthanqX Jul 13, 2024
8cf2257
isort
EthanqX Jul 13, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions benchmarks/benchmark_serving.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@

On the server side, run one of the following commands:
vLLM OpenAI API server
python -m vllm.entrypoints.openai.api_server \
--model <your_model> --swap-space 16 \
vllm serve <your_model> \
--swap-space 16 \
--disable-log-requests

(TGI backend)
Expand Down
6 changes: 2 additions & 4 deletions docs/source/getting_started/quickstart.rst
Original file line number Diff line number Diff line change
Expand Up @@ -73,15 +73,13 @@ Start the server:

.. code-block:: console

$ python -m vllm.entrypoints.openai.api_server \
$ --model facebook/opt-125m
$ vllm serve facebook/opt-125m

By default, the server uses a predefined chat template stored in the tokenizer. You can override this template by using the ``--chat-template`` argument:

.. code-block:: console

$ python -m vllm.entrypoints.openai.api_server \
$ --model facebook/opt-125m \
$ vllm serve facebook/opt-125m \
$ --chat-template ./examples/template_chatml.jinja

This server can be queried in the same format as OpenAI API. For example, list the models:
Expand Down
2 changes: 1 addition & 1 deletion docs/source/models/adding_model.rst
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,7 @@ Just add the following lines in your code:
from your_code import YourModelForCausalLM
ModelRegistry.register_model("YourModelForCausalLM", YourModelForCausalLM)

If you are running api server with `python -m vllm.entrypoints.openai.api_server args`, you can wrap the entrypoint with the following code:
If you are running api server with `vllm serve args`, you can wrap the entrypoint with the following code:

.. code-block:: python

Expand Down
3 changes: 1 addition & 2 deletions docs/source/models/lora.rst
Original file line number Diff line number Diff line change
Expand Up @@ -58,8 +58,7 @@ LoRA adapted models can also be served with the Open-AI compatible vLLM server.

.. code-block:: bash

python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-7b-hf \
vllm serve meta-llama/Llama-2-7b-hf \
--enable-lora \
--lora-modules sql-lora=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/

Expand Down
3 changes: 1 addition & 2 deletions docs/source/serving/distributed_serving.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,7 @@ To run multi-GPU serving, pass in the :code:`--tensor-parallel-size` argument wh

.. code-block:: console

$ python -m vllm.entrypoints.api_server \
$ --model facebook/opt-13b \
$ vllm serve facebook/opt-13b \
$ --tensor-parallel-size 4

To scale vLLM beyond a single machine, start a `Ray runtime <https://docs.ray.io/en/latest/ray-core/starting-ray.html>`_ via CLI before running vLLM:
Expand Down
5 changes: 2 additions & 3 deletions docs/source/serving/openai_compatible_server.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ vLLM provides an HTTP server that implements OpenAI's [Completions](https://plat

You can start the server using Python, or using [Docker](deploying_with_docker.rst):
```bash
python -m vllm.entrypoints.openai.api_server --model mistralai/Mistral-7B-Instruct-v0.2 --dtype auto --api-key token-abc123
vllm serve mistralai/Mistral-7B-Instruct-v0.2 --dtype auto --api-key token-abc123
```

To call the server, you can use the official OpenAI Python client library, or any other HTTP client.
Expand Down Expand Up @@ -95,8 +95,7 @@ template, or the template in string form. Without a chat template, the server wi
and all chat requests will error.

```bash
python -m vllm.entrypoints.openai.api_server \
--model ... \
vllm serve ... \
--chat-template ./path-to-chat-template.jinja
```

Expand Down
5 changes: 5 additions & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -410,4 +410,9 @@ def _read_requirements(filename: str) -> List[str]:
},
cmdclass={"build_ext": cmake_build_ext} if not _is_neuron() else {},
package_data=package_data,
entry_points={
"console_scripts": [
"vllm=vllm.scripts:main",
],
},
)
3 changes: 1 addition & 2 deletions tests/entrypoints/test_openai_server.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ def __init__(self, args):
env = os.environ.copy()
env["PYTHONUNBUFFERED"] = "1"
self.proc = subprocess.Popen(
["python3", "-m", "vllm.entrypoints.openai.api_server"] + args,
["vllm", "serve"] + args,
env=env,
stdout=sys.stdout,
stderr=sys.stderr,
Expand Down Expand Up @@ -123,7 +123,6 @@ def zephyr_lora_files():
def server(zephyr_lora_files):
ray.init()
server_runner = ServerRunner.remote([
"--model",
MODEL_NAME,
# use half precision for speed and memory savings in CI environment
"--dtype",
Expand Down
65 changes: 41 additions & 24 deletions vllm/entrypoints/openai/api_server.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import argparse
import asyncio
import importlib
import inspect
Expand All @@ -7,7 +8,7 @@

import fastapi
import uvicorn
from fastapi import Request
from fastapi import APIRouter, Request
from fastapi.exceptions import RequestValidationError
from fastapi.middleware.cors import CORSMiddleware
from fastapi.responses import JSONResponse, Response, StreamingResponse
Expand All @@ -24,8 +25,11 @@
from vllm.logger import init_logger
from vllm.usage.usage_lib import UsageContext


TIMEOUT_KEEP_ALIVE = 5 # seconds

engine: AsyncLLMEngine = None
engine_args: AsyncEngineArgs = None
openai_serving_chat: OpenAIServingChat = None
openai_serving_completion: OpenAIServingCompletion = None
logger = init_logger(__name__)
Expand All @@ -45,45 +49,33 @@ async def _force_log():
yield


app = fastapi.FastAPI(lifespan=lifespan)


def parse_args():
parser = make_arg_parser()
return parser.parse_args()

router = APIRouter()

# Add prometheus asgi middleware to route /metrics requests
metrics_app = make_asgi_app()
app.mount("/metrics", metrics_app)

router.mount("/metrics", metrics_app)

@app.exception_handler(RequestValidationError)
async def validation_exception_handler(_, exc):
err = openai_serving_chat.create_error_response(message=str(exc))
return JSONResponse(err.model_dump(), status_code=HTTPStatus.BAD_REQUEST)


@app.get("/health")
@router.get("/health")
async def health() -> Response:
"""Health check."""
await openai_serving_chat.engine.check_health()
return Response(status_code=200)


@app.get("/v1/models")
@router.get("/v1/models")
async def show_available_models():
models = await openai_serving_chat.show_available_models()
return JSONResponse(content=models.model_dump())


@app.get("/version")
@router.get("/version")
async def show_version():
ver = {"version": vllm.__version__}
return JSONResponse(content=ver)


@app.post("/v1/chat/completions")
@router.post("/v1/chat/completions")
async def create_chat_completion(request: ChatCompletionRequest,
raw_request: Request):
generator = await openai_serving_chat.create_chat_completion(
Expand All @@ -98,7 +90,7 @@ async def create_chat_completion(request: ChatCompletionRequest,
return JSONResponse(content=generator.model_dump())


@app.post("/v1/completions")
@router.post("/v1/completions")
async def create_completion(request: CompletionRequest, raw_request: Request):
generator = await openai_serving_completion.create_completion(
request, raw_request)
Expand All @@ -112,8 +104,10 @@ async def create_completion(request: CompletionRequest, raw_request: Request):
return JSONResponse(content=generator.model_dump())


if __name__ == "__main__":
args = parse_args()
def build_app(args):
app = fastapi.FastAPI(lifespan=lifespan)
app.include_router(router)
app.root_path = args.root_path

app.add_middleware(
CORSMiddleware,
Expand All @@ -123,6 +117,12 @@ async def create_completion(request: CompletionRequest, raw_request: Request):
allow_headers=args.allowed_headers,
)

@app.exception_handler(RequestValidationError)
async def validation_exception_handler(_, exc):
err = openai_serving_chat.create_error_response(message=str(exc))
return JSONResponse(err.model_dump(),
status_code=HTTPStatus.BAD_REQUEST)

if token := os.environ.get("VLLM_API_KEY") or args.api_key:

@app.middleware("http")
Expand All @@ -146,13 +146,21 @@ async def authentication(request: Request, call_next):
raise ValueError(f"Invalid middleware {middleware}. "
f"Must be a function or a class.")

return app


def run_server(args):
Copy link
Contributor

@prashantgupta24 prashantgupta24 Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to make engine as an optional arg to this function?

Suggested change
def run_server(args):
def run_server(args, llm_engine=None)

This can help external applications reuse the llm engine and attach other API interfaces (like grpc) to the same llm engine. To be used with the other suggestion of changing line 204 to:

engine = (llm_engine
              if llm_engine is not None else AsyncLLMEngine.from_engine_args(
                  engine_args, usage_context=UsageContext.OPENAI_API_SERVER))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, this would be useful.

app = build_app(args)

logger.info(f"vLLM API server version {vllm.__version__}")
logger.info(f"args: {args}")

if args.served_model_name is not None:
served_model_names = args.served_model_name
else:
served_model_names = [args.model]
served_model_names = [args.model_tag]

global engine_args, engine, openai_serving_chat, openai_serving_completion
engine_args = AsyncEngineArgs.from_cli_args(args)
engine = AsyncLLMEngine.from_engine_args(
engine_args, usage_context=UsageContext.OPENAI_API_SERVER)
Expand All @@ -163,7 +171,6 @@ async def authentication(request: Request, call_next):
openai_serving_completion = OpenAIServingCompletion(
engine, served_model_names, args.lora_modules)

app.root_path = args.root_path
uvicorn.run(app,
host=args.host,
port=args.port,
Expand All @@ -173,3 +180,13 @@ async def authentication(request: Request, call_next):
ssl_certfile=args.ssl_certfile,
ssl_ca_certs=args.ssl_ca_certs,
ssl_cert_reqs=args.ssl_cert_reqs)


if __name__ == "__main__":
# NOTE(simon):
# This section should be in sync with vllm/scripts.py for CLI entrypoints.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this note true? They seem to be different? (also in this case, should we have a common main method to share?)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In sync in their usage of make_arg_parser

parser = argparse.ArgumentParser(
description="vLLM OpenAI-Compatible RESTful API server.")
parser = make_arg_parser(parser)
args = parser.parse_args()
run_server(args)
5 changes: 2 additions & 3 deletions vllm/entrypoints/openai/cli_args.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,8 @@ def __call__(self, parser, namespace, values, option_string=None):
setattr(namespace, self.dest, lora_list)


def make_arg_parser():
parser = argparse.ArgumentParser(
description="vLLM OpenAI-Compatible RESTful API server.")
def make_arg_parser(
parser: argparse.ArgumentParser) -> argparse.ArgumentParser:
parser.add_argument("--host", type=str, default=None, help="host name")
parser.add_argument("--port", type=int, default=8000, help="port number")
parser.add_argument(
Expand Down
Loading