-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
Description
Your current environment
Current Environment
Docker image: vllm/vllm-openai:v0.5.0.post1
Running as part of a Docker Compose stack. Relevant sections of my docker-compose.yaml are below. This is part of a multi-model deployment with other vLLM-based text generation/chat models running successfully behind a Traefik reverse proxy. I split out the instance running LLaVa 1.6 into its own service in the docker-compose.yaml to test the different commands it requires passed in on startup, it is the third service in the file. I have included the .env file entries as well.
###docker-compose.yaml###
services:
reverseproxy:
image: ${PROXY_IMAGE}
container_name: reverseproxy
# Enables the web UI and tells Traefik to listen to docker
command: --api.insecure=true --providers.docker --api.dashboard=true
ports:
# The HTTP port
- "80:80"
# The Web UI (enabled by --api.insecure=true)
- "8080:8080"
volumes:
# So that Traefik can listen to the Docker events
- /var/run/docker.sock:/var/run/docker.sock
networks:
- llm-net
## Current best solution for chat/text generation models
## Change GPU device_ids if necessary
vllm-server:
depends_on:
- reverseproxy
image: ${VLLM_IMAGE}
container_name: vllm-server
restart: always
deploy:
resources:
reservations:
devices:
- driver: nvidia
capabilities: [gpu]
device_ids: ['0']
volumes:
- ${MODEL_VOL}/${VLLM_MODEL_ID}:/vllm-workspace/${VLLM_MODEL_ID}
command: ["--model", "${VLLM_MODEL_ID}", "--gpu-memory-utilization", "0.75", "--host", "0.0.0.0", "--root-path", "/vllm-server"]
labels:
- traefik.enable=true
- traefik.http.routers.vllm-server.rule=PathPrefix(`/vllm-server`)
- traefik.http.routers.vllm-server.middlewares=vllm-server-stripprefix
- traefik.http.middlewares.vllm-server-stripprefix.stripprefix.prefixes=/vllm-server
- traefik.http.services.vllm-server.loadbalancer.server.port=8000
networks:
- llm-net
# ports:
# - 8000:8000
## Testing llava serving with vllm
## Change GPU device_ids if necessary
vllm-llava-server:
depends_on:
- reverseproxy
image: ${VLLM_IMAGE}
container_name: vllm-llava-server
restart: always
deploy:
resources:
reservations:
devices:
- driver: nvidia
capabilities: [gpu]
device_ids: ['0']
volumes:
- ${MODEL_VOL}/${VLLM_IMAGE_MODEL_ID}:/vllm-workspace/${VLLM_IMAGE_MODEL_ID}
command: ["--model", "${VLLM_IMAGE_MODEL_ID}", "--gpu-memory-utilization", "0.75", "--host", "0.0.0.0", "--root-path", "/vllm-llava-server",
"--image-input-type", "pixel_values", "--image-token-id", "32000", "--image-input-shape", "1,3,336,336", "--image-feature-size", "576",
"--chat-template", "template_llava.jinja"]
labels:
- traefik.enable=true
- traefik.http.routers.vllm-llava-server.rule=PathPrefix(`/vllm-llava-server`)
- traefik.http.routers.vllm-llava-server.middlewares=vllm-llava-server-stripprefix
- traefik.http.middlewares.vllm-llava-server-stripprefix.stripprefix.prefixes=/vllm-llava-server
- traefik.http.services.vllm-llava-server.loadbalancer.server.port=8000
networks:
- llm-net
# ports:
# - 8000:8000
###.env file###
MODEL_VOL=/home/<intermediate_paths>/models
VLLM_MODEL_ID=Meta-Llama-3-8B-Instruct
VLLM_IMAGE_MODEL_ID=llava-v1.6-mistral-7b-hf
PROXY_IMAGE=traefik
VLLM_IMAGE=vllm/vllm-openai:v0.5.0.post1
VLLM_IMAGE_MODEL_ID points to a cloned Huggingface directory from https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf (with template_llava.jinja added) that has directory structure:
###llava-v1.6-mistral-7b-hf directory structure###
config.json
generation_config.json
.git
.gitattributes
model-00001-of-00004.safetensors
model-00002-of-00004.safetensors
model-00003-of-00004.safetensors
model-00004-of-00004.safetensors
model.safetensors.index.json
preprocessor_config.json
README.md
special_tokens_map.json
template_llava.jinja
tokenizer_config.json
tokenizer.json
tokenizer.model
🐛 Describe the bug
Bug description
On starting the service with docker compose --env-file .env.llava up reverseproxy vllm-llava-server, it appears to do the usual startup, but then throws a ValueError, see below for full text and STDOUT. I have included all startup values that appear to be required when instantiating a new LLM object from https://github.com/vllm-project/vllm/blob/main/examples/llava_example.py, am I missing something from my command entry in the docker-compose.yaml?
vllm-llava-server | INFO 06-26 18:28:25 api_server.py:177] vLLM API server version 0.5.0.post1 vllm-llava-server | INFO 06-26 18:28:25 api_server.py:178] args: Namespace(host='0.0.0.0', port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, chat_template='template_llava.jinja', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path='/vllm-llava-server', middleware=[], model='llava-v1.6-mistral-7b-hf', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.75, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, device='auto', image_input_type='pixel_values', image_token_id=32000, image_input_shape='1,3,336,336', image_feature_size=576, image_processor=None, image_processor_revision=None, disable_image_processor=False, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
vllm-llava-server | INFO 06-26 18:28:25 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='llava-v1.6-mistral-7b-hf', speculative_config=None, tokenizer='llava-v1.6-mistral-7b-hf', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=llava-v1.6-mistral-7b-hf)
vllm-llava-server | Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
vllm-llava-server | Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
vllm-llava-server | INFO 06-26 18:29:15 model_runner.py:160] Loading model weights took 14.1020 GB
vllm-llava-server | [rank0]: Traceback (most recent call last):
vllm-llava-server | [rank0]: File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
vllm-llava-server | [rank0]: return _run_code(code, main_globals, None,
vllm-llava-server | [rank0]: File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
vllm-llava-server | [rank0]: exec(code, run_globals)
vllm-llava-server | [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 196, in <module>
vllm-llava-server | [rank0]: engine = AsyncLLMEngine.from_engine_args(
vllm-llava-server | [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 398, in from_engine_args
vllm-llava-server | [rank0]: engine = cls(
vllm-llava-server | [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 349, in __init__
vllm-llava-server | [rank0]: self.engine = self._init_engine(*args, **kwargs)
vllm-llava-server | [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 473, in _init_engine
vllm-llava-server | [rank0]: return engine_class(*args, **kwargs)
vllm-llava-server | [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 236, in __init__
vllm-llava-server | [rank0]: self._initialize_kv_caches()
vllm-llava-server | [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 313, in _initialize_kv_caches
vllm-llava-server | [rank0]: self.model_executor.determine_num_available_blocks())
vllm-llava-server | [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 75, in determine_num_available_blocks
vllm-llava-server | [rank0]: return self.driver_worker.determine_num_available_blocks()
vllm-llava-server | [rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
vllm-llava-server | [rank0]: return func(*args, **kwargs)
vllm-llava-server | [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 162, in determine_num_available_blocks
vllm-llava-server | [rank0]: self.model_runner.profile_run()
vllm-llava-server | [rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
vllm-llava-server | [rank0]: return func(*args, **kwargs)
vllm-llava-server | [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 844, in profile_run
vllm-llava-server | [rank0]: self.execute_model(seqs, kv_caches)
vllm-llava-server | [rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
vllm-llava-server | [rank0]: return func(*args, **kwargs)
vllm-llava-server | [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 749, in execute_model
vllm-llava-server | [rank0]: hidden_states = model_executable(
vllm-llava-server | [rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
vllm-llava-server | [rank0]: return self._call_impl(*args, **kwargs)
vllm-llava-server | [rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
vllm-llava-server | [rank0]: return forward_call(*args, **kwargs)
vllm-llava-server | [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llava_next.py", line 383, in forward
vllm-llava-server | [rank0]: image_input = self._parse_and_validate_image_input(**kwargs)
vllm-llava-server | [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llava_next.py", line 196, in _parse_and_validate_image_input
vllm-llava-server | [rank0]: raise ValueError("Incorrect type of image sizes. "
vllm-llava-server | [rank0]: ValueError: Incorrect type of image sizes. Got type: <class 'NoneType'>
vllm-llava-server exited with code 0