Skip to content
Merged
Show file tree
Hide file tree
Changes from 34 commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
9b0ce65
[FEATURE] Enables offline /score for embedding models
gmarinho2 Jan 21, 2025
74cd2dd
Merge branch 'upstream_main'
gmarinho2 Jan 21, 2025
590ab4d
Changes variable name and uses tuple instead of list.
gmarinho2 Jan 21, 2025
f219024
Separates scoring logic and makes small ajustments
gmarinho2 Jan 22, 2025
47211e5
Separates scoring logic and makes small ajustments
gmarinho2 Jan 22, 2025
1a89033
Moves scoring functions declaration to llm class
gmarinho2 Jan 22, 2025
db7919b
Completes embedding_score function signature
gmarinho2 Jan 23, 2025
41b11be
Passes new parameters to self.encode() in embeddind_score
gmarinho2 Jan 23, 2025
b57e01a
Adds type annotations for the parameters
gmarinho2 Jan 23, 2025
4956eae
Minor adjustments
gmarinho2 Jan 23, 2025
f8b8d8c
Minor adjustments in embedding_score
gmarinho2 Jan 23, 2025
cd835de
trigger ci
gmarinho2 Jan 24, 2025
0e95ead
trigger ci
gmarinho2 Jan 24, 2025
bba2ea6
Merge branch 'vllm-project:main' into main
gmarinho2 Feb 4, 2025
635b8e8
Merge branch 'vllm-project:main' into main
gmarinho2 Feb 5, 2025
7851b44
first implementation of embedding scores via api
gmarinho2 Feb 5, 2025
12383b2
second version of api scoring
gmarinho2 Feb 5, 2025
12ef932
makes separate functions for cross-encoder score and embedding score
gmarinho2 Feb 6, 2025
945280f
fixes pre-commit errors
gmarinho2 Feb 7, 2025
053bdbb
adapts union sintax to python 3.9
gmarinho2 Feb 7, 2025
cfbc4b9
fixes alternating response bug
gmarinho2 Feb 7, 2025
a9b3a0d
fixes pre-commit errors
gmarinho2 Feb 7, 2025
4bcc9d5
Refactorings
maxdebayser Feb 10, 2025
366ab62
fixing type errors
gmarinho2 Feb 11, 2025
bcf20df
fix typing errors
maxdebayser Feb 11, 2025
9eaf4fc
remove assert
gmarinho2 Feb 12, 2025
f68bfaf
Merge branch 'main' into scoring-openai
gmarinho2 Feb 12, 2025
700603b
fix error type
maxdebayser Feb 12, 2025
409ad05
Add unit tests for the scoring API with embedding models
maxdebayser Feb 13, 2025
7e9478a
adds documentation
gmarinho2 Feb 13, 2025
056f1be
Refactor /rerank to reuse code from /score
maxdebayser Feb 13, 2025
6a6e45b
Merge branch 'max-scoring-openai' into scoring-openai
gmarinho2 Feb 13, 2025
5c0495d
fixes union syntax
gmarinho2 Feb 13, 2025
7b2891b
adds documentation
gmarinho2 Feb 13, 2025
6218cc3
fixing mypy errors and refactoring
gmarinho2 Feb 18, 2025
5b54495
Puts embedding score code in score_utils to avoid duplicated code
gmarinho2 Feb 18, 2025
17a2960
changes variable name for clarity
gmarinho2 Feb 18, 2025
2a082f3
refactoring serving_score
gmarinho2 Feb 18, 2025
077cbae
factor out common code
maxdebayser Feb 18, 2025
1fa73a8
remove extra code lines
maxdebayser Feb 18, 2025
c9a7240
Merge branch 'vllm-project:main' into scoring-openai
gmarinho2 Feb 20, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions docs/source/models/pooling_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,8 +108,7 @@ A code example can be found here: <gh-file:examples/offline_inference/classifica
### `LLM.score`

The {class}`~vllm.LLM.score` method outputs similarity scores between sentence pairs.
It is primarily designed for [cross-encoder models](https://www.sbert.net/examples/applications/cross-encoder/README.html).
These types of models serve as rerankers between candidate query-document pairs in RAG systems.
It is designed for embedding models and cross encoder models. Embedding models use cosine similarity, and [cross-encoder models](https://www.sbert.net/examples/applications/cross-encoder/README.html) serve as rerankers between candidate query-document pairs in RAG systems.

:::{note}
vLLM can only perform the model inference component (e.g. embedding, reranking) of RAG.
Expand Down
10 changes: 5 additions & 5 deletions docs/source/serving/openai_compatible_server.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ In addition, we have the following custom APIs:
- [Pooling API](#pooling-api) (`/pooling`)
- Applicable to all [pooling models](../models/pooling_models.md).
- [Score API](#score-api) (`/score`)
- Only applicable to [cross-encoder models](../models/pooling_models.md) (`--task score`).
- Applicable to embedding models and [cross-encoder models](../models/pooling_models.md) (`--task score`).
- [Re-rank API](#rerank-api) (`/rerank`, `/v1/rerank`, `/v2/rerank`)
- Implements [Jina AI's v1 re-rank API](https://jina.ai/reranker/)
- Also compatible with [Cohere's v1 & v2 re-rank APIs](https://docs.cohere.com/v2/reference/rerank)
Expand Down Expand Up @@ -320,10 +320,10 @@ Code example: <gh-file:examples/online_serving/openai_pooling_client.py>

### Score API

Our Score API applies a cross-encoder model to predict scores for sentence pairs.
Our Score API can apply a cross-encoder model or an embedding model to predict scores for sentence pairs. When using an embedding model the score corresponds to the cosine similarity between each embedding pair.
Usually, the score for a sentence pair refers to the similarity between two sentences, on a scale of 0 to 1.

You can find the documentation for these kind of models at [sbert.net](https://www.sbert.net/docs/package_reference/cross_encoder/cross_encoder.html).
You can find the documentation for cross encoder models at [sbert.net](https://www.sbert.net/docs/package_reference/cross_encoder/cross_encoder.html).

Code example: <gh-file:examples/online_serving/openai_cross_encoder_score.py>

Expand Down Expand Up @@ -483,11 +483,11 @@ The following extra parameters are supported:

### Re-rank API

Our Re-rank API applies a cross-encoder model to predict relevant scores between a single query, and
Our Re-rank API can apply an embedding model or a cross-encoder model to predict relevant scores between a single query, and
each of a list of documents. Usually, the score for a sentence pair refers to the similarity between two sentences, on
a scale of 0 to 1.

You can find the documentation for these kind of models at [sbert.net](https://www.sbert.net/docs/package_reference/cross_encoder/cross_encoder.html).
You can find the documentation for cross encoder models at [sbert.net](https://www.sbert.net/docs/package_reference/cross_encoder/cross_encoder.html).

The rerank endpoints support popular re-rank models such as `BAAI/bge-reranker-base` and other models supporting the
`score` task. Additionally, `/rerank`, `/v1/rerank`, and `/v2/rerank`
Expand Down
8 changes: 3 additions & 5 deletions tests/entrypoints/openai/test_rerank.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,17 +8,17 @@
from ...utils import RemoteOpenAIServer

MODEL_NAME = "BAAI/bge-reranker-base"
DTYPE = "half"


@pytest.fixture(scope="module")
def server():
args = ["--enforce-eager", "--max-model-len", "100"]
args = ["--enforce-eager", "--max-model-len", "100", "--dtype", DTYPE]

with RemoteOpenAIServer(MODEL_NAME, args) as remote_server:
yield remote_server


@pytest.mark.asyncio
@pytest.mark.parametrize("model_name", [MODEL_NAME])
def test_rerank_texts(server: RemoteOpenAIServer, model_name: str):
query = "What is the capital of France?"
Expand All @@ -42,7 +42,6 @@ def test_rerank_texts(server: RemoteOpenAIServer, model_name: str):
assert rerank.results[1].relevance_score <= 0.01


@pytest.mark.asyncio
@pytest.mark.parametrize("model_name", [MODEL_NAME])
def test_top_n(server: RemoteOpenAIServer, model_name: str):
query = "What is the capital of France?"
Expand All @@ -68,7 +67,6 @@ def test_top_n(server: RemoteOpenAIServer, model_name: str):
assert rerank.results[1].relevance_score <= 0.01


@pytest.mark.asyncio
@pytest.mark.parametrize("model_name", [MODEL_NAME])
def test_rerank_max_model_len(server: RemoteOpenAIServer, model_name: str):

Expand All @@ -86,4 +84,4 @@ def test_rerank_max_model_len(server: RemoteOpenAIServer, model_name: str):
assert rerank_response.status_code == 400
# Assert just a small fragments of the response
assert "Please reduce the length of the input." in \
rerank_response.text
rerank_response.text
284 changes: 173 additions & 111 deletions tests/entrypoints/openai/test_score.py
Original file line number Diff line number Diff line change
@@ -1,123 +1,185 @@
# SPDX-License-Identifier: Apache-2.0

import math
from typing import Any

import pytest
import requests
import torch.nn.functional as F
from torch import tensor

from vllm.entrypoints.openai.protocol import ScoreResponse

from ...utils import RemoteOpenAIServer

MODEL_NAME = "BAAI/bge-reranker-v2-m3"


@pytest.fixture(scope="module")
def server():
args = ["--enforce-eager", "--max-model-len", "100"]

with RemoteOpenAIServer(MODEL_NAME, args) as remote_server:
MODELS = [
{
"name": "BAAI/bge-reranker-v2-m3",
"is_cross_encoder": True
},
{
"name": "BAAI/bge-base-en-v1.5",
"is_cross_encoder": False
},
]
DTYPE = "half"


def run_transformers(hf_model, model, text_pairs):
if model["is_cross_encoder"]:
return hf_model.predict(text_pairs).tolist()
else:
hf_embeddings = [
hf_model.encode(text_pair) for text_pair in text_pairs
]
return [
F.cosine_similarity(tensor(pair[0]), tensor(pair[1]), dim=0)
for pair in hf_embeddings
]


@pytest.fixture(scope="class", params=MODELS)
def model(request):
yield request.param


@pytest.fixture(scope="class")
def server(model: dict[str, Any]):
args = ["--enforce-eager", "--max-model-len", "100", "--dtype", DTYPE]

with RemoteOpenAIServer(model["name"], args) as remote_server:
yield remote_server


@pytest.mark.asyncio
@pytest.mark.parametrize("model_name", [MODEL_NAME])
def test_text_1_str_text_2_list(server: RemoteOpenAIServer, model_name: str):
text_1 = "What is the capital of France?"
text_2 = [
"The capital of Brazil is Brasilia.", "The capital of France is Paris."
]

score_response = requests.post(server.url_for("score"),
json={
"model": model_name,
"text_1": text_1,
"text_2": text_2,
})
score_response.raise_for_status()
score = ScoreResponse.model_validate(score_response.json())

assert score.id is not None
assert score.data is not None
assert len(score.data) == 2
assert score.data[0].score <= 0.01
assert score.data[1].score >= 0.9


@pytest.mark.asyncio
@pytest.mark.parametrize("model_name", [MODEL_NAME])
def test_text_1_list_text_2_list(server: RemoteOpenAIServer, model_name: str):
text_1 = [
"What is the capital of the United States?",
"What is the capital of France?"
]
text_2 = [
"The capital of Brazil is Brasilia.", "The capital of France is Paris."
]

score_response = requests.post(server.url_for("score"),
json={
"model": model_name,
"text_1": text_1,
"text_2": text_2,
})
score_response.raise_for_status()
score = ScoreResponse.model_validate(score_response.json())

assert score.id is not None
assert score.data is not None
assert len(score.data) == 2
assert score.data[0].score <= 0.01
assert score.data[1].score >= 0.9


@pytest.mark.asyncio
@pytest.mark.parametrize("model_name", [MODEL_NAME])
def test_text_1_str_text_2_str(server: RemoteOpenAIServer, model_name: str):
text_1 = "What is the capital of France?"
text_2 = "The capital of France is Paris."

score_response = requests.post(server.url_for("score"),
json={
"model": model_name,
"text_1": text_1,
"text_2": text_2,
})
score_response.raise_for_status()
score = ScoreResponse.model_validate(score_response.json())

assert score.id is not None
assert score.data is not None
assert len(score.data) == 1
assert score.data[0].score >= 0.9


@pytest.mark.asyncio
@pytest.mark.parametrize("model_name", [MODEL_NAME])
def test_score_max_model_len(server: RemoteOpenAIServer, model_name: str):

text_1 = "What is the capital of France?" * 20
text_2 = [
"The capital of Brazil is Brasilia.", "The capital of France is Paris."
]

score_response = requests.post(server.url_for("score"),
json={
"model": model_name,
"text_1": text_1,
"text_2": text_2,
})
assert score_response.status_code == 400
# Assert just a small fragments of the response
assert "Please reduce the length of the input." in \
score_response.text

# Test truncation
score_response = requests.post(server.url_for("score"),
json={
"model": model_name,
"text_1": text_1,
"text_2": text_2,
"truncate_prompt_tokens": 101
})
assert score_response.status_code == 400
assert "Please, select a smaller truncation size." in \
score_response.text
@pytest.fixture(scope="class")
def runner(model: dict[str, Any], hf_runner):
kwargs = {
"dtype": DTYPE,
"is_cross_encoder" if model["is_cross_encoder"]\
else "is_sentence_transformer": True
}

with hf_runner(model["name"], **kwargs) as hf_model:
yield hf_model


class TestModel:

def test_text_1_str_text_2_list(self, server: RemoteOpenAIServer,
model: dict[str, Any], runner):
text_1 = "What is the capital of France?"
text_2 = [
"The capital of Brazil is Brasilia.",
"The capital of France is Paris."
]

score_response = requests.post(server.url_for("score"),
json={
"model": model["name"],
"text_1": text_1,
"text_2": text_2,
})
score_response.raise_for_status()
score = ScoreResponse.model_validate(score_response.json())

assert score.id is not None
assert score.data is not None
assert len(score.data) == 2

vllm_outputs = [d.score for d in score.data]

text_pairs = [[text_1, text_2[0]], [text_1, text_2[1]]]
hf_outputs = run_transformers(runner, model, text_pairs)

for i in range(len(vllm_outputs)):
assert math.isclose(hf_outputs[i], vllm_outputs[i], rel_tol=0.01)

def test_text_1_list_text_2_list(self, server: RemoteOpenAIServer,
model: dict[str, Any], runner):
text_1 = [
"What is the capital of the United States?",
"What is the capital of France?"
]
text_2 = [
"The capital of Brazil is Brasilia.",
"The capital of France is Paris."
]

score_response = requests.post(server.url_for("score"),
json={
"model": model["name"],
"text_1": text_1,
"text_2": text_2,
})
score_response.raise_for_status()
score = ScoreResponse.model_validate(score_response.json())

assert score.id is not None
assert score.data is not None
assert len(score.data) == 2

vllm_outputs = [d.score for d in score.data]

text_pairs = [[text_1[0], text_2[0]], [text_1[1], text_2[1]]]
hf_outputs = run_transformers(runner, model, text_pairs)

for i in range(len(vllm_outputs)):
assert math.isclose(hf_outputs[i], vllm_outputs[i], rel_tol=0.01)

def test_text_1_str_text_2_str(self, server: RemoteOpenAIServer,
model: dict[str, Any], runner):
text_1 = "What is the capital of France?"
text_2 = "The capital of France is Paris."

score_response = requests.post(server.url_for("score"),
json={
"model": model["name"],
"text_1": text_1,
"text_2": text_2,
})
score_response.raise_for_status()
score = ScoreResponse.model_validate(score_response.json())

assert score.id is not None
assert score.data is not None
assert len(score.data) == 1

vllm_outputs = [d.score for d in score.data]

text_pairs = [[text_1, text_2]]
hf_outputs = run_transformers(runner, model, text_pairs)

for i in range(len(vllm_outputs)):
assert math.isclose(hf_outputs[i], vllm_outputs[i], rel_tol=0.01)

def test_score_max_model_len(self, server: RemoteOpenAIServer,
model: dict[str, Any]):

text_1 = "What is the capital of France?" * 20
text_2 = [
"The capital of Brazil is Brasilia.",
"The capital of France is Paris."
]

score_response = requests.post(server.url_for("score"),
json={
"model": model["name"],
"text_1": text_1,
"text_2": text_2,
})
assert score_response.status_code == 400
# Assert just a small fragments of the response
assert "Please reduce the length of the input." in \
score_response.text

# Test truncation
score_response = requests.post(server.url_for("score"),
json={
"model": model["name"],
"text_1": text_1,
"text_2": text_2,
"truncate_prompt_tokens": 101
})
assert score_response.status_code == 400
assert "Please, select a smaller truncation size." in \
score_response.text
Loading