Skip to content

Added embedding service#33

Merged
SearchSavior merged 2 commits intoSearchSavior:1.0.6from
mwrothbe:1.0.6Emb
Oct 13, 2025
Merged

Added embedding service#33
SearchSavior merged 2 commits intoSearchSavior:1.0.6from
mwrothbe:1.0.6Emb

Conversation

@mwrothbe
Copy link
Copy Markdown
Contributor

@mwrothbe mwrothbe commented Oct 9, 2025

This enables the use of OpenArc to provide embeddings for RAG pipelines, but possibly for other uses. It provides an OpenAI interface that's confirmed to work with C# OpenAIClient library. Not all OpenAI inputs are implemented (user & encoding_format.) API also accepts an optional PreTrainedTokenizer object that is passed into the model with the goal of using the API with various models that have different requirements. Only Qwen3-Embedding-0.6B has been tested. The embeddings generator is in the optimum domain.

async def generate_embeddings(self, tok_config: TokenizerConfig) -> AsyncIterator[Union[Dict[str, Any], str]]:

# Tokenize the input texts
batch_dict = self.encoder_tokenizer(
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add some comments in the ''conversation' about what tasks these cover and how they can be used together. It looks like text similarity

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean, in the Issues thread give a quick description on how I plan to use the embedding api?

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this PR, cleaner for future reference. about what tasks you intend to enable, what the scope is. It seems like you want maximum access to the low level options one would normally configure in a script, to leverage serving, which really has been the goal of openarc from the beginning, so that works quite well.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

batch_size = last_hidden_states.shape[0]
return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]

def generate_type(self, tok_config: TokenizerConfig):
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generate_type is used to keep api abstraction intact for stream bool behavior. siicne we dont stream you can probably call generate_embeddings directly, unless you think this effects task coverage

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll short circuit it. I was just hacking my way through and trying not to deviate too much from a working example until I figured out which direction was up.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the infer_emb function is already pointing to generate_embeddings. I'll just remove generate_type.

Copy link
Copy Markdown
Owner

@SearchSavior SearchSavior Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we make it pooling_type and maybe add other pooling strategies later? Was just reading about mean pooling vs last token pooling and it sounds interesting, possibly easy to implement. I don't fully understand how it works yet. Maybe use pooling_type as a knob like generate_type to keep the async serving abstraction intact. so we configure pooling_type from a request header, since the generate_embeddings call probably wont change much, we make it easy to tinker with more techniques in the future? What do you think

from pydantic import BaseModel, Field

# converted from https://huggingface.co/docs/transformers/main_classes/tokenizer
class TokenizerConfig(BaseModel):
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

must be more descriprive, most generative models use some kind of tokenizer.

Optimum_EmbConfig scopes engine, task and intention

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I guess it's PreTrainedTokenizerConfig based on the huggingface page I referenced. Does that work?

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EMBTokenizerConfig seems good

@SearchSavior
Copy link
Copy Markdown
Owner

@mwrothbe Will let you know once first review pass is done.

Great work so far. Very excited to merge this.... blown away you decided to contribute, and deeply appreciate your work.

A quick note- I remember that Qwen3 embed takes an instruction on top of the text for embedding task. Have you tried this, and how is it implemented?

Please join our discord!!

@mwrothbe
Copy link
Copy Markdown
Contributor Author

mwrothbe commented Oct 10, 2025

A quick note- I remember that Qwen3 embed takes an instruction on top of the text for embedding task. Have you tried this, and how is it implemented?

Ya, so I'm not sure I really get this yet. I saw that on https://huggingface.co/Qwen/Qwen3-Embedding-0.6B the example includes a prompt on queries, but not on documents. I guessed this means that without a prompt, the model returns a "pure" vector representation of the text, and if a prompt is included, it will attempt to provide a document-like representation instead of a direct representation which would likely be more of a question than a statement. I seem to be able to get good results (on an extremely small dataset) using dotproduct similarity without adding prompts to the queries. I might try adding prompts on the query and see if that improves retrieval scores at all. Anyway, again assuming I understand this correctly, the user can just add whatever text prompt they want on the text submitted, so I don't think we need to add any kind of special API capability to support this model.

@mwrothbe
Copy link
Copy Markdown
Contributor Author

OK, I updated the PR with the bugfixes and comments from above. Turns out changes reflect in this PR and I don't have to create a new one (still learning git.) The bug fix was a change in the main.py embeddings route to return an array of embedding objects instead of an array of an array, in order to comply with the OpenAI response definition.

Copy link
Copy Markdown
Owner

@SearchSavior SearchSavior left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will test this once I get home from work but it looks great!


try:

tok_config = TokenizerConfig(
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add all the options from TokenizerConfig (or what its gets named) here, and make all optional with empty defaults so we can control them through requests without breaking openai api format. In practice this means you can use openai python library to interface, meet its minimum requirements --> load up request body with other options to control tokenization behavior every request. Maybe add tokenizerconfig as an object in the request body /v1/embeddings to keep the openai interface easier to read and tokenizerconfig extendable. The other openai endpoints will probably be handled this way as we add more features like sampling etc. Lmk what you think

from pydantic import BaseModel, Field

# converted from https://huggingface.co/docs/transformers/main_classes/tokenizer
class TokenizerConfig(BaseModel):
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EMBTokenizerConfig seems good

encoding_format: Optional[str] = "float" #not implemented
user: Optional[str] = None, #not implemented
#end of openai api
config: Optional[PreTrainedTokenizerConfig] = None
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(dont make this change) but would your approach in EmbeddingRequest work here?

config_kwargs = {

yours looks much cleaner since there are a t-o-n of sampling setting to keep organized. nice

@SearchSavior
Copy link
Copy Markdown
Owner

OK, I updated the PR with the bugfixes and comments from above. Turns out changes reflect in this PR and I don't have to create a new one (still learning git.) The bug fix was a change in the main.py embeddings route to return an array of embedding objects instead of an array of an array, in order to comply with the OpenAI response definition.

Yes, I am also still learning git. If it makes you feel better, I have been worried about clicking merge pull request by accident since you opened lmao. Good catch

@SearchSavior
Copy link
Copy Markdown
Owner

@mwrothbe if you are up to it, I would love some feedback on the codebase. Were things easy to understand? Like you said in the issue, most changes have been customizations aside from your bug.

@mwrothbe
Copy link
Copy Markdown
Contributor Author

mwrothbe commented Oct 10, 2025

@mwrothbe if you are up to it, I would love some feedback on the codebase. Were things easy to understand? Like you said in the issue, most changes have been customizations aside from your bug.

Well, I'm really not a python guy, so I'm probably not the best person to critique the code base. With that said, I was able to figure things out (with some trial and error and print markers) without being a python guy, so that's a good sign. With this in mind, some of my code may be really rudimentary and may need to be refactored by someone who knows what they are doing.

@SearchSavior
Copy link
Copy Markdown
Owner

@mwrothbe if you are up to it, I would love some feedback on the codebase. Were things easy to understand? Like you said in the issue, most changes have been customization aside from your bug.

Well, I'm really not a python guy, so I'm probably not the best person to critique the code base. With that said, I was able to figure things out (with some trial and error and print markers) without being a python guy, so that's a good sign. With this in mind, some of my code may be really rudimentary and may need to be refactored by someone who knows what they are doing.

Appreciate this! Knowing someone else has been poking around code I look at almost everyday was instructive. Thanks for the feedback :)

@SearchSavior
Copy link
Copy Markdown
Owner

SearchSavior commented Oct 11, 2025

@mwrothbe OK, so I took your example from #32

import os
from openai import OpenAI

def main():
    api_key = os.getenv("OPENARC_API_KEY")

    base_url = os.getenv("http://localhost:8000/v1")

    client = OpenAI(
        base_url=base_url,
        api_key=api_key,
    )

    text = "Hello, this is a test embedding"

    config = {
        "padding": True,
        "truncation": True,
        "return_tensors": "pt"   
    }
    
    response = client.embeddings.create(
        model="Qwen-Embed",
        input=text,
        dimensions=1024,
        extra_body=config, # entrypoint for extra options. Otherwise .create() complains.
    )

    embedding = response.data[0].embedding
    print(f"Embedding vector: {embedding}")
    print(f"Vector length: {len(embedding)}")

if __name__ == "__main__":
    main()

This works on CPU but fails with GPU A770.

Had some errors related to input formatting

RuntimeError: ("Exception from src/inference/src/cpp/infer_request.cpp:223:\nCheck 'args[i].index < data.inputs.size() && data.inputs[args[i].index]' failed at src/plugins/intel_gpu/src/runtime/ocl/ocl_stream.cpp:91:\nThe allocated input memory is necessary to set kernel arguments.\n\n",

Got unexpected inputs: expecting the following inputs {'attention_mask', 'position_ids', 'input_ids'} but got {'attention_mask', 'input_ids'}.")

So I tried the qwen embedding notebook and it works on GPU there.

But loading the model on the server gives

working...
❌ error: 500
Response: {"detail":"Failed to load model: Model loading failed:
The library name could not be automatically inferred. If using 
the command-line, please provide the argument --library 
{transformers,diffusers,timm,sentence_transformers}. Example: 
`--library diffusers`."}

I have been driving myself absolutely insane all afternoon on this lol.

@SearchSavior
Copy link
Copy Markdown
Owner

@mwrothbe I have been trying to break the PretrainedTokenizer approach all day to see if there were some defaults being overridden somewhere, unseen automatic work of AutoTokenizers being interfered with by PretrainedTokenizer options. Changing typing in the pydantic model from default=None did not work, neither did making tok_config match the qwen example without using pydantic at all. Hopefully it worked for you on GPU, and its something in my environment.

@mwrothbe
Copy link
Copy Markdown
Contributor Author

Huh. Seems to be working for me with GPU setting: "openarc add --model-name qwen3_emb --model-path ~/Models/qwen3_emb/ --engine optimum --model-type emb --device GPU". I also have an a770. Although I plan to just use CPU for this as it's plenty fast (possibly using AMX, but I don't know how to confirm that) and I want to save GPU resources for other things.

On the last update I had to make a couple of edits to the pyprojects.toml as UV couldn't seem to find the versions listed. Not sure if that matters. Edits are highlighted in the Files Changed section of this. I'm also using the latest optimum-intel "uv pip install -U "optimum-intel[openvino] @ git+https://github.com/huggingface/optimum-intel". I don't think I updated anything else.

@mwrothbe
Copy link
Copy Markdown
Contributor Author

In case you are curious, on my system sending the same ~400 token input for embedding multiple times, seems to take ~45ms using GPU and ~200ms on CPU. So it does appear that GPU is actually working on my end.

@SearchSavior
Copy link
Copy Markdown
Owner

TBH that makes more sense. Yes, openvino does use AMX. My hardware doesnt support AMX, so I did not consider lol. I also would use GPU for other things... I guess I spent most of today learning more about what this code does and Qwen Embedding paper.

Have you tried other models then Qwen-Emebdding?

@SearchSavior
Copy link
Copy Markdown
Owner

@mwrothbe any other changes you want to make before merge? :)

@mwrothbe
Copy link
Copy Markdown
Contributor Author

Nope. I'm good with this PR if you are.

@SearchSavior SearchSavior merged commit 8a3d1c6 into SearchSavior:1.0.6 Oct 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants