Added embedding service by mwrothbe · Pull Request #33 · SearchSavior/OpenArc

mwrothbe · 2025-10-09T20:35:37Z

This enables the use of OpenArc to provide embeddings for RAG pipelines, but possibly for other uses. It provides an OpenAI interface that's confirmed to work with C# OpenAIClient library. Not all OpenAI inputs are implemented (user & encoding_format.) API also accepts an optional PreTrainedTokenizer object that is passed into the model with the goal of using the API with various models that have different requirements. Only Qwen3-Embedding-0.6B has been tested. The embeddings generator is in the optimum domain.

src/engine/optimum/optimum_emb.py

SearchSavior · 2025-10-09T21:09:11Z

src/engine/optimum/optimum_emb.py

+    async def generate_embeddings(self, tok_config: TokenizerConfig) -> AsyncIterator[Union[Dict[str, Any], str]]:
+
+        # Tokenize the input texts
+        batch_dict = self.encoder_tokenizer(


please add some comments in the ''conversation' about what tasks these cover and how they can be used together. It looks like text similarity

You mean, in the Issues thread give a quick description on how I plan to use the embedding api?

In this PR, cleaner for future reference. about what tasks you intend to enable, what the scope is. It seems like you want maximum access to the low level options one would normally configure in a script, to leverage serving, which really has been the goal of openarc from the beginning, so that works quite well.

SearchSavior · 2025-10-09T21:10:13Z

src/engine/optimum/optimum_emb.py

+            batch_size = last_hidden_states.shape[0]
+            return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]
+
+    def generate_type(self, tok_config: TokenizerConfig):


generate_type is used to keep api abstraction intact for stream bool behavior. siicne we dont stream you can probably call generate_embeddings directly, unless you think this effects task coverage

Yes, I'll short circuit it. I was just hacking my way through and trying not to deviate too much from a working example until I figured out which direction was up.

Looks like the infer_emb function is already pointing to generate_embeddings. I'll just remove generate_type.

What if we make it pooling_type and maybe add other pooling strategies later? Was just reading about mean pooling vs last token pooling and it sounds interesting, possibly easy to implement. I don't fully understand how it works yet. Maybe use pooling_type as a knob like generate_type to keep the async serving abstraction intact. so we configure pooling_type from a request header, since the generate_embeddings call probably wont change much, we make it easy to tinker with more techniques in the future? What do you think

src/engine/optimum/optimum_emb.py

SearchSavior · 2025-10-09T21:24:35Z

src/server/models/optimum.py

+from pydantic import BaseModel, Field
+
+# converted from https://huggingface.co/docs/transformers/main_classes/tokenizer
+class TokenizerConfig(BaseModel):


must be more descriprive, most generative models use some kind of tokenizer.

Optimum_EmbConfig scopes engine, task and intention

Well, I guess it's PreTrainedTokenizerConfig based on the huggingface page I referenced. Does that work?

EMBTokenizerConfig seems good

SearchSavior · 2025-10-09T21:34:26Z

@mwrothbe Will let you know once first review pass is done.

Great work so far. Very excited to merge this.... blown away you decided to contribute, and deeply appreciate your work.

A quick note- I remember that Qwen3 embed takes an instruction on top of the text for embedding task. Have you tried this, and how is it implemented?

Please join our discord!!

mwrothbe · 2025-10-10T18:15:57Z

A quick note- I remember that Qwen3 embed takes an instruction on top of the text for embedding task. Have you tried this, and how is it implemented?

Ya, so I'm not sure I really get this yet. I saw that on https://huggingface.co/Qwen/Qwen3-Embedding-0.6B the example includes a prompt on queries, but not on documents. I guessed this means that without a prompt, the model returns a "pure" vector representation of the text, and if a prompt is included, it will attempt to provide a document-like representation instead of a direct representation which would likely be more of a question than a statement. I seem to be able to get good results (on an extremely small dataset) using dotproduct similarity without adding prompts to the queries. I might try adding prompts on the query and see if that improves retrieval scores at all. Anyway, again assuming I understand this correctly, the user can just add whatever text prompt they want on the text submitted, so I don't think we need to add any kind of special API capability to support this model.

mwrothbe · 2025-10-10T18:40:01Z

OK, I updated the PR with the bugfixes and comments from above. Turns out changes reflect in this PR and I don't have to create a new one (still learning git.) The bug fix was a change in the main.py embeddings route to return an array of embedding objects instead of an array of an array, in order to comply with the OpenAI response definition.

SearchSavior

will test this once I get home from work but it looks great!

SearchSavior · 2025-10-10T04:53:28Z

src/server/main.py

+
+    try:
+
+        tok_config = TokenizerConfig(


Maybe add all the options from TokenizerConfig (or what its gets named) here, and make all optional with empty defaults so we can control them through requests without breaking openai api format. In practice this means you can use openai python library to interface, meet its minimum requirements --> load up request body with other options to control tokenization behavior every request. Maybe add tokenizerconfig as an object in the request body /v1/embeddings to keep the openai interface easier to read and tokenizerconfig extendable. The other openai endpoints will probably be handled this way as we add more features like sampling etc. Lmk what you think

SearchSavior · 2025-10-10T11:00:16Z

src/server/models/optimum.py

+from pydantic import BaseModel, Field
+
+# converted from https://huggingface.co/docs/transformers/main_classes/tokenizer
+class TokenizerConfig(BaseModel):


EMBTokenizerConfig seems good

SearchSavior · 2025-10-10T19:12:42Z

src/server/main.py

+    encoding_format: Optional[str] = "float" #not implemented
+    user: Optional[str] = None, #not implemented
+    #end of openai api
+    config: Optional[PreTrainedTokenizerConfig] = None


(dont make this change) but would your approach in EmbeddingRequest work here?

OpenArc/src/server/main.py

Line 184 in 9577c28

config_kwargs = {

yours looks much cleaner since there are a t-o-n of sampling setting to keep organized. nice

SearchSavior · 2025-10-10T19:17:58Z

OK, I updated the PR with the bugfixes and comments from above. Turns out changes reflect in this PR and I don't have to create a new one (still learning git.) The bug fix was a change in the main.py embeddings route to return an array of embedding objects instead of an array of an array, in order to comply with the OpenAI response definition.

Yes, I am also still learning git. If it makes you feel better, I have been worried about clicking merge pull request by accident since you opened lmao. Good catch

SearchSavior · 2025-10-10T19:37:26Z

@mwrothbe if you are up to it, I would love some feedback on the codebase. Were things easy to understand? Like you said in the issue, most changes have been customizations aside from your bug.

mwrothbe · 2025-10-10T22:25:47Z

@mwrothbe if you are up to it, I would love some feedback on the codebase. Were things easy to understand? Like you said in the issue, most changes have been customizations aside from your bug.

Well, I'm really not a python guy, so I'm probably not the best person to critique the code base. With that said, I was able to figure things out (with some trial and error and print markers) without being a python guy, so that's a good sign. With this in mind, some of my code may be really rudimentary and may need to be refactored by someone who knows what they are doing.

SearchSavior · 2025-10-11T18:28:43Z

@mwrothbe if you are up to it, I would love some feedback on the codebase. Were things easy to understand? Like you said in the issue, most changes have been customization aside from your bug.

Well, I'm really not a python guy, so I'm probably not the best person to critique the code base. With that said, I was able to figure things out (with some trial and error and print markers) without being a python guy, so that's a good sign. With this in mind, some of my code may be really rudimentary and may need to be refactored by someone who knows what they are doing.

Appreciate this! Knowing someone else has been poking around code I look at almost everyday was instructive. Thanks for the feedback :)

SearchSavior · 2025-10-11T22:33:24Z

@mwrothbe OK, so I took your example from #32

import os
from openai import OpenAI

def main():
    api_key = os.getenv("OPENARC_API_KEY")

    base_url = os.getenv("http://localhost:8000/v1")

    client = OpenAI(
        base_url=base_url,
        api_key=api_key,
    )

    text = "Hello, this is a test embedding"

    config = {
        "padding": True,
        "truncation": True,
        "return_tensors": "pt"   
    }
    
    response = client.embeddings.create(
        model="Qwen-Embed",
        input=text,
        dimensions=1024,
        extra_body=config, # entrypoint for extra options. Otherwise .create() complains.
    )

    embedding = response.data[0].embedding
    print(f"Embedding vector: {embedding}")
    print(f"Vector length: {len(embedding)}")

if __name__ == "__main__":
    main()

This works on CPU but fails with GPU A770.

Had some errors related to input formatting

RuntimeError: ("Exception from src/inference/src/cpp/infer_request.cpp:223:\nCheck 'args[i].index < data.inputs.size() && data.inputs[args[i].index]' failed at src/plugins/intel_gpu/src/runtime/ocl/ocl_stream.cpp:91:\nThe allocated input memory is necessary to set kernel arguments.\n\n",

Got unexpected inputs: expecting the following inputs {'attention_mask', 'position_ids', 'input_ids'} but got {'attention_mask', 'input_ids'}.")

So I tried the qwen embedding notebook and it works on GPU there.

But loading the model on the server gives

working...
❌ error: 500
Response: {"detail":"Failed to load model: Model loading failed:
The library name could not be automatically inferred. If using 
the command-line, please provide the argument --library 
{transformers,diffusers,timm,sentence_transformers}. Example: 
`--library diffusers`."}

I have been driving myself absolutely insane all afternoon on this lol.

SearchSavior · 2025-10-11T22:56:16Z

@mwrothbe I have been trying to break the PretrainedTokenizer approach all day to see if there were some defaults being overridden somewhere, unseen automatic work of AutoTokenizers being interfered with by PretrainedTokenizer options. Changing typing in the pydantic model from default=None did not work, neither did making tok_config match the qwen example without using pydantic at all. Hopefully it worked for you on GPU, and its something in my environment.

mwrothbe · 2025-10-11T23:02:49Z

Huh. Seems to be working for me with GPU setting: "openarc add --model-name qwen3_emb --model-path ~/Models/qwen3_emb/ --engine optimum --model-type emb --device GPU". I also have an a770. Although I plan to just use CPU for this as it's plenty fast (possibly using AMX, but I don't know how to confirm that) and I want to save GPU resources for other things.

On the last update I had to make a couple of edits to the pyprojects.toml as UV couldn't seem to find the versions listed. Not sure if that matters. Edits are highlighted in the Files Changed section of this. I'm also using the latest optimum-intel "uv pip install -U "optimum-intel[openvino] @ git+https://github.com/huggingface/optimum-intel". I don't think I updated anything else.

mwrothbe · 2025-10-11T23:12:28Z

In case you are curious, on my system sending the same ~400 token input for embedding multiple times, seems to take ~45ms using GPU and ~200ms on CPU. So it does appear that GPU is actually working on my end.

SearchSavior · 2025-10-11T23:57:19Z

TBH that makes more sense. Yes, openvino does use AMX. My hardware doesnt support AMX, so I did not consider lol. I also would use GPU for other things... I guess I spent most of today learning more about what this code does and Qwen Embedding paper.

Have you tried other models then Qwen-Emebdding?

SearchSavior · 2025-10-12T18:39:22Z

@mwrothbe any other changes you want to make before merge? :)

mwrothbe · 2025-10-12T22:09:03Z

Nope. I'm good with this PR if you are.

Added embedding service

8fde8ae