Added embedding service#33
Conversation
src/engine/optimum/optimum_emb.py
Outdated
| async def generate_embeddings(self, tok_config: TokenizerConfig) -> AsyncIterator[Union[Dict[str, Any], str]]: | ||
|
|
||
| # Tokenize the input texts | ||
| batch_dict = self.encoder_tokenizer( |
There was a problem hiding this comment.
please add some comments in the ''conversation' about what tasks these cover and how they can be used together. It looks like text similarity
There was a problem hiding this comment.
You mean, in the Issues thread give a quick description on how I plan to use the embedding api?
There was a problem hiding this comment.
In this PR, cleaner for future reference. about what tasks you intend to enable, what the scope is. It seems like you want maximum access to the low level options one would normally configure in a script, to leverage serving, which really has been the goal of openarc from the beginning, so that works quite well.
src/engine/optimum/optimum_emb.py
Outdated
| batch_size = last_hidden_states.shape[0] | ||
| return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths] | ||
|
|
||
| def generate_type(self, tok_config: TokenizerConfig): |
There was a problem hiding this comment.
generate_type is used to keep api abstraction intact for stream bool behavior. siicne we dont stream you can probably call generate_embeddings directly, unless you think this effects task coverage
There was a problem hiding this comment.
Yes, I'll short circuit it. I was just hacking my way through and trying not to deviate too much from a working example until I figured out which direction was up.
There was a problem hiding this comment.
Looks like the infer_emb function is already pointing to generate_embeddings. I'll just remove generate_type.
There was a problem hiding this comment.
What if we make it pooling_type and maybe add other pooling strategies later? Was just reading about mean pooling vs last token pooling and it sounds interesting, possibly easy to implement. I don't fully understand how it works yet. Maybe use pooling_type as a knob like generate_type to keep the async serving abstraction intact. so we configure pooling_type from a request header, since the generate_embeddings call probably wont change much, we make it easy to tinker with more techniques in the future? What do you think
src/server/models/optimum.py
Outdated
| from pydantic import BaseModel, Field | ||
|
|
||
| # converted from https://huggingface.co/docs/transformers/main_classes/tokenizer | ||
| class TokenizerConfig(BaseModel): |
There was a problem hiding this comment.
must be more descriprive, most generative models use some kind of tokenizer.
Optimum_EmbConfig scopes engine, task and intention
There was a problem hiding this comment.
Well, I guess it's PreTrainedTokenizerConfig based on the huggingface page I referenced. Does that work?
There was a problem hiding this comment.
EMBTokenizerConfig seems good
|
@mwrothbe Will let you know once first review pass is done. Great work so far. Very excited to merge this.... blown away you decided to contribute, and deeply appreciate your work. A quick note- I remember that Qwen3 embed takes an instruction on top of the text for embedding task. Have you tried this, and how is it implemented? Please join our discord!! |
Ya, so I'm not sure I really get this yet. I saw that on https://huggingface.co/Qwen/Qwen3-Embedding-0.6B the example includes a prompt on queries, but not on documents. I guessed this means that without a prompt, the model returns a "pure" vector representation of the text, and if a prompt is included, it will attempt to provide a document-like representation instead of a direct representation which would likely be more of a question than a statement. I seem to be able to get good results (on an extremely small dataset) using dotproduct similarity without adding prompts to the queries. I might try adding prompts on the query and see if that improves retrieval scores at all. Anyway, again assuming I understand this correctly, the user can just add whatever text prompt they want on the text submitted, so I don't think we need to add any kind of special API capability to support this model. |
|
OK, I updated the PR with the bugfixes and comments from above. Turns out changes reflect in this PR and I don't have to create a new one (still learning git.) The bug fix was a change in the main.py embeddings route to return an array of embedding objects instead of an array of an array, in order to comply with the OpenAI response definition. |
SearchSavior
left a comment
There was a problem hiding this comment.
will test this once I get home from work but it looks great!
src/server/main.py
Outdated
|
|
||
| try: | ||
|
|
||
| tok_config = TokenizerConfig( |
There was a problem hiding this comment.
Maybe add all the options from TokenizerConfig (or what its gets named) here, and make all optional with empty defaults so we can control them through requests without breaking openai api format. In practice this means you can use openai python library to interface, meet its minimum requirements --> load up request body with other options to control tokenization behavior every request. Maybe add tokenizerconfig as an object in the request body /v1/embeddings to keep the openai interface easier to read and tokenizerconfig extendable. The other openai endpoints will probably be handled this way as we add more features like sampling etc. Lmk what you think
src/server/models/optimum.py
Outdated
| from pydantic import BaseModel, Field | ||
|
|
||
| # converted from https://huggingface.co/docs/transformers/main_classes/tokenizer | ||
| class TokenizerConfig(BaseModel): |
There was a problem hiding this comment.
EMBTokenizerConfig seems good
| encoding_format: Optional[str] = "float" #not implemented | ||
| user: Optional[str] = None, #not implemented | ||
| #end of openai api | ||
| config: Optional[PreTrainedTokenizerConfig] = None |
There was a problem hiding this comment.
(dont make this change) but would your approach in EmbeddingRequest work here?
Line 184 in 9577c28
yours looks much cleaner since there are a t-o-n of sampling setting to keep organized. nice
Yes, I am also still learning git. If it makes you feel better, I have been worried about clicking merge pull request by accident since you opened lmao. Good catch |
|
@mwrothbe if you are up to it, I would love some feedback on the codebase. Were things easy to understand? Like you said in the issue, most changes have been customizations aside from your bug. |
Well, I'm really not a python guy, so I'm probably not the best person to critique the code base. With that said, I was able to figure things out (with some trial and error and print markers) without being a python guy, so that's a good sign. With this in mind, some of my code may be really rudimentary and may need to be refactored by someone who knows what they are doing. |
Appreciate this! Knowing someone else has been poking around code I look at almost everyday was instructive. Thanks for the feedback :) |
|
@mwrothbe OK, so I took your example from #32 This works on CPU but fails with GPU A770. Had some errors related to input formatting So I tried the qwen embedding notebook and it works on GPU there. But loading the model on the server gives I have been driving myself absolutely insane all afternoon on this lol. |
|
@mwrothbe I have been trying to break the |
|
Huh. Seems to be working for me with GPU setting: "openarc add --model-name qwen3_emb --model-path ~/Models/qwen3_emb/ --engine optimum --model-type emb --device GPU". I also have an a770. Although I plan to just use CPU for this as it's plenty fast (possibly using AMX, but I don't know how to confirm that) and I want to save GPU resources for other things. On the last update I had to make a couple of edits to the pyprojects.toml as UV couldn't seem to find the versions listed. Not sure if that matters. Edits are highlighted in the Files Changed section of this. I'm also using the latest optimum-intel "uv pip install -U "optimum-intel[openvino] @ git+https://github.com/huggingface/optimum-intel". I don't think I updated anything else. |
|
In case you are curious, on my system sending the same ~400 token input for embedding multiple times, seems to take ~45ms using GPU and ~200ms on CPU. So it does appear that GPU is actually working on my end. |
|
TBH that makes more sense. Yes, openvino does use AMX. My hardware doesnt support AMX, so I did not consider lol. I also would use GPU for other things... I guess I spent most of today learning more about what this code does and Qwen Embedding paper. Have you tried other models then Qwen-Emebdding? |
|
@mwrothbe any other changes you want to make before merge? :) |
|
Nope. I'm good with this PR if you are. |
This enables the use of OpenArc to provide embeddings for RAG pipelines, but possibly for other uses. It provides an OpenAI interface that's confirmed to work with C# OpenAIClient library. Not all OpenAI inputs are implemented (user & encoding_format.) API also accepts an optional PreTrainedTokenizer object that is passed into the model with the goal of using the API with various models that have different requirements. Only Qwen3-Embedding-0.6B has been tested. The embeddings generator is in the optimum domain.