-
Notifications
You must be signed in to change notification settings - Fork 53
Add constructor in OnnxBertBiEncoder #38
Conversation
Signed-off-by: Jorge Bescos Gascon <[email protected]>
362fb8a to
1cb69a9
Compare
|
Sorry, wrong button 🫤 |
| } | ||
| } | ||
|
|
||
| public OnnxBertBiEncoder(OrtEnvironment environment, OrtSession session, HuggingFaceTokenizer tokenizer, PoolingMode poolingMode) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about exposing HuggingFaceTokenizer... Do you have other options in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be just a mock.
It is also more specific, because InputStream is too generic. And also HuggingFaceTokenizer can be easily instanced with a dependency injection, like CDI or Spring. Instancing an InputStream is tricky, because it has to be closed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by "mock"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For unit testing, you can mock HuggingFaceTokenizer to provide different behaviors (throwing exceptions and so on).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I meant that I would try to avoid exposing HuggingFaceTokenizer type from LC4j API, as this class belongs to another library (djl), which we might want to not use in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right with that point, however in case you want to use different tokenizer, it is possible that this one will not require that inputstream.
I can keep the inputstream if you want. I don't know much details of the internals, but normally this is solved with an interface Tokenizer, and let the user provide the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
however in case you want to use different tokenizer, it is possible that this one will not require that inputstream.
What do you mean? InputStream is versatile because it allows to read tokenizer dictionary from either JAR or filesystem or anywhere else.
I can keep the inputstream if you want.
It would be great, thanks. Otherwise we could keep HuggingFaceTokenizer with a big warning message, but I will not promise any backward compatibility. Breaking changes will most proably happen at some point.
I don't know much details of the internals, but normally this is solved with an interface Tokenizer, and let the user provide the implementation.
I was considering wrapping HuggingFaceTokenizer into a more generic interface (e.g. Tokenizer), but lots of internal details will leak into the interface. TBH I am not happy with the current implementation (that it has to use tokenizer.buildSentence()), this should be re-implemented. So the API of this interface will change, so this will defeat the whole purpose of the interface in the first place...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean?
InputStreamis versatile because it allows to read tokenizer dictionary from either JAR or filesystem or anywhere else.
InputStream is more versatile, I agree. But I wanted to say, that maybe in the future you replace the tokenizer by other that requires 2 configuration files, instead of one. Or it does not require a configuration file at all. In these cases the constructor will become tricky to manage that.
In any case, from Helidon point of view it is fine to use the InputStream, so I am going to update the PR with that.
Signed-off-by: Jorge Bescos Gascon <[email protected]>
langchain4j
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbescos thank you!
I am working on integrating ONNX Bert embedded in Helidon, and the current constructor of OnnxBertBiEncoder does not fit very well, because it creates new session and tokenizer every time.
Do you mind to add a new constructor where these values are passed by constructor?.
Thank you