Skip to content
This repository was archived by the owner on Dec 4, 2025. It is now read-only.

Conversation

@jbescos
Copy link
Contributor

@jbescos jbescos commented Sep 4, 2024

I am working on integrating ONNX Bert embedded in Helidon, and the current constructor of OnnxBertBiEncoder does not fit very well, because it creates new session and tokenizer every time.

Do you mind to add a new constructor where these values are passed by constructor?.

Thank you

Signed-off-by: Jorge Bescos Gascon <[email protected]>
@jbescos jbescos force-pushed the OnnxBertBiEncoderFix branch from 362fb8a to 1cb69a9 Compare September 4, 2024 06:53
@langchain4j langchain4j closed this Sep 4, 2024
@langchain4j langchain4j reopened this Sep 4, 2024
@langchain4j
Copy link
Collaborator

Sorry, wrong button 🫤

}
}

public OnnxBertBiEncoder(OrtEnvironment environment, OrtSession session, HuggingFaceTokenizer tokenizer, PoolingMode poolingMode) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about exposing HuggingFaceTokenizer... Do you have other options in mind?

Copy link
Contributor Author

@jbescos jbescos Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be just a mock.

It is also more specific, because InputStream is too generic. And also HuggingFaceTokenizer can be easily instanced with a dependency injection, like CDI or Spring. Instancing an InputStream is tricky, because it has to be closed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "mock"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For unit testing, you can mock HuggingFaceTokenizer to provide different behaviors (throwing exceptions and so on).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I meant that I would try to avoid exposing HuggingFaceTokenizer type from LC4j API, as this class belongs to another library (djl), which we might want to not use in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right with that point, however in case you want to use different tokenizer, it is possible that this one will not require that inputstream.

I can keep the inputstream if you want. I don't know much details of the internals, but normally this is solved with an interface Tokenizer, and let the user provide the implementation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

however in case you want to use different tokenizer, it is possible that this one will not require that inputstream.

What do you mean? InputStream is versatile because it allows to read tokenizer dictionary from either JAR or filesystem or anywhere else.

I can keep the inputstream if you want.

It would be great, thanks. Otherwise we could keep HuggingFaceTokenizer with a big warning message, but I will not promise any backward compatibility. Breaking changes will most proably happen at some point.

I don't know much details of the internals, but normally this is solved with an interface Tokenizer, and let the user provide the implementation.

I was considering wrapping HuggingFaceTokenizer into a more generic interface (e.g. Tokenizer), but lots of internal details will leak into the interface. TBH I am not happy with the current implementation (that it has to use tokenizer.buildSentence()), this should be re-implemented. So the API of this interface will change, so this will defeat the whole purpose of the interface in the first place...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean? InputStream is versatile because it allows to read tokenizer dictionary from either JAR or filesystem or anywhere else.

InputStream is more versatile, I agree. But I wanted to say, that maybe in the future you replace the tokenizer by other that requires 2 configuration files, instead of one. Or it does not require a configuration file at all. In these cases the constructor will become tricky to manage that.

In any case, from Helidon point of view it is fine to use the InputStream, so I am going to update the PR with that.

Copy link
Collaborator

@langchain4j langchain4j left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbescos thank you!

@langchain4j langchain4j merged commit 62046f5 into langchain4j:main Sep 5, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants