-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Description
Hi Docling Development Team
Requested feature
I would like to request a way to unload models from GPU memory, or to reuse shared models if multiple pipelines use the same models.
I realized that creating a pipeline with different configurations will lead to a duplicated allocation of models used by the pipeline. I can confirm this by evaluating docling-serve and inspecting the implementation of DoclingConverterManager in docling-jobkit.
e.g. Submitting two jobs with the Code Enhancement Feature on & off, Docling will allocate two identical OCR Models on the GPU.
I am developing a web application with docling as a long-standing service. I want to reduce the GPU memory usage as it is scarce. However, the above behaviour multiplies the GPU memory usage. The GPU will run out of memory very soon if multiple requests with different pipeline configurations are sent to the service, which is very common.
Alternatives
- Adding a model manager in
doclingwhich ensures allocated models are singletons and can be reused.- This alternative is my preference due to simplicity and future extension, while I understand this may cause architectural change to the code base.
- Adding an
unload()method for the converter, and the application can release the resources on demand.- - Using a new Python Process to wrap the execution per request
- This option consumes too much time and is considered inefficient.