Reuse / Unloading Models in GPU Memory

Hi Docling Development Team

### Requested feature

I would like to request a way to unload models from GPU memory, or to reuse shared models if multiple pipelines use the same models.

I realized that creating a pipeline with different configurations will lead to a duplicated allocation of models used by the pipeline. I can confirm this by evaluating `docling-serve` and inspecting the implementation of `DoclingConverterManager` in `docling-jobkit`. 
e.g. Submitting two jobs with the Code Enhancement Feature on & off, Docling will allocate two identical OCR Models on the GPU.

I am developing a web application with `docling` as a long-standing service. I want to reduce the GPU memory usage as it is scarce. However, the above behaviour multiplies the GPU memory usage. The GPU will run out of memory very soon if multiple requests with different pipeline configurations are sent to the service, which is very common.

Similar issue: #2954 #2788 

### Alternatives

- Adding a model manager in `docling` which ensures allocated models are singletons and can be reused.
  - This alternative is my preference due to simplicity and future extension, while I understand this may cause architectural change to the code base.
- Adding an `unload()` method for the converter, and the application can release the resources on demand.- 
- Using a new Python Process to wrap the execution per request
  - This option consumes too much time and is considered inefficient.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reuse / Unloading Models in GPU Memory #3061

Requested feature

Alternatives

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reuse / Unloading Models in GPU Memory #3061

Description

Requested feature

Alternatives

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions