Skip to content

Reuse / Unloading Models in GPU Memory #3061

@khwong-c

Description

@khwong-c

Hi Docling Development Team

Requested feature

I would like to request a way to unload models from GPU memory, or to reuse shared models if multiple pipelines use the same models.

I realized that creating a pipeline with different configurations will lead to a duplicated allocation of models used by the pipeline. I can confirm this by evaluating docling-serve and inspecting the implementation of DoclingConverterManager in docling-jobkit.
e.g. Submitting two jobs with the Code Enhancement Feature on & off, Docling will allocate two identical OCR Models on the GPU.

I am developing a web application with docling as a long-standing service. I want to reduce the GPU memory usage as it is scarce. However, the above behaviour multiplies the GPU memory usage. The GPU will run out of memory very soon if multiple requests with different pipeline configurations are sent to the service, which is very common.

Similar issue: #2954 #2788

Alternatives

  • Adding a model manager in docling which ensures allocated models are singletons and can be reused.
    • This alternative is my preference due to simplicity and future extension, while I understand this may cause architectural change to the code base.
  • Adding an unload() method for the converter, and the application can release the resources on demand.-
  • Using a new Python Process to wrap the execution per request
    • This option consumes too much time and is considered inefficient.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions