Reduce memory cost of importing the completion function #16860

AlexsanderHamir · 2025-11-20T01:56:19Z

This PR is not to be merged, will cherry pick from here and merge into main slowly.

Title

Reduce memory cost of importing the completion function

Relevant issues

Pre-Submission checklist

Please complete all items before asking a LiteLLM maintainer to review your PR

I have Added testing in the tests/litellm/ directory, Adding at least 1 test is a hard requirement - see details
I have added a screenshot of my new test passing locally
My PR passes all unit tests on make test-unit
My PR's scope is as isolated as possible, it only solves 1 specific problem

Type

🧹 Refactoring

Context

Our current import strategy pulls in large portions of the codebase—even when only a single function is needed. Many modules perform heavy work at import time or bring in sizable dependencies, so importing the completion function triggers unnecessary initialization and memory allocation.

While this PR reduces the overhead for the completion function, it doesn’t fully resolve the underlying issue. A broader cleanup of our import structure is required for a complete fix.

Changes

Lazy-loaded the heaviest libraries identified in the memory profile during completion import.

Memory Differences

Before

After

This change removes 67MB of memory consumption on import time.

This reduced memory usage when importing the LiteLLM completion function from 200 MB to 140 MB.

This brings us down to 20MB, but something is getting triggered that is causing memory to spike.

Lazy-load most functions and response types from utils.py to avoid loading tiktoken and other heavy dependencies at import time. This significantly reduces memory usage when importing completion from litellm. Changes: - Made utils functions (exception_type, get_litellm_params, ModelResponse, etc.) lazy-loaded via __getattr__ - Made ALL_LITELLM_RESPONSE_TYPES lazy-loaded - Fixed circular imports by updating files to import directly from litellm.utils or litellm.types.utils instead of from litellm - Kept client decorator as immediate import since it's used at function definition time Only client is now imported immediately from utils.py; all other utils functions and response types are loaded on-demand when accessed.

Lazy-load tiktoken and default_encoding from litellm_core_utils to avoid loading these heavy dependencies at import time. This further reduces memory usage when importing completion from litellm. Changes: - Made tiktoken imports lazy-loaded in utils.py, main.py, and token_counter.py - Made default_encoding lazy-loaded in token_counter.py and utils.py - Made get_modified_max_tokens lazy-loaded in utils.py (only used internally) - Made encoding attribute lazy-loaded via __getattr__ in __init__.py - Removed top-level tiktoken and Encoding imports that were loading at module level tiktoken and default_encoding are now only loaded when token counting or encoding functions are actually called, not when importing completion.

Refactor repetitive lazy import and caching code into reusable helper functions to improve code maintainability and readability. Changes: - Added _lazy_import_and_cache() generic helper for lazy importing with caching - Added _lazy_import_from() convenience wrapper for common import pattern - Replaced 4 repetitive code blocks with simple function calls - Maintains same performance: imports cached after first access, zero overhead on subsequent calls The helper functions eliminate code duplication while preserving the performance benefits of cached lazy loading.

- Remove eager import of AsyncHTTPHandler and HTTPHandler from __init__.py - Make module_level_aclient and module_level_client lazy-loaded via __getattr__ - HTTP handler clients are now instantiated on first access, not at import time - Reduces memory footprint when importing completion from litellm

Lazy-load Cache, DualCache, RedisCache, and InMemoryCache from caching.caching to avoid loading these dependencies at import time. This further reduces memory usage when importing completion from litellm. Changes: - Made Cache, DualCache, RedisCache, and InMemoryCache lazy-loaded via __getattr__ in __init__.py - Removed top-level caching class imports that were loading at module level - Updated cache type annotation to use forward reference string to avoid runtime import - Caching classes are now only loaded when actually accessed, not when importing completion Performance: - First access: 0.001-0.008ms (negligible latency) - Cached access: 0.000ms (no latency penalty) - Classes are cached in globals() after first access to avoid repeated import overhead This follows the same pattern as HTTP handlers lazy loading and avoids latency issues by caching imported classes after first access.

vercel · 2025-11-20T01:56:26Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Preview	Comments	Updated (UTC)
litellm	Error			Nov 24, 2025 7:04pm

1. Grouped lazy imports into the same functions. 2. Removed importing more then one lib when its name wasn't called.

…_issue

…e_index_from_tool_calls to reduce import-time memory cost

- Convert most types.utils imports to lazy loading via __getattr__ - Add _lazy_import_types_utils function for on-demand imports - Keep LlmProviders and PriorityReservationSettings as direct imports (needed for module-level initialization) - Add TYPE_CHECKING imports for type annotations (CredentialItem, BudgetConfig, etc.) - Significantly reduces import cascade and memory usage at import time

- Make provider_list and priority_reservation_settings lazy-loaded via __getattr__ - Lazy load types.proxy.management_endpoints.ui_sso imports (DefaultTeamSSOParams, LiteLLM_UpperboundKeyGenerateParams) - Keep LlmProviders and PriorityReservationSettings as direct imports (needed by other modules) - Remove non-essential comments - Significantly reduces import-time memory usage

- Make KeyManagementSystem fully lazy-loaded via __getattr__ - Make KeyManagementSettings lazy-loadable via __getattr__ - Keep KeyManagementSettings as direct import (needed for _key_management_settings initialization during import) - Add TYPE_CHECKING imports for type annotations - Significantly reduces import-time memory usage

@client

- Move client import from line 1053 to right before main.py import (line 1328) - This delays loading utils.py (which imports tiktoken) until after most other imports - client cannot be fully lazy-loaded because main.py needs it at import time for @client decorator - Reduces memory footprint during early import phase

- Remove direct import of BytezChatConfig from early in __init__.py - Add lazy loading via __getattr__ pattern - Delays loading bytez transformation module until BytezChatConfig is accessed - main.py still works (imports directly), utils.py works (accesses via litellm.BytezChatConfig)

- Remove direct import of CustomLLM from early in __init__.py - Add lazy loading via __getattr__ pattern - Delays loading custom_llm module until CustomLLM is accessed - images/main.py still works (imports directly from source) - Proxy examples still work (access via litellm.CustomLLM)

- Remove direct import of AmazonConverseConfig from early in __init__.py - Add lazy loading via __getattr__ pattern - Delays loading converse_transformation module until AmazonConverseConfig is accessed - common_utils.py still works (accesses via litellm.AmazonConverseConfig()) - invoke_handler.py still works (imports directly from source)

…cale, Perplexity, WatsonX, GithubCopilot, and VLLM configs - Group chat configs (HostedVLLMChatConfig, LlamafileChatConfig, LiteLLMProxyChatConfig, DeepSeekChatConfig, LMStudioChatConfig, NscaleConfig, PerplexityChatConfig, IBMWatsonXChatConfig, GithubCopilotConfig) in _lazy_import_small_provider_chat_configs - Group transformation configs (VLLMConfig, IBMWatsonXAIConfig, LmStudioEmbeddingConfig, IBMWatsonXEmbeddingConfig) in _lazy_import_misc_transformation_configs - Add GithubCopilotResponsesAPIConfig to _lazy_import_azure_responses_configs - Add all configs to TYPE_CHECKING block for type annotations - Remove direct imports from __init__.py - Preserves lazy loading to reduce import-time memory cost

…OCI, Morph, LambdaAI, Hyperbolic, VercelAIGateway, OVHCloud, Lemonade, and Snowflake configs - Group chat configs (NebiusConfig, WandbConfig, DashScopeChatConfig, MoonshotChatConfig, DockerModelRunnerChatConfig, V0ChatConfig, OCIChatConfig, MorphChatConfig, LambdaAIChatConfig, HyperbolicChatConfig, VercelAIGatewayConfig, OVHCloudChatConfig, LemonadeChatConfig) in _lazy_import_small_provider_chat_configs - Group embedding configs (OVHCloudEmbeddingConfig, CometAPIEmbeddingConfig, SnowflakeEmbeddingConfig) in _lazy_import_misc_transformation_configs - Add all configs to TYPE_CHECKING block for type annotations - Remove direct imports from __init__.py - Preserves lazy loading to reduce import-time memory cost

…m in utils.py - Move BaseFilesConfig import to TYPE_CHECKING block - Move AllowedModelRegion and KeyManagementSystem imports to TYPE_CHECKING block - Update type annotations to use string annotations for lazy-loaded types - Reduces import-time memory cost for these utility types

- Add _lazy_import_main_functions helper in _lazy_imports.py - Dynamically imports requested attributes from main module on demand - Enables lazy loading of completion, acompletion, embedding, and other main functions

- Remove from .main import * to enable lazy loading of main functions - Add direct imports for functions needed during module initialization: - get_secret, get_secret_str, get_secret_bool (from secret_managers.main) - ModelResponse (from types.utils) - token_counter, print_verbose (from utils) - CustomStreamWrapper (from litellm_core_utils.streaming_handler) - These are required for other modules that import from litellm at module level

- Add lazy loading handler in __getattr__ that uses _lazy_import_main_functions - Enables lazy loading of completion, acompletion, embedding, and other main functions - Functions are only loaded when accessed, reducing import-time memory cost

- Move anthropic_tokenizer.json loading from module import time to first use - Create _get_claude_json_str() helper function that loads and caches the tokenizer JSON - Update _return_huggingface_tokenizer() to use the lazy-loaded function - Fix type annotation to use proper syntax instead of deprecated type comment - This defers loading the tokenizer file until it's actually needed for older Anthropic models

- Optimize _lazy_import_main_functions to check if module already loaded - Lazy load get_llm_provider in __init__.py to reduce import-time memory cost - Fix circular import by lazy-loading get_llm_provider in pattern_match_deployments and realtime_api - Add shared get_cached_llm_provider() helper for hot-path performance optimization

- Defer model_cost map loading until first access via __getattr__ - Make add_known_models() lazy - called when model_cost is first accessed - Add _get_model_cost() helper for cached lazy loading - Reduces import-time memory by avoiding cost map download/parsing at import

- Defer batches module import until first function access via __getattr__ - Add _lazy_import_batches_functions with fast path optimization - Bulk cache all public batch functions on first access to avoid repeated __getattr__ calls - Add fast path check to skip bulk caching if already done

…ort-time memory cost - Move imports inside TYPE_CHECKING block for type-only imports - Use string literals in type annotations to defer type evaluation - Reduces import-time memory by deferring datadog types module load

…-time memory cost - Remove direct imports from __init__.py - Add TritonGenerateConfig and TritonInferConfig to _lazy_import_triton_configs handler - Update __getattr__ to handle these configs via lazy loading

- Remove direct import from __init__.py - Add GeminiModelInfo to __getattr__ for lazy loading - Follows same pattern as XAIModelInfo and other model info classes

- Remove direct import from __init__.py - Add _lazy_import_assistants_functions handler with bulk caching - Add all 18 assistants functions to __getattr__ for lazy loading - Follows same pattern as batches.main with performance optimizations

- Remove direct import from __init__.py - Add OpenAIImageVariationConfig to __getattr__ for lazy loading - Follows same pattern as other config classes

…ry cost - Remove direct import from __init__.py - Add DeepgramAudioTranscriptionConfig to __getattr__ for lazy loading - Follows same pattern as other config classes

- Remove direct import from __init__.py - Add TopazModelInfo to __getattr__ for lazy loading - Follows same pattern as other model info classes

- Remove direct import from __init__.py - Add TopazImageVariationConfig to __getattr__ for lazy loading - Follows same pattern as other config classes

- Remove direct import from __init__.py - Add OpenAIResponsesAPIConfig to __getattr__ for lazy loading - Follows same pattern as other config classes

- Make DualCache import lazy in custom_logger.py using TYPE_CHECKING - Use string annotation for DualCache type hint to avoid runtime import - Breaks circular dependency: custom_logger -> caching -> gcs_cache -> gcs_bucket_base -> custom_batch_logger -> custom_logger - Resolves ImportError when importing litellm

AlexsanderHamir added 8 commits November 18, 2025 14:47

fix: lazy load cost_calculator.py

f0cf772

This change removes 67MB of memory consumption on import time.

fix: lazy-load Prometheus

216b08d

This reduced memory usage when importing the LiteLLM completion function from 200 MB to 140 MB.

fix: lazy load litellm_logging

e8a6a07

This brings us down to 20MB, but something is getting triggered that is causing memory to spike.

AlexsanderHamir requested a review from ishaan-jaff November 21, 2025 01:41

refactor: make lazy imports cleaner

726bb49

1. Grouped lazy imports into the same functions. 2. Removed importing more then one lib when its name wasn't called.

vercel bot deployed to Preview November 21, 2025 20:40 View deployment

fix: lazy load LLMClientCache

505c598

vercel bot deployed to Preview November 21, 2025 21:47 View deployment

Merge remote-tracking branch 'origin/main' into litellm_memory_import…

98fc291

…_issue

vercel bot had a problem deploying to Preview November 22, 2025 18:51 Failure

fix: lazy load COHERE_EMBEDDING_INPUT_TYPES, GuardrailItem, and remov…

da97d2c

…e_index_from_tool_calls to reduce import-time memory cost

vercel bot had a problem deploying to Preview November 22, 2025 19:19 Failure

vercel bot had a problem deploying to Preview November 22, 2025 19:28 Failure

vercel bot had a problem deploying to Preview November 22, 2025 19:43 Failure

vercel bot had a problem deploying to Preview November 22, 2025 20:00 Failure

AlexsanderHamir added 2 commits November 22, 2025 12:03

AlexsanderHamir force-pushed the litellm_memory_import_issue branch from afc07ed to b03746b Compare November 22, 2025 20:15

vercel bot had a problem deploying to Preview November 22, 2025 20:20 Failure

AlexsanderHamir added 2 commits November 22, 2025 12:21

AlexsanderHamir added 2 commits November 22, 2025 16:54

vercel bot had a problem deploying to Preview November 23, 2025 01:11 Failure

AlexsanderHamir added 4 commits November 22, 2025 17:41

Add lazy import helper for main module functions

540be15

- Add _lazy_import_main_functions helper in _lazy_imports.py - Dynamically imports requested attributes from main module on demand - Enables lazy loading of completion, acompletion, embedding, and other main functions

vercel bot had a problem deploying to Preview November 23, 2025 01:49 Failure

optimize lazy load fallback

b42e0ee

vercel bot had a problem deploying to Preview November 24, 2025 17:03 Failure

vercel bot had a problem deploying to Preview November 24, 2025 17:46 Failure

AlexsanderHamir added 2 commits November 24, 2025 09:53

vercel bot had a problem deploying to Preview November 24, 2025 17:58 Failure

AlexsanderHamir added 2 commits November 24, 2025 10:12

vercel bot had a problem deploying to Preview November 24, 2025 18:18 Failure

AlexsanderHamir added 2 commits November 24, 2025 10:20

Lazy load TritonGenerateConfig and TritonInferConfig to reduce import…

dcbd8e0

…-time memory cost - Remove direct imports from __init__.py - Add TritonGenerateConfig and TritonInferConfig to _lazy_import_triton_configs handler - Update __getattr__ to handle these configs via lazy loading

Lazy load GeminiModelInfo to reduce import-time memory cost

ef6a16f

- Remove direct import from __init__.py - Add GeminiModelInfo to __getattr__ for lazy loading - Follows same pattern as XAIModelInfo and other model info classes

vercel bot had a problem deploying to Preview November 24, 2025 18:25 Failure

AlexsanderHamir added 4 commits November 24, 2025 10:29

Lazy load OpenAIImageVariationConfig to reduce import-time memory cost

668e4a5

- Remove direct import from __init__.py - Add OpenAIImageVariationConfig to __getattr__ for lazy loading - Follows same pattern as other config classes

Lazy load DeepgramAudioTranscriptionConfig to reduce import-time memo…

fa00a74

…ry cost - Remove direct import from __init__.py - Add DeepgramAudioTranscriptionConfig to __getattr__ for lazy loading - Follows same pattern as other config classes

Lazy load TopazModelInfo to reduce import-time memory cost

8d92e83

- Remove direct import from __init__.py - Add TopazModelInfo to __getattr__ for lazy loading - Follows same pattern as other model info classes

vercel bot had a problem deploying to Preview November 24, 2025 18:45 Failure

AlexsanderHamir added 3 commits November 24, 2025 10:47

Lazy load TopazImageVariationConfig to reduce import-time memory cost

8face9c

- Remove direct import from __init__.py - Add TopazImageVariationConfig to __getattr__ for lazy loading - Follows same pattern as other config classes

Lazy load OpenAIResponsesAPIConfig to reduce import-time memory cost

7fa8555

- Remove direct import from __init__.py - Add OpenAIResponsesAPIConfig to __getattr__ for lazy loading - Follows same pattern as other config classes

vercel bot had a problem deploying to Preview November 24, 2025 19:04 Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Reduce memory cost of importing the completion function #16860

Reduce memory cost of importing the completion function #16860

AlexsanderHamir commented Nov 20, 2025 •

edited

Loading

Uh oh!

vercel bot commented Nov 20, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Reduce memory cost of importing the completion function #16860

Are you sure you want to change the base?

Reduce memory cost of importing the completion function #16860

Conversation

AlexsanderHamir commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Title

Relevant issues

Pre-Submission checklist

Type

Context

Changes

Memory Differences

Uh oh!

vercel bot commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AlexsanderHamir commented Nov 20, 2025 •

edited

Loading

vercel bot commented Nov 20, 2025 •

edited

Loading