Skip to content

Conversation

@AlexsanderHamir
Copy link
Collaborator

@AlexsanderHamir AlexsanderHamir commented Nov 20, 2025

This PR is not to be merged, will cherry pick from here and merge into main slowly.

Title

Reduce memory cost of importing the completion function

Relevant issues

Pre-Submission checklist

Please complete all items before asking a LiteLLM maintainer to review your PR

  • I have Added testing in the tests/litellm/ directory, Adding at least 1 test is a hard requirement - see details
  • I have added a screenshot of my new test passing locally
  • My PR passes all unit tests on make test-unit
  • My PR's scope is as isolated as possible, it only solves 1 specific problem

Type

🧹 Refactoring

Context

Our current import strategy pulls in large portions of the codebase—even when only a single function is needed. Many modules perform heavy work at import time or bring in sizable dependencies, so importing the completion function triggers unnecessary initialization and memory allocation.

While this PR reduces the overhead for the completion function, it doesn’t fully resolve the underlying issue. A broader cleanup of our import structure is required for a complete fix.

Changes

  • Lazy-loaded the heaviest libraries identified in the memory profile during completion import.

Memory Differences

Before

Screenshot 2025-11-19 at 5 37 23 PM Screenshot 2025-11-19 at 5 38 03 PM

After

Screenshot 2025-11-19 at 5 37 41 PM Screenshot 2025-11-19 at 5 37 52 PM

This change removes 67MB of memory consumption on import time.
This reduced memory usage when importing the LiteLLM completion function from 200 MB to 140 MB.
This brings us down to 20MB, but something is getting triggered that is causing memory to spike.
Lazy-load most functions and response types from utils.py to avoid loading
tiktoken and other heavy dependencies at import time. This significantly
reduces memory usage when importing completion from litellm.

Changes:
- Made utils functions (exception_type, get_litellm_params, ModelResponse, etc.)
  lazy-loaded via __getattr__
- Made ALL_LITELLM_RESPONSE_TYPES lazy-loaded
- Fixed circular imports by updating files to import directly from litellm.utils
  or litellm.types.utils instead of from litellm
- Kept client decorator as immediate import since it's used at function
  definition time

Only client is now imported immediately from utils.py; all other utils
functions and response types are loaded on-demand when accessed.
Lazy-load tiktoken and default_encoding from litellm_core_utils to avoid
loading these heavy dependencies at import time. This further reduces memory
usage when importing completion from litellm.

Changes:
- Made tiktoken imports lazy-loaded in utils.py, main.py, and token_counter.py
- Made default_encoding lazy-loaded in token_counter.py and utils.py
- Made get_modified_max_tokens lazy-loaded in utils.py (only used internally)
- Made encoding attribute lazy-loaded via __getattr__ in __init__.py
- Removed top-level tiktoken and Encoding imports that were loading at module level

tiktoken and default_encoding are now only loaded when token counting or
encoding functions are actually called, not when importing completion.
Refactor repetitive lazy import and caching code into reusable helper
functions to improve code maintainability and readability.

Changes:
- Added _lazy_import_and_cache() generic helper for lazy importing with caching
- Added _lazy_import_from() convenience wrapper for common import pattern
- Replaced 4 repetitive code blocks with simple function calls
- Maintains same performance: imports cached after first access, zero
  overhead on subsequent calls

The helper functions eliminate code duplication while preserving the
performance benefits of cached lazy loading.
- Remove eager import of AsyncHTTPHandler and HTTPHandler from __init__.py
- Make module_level_aclient and module_level_client lazy-loaded via __getattr__
- HTTP handler clients are now instantiated on first access, not at import time
- Reduces memory footprint when importing completion from litellm
Lazy-load Cache, DualCache, RedisCache, and InMemoryCache from caching.caching
to avoid loading these dependencies at import time. This further reduces memory
usage when importing completion from litellm.

Changes:
- Made Cache, DualCache, RedisCache, and InMemoryCache lazy-loaded via __getattr__ in __init__.py
- Removed top-level caching class imports that were loading at module level
- Updated cache type annotation to use forward reference string to avoid runtime import
- Caching classes are now only loaded when actually accessed, not when importing completion

Performance:
- First access: 0.001-0.008ms (negligible latency)
- Cached access: 0.000ms (no latency penalty)
- Classes are cached in globals() after first access to avoid repeated import overhead

This follows the same pattern as HTTP handlers lazy loading and avoids latency
issues by caching imported classes after first access.
@vercel
Copy link

vercel bot commented Nov 20, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
litellm Error Error Nov 24, 2025 7:04pm

1. Grouped lazy imports into the same functions.
2. Removed importing more then one lib when its name wasn't called.
…e_index_from_tool_calls to reduce import-time memory cost
- Convert most types.utils imports to lazy loading via __getattr__
- Add _lazy_import_types_utils function for on-demand imports
- Keep LlmProviders and PriorityReservationSettings as direct imports (needed for module-level initialization)
- Add TYPE_CHECKING imports for type annotations (CredentialItem, BudgetConfig, etc.)
- Significantly reduces import cascade and memory usage at import time
- Make provider_list and priority_reservation_settings lazy-loaded via __getattr__
- Lazy load types.proxy.management_endpoints.ui_sso imports (DefaultTeamSSOParams, LiteLLM_UpperboundKeyGenerateParams)
- Keep LlmProviders and PriorityReservationSettings as direct imports (needed by other modules)
- Remove non-essential comments
- Significantly reduces import-time memory usage
- Make KeyManagementSystem fully lazy-loaded via __getattr__
- Make KeyManagementSettings lazy-loadable via __getattr__
- Keep KeyManagementSettings as direct import (needed for _key_management_settings initialization during import)
- Add TYPE_CHECKING imports for type annotations
- Significantly reduces import-time memory usage
- Move client import from line 1053 to right before main.py import (line 1328)
- This delays loading utils.py (which imports tiktoken) until after most other imports
- client cannot be fully lazy-loaded because main.py needs it at import time for @client decorator
- Reduces memory footprint during early import phase
@AlexsanderHamir AlexsanderHamir force-pushed the litellm_memory_import_issue branch from afc07ed to b03746b Compare November 22, 2025 20:15
- Remove direct import of BytezChatConfig from early in __init__.py
- Add lazy loading via __getattr__ pattern
- Delays loading bytez transformation module until BytezChatConfig is accessed
- main.py still works (imports directly), utils.py works (accesses via litellm.BytezChatConfig)
- Remove direct import of CustomLLM from early in __init__.py
- Add lazy loading via __getattr__ pattern
- Delays loading custom_llm module until CustomLLM is accessed
- images/main.py still works (imports directly from source)
- Proxy examples still work (access via litellm.CustomLLM)
- Remove direct import of AmazonConverseConfig from early in __init__.py
- Add lazy loading via __getattr__ pattern
- Delays loading converse_transformation module until AmazonConverseConfig is accessed
- common_utils.py still works (accesses via litellm.AmazonConverseConfig())
- invoke_handler.py still works (imports directly from source)
…cale, Perplexity, WatsonX, GithubCopilot, and VLLM configs

- Group chat configs (HostedVLLMChatConfig, LlamafileChatConfig, LiteLLMProxyChatConfig, DeepSeekChatConfig, LMStudioChatConfig, NscaleConfig, PerplexityChatConfig, IBMWatsonXChatConfig, GithubCopilotConfig) in _lazy_import_small_provider_chat_configs
- Group transformation configs (VLLMConfig, IBMWatsonXAIConfig, LmStudioEmbeddingConfig, IBMWatsonXEmbeddingConfig) in _lazy_import_misc_transformation_configs
- Add GithubCopilotResponsesAPIConfig to _lazy_import_azure_responses_configs
- Add all configs to TYPE_CHECKING block for type annotations
- Remove direct imports from __init__.py
- Preserves lazy loading to reduce import-time memory cost
…OCI, Morph, LambdaAI, Hyperbolic, VercelAIGateway, OVHCloud, Lemonade, and Snowflake configs

- Group chat configs (NebiusConfig, WandbConfig, DashScopeChatConfig, MoonshotChatConfig, DockerModelRunnerChatConfig, V0ChatConfig, OCIChatConfig, MorphChatConfig, LambdaAIChatConfig, HyperbolicChatConfig, VercelAIGatewayConfig, OVHCloudChatConfig, LemonadeChatConfig) in _lazy_import_small_provider_chat_configs
- Group embedding configs (OVHCloudEmbeddingConfig, CometAPIEmbeddingConfig, SnowflakeEmbeddingConfig) in _lazy_import_misc_transformation_configs
- Add all configs to TYPE_CHECKING block for type annotations
- Remove direct imports from __init__.py
- Preserves lazy loading to reduce import-time memory cost
…m in utils.py

- Move BaseFilesConfig import to TYPE_CHECKING block
- Move AllowedModelRegion and KeyManagementSystem imports to TYPE_CHECKING block
- Update type annotations to use string annotations for lazy-loaded types
- Reduces import-time memory cost for these utility types
- Add _lazy_import_main_functions helper in _lazy_imports.py
- Dynamically imports requested attributes from main module on demand
- Enables lazy loading of completion, acompletion, embedding, and other main functions
- Remove from .main import * to enable lazy loading of main functions
- Add direct imports for functions needed during module initialization:
  - get_secret, get_secret_str, get_secret_bool (from secret_managers.main)
  - ModelResponse (from types.utils)
  - token_counter, print_verbose (from utils)
  - CustomStreamWrapper (from litellm_core_utils.streaming_handler)
- These are required for other modules that import from litellm at module level
- Add lazy loading handler in __getattr__ that uses _lazy_import_main_functions
- Enables lazy loading of completion, acompletion, embedding, and other main functions
- Functions are only loaded when accessed, reducing import-time memory cost
- Move anthropic_tokenizer.json loading from module import time to first use
- Create _get_claude_json_str() helper function that loads and caches the tokenizer JSON
- Update _return_huggingface_tokenizer() to use the lazy-loaded function
- Fix type annotation to use proper syntax instead of deprecated type comment
- This defers loading the tokenizer file until it's actually needed for older Anthropic models
- Optimize _lazy_import_main_functions to check if module already loaded
- Lazy load get_llm_provider in __init__.py to reduce import-time memory cost
- Fix circular import by lazy-loading get_llm_provider in pattern_match_deployments and realtime_api
- Add shared get_cached_llm_provider() helper for hot-path performance optimization
- Defer model_cost map loading until first access via __getattr__
- Make add_known_models() lazy - called when model_cost is first accessed
- Add _get_model_cost() helper for cached lazy loading
- Reduces import-time memory by avoiding cost map download/parsing at import
- Defer batches module import until first function access via __getattr__
- Add _lazy_import_batches_functions with fast path optimization
- Bulk cache all public batch functions on first access to avoid repeated __getattr__ calls
- Add fast path check to skip bulk caching if already done
…ort-time memory cost

- Move imports inside TYPE_CHECKING block for type-only imports
- Use string literals in type annotations to defer type evaluation
- Reduces import-time memory by deferring datadog types module load
…-time memory cost

- Remove direct imports from __init__.py
- Add TritonGenerateConfig and TritonInferConfig to _lazy_import_triton_configs handler
- Update __getattr__ to handle these configs via lazy loading
- Remove direct import from __init__.py
- Add GeminiModelInfo to __getattr__ for lazy loading
- Follows same pattern as XAIModelInfo and other model info classes
- Remove direct import from __init__.py
- Add _lazy_import_assistants_functions handler with bulk caching
- Add all 18 assistants functions to __getattr__ for lazy loading
- Follows same pattern as batches.main with performance optimizations
- Remove direct import from __init__.py
- Add OpenAIImageVariationConfig to __getattr__ for lazy loading
- Follows same pattern as other config classes
…ry cost

- Remove direct import from __init__.py
- Add DeepgramAudioTranscriptionConfig to __getattr__ for lazy loading
- Follows same pattern as other config classes
- Remove direct import from __init__.py
- Add TopazModelInfo to __getattr__ for lazy loading
- Follows same pattern as other model info classes
- Remove direct import from __init__.py
- Add TopazImageVariationConfig to __getattr__ for lazy loading
- Follows same pattern as other config classes
- Remove direct import from __init__.py
- Add OpenAIResponsesAPIConfig to __getattr__ for lazy loading
- Follows same pattern as other config classes
- Make DualCache import lazy in custom_logger.py using TYPE_CHECKING
- Use string annotation for DualCache type hint to avoid runtime import
- Breaks circular dependency: custom_logger -> caching -> gcs_cache -> gcs_bucket_base -> custom_batch_logger -> custom_logger
- Resolves ImportError when importing litellm
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants