-
-
Notifications
You must be signed in to change notification settings - Fork 4.8k
Reduce memory cost of importing the completion function #16860
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
AlexsanderHamir
wants to merge
136
commits into
main
Choose a base branch
from
litellm_memory_import_issue
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+2,335
−441
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This change removes 67MB of memory consumption on import time.
This reduced memory usage when importing the LiteLLM completion function from 200 MB to 140 MB.
This brings us down to 20MB, but something is getting triggered that is causing memory to spike.
Lazy-load most functions and response types from utils.py to avoid loading tiktoken and other heavy dependencies at import time. This significantly reduces memory usage when importing completion from litellm. Changes: - Made utils functions (exception_type, get_litellm_params, ModelResponse, etc.) lazy-loaded via __getattr__ - Made ALL_LITELLM_RESPONSE_TYPES lazy-loaded - Fixed circular imports by updating files to import directly from litellm.utils or litellm.types.utils instead of from litellm - Kept client decorator as immediate import since it's used at function definition time Only client is now imported immediately from utils.py; all other utils functions and response types are loaded on-demand when accessed.
Lazy-load tiktoken and default_encoding from litellm_core_utils to avoid loading these heavy dependencies at import time. This further reduces memory usage when importing completion from litellm. Changes: - Made tiktoken imports lazy-loaded in utils.py, main.py, and token_counter.py - Made default_encoding lazy-loaded in token_counter.py and utils.py - Made get_modified_max_tokens lazy-loaded in utils.py (only used internally) - Made encoding attribute lazy-loaded via __getattr__ in __init__.py - Removed top-level tiktoken and Encoding imports that were loading at module level tiktoken and default_encoding are now only loaded when token counting or encoding functions are actually called, not when importing completion.
Refactor repetitive lazy import and caching code into reusable helper functions to improve code maintainability and readability. Changes: - Added _lazy_import_and_cache() generic helper for lazy importing with caching - Added _lazy_import_from() convenience wrapper for common import pattern - Replaced 4 repetitive code blocks with simple function calls - Maintains same performance: imports cached after first access, zero overhead on subsequent calls The helper functions eliminate code duplication while preserving the performance benefits of cached lazy loading.
- Remove eager import of AsyncHTTPHandler and HTTPHandler from __init__.py - Make module_level_aclient and module_level_client lazy-loaded via __getattr__ - HTTP handler clients are now instantiated on first access, not at import time - Reduces memory footprint when importing completion from litellm
Lazy-load Cache, DualCache, RedisCache, and InMemoryCache from caching.caching to avoid loading these dependencies at import time. This further reduces memory usage when importing completion from litellm. Changes: - Made Cache, DualCache, RedisCache, and InMemoryCache lazy-loaded via __getattr__ in __init__.py - Removed top-level caching class imports that were loading at module level - Updated cache type annotation to use forward reference string to avoid runtime import - Caching classes are now only loaded when actually accessed, not when importing completion Performance: - First access: 0.001-0.008ms (negligible latency) - Cached access: 0.000ms (no latency penalty) - Classes are cached in globals() after first access to avoid repeated import overhead This follows the same pattern as HTTP handlers lazy loading and avoids latency issues by caching imported classes after first access.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
1. Grouped lazy imports into the same functions. 2. Removed importing more then one lib when its name wasn't called.
…e_index_from_tool_calls to reduce import-time memory cost
- Convert most types.utils imports to lazy loading via __getattr__ - Add _lazy_import_types_utils function for on-demand imports - Keep LlmProviders and PriorityReservationSettings as direct imports (needed for module-level initialization) - Add TYPE_CHECKING imports for type annotations (CredentialItem, BudgetConfig, etc.) - Significantly reduces import cascade and memory usage at import time
- Make provider_list and priority_reservation_settings lazy-loaded via __getattr__ - Lazy load types.proxy.management_endpoints.ui_sso imports (DefaultTeamSSOParams, LiteLLM_UpperboundKeyGenerateParams) - Keep LlmProviders and PriorityReservationSettings as direct imports (needed by other modules) - Remove non-essential comments - Significantly reduces import-time memory usage
- Make KeyManagementSystem fully lazy-loaded via __getattr__ - Make KeyManagementSettings lazy-loadable via __getattr__ - Keep KeyManagementSettings as direct import (needed for _key_management_settings initialization during import) - Add TYPE_CHECKING imports for type annotations - Significantly reduces import-time memory usage
- Move client import from line 1053 to right before main.py import (line 1328) - This delays loading utils.py (which imports tiktoken) until after most other imports - client cannot be fully lazy-loaded because main.py needs it at import time for @client decorator - Reduces memory footprint during early import phase
afc07ed to
b03746b
Compare
- Remove direct import of BytezChatConfig from early in __init__.py - Add lazy loading via __getattr__ pattern - Delays loading bytez transformation module until BytezChatConfig is accessed - main.py still works (imports directly), utils.py works (accesses via litellm.BytezChatConfig)
- Remove direct import of CustomLLM from early in __init__.py - Add lazy loading via __getattr__ pattern - Delays loading custom_llm module until CustomLLM is accessed - images/main.py still works (imports directly from source) - Proxy examples still work (access via litellm.CustomLLM)
- Remove direct import of AmazonConverseConfig from early in __init__.py - Add lazy loading via __getattr__ pattern - Delays loading converse_transformation module until AmazonConverseConfig is accessed - common_utils.py still works (accesses via litellm.AmazonConverseConfig()) - invoke_handler.py still works (imports directly from source)
…cale, Perplexity, WatsonX, GithubCopilot, and VLLM configs - Group chat configs (HostedVLLMChatConfig, LlamafileChatConfig, LiteLLMProxyChatConfig, DeepSeekChatConfig, LMStudioChatConfig, NscaleConfig, PerplexityChatConfig, IBMWatsonXChatConfig, GithubCopilotConfig) in _lazy_import_small_provider_chat_configs - Group transformation configs (VLLMConfig, IBMWatsonXAIConfig, LmStudioEmbeddingConfig, IBMWatsonXEmbeddingConfig) in _lazy_import_misc_transformation_configs - Add GithubCopilotResponsesAPIConfig to _lazy_import_azure_responses_configs - Add all configs to TYPE_CHECKING block for type annotations - Remove direct imports from __init__.py - Preserves lazy loading to reduce import-time memory cost
…OCI, Morph, LambdaAI, Hyperbolic, VercelAIGateway, OVHCloud, Lemonade, and Snowflake configs - Group chat configs (NebiusConfig, WandbConfig, DashScopeChatConfig, MoonshotChatConfig, DockerModelRunnerChatConfig, V0ChatConfig, OCIChatConfig, MorphChatConfig, LambdaAIChatConfig, HyperbolicChatConfig, VercelAIGatewayConfig, OVHCloudChatConfig, LemonadeChatConfig) in _lazy_import_small_provider_chat_configs - Group embedding configs (OVHCloudEmbeddingConfig, CometAPIEmbeddingConfig, SnowflakeEmbeddingConfig) in _lazy_import_misc_transformation_configs - Add all configs to TYPE_CHECKING block for type annotations - Remove direct imports from __init__.py - Preserves lazy loading to reduce import-time memory cost
…m in utils.py - Move BaseFilesConfig import to TYPE_CHECKING block - Move AllowedModelRegion and KeyManagementSystem imports to TYPE_CHECKING block - Update type annotations to use string annotations for lazy-loaded types - Reduces import-time memory cost for these utility types
- Add _lazy_import_main_functions helper in _lazy_imports.py - Dynamically imports requested attributes from main module on demand - Enables lazy loading of completion, acompletion, embedding, and other main functions
- Remove from .main import * to enable lazy loading of main functions - Add direct imports for functions needed during module initialization: - get_secret, get_secret_str, get_secret_bool (from secret_managers.main) - ModelResponse (from types.utils) - token_counter, print_verbose (from utils) - CustomStreamWrapper (from litellm_core_utils.streaming_handler) - These are required for other modules that import from litellm at module level
- Add lazy loading handler in __getattr__ that uses _lazy_import_main_functions - Enables lazy loading of completion, acompletion, embedding, and other main functions - Functions are only loaded when accessed, reducing import-time memory cost
- Move anthropic_tokenizer.json loading from module import time to first use - Create _get_claude_json_str() helper function that loads and caches the tokenizer JSON - Update _return_huggingface_tokenizer() to use the lazy-loaded function - Fix type annotation to use proper syntax instead of deprecated type comment - This defers loading the tokenizer file until it's actually needed for older Anthropic models
- Optimize _lazy_import_main_functions to check if module already loaded - Lazy load get_llm_provider in __init__.py to reduce import-time memory cost - Fix circular import by lazy-loading get_llm_provider in pattern_match_deployments and realtime_api - Add shared get_cached_llm_provider() helper for hot-path performance optimization
- Defer model_cost map loading until first access via __getattr__ - Make add_known_models() lazy - called when model_cost is first accessed - Add _get_model_cost() helper for cached lazy loading - Reduces import-time memory by avoiding cost map download/parsing at import
- Defer batches module import until first function access via __getattr__ - Add _lazy_import_batches_functions with fast path optimization - Bulk cache all public batch functions on first access to avoid repeated __getattr__ calls - Add fast path check to skip bulk caching if already done
…ort-time memory cost - Move imports inside TYPE_CHECKING block for type-only imports - Use string literals in type annotations to defer type evaluation - Reduces import-time memory by deferring datadog types module load
…-time memory cost - Remove direct imports from __init__.py - Add TritonGenerateConfig and TritonInferConfig to _lazy_import_triton_configs handler - Update __getattr__ to handle these configs via lazy loading
- Remove direct import from __init__.py - Add GeminiModelInfo to __getattr__ for lazy loading - Follows same pattern as XAIModelInfo and other model info classes
- Remove direct import from __init__.py - Add _lazy_import_assistants_functions handler with bulk caching - Add all 18 assistants functions to __getattr__ for lazy loading - Follows same pattern as batches.main with performance optimizations
- Remove direct import from __init__.py - Add OpenAIImageVariationConfig to __getattr__ for lazy loading - Follows same pattern as other config classes
…ry cost - Remove direct import from __init__.py - Add DeepgramAudioTranscriptionConfig to __getattr__ for lazy loading - Follows same pattern as other config classes
- Remove direct import from __init__.py - Add TopazModelInfo to __getattr__ for lazy loading - Follows same pattern as other model info classes
- Remove direct import from __init__.py - Add TopazImageVariationConfig to __getattr__ for lazy loading - Follows same pattern as other config classes
- Remove direct import from __init__.py - Add OpenAIResponsesAPIConfig to __getattr__ for lazy loading - Follows same pattern as other config classes
- Make DualCache import lazy in custom_logger.py using TYPE_CHECKING - Use string annotation for DualCache type hint to avoid runtime import - Breaks circular dependency: custom_logger -> caching -> gcs_cache -> gcs_bucket_base -> custom_batch_logger -> custom_logger - Resolves ImportError when importing litellm
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Title
Reduce memory cost of importing the completion function
Relevant issues
Pre-Submission checklist
Please complete all items before asking a LiteLLM maintainer to review your PR
tests/litellm/directory, Adding at least 1 test is a hard requirement - see detailsmake test-unitType
🧹 Refactoring
Context
Our current import strategy pulls in large portions of the codebase—even when only a single function is needed. Many modules perform heavy work at import time or bring in sizable dependencies, so importing the completion function triggers unnecessary initialization and memory allocation.
While this PR reduces the overhead for the completion function, it doesn’t fully resolve the underlying issue. A broader cleanup of our import structure is required for a complete fix.
Changes
Memory Differences
Before
After