| layout | default |
|---|---|
| title | LiteLLM Tutorial - Chapter 3: Completion API |
| nav_order | 3 |
| has_children | false |
| parent | LiteLLM Tutorial |
Welcome to Chapter 3: Completion API. In this part of LiteLLM Tutorial: Unified LLM Gateway and Routing Layer, you will build an intuitive mental model first, then move into concrete implementation details and practical production tradeoffs.
Master text and chat completions with advanced parameters, formatting, and multi-turn conversations.
The completion API is the core of LiteLLM. This chapter covers how to craft effective prompts, use advanced parameters, and handle different types of completions across all providers.
The standard chat completion format:
import litellm
response = litellm.completion(
model="gpt-4",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing in simple terms."}
]
)
print(response.choices[0].message.content)Understanding conversation roles:
- system: Sets the AI's behavior and context
- user: Human messages
- assistant: AI responses (can include previous responses for context)
messages = [
{
"role": "system",
"content": "You are an expert Python programmer. Provide clear, well-commented code examples."
},
{
"role": "user",
"content": "Write a function to calculate fibonacci numbers recursively."
}
]
response = litellm.completion(model="gpt-4", messages=messages)Control model behavior with parameters:
response = litellm.completion(
model="gpt-4",
messages=messages,
max_tokens=500, # Maximum response length
temperature=0.7, # Randomness (0.0-1.0)
top_p=0.9, # Nucleus sampling
frequency_penalty=0.0, # Reduce repetition (-2.0 to 2.0)
presence_penalty=0.0, # Encourage new topics (-2.0 to 2.0)
stop=["\n\n", "###"], # Stop sequences
n=1, # Number of completions to generate
logit_bias={}, # Bias token probabilities
)| Parameter | Range | Description | Use Case |
|---|---|---|---|
temperature |
0.0-2.0 | Randomness in output | Creative writing (high), Code (low) |
top_p |
0.0-1.0 | Nucleus sampling | Alternative to temperature |
max_tokens |
1+ | Maximum response length | Control costs and length |
frequency_penalty |
-2.0-2.0 | Reduce token repetition | Avoid loops in text |
presence_penalty |
-2.0-2.0 | Encourage new topics | Diverse responses |
stop |
strings | Stop generation at these strings | Structured outputs |
Maintain context across multiple exchanges:
conversation = [
{"role": "system", "content": "You are a helpful coding tutor."},
{"role": "user", "content": "How do I reverse a string in Python?"},
]
# First response
response1 = litellm.completion(model="gpt-4", messages=conversation)
print("Assistant:", response1.choices[0].message.content)
# Continue conversation
conversation.append({
"role": "assistant",
"content": response1.choices[0].message.content
})
conversation.append({
"role": "user",
"content": "Can you show me a more efficient way using slicing?"
})
response2 = litellm.completion(model="gpt-4", messages=conversation)
print("Assistant:", response2.choices[0].message.content)Create a helper class for managing conversations:
class ConversationManager:
def __init__(self, model="gpt-4", system_message=None):
self.model = model
self.messages = []
if system_message:
self.messages.append({"role": "system", "content": system_message})
def add_message(self, role, content):
"""Add a message to the conversation."""
self.messages.append({"role": role, "content": content})
def send_message(self, user_message, **kwargs):
"""Send a user message and get response."""
self.add_message("user", user_message)
response = litellm.completion(
model=self.model,
messages=self.messages,
**kwargs
)
assistant_message = response.choices[0].message.content
self.add_message("assistant", assistant_message)
return assistant_message, response
def get_history(self):
"""Get conversation history."""
return self.messages.copy()
def clear_history(self):
"""Clear conversation history."""
system_msg = None
if self.messages and self.messages[0]["role"] == "system":
system_msg = self.messages[0]
self.messages = [system_msg] if system_msg else []
# Usage
chat = ConversationManager(
model="gpt-4",
system_message="You are a knowledgeable history teacher."
)
response, _ = chat.send_message("Tell me about the Roman Empire")
print("Response:", response[:200] + "...")
response, _ = chat.send_message("What were their biggest achievements?")
print("Follow-up:", response[:200] + "...")Force specific output formats:
# JSON output
json_prompt = """
Extract the following information from the text and return as JSON:
- Name
- Age
- Occupation
Text: John Smith is a 35-year-old software engineer from San Francisco.
Return only valid JSON.
"""
response = litellm.completion(
model="gpt-4",
messages=[{"role": "user", "content": json_prompt}],
temperature=0.1 # Lower temperature for consistent formatting
)
import json
try:
data = json.loads(response.choices[0].message.content)
print("Extracted:", data)
except json.JSONDecodeError:
print("Failed to parse JSON response")Specialized prompts for code:
def generate_code(requirement, language="python"):
"""Generate code based on requirements."""
prompt = f"""
Write a {language} function that {requirement}.
Requirements:
- Include docstring
- Add type hints
- Handle edge cases
- Include example usage
Return only the code, no explanation.
"""
response = litellm.completion(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0.2, # Low temperature for code
stop=["```"] # Stop at code blocks
)
return response.choices[0].message.content
# Usage
code = generate_code("calculates the factorial of a number with memoization")
print(code)Provide examples for better results:
def analyze_sentiment(text):
"""Analyze sentiment using few-shot examples."""
examples = """
Here are examples of sentiment analysis:
Text: "I love this product, it's amazing!"
Sentiment: positive
Text: "This is terrible, I hate it."
Sentiment: negative
Text: "It's okay, nothing special."
Sentiment: neutral
Now analyze this text:
"""
prompt = examples + text + "\n\nSentiment:"
response = litellm.completion(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": prompt}],
temperature=0.1,
max_tokens=10
)
return response.choices[0].message.content.strip()
# Usage
sentiment = analyze_sentiment("This movie was fantastic!")
print(f"Sentiment: {sentiment}")Encourage step-by-step reasoning:
def solve_problem(problem):
"""Solve a problem with chain of thought."""
prompt = f"""
Solve this problem step by step. Show your work clearly.
Problem: {problem}
Think through this systematically:
1. Understand the problem
2. Identify the key information
3. Consider different approaches
4. Choose the best method
5. Execute the solution
6. Verify the answer
Final Answer: [Your final answer here]
"""
response = litellm.completion(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0.3,
max_tokens=1000
)
return response.choices[0].message.content
# Usage
solution = solve_problem("If a train travels at 60 mph for 2 hours, how far does it go?")
print(solution)Generate multiple responses for comparison:
def generate_multiple_responses(prompt, n=3, model="gpt-4"):
"""Generate multiple completions for the same prompt."""
response = litellm.completion(
model=model,
messages=[{"role": "user", "content": prompt}],
n=n, # Number of completions
temperature=0.8, # Higher temperature for variety
max_tokens=200
)
return [choice.message.content for choice in response.choices]
# Usage
responses = generate_multiple_responses(
"Write a creative slogan for a coffee shop",
n=5,
model="claude-3-sonnet-20240229"
)
for i, response in enumerate(responses, 1):
print(f"{i}. {response}")Leverage unique provider capabilities:
# Anthropic's extended thinking (if available)
response = litellm.completion(
model="claude-3-opus-20240229",
messages=[{"role": "user", "content": "Solve this complex math problem..."}],
max_tokens=4000,
thinking_budget=2000 # Anthropic-specific parameter
)
# OpenAI's function calling
function_response = litellm.completion(
model="gpt-4",
messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
functions=[
{
"name": "get_weather",
"description": "Get weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string"}
},
"required": ["location"]
}
}
],
function_call="auto"
)
# Check if function was called
if hasattr(function_response.choices[0].message, 'function_call'):
func_call = function_response.choices[0].message.function_call
print(f"Function: {func_call.name}")
print(f"Arguments: {func_call.arguments}")Robust completion handling:
def safe_completion(model, messages, **kwargs):
"""Completion with comprehensive error handling."""
# Default parameters
defaults = {
"max_tokens": 1000,
"temperature": 0.7,
"timeout": 30
}
defaults.update(kwargs)
# Validate inputs
if not messages or not isinstance(messages, list):
raise ValueError("Messages must be a non-empty list")
for msg in messages:
if not isinstance(msg, dict) or "role" not in msg or "content" not in msg:
raise ValueError("Each message must have 'role' and 'content' fields")
try:
response = litellm.completion(model=model, messages=messages, **defaults)
# Validate response
if not response.choices:
raise ValueError("No choices returned in response")
content = response.choices[0].message.content
if not content or not content.strip():
raise ValueError("Empty response content")
return response
except litellm.RateLimitError:
print("Rate limit exceeded. Waiting and retrying...")
time.sleep(60) # Wait 1 minute
return safe_completion(model, messages, **kwargs) # Retry
except litellm.AuthenticationError:
raise ValueError(f"Invalid API key for {model}")
except litellm.APIError as e:
print(f"API error: {e}")
# Could implement fallback to different model here
raise
except Exception as e:
print(f"Unexpected error: {e}")
raise
# Usage
try:
response = safe_completion(
model="gpt-4",
messages=[{"role": "user", "content": "Hello!"}]
)
print("Response:", response.choices[0].message.content)
except Exception as e:
print(f"Error: {e}")Estimate costs before making calls:
def estimate_completion_cost(model, messages, max_tokens=1000):
"""Estimate cost for a completion request."""
# Rough token estimation
total_chars = sum(len(msg["content"]) for msg in messages)
estimated_input_tokens = total_chars // 4 # Rough approximation
estimated_output_tokens = max_tokens
# Cost per 1K tokens (approximate)
costs = {
"gpt-4": {"input": 0.03, "output": 0.06},
"gpt-4-turbo": {"input": 0.01, "output": 0.03},
"gpt-3.5-turbo": {"input": 0.0015, "output": 0.002},
"claude-3-opus-20240229": {"input": 0.015, "output": 0.075},
"claude-3-haiku-20240307": {"input": 0.00025, "output": 0.00125},
}
if model not in costs:
return None # Unknown cost
model_costs = costs[model]
input_cost = (estimated_input_tokens / 1000) * model_costs["input"]
output_cost = (estimated_output_tokens / 1000) * model_costs["output"]
return {
"estimated_input_tokens": estimated_input_tokens,
"estimated_output_tokens": estimated_output_tokens,
"estimated_cost": input_cost + output_cost,
"currency": "USD"
}
# Usage
cost_estimate = estimate_completion_cost(
"gpt-4",
[{"role": "user", "content": "Write a 500-word essay about AI"}],
max_tokens=1000
)
if cost_estimate:
print(f"Estimated cost: ${cost_estimate['estimated_cost']:.4f}")
print(f"Input tokens: {cost_estimate['estimated_input_tokens']}")
print(f"Output tokens: {cost_estimate['estimated_output_tokens']}")- Temperature Tuning: Use lower temperatures (0.1-0.3) for factual/coding tasks, higher (0.7-0.9) for creative tasks
- Max Tokens: Set appropriate limits to control costs and response length
- System Messages: Use system messages to set context and behavior
- Conversation Context: Maintain conversation history for multi-turn interactions
- Error Handling: Always wrap API calls in try-catch blocks
- Cost Monitoring: Track usage and set budgets
- Prompt Engineering: Craft clear, specific prompts for better results
The completion API is your primary interface to LLM capabilities. Mastering these patterns will enable you to build sophisticated AI applications across any provider.
The bold function in litellm/setup_wizard.py handles a key part of this chapter's functionality:
def bold(t: str) -> str:
return _c(_BOLD, t)
def green(t: str) -> str:
return _c(_GREEN, t)
def blue(t: str) -> str:
return _c(_BLUE, t)
def grey(t: str) -> str:
return _c(_GREY, t)
def dim(t: str) -> str:
return _c(_DIM, t)
def _divider() -> str:
"""Return a styled divider line (evaluated at call-time, not import-time)."""
return dim(" " + "╌" * 74)
def _styled_input(prompt: str) -> str:
"""
Like input() but wraps ANSI sequences in readline ignore markers
(\\001...\\002) so readline correctly tracks the cursor column.
In non-TTY contexts, strips ANSI entirely so no escape codes appear.This function is important because it defines how LiteLLM Tutorial: Unified LLM Gateway and Routing Layer implements the patterns covered in this chapter.
The green function in litellm/setup_wizard.py handles a key part of this chapter's functionality:
def green(t: str) -> str:
return _c(_GREEN, t)
def blue(t: str) -> str:
return _c(_BLUE, t)
def grey(t: str) -> str:
return _c(_GREY, t)
def dim(t: str) -> str:
return _c(_DIM, t)
def _divider() -> str:
"""Return a styled divider line (evaluated at call-time, not import-time)."""
return dim(" " + "╌" * 74)
def _styled_input(prompt: str) -> str:
"""
Like input() but wraps ANSI sequences in readline ignore markers
(\\001...\\002) so readline correctly tracks the cursor column.
In non-TTY contexts, strips ANSI entirely so no escape codes appear.
"""
if sys.stdout.isatty():
rl_prompt = _ANSI_RE.sub(lambda m: f"\001{m.group()}\002", prompt)
else:This function is important because it defines how LiteLLM Tutorial: Unified LLM Gateway and Routing Layer implements the patterns covered in this chapter.
The blue function in litellm/setup_wizard.py handles a key part of this chapter's functionality:
def blue(t: str) -> str:
return _c(_BLUE, t)
def grey(t: str) -> str:
return _c(_GREY, t)
def dim(t: str) -> str:
return _c(_DIM, t)
def _divider() -> str:
"""Return a styled divider line (evaluated at call-time, not import-time)."""
return dim(" " + "╌" * 74)
def _styled_input(prompt: str) -> str:
"""
Like input() but wraps ANSI sequences in readline ignore markers
(\\001...\\002) so readline correctly tracks the cursor column.
In non-TTY contexts, strips ANSI entirely so no escape codes appear.
"""
if sys.stdout.isatty():
rl_prompt = _ANSI_RE.sub(lambda m: f"\001{m.group()}\002", prompt)
else:
rl_prompt = _ANSI_RE.sub("", prompt)
return input(rl_prompt).strip()
This function is important because it defines how LiteLLM Tutorial: Unified LLM Gateway and Routing Layer implements the patterns covered in this chapter.
The grey function in litellm/setup_wizard.py handles a key part of this chapter's functionality:
def grey(t: str) -> str:
return _c(_GREY, t)
def dim(t: str) -> str:
return _c(_DIM, t)
def _divider() -> str:
"""Return a styled divider line (evaluated at call-time, not import-time)."""
return dim(" " + "╌" * 74)
def _styled_input(prompt: str) -> str:
"""
Like input() but wraps ANSI sequences in readline ignore markers
(\\001...\\002) so readline correctly tracks the cursor column.
In non-TTY contexts, strips ANSI entirely so no escape codes appear.
"""
if sys.stdout.isatty():
rl_prompt = _ANSI_RE.sub(lambda m: f"\001{m.group()}\002", prompt)
else:
rl_prompt = _ANSI_RE.sub("", prompt)
return input(rl_prompt).strip()
def _yaml_escape(value: str) -> str:
"""Escape a string for safe embedding in a double-quoted YAML scalar."""
return (
value.replace("\\", "\\\\")This function is important because it defines how LiteLLM Tutorial: Unified LLM Gateway and Routing Layer implements the patterns covered in this chapter.
flowchart TD
A[bold]
B[green]
C[blue]
D[grey]
E[dim]
A --> B
B --> C
C --> D
D --> E