LLM API calls cost money, and controlling usage is essential for production applications. LiteLLM provides several parameters to help you manage costs and computational resources.
Understanding Token Costs
Most LLM providers charge based on tokens - the number of input and output tokens you use:
- Input tokens: Your messages (prompts)
- Output tokens: The model’s responses
Pricing varies by provider and model:
- Some models charge per million tokens
- Different rates for input vs output tokens
- More powerful models typically cost more
1. Limiting Response Length (max_tokens)
The max_tokens parameter limits how long the response can be:
messages = [{"role": "user", "content": "Explain quantum physics in detail."}]
# Short, cost-effective response
response = litellm.completion(
model="gemini/gemini-2.5-flash",
messages=messages,
max_tokens=100 # Limit to ~100 tokens
)
print(response.choices[0].message.content)
# Result: Brief, concise explanation
# Longer, more detailed response
response = litellm.completion(
model="gemini/gemini-2.5-flash",
messages=messages,
max_tokens=500 # Allow up to ~500 tokens
)
print(response.choices[0].message.content)
# Result: More comprehensive explanationWhen to use max_tokens:
- ✅ Control costs (fewer tokens = lower cost)
- ✅ Enforce brevity (e.g., one-sentence summaries)
- ✅ Prevent overly long responses in chat apps
- ✅ Stay within context limits
Note: 1 token ≈ 4 characters in English. Plan accordingly!
2. Forcing Early Stops (stop)
The stop parameter forces the model to stop generating when it encounters specific sequences:
messages = [{"role": "user", "content": "List programming languages:"}]
response = litellm.completion(
model="gemini/gemini-2.5-flash",
messages=messages,
stop=["5."] # Stop after the 4th item
)
print(response.choices[0].message.content)
# Output: "1. Python\n2. JavaScript\n3. Java\n4. C++" (stops before "5.")Use cases for stop:
- Control output format (stop at section markers)
- Limit list lengths programmatically
- Prevent unwanted continuation
- Save tokens on structured output
- Parse code generation boundaries
Example: Structured output
messages = [{"role": "user", "content": "Write a haiku, then explain it."}]
# Stop after the haiku (before explanation)
response = litellm.completion(
model="gemini/gemini-2.5-flash",
messages=messages,
stop=["\n\n"] # Stop at double newline
)
# Gets just the haiku, saves tokens3. Tracking Costs (litellm.completion_cost)
LiteLLM can automatically calculate the cost of your API calls:
import litellm
messages = [{"role": "user", "content": "Explain machine learning."}]
response = litellm.completion(
model="gpt-4",
messages=messages,
max_tokens=200
)
# Calculate the cost
cost = litellm.completion_cost(completion_response=response)
print(f"Cost: ${cost:.6f}")
# Example output: "Cost: $0.002400"Tracking Costs Across Multiple Calls
total_cost = 0
# Multiple API calls
for question in ["What is AI?", "What is ML?", "What is DL?"]:
messages = [{"role": "user", "content": question}]
response = litellm.completion(
model="gemini/gemini-2.5-flash",
messages=messages,
max_tokens=100
)
call_cost = litellm.completion_cost(completion_response=response)
total_cost += call_cost
print(f"Question: {question}")
print(f"Response: {response.choices[0].message.content}")
print(f"Cost: ${call_cost:.6f}\n")
print(f"Total cost: ${total_cost:.6f}")Cost Tracking with Custom Pricing
If you’re using a custom model or local deployment, you can specify custom pricing:
response = litellm.completion(
model="custom-model",
messages=messages,
input_cost_per_token=0.000001, # $0.000001 per input token
output_cost_per_token=0.000002 # $0.000002 per output token
)
cost = litellm.completion_cost(completion_response=response)Combining Cost Control Techniques
For maximum efficiency, combine all three approaches:
messages = [{"role": "user", "content": "Summarize the benefits of renewable energy."}]
response = litellm.completion(
model="gemini/gemini-2.5-flash",
messages=messages,
max_tokens=150, # Limit length
stop=["\n\nConclusion"], # Stop before conclusion section
temperature=0.3 # Lower temperature for consistency
)
# Track the cost
cost = litellm.completion_cost(completion_response=response)
print(response.choices[0].message.content)
print(f"\nTokens used: {response.usage.total_tokens}")
print(f"Cost: ${cost:.6f}")Best Practices for Cost Management
✅ Set max_tokens appropriately - Don’t allow unlimited generation
✅ Use cheaper models when possible - Gemini Flash vs GPT-4 for simple tasks
✅ Cache common responses - Store frequently asked questions
✅ Monitor costs - Track usage with completion_cost()
✅ Use local models (Ollama) for development and testing
✅ Optimize prompts - Shorter, clearer prompts save input tokens
✅ Use stop for structured output - Avoid generating unnecessary text
Cost Comparison Example
# Expensive: Long response, powerful model
response = litellm.completion(
model="gpt-4",
messages=[{"role": "user", "content": "Write a 500-word essay on AI."}],
max_tokens=1000
)
# Cost: ~$0.03-0.06
# Cheaper: Concise response, efficient model
response = litellm.completion(
model="gemini/gemini-2.5-flash",
messages=[{"role": "user", "content": "Summarize AI in 100 words."}],
max_tokens=150
)
# Cost: ~$0.0001-0.0003200x cost difference! Choose wisely based on your needs.
Key Takeaways
✅ Use max_tokens to limit response length and control costs
✅ Use stop to end generation at specific points
✅ Track costs with litellm.completion_cost() for monitoring
✅ Combine techniques for maximum efficiency
✅ Choose appropriate models for the task complexity
✅ Test with local models (Ollama) before using paid APIs
Previous: Lesson 1.4 - System Prompts and Context
Next: Lesson 1.6 - Putting It All Together
