LLM Integration Patterns for Production Applications
Integrating Large Language Models into production applications is fundamentally different from experimenting in a playground. LLMs are non-deterministic, expensive, slow compared to traditional APIs, and fail in novel ways. Building reliable systems on top of them requires thoughtful architecture.
This guide covers the patterns I've learned integrating LLMs into production: prompt engineering for consistency, error handling for unreliable APIs, caching strategies for cost control, and evaluation approaches that actually work.
Problem
LLM integrations fail in ways that traditional software doesn't:
- Non-determinism — The same prompt can produce different outputs
- Latency — Responses take seconds, not milliseconds
- Cost — Token-based pricing adds up quickly at scale
- Hallucinations — Models confidently produce incorrect information
- Rate limits — API quotas throttle high-volume applications
- Context limits — Input size is bounded, often in surprising ways
Teams that treat LLM APIs like regular REST endpoints quickly discover these constraints the hard way.
Why This Matters
LLMs enable capabilities that were previously impossible or required massive ML teams. But realizing that potential requires:
- Reliability — Users can't tolerate random failures
- Consistency — Business logic needs predictable behavior
- Cost efficiency — Unbounded API costs kill projects
- Observability — You need to understand what the model is doing
NOTE: LLM technology is evolving rapidly. These patterns focus on principles that should remain relevant even as specific APIs change.
Solution
Architecture Overview
A production LLM integration typically includes:
┌─────────────────────────────────────────────────────────┐
│ Application Layer │
├─────────────────────────────────────────────────────────┤
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ │
│ │ Prompt │ │ Cache │ │ Output │ │
│ │ Manager │ │ Layer │ │ Parser │ │
│ └─────┬─────┘ └─────┬─────┘ └─────┬─────┘ │
│ │ │ │ │
│ ┌─────┴──────────────┴──────────────┴─────┐ │
│ │ LLM Client Wrapper │ │
│ │ (retry, rate limit, observability) │ │
│ └─────────────────────┬───────────────────┘ │
│ │ │
└────────────────────────┼────────────────────────────────┘
│
┌──────────┴──────────┐
│ LLM Provider API │
│ (OpenAI, Anthropic)│
└─────────────────────┘
Implementation: LLM Client Wrapper
Wrap the provider SDK with retry logic, rate limiting, and observability.
import asyncio
from typing import TypeVar, Callable, Any
from functools import wraps
import structlog
from openai import AsyncOpenAI, RateLimitError, APIError
from tenacity import (
retry,
stop_after_attempt,
wait_exponential,
retry_if_exception_type
)
logger = structlog.get_logger()
T = TypeVar("T")
class LLMClient:
"""Production-ready LLM client with retry and observability."""
def __init__(
self,
api_key: str,
model: str = "gpt-4o",
max_retries: int = 3,
timeout: float = 30.0
):
self._client = AsyncOpenAI(api_key=api_key, timeout=timeout)
self._model = model
self._max_retries = max_retries
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=2, max=30),
retry=retry_if_exception_type((RateLimitError, APIError))
)
async def complete(
self,
messages: list[dict],
temperature: float = 0.0,
max_tokens: int = 1024,
**kwargs
) -> str:
"""Generate completion with automatic retry."""
log = logger.bind(
model=self._model,
message_count=len(messages),
temperature=temperature
)
log.info("llm_request_started")
try:
response = await self._client.chat.completions.create(
model=self._model,
messages=messages,
temperature=temperature,
max_tokens=max_tokens,
**kwargs
)
result = response.choices[0].message.content
log.info(
"llm_request_completed",
prompt_tokens=response.usage.prompt_tokens,
completion_tokens=response.usage.completion_tokens,
total_tokens=response.usage.total_tokens
)
return result
except RateLimitError as e:
log.warning("llm_rate_limited", error=str(e))
raise
except APIError as e:
log.error("llm_api_error", error=str(e))
raise
TIP: Always set
temperature=0.0for structured outputs and deterministic tasks. Use higher temperatures only when creativity is desired.
Prompt Engineering for Production
Prompts are code. Treat them accordingly—version control, test, and iterate.
from dataclasses import dataclass
from typing import Literal
from string import Template
@dataclass
class PromptTemplate:
"""Structured prompt with validation."""
system: str
user: Template
output_format: Literal["json", "text", "markdown"]
def render(self, **variables) -> list[dict]:
"""Render prompt with variables."""
return [
{"role": "system", "content": self.system},
{"role": "user", "content": self.user.substitute(**variables)}
]
# Structured prompts with clear output format
EXTRACT_ENTITIES_PROMPT = PromptTemplate(
system="""You are a precise entity extraction system.
Extract entities from the given text and return them in valid JSON.
Only extract entities that are explicitly mentioned.
If no entities are found, return an empty array.
Output Format:
{
"entities": [
{"type": "PERSON|ORG|LOCATION|DATE", "value": "extracted text", "confidence": 0.0-1.0}
]
}""",
user=Template("""Extract all entities from this text:
$text
Return only valid JSON, no additional text."""),
output_format="json"
)
SUMMARIZE_PROMPT = PromptTemplate(
system="""You are a concise summarizer. Create summaries that:
- Capture the main points
- Are factually accurate to the source
- Use clear, simple language
- Stay within the specified length""",
user=Template("""Summarize this text in $max_sentences sentences or fewer:
$text"""),
output_format="text"
)
Output Parsing with Validation
Never trust raw LLM output. Parse and validate.
import json
from pydantic import BaseModel, ValidationError
from typing import Optional
class ExtractedEntity(BaseModel):
"""Validated entity from LLM extraction."""
type: Literal["PERSON", "ORG", "LOCATION", "DATE"]
value: str
confidence: float
class EntityExtractionResult(BaseModel):
"""Validated extraction result."""
entities: list[ExtractedEntity]
class LLMOutputParser:
"""Parse and validate LLM outputs."""
@staticmethod
def parse_json(response: str, model: type[BaseModel]) -> BaseModel:
"""Parse JSON response with Pydantic validation."""
# Handle markdown code blocks
cleaned = response.strip()
if cleaned.startswith("```json"):
cleaned = cleaned[7:]
if cleaned.startswith("```"):
cleaned = cleaned[3:]
if cleaned.endswith("```"):
cleaned = cleaned[:-3]
try:
data = json.loads(cleaned.strip())
return model.model_validate(data)
except json.JSONDecodeError as e:
raise LLMOutputParseError(f"Invalid JSON: {e}")
except ValidationError as e:
raise LLMOutputParseError(f"Validation failed: {e}")
class LLMOutputParseError(Exception):
"""Raised when LLM output cannot be parsed."""
pass
WARNING: LLMs often produce invalid JSON—missing quotes, trailing commas, markdown wrappers. Build robust parsing that handles common issues.
Caching for Cost Control
Caching is essential for managing costs and latency.
import hashlib
from typing import Optional
import redis.asyncio as redis
class LLMCache:
"""Cache LLM responses to reduce costs and latency."""
def __init__(self, redis_client: redis.Redis, ttl_seconds: int = 3600):
self._redis = redis_client
self._ttl = ttl_seconds
def _cache_key(self, model: str, messages: list[dict], temperature: float) -> str:
"""Generate deterministic cache key."""
# Only cache deterministic requests (temperature=0)
if temperature > 0:
return None
content = json.dumps({
"model": model,
"messages": messages,
"temperature": temperature
}, sort_keys=True)
return f"llm:cache:{hashlib.sha256(content.encode()).hexdigest()}"
async def get(
self, model: str, messages: list[dict], temperature: float
) -> Optional[str]:
"""Get cached response if available."""
key = self._cache_key(model, messages, temperature)
if not key:
return None
cached = await self._redis.get(key)
return cached.decode() if cached else None
async def set(
self, model: str, messages: list[dict], temperature: float, response: str
) -> None:
"""Cache response."""
key = self._cache_key(model, messages, temperature)
if key:
await self._redis.setex(key, self._ttl, response)
Example: Complete Service
Putting it all together with a complete entity extraction service:
from dataclasses import dataclass
from typing import Optional
@dataclass
class ExtractionResult:
"""Result of entity extraction."""
entities: list[ExtractedEntity]
cached: bool
latency_ms: float
tokens_used: int
class EntityExtractionService:
"""Production service for entity extraction using LLM."""
def __init__(
self,
llm_client: LLMClient,
cache: LLMCache,
parser: LLMOutputParser
):
self._llm = llm_client
self._cache = cache
self._parser = parser
async def extract_entities(
self,
text: str,
use_cache: bool = True
) -> ExtractionResult:
"""Extract entities from text."""
import time
start = time.perf_counter()
messages = EXTRACT_ENTITIES_PROMPT.render(text=text)
# Check cache
if use_cache:
cached_response = await self._cache.get(
model=self._llm._model,
messages=messages,
temperature=0.0
)
if cached_response:
result = self._parser.parse_json(
cached_response, EntityExtractionResult
)
return ExtractionResult(
entities=result.entities,
cached=True,
latency_ms=(time.perf_counter() - start) * 1000,
tokens_used=0
)
# Call LLM
response = await self._llm.complete(
messages=messages,
temperature=0.0,
max_tokens=2048
)
# Cache response
if use_cache:
await self._cache.set(
model=self._llm._model,
messages=messages,
temperature=0.0,
response=response
)
# Parse and validate
result = self._parser.parse_json(response, EntityExtractionResult)
return ExtractionResult(
entities=result.entities,
cached=False,
latency_ms=(time.perf_counter() - start) * 1000,
tokens_used=0 # Would come from completion response
)
Evaluation and Monitoring
LLMs require ongoing monitoring—they can degrade without code changes.
from datetime import datetime
from typing import Any
class LLMMetrics:
"""Track LLM usage and quality metrics."""
def __init__(self, prometheus_client):
self._prom = prometheus_client
# Define metrics
self.request_count = self._prom.Counter(
"llm_requests_total",
"Total LLM requests",
["model", "operation", "status"]
)
self.latency_histogram = self._prom.Histogram(
"llm_request_duration_seconds",
"LLM request latency",
["model", "operation"]
)
self.token_count = self._prom.Counter(
"llm_tokens_total",
"Total tokens used",
["model", "type"] # type: prompt|completion
)
self.cache_hits = self._prom.Counter(
"llm_cache_hits_total",
"Cache hit count",
["operation"]
)
def record_request(
self,
model: str,
operation: str,
status: str,
latency_seconds: float,
prompt_tokens: int,
completion_tokens: int,
cached: bool
):
"""Record metrics for an LLM request."""
self.request_count.labels(model, operation, status).inc()
self.latency_histogram.labels(model, operation).observe(latency_seconds)
if not cached:
self.token_count.labels(model, "prompt").inc(prompt_tokens)
self.token_count.labels(model, "completion").inc(completion_tokens)
else:
self.cache_hits.labels(operation).inc()
Common Mistakes
1. No Fallback Strategy
# Wrong - single point of failure
response = await llm.complete(messages)
# Correct - fallback models
async def complete_with_fallback(messages: list[dict]) -> str:
models = ["gpt-4o", "gpt-4o-mini", "claude-3-sonnet"]
for model in models:
try:
return await llm.complete(messages, model=model)
except (RateLimitError, APIError):
continue
raise AllModelsFailedError("All LLM providers failed")
2. Unbounded Context
# Wrong - can exceed context limit
prompt = f"Summarize this: {very_long_document}"
# Correct - chunk and process
def chunk_text(text: str, max_tokens: int = 3000) -> list[str]:
# Use tiktoken to count tokens accurately
encoder = tiktoken.encoding_for_model("gpt-4")
tokens = encoder.encode(text)
chunks = []
for i in range(0, len(tokens), max_tokens):
chunk_tokens = tokens[i:i + max_tokens]
chunks.append(encoder.decode(chunk_tokens))
return chunks
3. No Cost Monitoring
Track costs per request, user, and feature. Set budgets and alerts.
# Track cost per request
COST_PER_1K_TOKENS = {
"gpt-4o": {"input": 0.005, "output": 0.015},
"gpt-4o-mini": {"input": 0.00015, "output": 0.0006},
}
def calculate_cost(model: str, prompt_tokens: int, completion_tokens: int) -> float:
rates = COST_PER_1K_TOKENS.get(model, {"input": 0, "output": 0})
return (prompt_tokens / 1000 * rates["input"] +
completion_tokens / 1000 * rates["output"])
4. Testing Against Live APIs
# Use recorded responses for tests
@pytest.fixture
def mock_llm_response():
return {
"choices": [{
"message": {"content": '{"entities": []}'}
}],
"usage": {"prompt_tokens": 100, "completion_tokens": 20}
}
Conclusion
Production LLM integration requires treating LLMs as unreliable external services. Wrap API calls with retries and fallbacks. Cache aggressively. Parse outputs defensively. Monitor costs and quality continuously.
These patterns will evolve as LLM technology matures, but the principles—reliability, observability, cost control—remain constant. Build systems that gracefully handle the unique failure modes of language models, and you'll unlock capabilities that would have been impossible just a few years ago.