LLM Integration Patterns for Production Applications

· 11 min read · AI

Learn battle-tested patterns for integrating Large Language Models into production systems, including prompt engineering, error handling, and cost optimization.

LLM Integration Patterns for Production Applications

Integrating Large Language Models into production applications is fundamentally different from experimenting in a playground. LLMs are non-deterministic, expensive, slow compared to traditional APIs, and fail in novel ways. Building reliable systems on top of them requires thoughtful architecture.

This guide covers the patterns I've learned integrating LLMs into production: prompt engineering for consistency, error handling for unreliable APIs, caching strategies for cost control, and evaluation approaches that actually work.

Problem

LLM integrations fail in ways that traditional software doesn't:

  • Non-determinism — The same prompt can produce different outputs
  • Latency — Responses take seconds, not milliseconds
  • Cost — Token-based pricing adds up quickly at scale
  • Hallucinations — Models confidently produce incorrect information
  • Rate limits — API quotas throttle high-volume applications
  • Context limits — Input size is bounded, often in surprising ways

Teams that treat LLM APIs like regular REST endpoints quickly discover these constraints the hard way.

Why This Matters

LLMs enable capabilities that were previously impossible or required massive ML teams. But realizing that potential requires:

  1. Reliability — Users can't tolerate random failures
  2. Consistency — Business logic needs predictable behavior
  3. Cost efficiency — Unbounded API costs kill projects
  4. Observability — You need to understand what the model is doing

NOTE: LLM technology is evolving rapidly. These patterns focus on principles that should remain relevant even as specific APIs change.

Solution

Architecture Overview

A production LLM integration typically includes:

┌─────────────────────────────────────────────────────────┐
│                    Application Layer                     │
├─────────────────────────────────────────────────────────┤
│  ┌───────────┐  ┌───────────┐  ┌───────────┐           │
│  │  Prompt   │  │   Cache   │  │  Output   │           │
│  │  Manager  │  │   Layer   │  │  Parser   │           │
│  └─────┬─────┘  └─────┬─────┘  └─────┬─────┘           │
│        │              │              │                  │
│  ┌─────┴──────────────┴──────────────┴─────┐           │
│  │           LLM Client Wrapper            │           │
│  │   (retry, rate limit, observability)    │           │
│  └─────────────────────┬───────────────────┘           │
│                        │                                │
└────────────────────────┼────────────────────────────────┘
                         │
              ┌──────────┴──────────┐
              │   LLM Provider API  │
              │  (OpenAI, Anthropic)│
              └─────────────────────┘

Implementation: LLM Client Wrapper

Wrap the provider SDK with retry logic, rate limiting, and observability.

import asyncio
from typing import TypeVar, Callable, Any
from functools import wraps
import structlog
from openai import AsyncOpenAI, RateLimitError, APIError
from tenacity import (
    retry,
    stop_after_attempt,
    wait_exponential,
    retry_if_exception_type
)

logger = structlog.get_logger()

T = TypeVar("T")

class LLMClient:
    """Production-ready LLM client with retry and observability."""
    
    def __init__(
        self,
        api_key: str,
        model: str = "gpt-4o",
        max_retries: int = 3,
        timeout: float = 30.0
    ):
        self._client = AsyncOpenAI(api_key=api_key, timeout=timeout)
        self._model = model
        self._max_retries = max_retries
    
    @retry(
        stop=stop_after_attempt(3),
        wait=wait_exponential(multiplier=1, min=2, max=30),
        retry=retry_if_exception_type((RateLimitError, APIError))
    )
    async def complete(
        self,
        messages: list[dict],
        temperature: float = 0.0,
        max_tokens: int = 1024,
        **kwargs
    ) -> str:
        """Generate completion with automatic retry."""
        log = logger.bind(
            model=self._model,
            message_count=len(messages),
            temperature=temperature
        )
        
        log.info("llm_request_started")
        
        try:
            response = await self._client.chat.completions.create(
                model=self._model,
                messages=messages,
                temperature=temperature,
                max_tokens=max_tokens,
                **kwargs
            )
            
            result = response.choices[0].message.content
            
            log.info(
                "llm_request_completed",
                prompt_tokens=response.usage.prompt_tokens,
                completion_tokens=response.usage.completion_tokens,
                total_tokens=response.usage.total_tokens
            )
            
            return result
            
        except RateLimitError as e:
            log.warning("llm_rate_limited", error=str(e))
            raise
        except APIError as e:
            log.error("llm_api_error", error=str(e))
            raise

TIP: Always set temperature=0.0 for structured outputs and deterministic tasks. Use higher temperatures only when creativity is desired.

Prompt Engineering for Production

Prompts are code. Treat them accordingly—version control, test, and iterate.

from dataclasses import dataclass
from typing import Literal
from string import Template

@dataclass
class PromptTemplate:
    """Structured prompt with validation."""
    
    system: str
    user: Template
    output_format: Literal["json", "text", "markdown"]
    
    def render(self, **variables) -> list[dict]:
        """Render prompt with variables."""
        return [
            {"role": "system", "content": self.system},
            {"role": "user", "content": self.user.substitute(**variables)}
        ]

# Structured prompts with clear output format
EXTRACT_ENTITIES_PROMPT = PromptTemplate(
    system="""You are a precise entity extraction system.
Extract entities from the given text and return them in valid JSON.
Only extract entities that are explicitly mentioned.
If no entities are found, return an empty array.

Output Format:
{
  "entities": [
    {"type": "PERSON|ORG|LOCATION|DATE", "value": "extracted text", "confidence": 0.0-1.0}
  ]
}""",
    user=Template("""Extract all entities from this text:

$text

Return only valid JSON, no additional text."""),
    output_format="json"
)

SUMMARIZE_PROMPT = PromptTemplate(
    system="""You are a concise summarizer. Create summaries that:
- Capture the main points
- Are factually accurate to the source
- Use clear, simple language
- Stay within the specified length""",
    user=Template("""Summarize this text in $max_sentences sentences or fewer:

$text"""),
    output_format="text"
)

Output Parsing with Validation

Never trust raw LLM output. Parse and validate.

import json
from pydantic import BaseModel, ValidationError
from typing import Optional

class ExtractedEntity(BaseModel):
    """Validated entity from LLM extraction."""
    type: Literal["PERSON", "ORG", "LOCATION", "DATE"]
    value: str
    confidence: float

class EntityExtractionResult(BaseModel):
    """Validated extraction result."""
    entities: list[ExtractedEntity]

class LLMOutputParser:
    """Parse and validate LLM outputs."""
    
    @staticmethod
    def parse_json(response: str, model: type[BaseModel]) -> BaseModel:
        """Parse JSON response with Pydantic validation."""
        # Handle markdown code blocks
        cleaned = response.strip()
        if cleaned.startswith("```json"):
            cleaned = cleaned[7:]
        if cleaned.startswith("```"):
            cleaned = cleaned[3:]
        if cleaned.endswith("```"):
            cleaned = cleaned[:-3]
        
        try:
            data = json.loads(cleaned.strip())
            return model.model_validate(data)
        except json.JSONDecodeError as e:
            raise LLMOutputParseError(f"Invalid JSON: {e}")
        except ValidationError as e:
            raise LLMOutputParseError(f"Validation failed: {e}")

class LLMOutputParseError(Exception):
    """Raised when LLM output cannot be parsed."""
    pass

WARNING: LLMs often produce invalid JSON—missing quotes, trailing commas, markdown wrappers. Build robust parsing that handles common issues.

Caching for Cost Control

Caching is essential for managing costs and latency.

import hashlib
from typing import Optional
import redis.asyncio as redis

class LLMCache:
    """Cache LLM responses to reduce costs and latency."""
    
    def __init__(self, redis_client: redis.Redis, ttl_seconds: int = 3600):
        self._redis = redis_client
        self._ttl = ttl_seconds
    
    def _cache_key(self, model: str, messages: list[dict], temperature: float) -> str:
        """Generate deterministic cache key."""
        # Only cache deterministic requests (temperature=0)
        if temperature > 0:
            return None
            
        content = json.dumps({
            "model": model,
            "messages": messages,
            "temperature": temperature
        }, sort_keys=True)
        
        return f"llm:cache:{hashlib.sha256(content.encode()).hexdigest()}"
    
    async def get(
        self, model: str, messages: list[dict], temperature: float
    ) -> Optional[str]:
        """Get cached response if available."""
        key = self._cache_key(model, messages, temperature)
        if not key:
            return None
        
        cached = await self._redis.get(key)
        return cached.decode() if cached else None
    
    async def set(
        self, model: str, messages: list[dict], temperature: float, response: str
    ) -> None:
        """Cache response."""
        key = self._cache_key(model, messages, temperature)
        if key:
            await self._redis.setex(key, self._ttl, response)

Example: Complete Service

Putting it all together with a complete entity extraction service:

from dataclasses import dataclass
from typing import Optional

@dataclass
class ExtractionResult:
    """Result of entity extraction."""
    entities: list[ExtractedEntity]
    cached: bool
    latency_ms: float
    tokens_used: int

class EntityExtractionService:
    """Production service for entity extraction using LLM."""
    
    def __init__(
        self,
        llm_client: LLMClient,
        cache: LLMCache,
        parser: LLMOutputParser
    ):
        self._llm = llm_client
        self._cache = cache
        self._parser = parser
    
    async def extract_entities(
        self,
        text: str,
        use_cache: bool = True
    ) -> ExtractionResult:
        """Extract entities from text."""
        import time
        start = time.perf_counter()
        
        messages = EXTRACT_ENTITIES_PROMPT.render(text=text)
        
        # Check cache
        if use_cache:
            cached_response = await self._cache.get(
                model=self._llm._model,
                messages=messages,
                temperature=0.0
            )
            if cached_response:
                result = self._parser.parse_json(
                    cached_response, EntityExtractionResult
                )
                return ExtractionResult(
                    entities=result.entities,
                    cached=True,
                    latency_ms=(time.perf_counter() - start) * 1000,
                    tokens_used=0
                )
        
        # Call LLM
        response = await self._llm.complete(
            messages=messages,
            temperature=0.0,
            max_tokens=2048
        )
        
        # Cache response
        if use_cache:
            await self._cache.set(
                model=self._llm._model,
                messages=messages,
                temperature=0.0,
                response=response
            )
        
        # Parse and validate
        result = self._parser.parse_json(response, EntityExtractionResult)
        
        return ExtractionResult(
            entities=result.entities,
            cached=False,
            latency_ms=(time.perf_counter() - start) * 1000,
            tokens_used=0  # Would come from completion response
        )

Evaluation and Monitoring

LLMs require ongoing monitoring—they can degrade without code changes.

from datetime import datetime
from typing import Any

class LLMMetrics:
    """Track LLM usage and quality metrics."""
    
    def __init__(self, prometheus_client):
        self._prom = prometheus_client
        
        # Define metrics
        self.request_count = self._prom.Counter(
            "llm_requests_total",
            "Total LLM requests",
            ["model", "operation", "status"]
        )
        self.latency_histogram = self._prom.Histogram(
            "llm_request_duration_seconds",
            "LLM request latency",
            ["model", "operation"]
        )
        self.token_count = self._prom.Counter(
            "llm_tokens_total",
            "Total tokens used",
            ["model", "type"]  # type: prompt|completion
        )
        self.cache_hits = self._prom.Counter(
            "llm_cache_hits_total",
            "Cache hit count",
            ["operation"]
        )
    
    def record_request(
        self,
        model: str,
        operation: str,
        status: str,
        latency_seconds: float,
        prompt_tokens: int,
        completion_tokens: int,
        cached: bool
    ):
        """Record metrics for an LLM request."""
        self.request_count.labels(model, operation, status).inc()
        self.latency_histogram.labels(model, operation).observe(latency_seconds)
        
        if not cached:
            self.token_count.labels(model, "prompt").inc(prompt_tokens)
            self.token_count.labels(model, "completion").inc(completion_tokens)
        else:
            self.cache_hits.labels(operation).inc()

Common Mistakes

1. No Fallback Strategy

# Wrong - single point of failure
response = await llm.complete(messages)

# Correct - fallback models
async def complete_with_fallback(messages: list[dict]) -> str:
    models = ["gpt-4o", "gpt-4o-mini", "claude-3-sonnet"]
    
    for model in models:
        try:
            return await llm.complete(messages, model=model)
        except (RateLimitError, APIError):
            continue
    
    raise AllModelsFailedError("All LLM providers failed")

2. Unbounded Context

# Wrong - can exceed context limit
prompt = f"Summarize this: {very_long_document}"

# Correct - chunk and process
def chunk_text(text: str, max_tokens: int = 3000) -> list[str]:
    # Use tiktoken to count tokens accurately
    encoder = tiktoken.encoding_for_model("gpt-4")
    tokens = encoder.encode(text)
    
    chunks = []
    for i in range(0, len(tokens), max_tokens):
        chunk_tokens = tokens[i:i + max_tokens]
        chunks.append(encoder.decode(chunk_tokens))
    
    return chunks

3. No Cost Monitoring

Track costs per request, user, and feature. Set budgets and alerts.

# Track cost per request
COST_PER_1K_TOKENS = {
    "gpt-4o": {"input": 0.005, "output": 0.015},
    "gpt-4o-mini": {"input": 0.00015, "output": 0.0006},
}

def calculate_cost(model: str, prompt_tokens: int, completion_tokens: int) -> float:
    rates = COST_PER_1K_TOKENS.get(model, {"input": 0, "output": 0})
    return (prompt_tokens / 1000 * rates["input"] + 
            completion_tokens / 1000 * rates["output"])

4. Testing Against Live APIs

# Use recorded responses for tests
@pytest.fixture
def mock_llm_response():
    return {
        "choices": [{
            "message": {"content": '{"entities": []}'}
        }],
        "usage": {"prompt_tokens": 100, "completion_tokens": 20}
    }

Conclusion

Production LLM integration requires treating LLMs as unreliable external services. Wrap API calls with retries and fallbacks. Cache aggressively. Parse outputs defensively. Monitor costs and quality continuously.

These patterns will evolve as LLM technology matures, but the principles—reliability, observability, cost control—remain constant. Build systems that gracefully handle the unique failure modes of language models, and you'll unlock capabilities that would have been impossible just a few years ago.