Observability for Backend Services: Logs, Traces, and Metrics That Matter
Overview
Monitoring tells you something is broken. Observability tells you why. This post covers how to instrument a Python backend so you can diagnose production issues in minutes instead of hours.
Problem
You deploy a FastAPI service. It works. A week later, users report intermittent slowness. You check the logs — nothing obvious. You check the database — queries look fine. You check CPU and memory — normal.
The issue is a third-party API call that times out 2% of the time, causing a cascade of retries that saturate your connection pool. Without structured observability, finding this takes days. With it, it takes one query.
Solution
Implement the three pillars of observability: structured logs, distributed traces, and application metrics.
Pillar 1: Structured Logging
Stop using print statements. Stop using unstructured log lines. Every log entry should be a JSON object with consistent fields.
import structlog
import uuid
from fastapi import Request
structlog.configure(
processors=[
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.add_log_level,
structlog.processors.JSONRenderer(),
]
)
logger = structlog.get_logger()
@app.middleware("http")
async def logging_middleware(request: Request, call_next):
request_id = str(uuid.uuid4())[:8]
with structlog.contextvars.bound_contextvars(
request_id=request_id,
method=request.method,
path=request.url.path,
):
logger.info("request_started")
response = await call_next(request)
logger.info("request_completed", status=response.status_code)
response.headers["X-Request-ID"] = request_id
return response
Every log line now includes the request ID, so you can trace a single request across all log entries.
Pillar 2: Distributed Tracing
Traces show you the full journey of a request — which services it touched, how long each step took, and where the time went.
from opentelemetry import trace
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
# Auto-instrument frameworks
FastAPIInstrumentor.instrument_app(app)
HTTPXClientInstrumentor().instrument()
SQLAlchemyInstrumentor().instrument(engine=engine)
# Custom spans for business logic
tracer = trace.get_tracer(__name__)
async def process_order(order_id: str):
with tracer.start_as_current_span("process_order") as span:
span.set_attribute("order.id", order_id)
with tracer.start_as_current_span("validate_inventory"):
await check_inventory(order_id)
with tracer.start_as_current_span("charge_payment"):
await charge_payment(order_id)
with tracer.start_as_current_span("send_confirmation"):
await send_email(order_id)
Each span shows its duration, parent span, and custom attributes. View them in Jaeger, Grafana Tempo, or any OpenTelemetry-compatible backend.
Pillar 3: Application Metrics
Metrics answer aggregate questions: How many requests per second? What is the p99 latency? How many database connections are in use?
from prometheus_client import Counter, Histogram, Gauge
REQUEST_COUNT = Counter(
"http_requests_total",
"Total HTTP requests",
["method", "endpoint", "status"]
)
REQUEST_LATENCY = Histogram(
"http_request_duration_seconds",
"Request latency in seconds",
["method", "endpoint"],
buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0]
)
DB_POOL_SIZE = Gauge(
"db_connection_pool_size",
"Current database connection pool size",
["state"] # active, idle, waiting
)
@app.middleware("http")
async def metrics_middleware(request: Request, call_next):
start = time.perf_counter()
response = await call_next(request)
duration = time.perf_counter() - start
endpoint = request.url.path
REQUEST_COUNT.labels(request.method, endpoint, response.status_code).inc()
REQUEST_LATENCY.labels(request.method, endpoint).observe(duration)
return response
Implementation
Connecting the Three Pillars
The real power comes from correlating logs, traces, and metrics. Include the trace ID in every log entry:
def add_trace_context(logger, method_name, event_dict):
span = trace.get_current_span()
if span.is_recording():
ctx = span.get_span_context()
event_dict["trace_id"] = format(ctx.trace_id, "032x")
event_dict["span_id"] = format(ctx.span_id, "016x")
return event_dict
Now when a metric spikes, you find the trace. The trace shows which span is slow. The logs for that span show exactly what went wrong.
What to Alert On
Do not alert on everything. Alert on symptoms, not causes:
- Error rate > 1% over 5 minutes
- p99 latency > 2x the baseline
- Database connection pool utilization > 80%
- Queue depth growing faster than drain rate
Challenges
Log volume: Structured logging generates more data. Use sampling — log 100% of errors, 10% of successful requests. Traces can be sampled similarly.
Cardinality explosion: Do not use user IDs or request IDs as metric labels. Prometheus (and most TSDB systems) struggle with high-cardinality labels. Keep labels low-cardinality.
Conclusion
Start with structured logging — it has the lowest cost and highest immediate value. Add tracing when you need to debug cross-service latency. Add metrics when you need dashboards and alerts. The goal is not to collect all possible data, but to be able to answer 'why is this slow?' within 5 minutes.