Observability for Backend Services: Logs, Traces, and Metrics That Matter

· 13 min read · DevOps & Cloud

How to instrument a FastAPI backend with structured logging, distributed tracing, and metrics — without drowning in data.

Observability for Backend Services: Logs, Traces, and Metrics That Matter

Overview

Monitoring tells you something is broken. Observability tells you why. This post covers how to instrument a Python backend so you can diagnose production issues in minutes instead of hours.

Problem

You deploy a FastAPI service. It works. A week later, users report intermittent slowness. You check the logs — nothing obvious. You check the database — queries look fine. You check CPU and memory — normal.

The issue is a third-party API call that times out 2% of the time, causing a cascade of retries that saturate your connection pool. Without structured observability, finding this takes days. With it, it takes one query.

Solution

Implement the three pillars of observability: structured logs, distributed traces, and application metrics.

Pillar 1: Structured Logging

Stop using print statements. Stop using unstructured log lines. Every log entry should be a JSON object with consistent fields.

import structlog
import uuid
from fastapi import Request

structlog.configure(
    processors=[
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.add_log_level,
        structlog.processors.JSONRenderer(),
    ]
)

logger = structlog.get_logger()

@app.middleware("http")
async def logging_middleware(request: Request, call_next):
    request_id = str(uuid.uuid4())[:8]
    
    with structlog.contextvars.bound_contextvars(
        request_id=request_id,
        method=request.method,
        path=request.url.path,
    ):
        logger.info("request_started")
        response = await call_next(request)
        logger.info("request_completed", status=response.status_code)
        response.headers["X-Request-ID"] = request_id
        return response

Every log line now includes the request ID, so you can trace a single request across all log entries.

Pillar 2: Distributed Tracing

Traces show you the full journey of a request — which services it touched, how long each step took, and where the time went.

from opentelemetry import trace
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor

# Auto-instrument frameworks
FastAPIInstrumentor.instrument_app(app)
HTTPXClientInstrumentor().instrument()
SQLAlchemyInstrumentor().instrument(engine=engine)

# Custom spans for business logic
tracer = trace.get_tracer(__name__)

async def process_order(order_id: str):
    with tracer.start_as_current_span("process_order") as span:
        span.set_attribute("order.id", order_id)
        
        with tracer.start_as_current_span("validate_inventory"):
            await check_inventory(order_id)
        
        with tracer.start_as_current_span("charge_payment"):
            await charge_payment(order_id)
        
        with tracer.start_as_current_span("send_confirmation"):
            await send_email(order_id)

Each span shows its duration, parent span, and custom attributes. View them in Jaeger, Grafana Tempo, or any OpenTelemetry-compatible backend.

Pillar 3: Application Metrics

Metrics answer aggregate questions: How many requests per second? What is the p99 latency? How many database connections are in use?

from prometheus_client import Counter, Histogram, Gauge

REQUEST_COUNT = Counter(
    "http_requests_total",
    "Total HTTP requests",
    ["method", "endpoint", "status"]
)

REQUEST_LATENCY = Histogram(
    "http_request_duration_seconds",
    "Request latency in seconds",
    ["method", "endpoint"],
    buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0]
)

DB_POOL_SIZE = Gauge(
    "db_connection_pool_size",
    "Current database connection pool size",
    ["state"]  # active, idle, waiting
)

@app.middleware("http")
async def metrics_middleware(request: Request, call_next):
    start = time.perf_counter()
    response = await call_next(request)
    duration = time.perf_counter() - start
    
    endpoint = request.url.path
    REQUEST_COUNT.labels(request.method, endpoint, response.status_code).inc()
    REQUEST_LATENCY.labels(request.method, endpoint).observe(duration)
    
    return response

Implementation

Connecting the Three Pillars

The real power comes from correlating logs, traces, and metrics. Include the trace ID in every log entry:

def add_trace_context(logger, method_name, event_dict):
    span = trace.get_current_span()
    if span.is_recording():
        ctx = span.get_span_context()
        event_dict["trace_id"] = format(ctx.trace_id, "032x")
        event_dict["span_id"] = format(ctx.span_id, "016x")
    return event_dict

Now when a metric spikes, you find the trace. The trace shows which span is slow. The logs for that span show exactly what went wrong.

What to Alert On

Do not alert on everything. Alert on symptoms, not causes:

  • Error rate > 1% over 5 minutes
  • p99 latency > 2x the baseline
  • Database connection pool utilization > 80%
  • Queue depth growing faster than drain rate

Challenges

Log volume: Structured logging generates more data. Use sampling — log 100% of errors, 10% of successful requests. Traces can be sampled similarly.

Cardinality explosion: Do not use user IDs or request IDs as metric labels. Prometheus (and most TSDB systems) struggle with high-cardinality labels. Keep labels low-cardinality.

Conclusion

Start with structured logging — it has the lowest cost and highest immediate value. Add tracing when you need to debug cross-service latency. Add metrics when you need dashboards and alerts. The goal is not to collect all possible data, but to be able to answer 'why is this slow?' within 5 minutes.