Debugging Hard Production Bugs: Techniques Used by Senior Engineers
The hardest bugs do not happen in development. They happen in production, under load, intermittently, and without stack traces. The log says "error," the user says "it does not work," and you have no way to reproduce it locally.
Senior engineers do not debug these faster because they are smarter. They debug faster because they have a systematic process. This guide covers that process.
Problem
Hard production bugs share common characteristics:
- They cannot be reproduced locally
- They happen intermittently or under specific conditions
- Logs are insufficient or misleading
- Multiple systems interact to cause the failure
- The bug was introduced weeks or months ago
Step 1: Gather Evidence Before Guessing
Most engineers start debugging by guessing. Seniors start by collecting data:
Evidence checklist:
□ When did the bug first appear? (check deploy history)
□ What changed? (git log --since="2 days ago" --oneline)
□ Who is affected? (all users? specific region? specific plan?)
□ What is the error rate? (1%? 50%? 100%?)
□ What do the logs show at the exact timestamp?
□ What do monitoring dashboards show?
The answer to "when did it start" narrows the search space more than any other question.
Step 2: Reproduce With Constraints
If you cannot reproduce locally, reproduce the conditions:
# Reproduce timing issues with artificial latency
import asyncio
async def fetch_with_simulated_latency(url: str):
await asyncio.sleep(random.uniform(0.1, 2.0)) # Simulate network jitter
return await http_client.get(url)
# Reproduce concurrency issues with parallel requests
async def stress_test():
tasks = [create_order(random_user()) for _ in range(100)]
results = await asyncio.gather(*tasks, return_exceptions=True)
errors = [r for r in results if isinstance(r, Exception)]
print(f"{len(errors)} failures out of {len(results)}")
Step 3: Binary Search Through Time
If you know when the bug started, use git bisect:
git bisect start
git bisect bad # Current commit is broken
git bisect good v1.2.0 # This release was working
# Git checks out a middle commit
# Test it, then:
git bisect good # or
git bisect bad
# Repeat until Git finds the exact commit
This finds the offending commit in O(log n) steps. For 1,000 commits, that is about 10 tests.
Step 4: Add Observability Retroactively
When existing logs are insufficient, add structured logging to the suspicious area:
import structlog
logger = structlog.get_logger()
async def process_payment(order_id: str, amount: float):
logger.info("payment_started", order_id=order_id, amount=amount)
try:
result = await payment_gateway.charge(amount)
logger.info(
"payment_succeeded",
order_id=order_id,
transaction_id=result.id,
latency_ms=result.latency,
)
return result
except TimeoutError:
logger.error(
"payment_timeout",
order_id=order_id,
gateway_host=payment_gateway.host,
)
raise
except Exception as e:
logger.error(
"payment_failed",
order_id=order_id,
error_type=type(e).__name__,
error_msg=str(e),
)
raise
Structured logs with context fields are searchable. String-interpolated logs are not.
Step 5: Narrow the Blast Radius
Isolate the problem by eliminating components:
System: Frontend → API Gateway → Auth Service → Database
Test: Can you reproduce with a direct API call (skip frontend)?
Yes → Problem is in API or downstream
No → Problem is in the frontend
Test: Can you reproduce with a different auth token?
Yes → Problem is not auth-related
No → Problem is in the Auth Service
Test: Can you reproduce with a direct database query?
Yes → Problem is in the data
No → Problem is in the application logic
Each test eliminates half the system. Three tests narrow the search to one component.
Step 6: The Rubber Duck Technique (For Real)
Explain the bug out loud. Not to a person — to a notepad, a duck, anything. The act of articulating the problem forces you to examine assumptions.
Write down:
- What should happen
- What actually happens
- What is different between those two states
The bug is in the difference.
Patterns Senior Engineers Recognize
Race Conditions
Two operations that work individually but fail when concurrent:
# BUG: Two requests read balance, both subtract, one write is lost
balance = await get_balance(user_id) # Both read 100
new_balance = balance - amount # Both compute 80
await set_balance(user_id, new_balance) # Second write overwrites first
# FIX: Use database-level atomicity
await pool.execute(
"UPDATE accounts SET balance = balance - $1 WHERE user_id = $2",
amount, user_id,
)
Off-By-One in Pagination
# BUG: Skips one item between pages
page_1 = items[0:20] # Items 0-19
page_2 = items[21:40] # Skips item 20
# FIX
page_1 = items[0:20] # Items 0-19
page_2 = items[20:40] # Items 20-39
Timezone Confusion
# BUG: Comparing timezone-naive and timezone-aware datetimes
from datetime import datetime, timezone
now = datetime.now() # Naive (no timezone)
stored = datetime.now(timezone.utc) # Aware (UTC)
# now == stored → TypeError or wrong result
Common Mistakes
Mistake 1: Debugging by changing code randomly. Every change should test a specific hypothesis. If it does not, revert it.
Mistake 2: Not checking what changed. The first question should always be: what was deployed recently?
Mistake 3: Debugging alone for too long. If you have been stuck for more than an hour, explain the problem to someone else. A fresh perspective finds the assumption you missed.
Takeaways
Hard bugs yield to systematic investigation. Collect evidence before guessing. Use git bisect to find when it broke. Add structured logging for the next time. Eliminate components to narrow the search. The debugging skills that separate senior engineers from junior ones are not technical knowledge — they are discipline and process.