Claude Code Plugins

Community-maintained marketplace

Feedback

fail-fast-no-hedging

@robertcope/mira
0
0

Eliminate component hedging anti-patterns that mask infrastructure failures. Build systems that fail loudly when broken instead of limping along in degraded states. Critical for production reliability and operational visibility.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name fail-fast-no-hedging
description Eliminate component hedging anti-patterns that mask infrastructure failures. Build systems that fail loudly when broken instead of limping along in degraded states. Critical for production reliability and operational visibility.

Fail-Fast Engineering: Architectural Honesty Over Silent Degradation

🚨 Core Principle

Hedging is treating required infrastructure as optional through defensive try/except blocks that mask failures as normal operation.

The Critical Question: Is this component required for the system to work correctly, or is it a genuine optional enhancement?

When required infrastructure fails, the system MUST fail loudly. When optional features fail, logging and continuing may be appropriate.

📋 Assessment Output Format (REQUIRED)

When analyzing a codebase, conclude with this structured list of all hedging issues found:

## Hedging Anti-Patterns Found in [directory/module name]

**N files with component hedging violations:**

1. **filename.py** (SEVERITY)
   - Location: `function_name()` lines X-Y
   - Issue: Brief description of what's wrong
   - Hedging: Explanation of how it's treating required infrastructure as optional
   - Impact: What happens in production when infrastructure fails

2. **filename.py** (SEVERITY)
   - Location: `function_name()` lines X-Y
   - Issue: Brief description of what's wrong
   - Hedging: Explanation of how it's treating required infrastructure as optional
   - Impact: What happens in production when infrastructure fails

Example (from lt_memory/ audit):

## Hedging Anti-Patterns Found in lt_memory/

**3 files with component hedging violations:**

1. **db_access.py** (HIGH SEVERITY)
   - Location: `get_or_create_entity()` lines 916-933
   - Issue: Returns `None` when database INSERT...RETURNING fails, masking query failure as "entity not found"
   - Hedging: Docstring declares return type as `Entity`, implementation silently returns `None` on infrastructure failure
   - Impact: Caller cannot distinguish between "entity doesn't exist" (legitimate) vs "database query failed" (infrastructure down)

2. **extraction.py** (HIGH SEVERITY)
   - Location: `_parse_extraction_response()` lines 396-397 and 430
   - Issue: Returns `[]` (empty list) when JSON parsing fails, instead of raising as docstring declares
   - Hedging: Docstring says "Raises: ValueError" but code returns `[]` on parse failures
   - Impact: Caller sees empty list and treats it as "no memories extracted" when actually LLM returned invalid JSON

This format makes all issues immediately scannable before remediation work begins.


🎯 Quick-Start Guide

When analyzing a new codebase:

  1. Identify Infrastructure Dependencies

    • Database connections
    • Cache/session stores
    • External APIs
    • Message queues
    • File systems
  2. Apply Three Diagnostic Tests

    • Semantic Distinction Test: Can you distinguish "no data" from "infrastructure down"?
    • Never Executes Test: Will this fallback realistically run during normal operation?
    • Contract Match Test: Does behavior match the docstring/type hints?
  3. Look for These Red Flags

    except Exception: return []     # Infrastructure failure → empty data
    except Exception: return False  # Connection error → "not allowed"
    except Exception: return None   # Database down → "not found"
    

🌍 Real-World Patterns from Production Codebases

Based on systematic removal of 40+ hedging anti-patterns across production systems, these specific patterns emerge repeatedly:

1. Infrastructure Failures Converted to Client Errors (CRITICAL)

Pattern: Catching database/service failures and raising ValidationError (400) instead of letting them propagate as 500s.

# REAL EXAMPLE: CNS API Layer
def execute_action(self, action: str, data: Dict) -> Dict:
    try:
        session_manager = get_shared_session_manager()
        lt_db = LTMemoryDB(session_manager)
    except Exception as e:
        # Database down converted to "your input is invalid"!
        if "connection" in str(e) or "database" in str(e):
            raise ValidationError(f"Database connection failed: {e}")

Impact: Users see "Bad Request" when database is down. Monitoring doesn't alert (watches 500s, not 400s). Operators think users are sending bad data while infrastructure burns.

Fix: Remove business-layer exception translation. Let infrastructure exceptions bubble to API boundary where proper HTTP status translation happens.

2. The Availability Flag Plague (PERVASIVE)

Pattern: Setting self.component_available = False during init, then checking it hundreds of times throughout the codebase.

# REAL EXAMPLE: ValkeyClient with 200+ defensive checks
class ValkeyClient:
    def __init__(self):
        try:
            self._init_connections()
            self.valkey_available = True
        except Exception:
            self.valkey_available = False

    def get(self, key: str) -> Optional[Any]:
        if not self.valkey_available:  # One of 200+ checks!
            return None
        return self.valkey.get(key)

Impact:

  • Every operation has defensive check overhead
  • Infrastructure failures silently masked as "feature disabled"
  • Dead code paths when component is actually required
  • False sense of "graceful degradation"

Fix: Remove availability tracking entirely. If component is required, fail at initialization. The "graceful degradation" never actually helps - it just delays the inevitable failure.

3. Silent Success Claims on Failure (DECEPTIVE)

Pattern: Returning {"success": True, "value": None} when infrastructure fails.

# REAL EXAMPLE: Calendar configuration endpoint
def get_calendar_config(self, user_id: str) -> Dict:
    try:
        config = credential_service.get_credential(user_id, "calendar_url")
        return {"success": True, "calendar_url": config}
    except Exception:
        # Vault down? Claim success anyway!
        return {"success": True, "calendar_url": None, "message": "Not configured"}

Impact: Client cannot distinguish "user hasn't configured calendar" from "credential service is down". UI shows "not configured" while Vault burns.

4. Job Registration Boolean Swallowing (TIME BOMB)

Pattern: Critical scheduled jobs return False on registration failure instead of raising.

# REAL EXAMPLE: Token renewal that fails silently
def register_jobs(self) -> bool:
    try:
        scheduler = get_scheduler()
        success = scheduler.register_job(
            job_id="vault-token-renewer",
            func=self.renew_token,
            trigger="interval",
            days=7
        )
        return success  # False if registration failed
    except Exception:
        return False  # Import error? Scheduler down? Who knows!

Impact: Token renewal silently fails to register. System runs fine for 32 days, then authentication mysteriously breaks when token expires. No errors logged at startup.

5. Lazy Loading Creating False Optionality

Pattern: Deferring initialization with lazy loading, making required components appear optional.

# REAL EXAMPLE: Email service that's actually required
class Service:
    def __init__(self):
        self._email_service = None  # Lazy loaded

    def send_welcome_email(self, user_id: str):
        if hasattr(self, '_email_service') and self._email_service:
            self._email_service.send(...)  # Silently skip if not loaded!

Impact: Required functionality (welcome emails) silently skipped when service fails to initialize. Users don't get onboarding emails, no errors logged.

6. Multi-Layer Exception Masking (DIAGNOSTIC HELL)

Pattern: Base class catches exceptions, subclass adds another layer, framework adds third layer.

# REAL EXAMPLE: Working memory trinket system
# Layer 1: Base trinket class
class BaseTrinket:
    def handle_update_request(self):
        try:
            return self._generate_content()
        except Exception:
            return None  # Mask all errors

# Layer 2: Event handler
def _handle_update_trinket(self, event):
    try:
        content = trinket.handle_update_request()
    except Exception as e:
        logger.warning(f"Trinket failed: {e}")  # Second mask

# Layer 3: Individual trinket
class ReminderTrinket(BaseTrinket):
    def _generate_content(self):
        try:
            reminders = self.get_reminders()
        except Exception:
            return []  # Third mask!

Impact: Database failure → returns [] → caught and returns None → caught and logged as warning. Original DatabaseError completely lost. Operators see "Trinket failed: 'NoneType' has no attribute 'format'" instead of "Database connection lost".

7. Data Corruption Hidden as Empty State

Pattern: JSON decode errors returning None instead of raising.

# REAL EXAMPLE: Database JSON columns
def fetch_as_dict(self, query: str) -> Optional[Dict]:
    result = self.execute(query)
    if result and result[0]:
        try:
            return json.loads(result[0])
        except json.JSONDecodeError:
            logger.warning("Failed to parse JSON")
            return None  # Corrupted data = no data!

Impact: Data corruption indistinguishable from NULL values. Corrupted user preferences silently become "no preferences". Data quality issues invisible until user complaints.

8. The hasattr() Vestigial Pattern

Pattern: Checking for attributes that are never set anywhere in the codebase.

# REAL EXAMPLE: Cache updater that doesn't exist
def _update_cache_if_pooled(self) -> None:
    if hasattr(self, '_cache_updater') and self._cache_updater:
        # This attribute is NEVER set anywhere!
        try:
            self._cache_updater(self.user_id, self._message_cache)
        except Exception:
            logger.warning("Cache update failed")

Impact: Dead code creating false architectural impressions. Future developers think "cache updates must be optional" when they're actually handled elsewhere entirely.

9. Partial Success Masquerading as Complete Success

Pattern: Continuing after required sub-operations fail.

# REAL EXAMPLE: User creation with broken setup
def create_user(email: str) -> str:
    user_id = db.insert_user(email)

    try:
        initialize_user_preferences(user_id)
        create_welcome_reminders(user_id)
        send_welcome_email(user_id)
    except Exception as e:
        logger.error(f"User setup failed: {e}")
        # Continue - user exists but broken!

    return user_id  # "Success!"

Impact: Users created in broken state. Can log in but missing preferences, reminders, welcome email. Support tickets: "Features don't work for some users."

10. Rate Limiter Fail-Closed Masquerade

Pattern: Security-motivated hedging that creates operational blindness.

# REAL EXAMPLE: Rate limiter "failing closed"
def is_allowed(self, identifier: str) -> Tuple[bool, int]:
    try:
        count = valkey.increment(f"rate:{identifier}")
        return count <= self.limit, 0
    except Exception:
        # "Fail closed for security" - but identical to rate limited!
        return False, self.window_seconds

Impact: Valkey outage = all users rate limited. Support flooded with "I can't log in" while ops thinks "high traffic day". Security theater creating availability problems.

📐 Pattern Library

Pattern 1: Infrastructure Masquerading

def operation([PARAMS]) -> [RETURN_TYPE]: try: result = [REQUIRED_INFRASTRUCTURE].query() return result except Exception: return [SAFE_DEFAULT] # Infrastructure failure looks like [NORMAL_STATE] def operation([PARAMS]) -> [RETURN_TYPE]: result = [REQUIRED_INFRASTRUCTURE].query() # Raises on infrastructure failure return result # Only returns actual data or legitimate empty state Can caller distinguish between "[NO_DATA]" and "[INFRASTRUCTURE_DOWN]"?

Examples of [SAFE_DEFAULT]: [], None, False, 0, {} Examples of [REQUIRED_INFRASTRUCTURE]: database, cache, auth_service, message_queue

Pattern 2: False Optionality

class Service: def __init__(self): self._component = None # Lazy loading creates false impression
@property
def component(self):
    if self._component is None:
        self._component = initialize_component()
    return self._component

def operation(self):
    if hasattr(self, '_component') and self._component:  # Defensive checks
        return self._component.do_work()
    return [DEGRADED_RESULT]
class Service: def __init__(self): self.component = initialize_component() # Fail at startup if broken
def operation(self):
    return self.component.do_work()  # No defensive checks needed
Is the component truly optional, or are we just afraid of startup failures?

Pattern 3: Partial Success Masking

def create_resource([PARAMS]) -> [ID_TYPE]: resource_id = database.insert([DATA])
try:
    initialize_related([resource_id])  # Required setup
except Exception:
    logger.error("Setup failed")
    # Continue - resource exists but broken

return resource_id  # Partial success returned as complete
def create_resource([PARAMS]) -> [ID_TYPE]: resource_id = database.insert([DATA]) initialize_related([resource_id]) # Fails → transaction rolls back return resource_id # Only returns if fully initialized Are we creating broken/incomplete entities that will fail later?

Pattern 4: Safe Defaults on Error

def get_metric([PARAMS]) -> [NUMERIC_TYPE]: try: result = metrics_service.query([PARAMS]) return result.value except Exception: return 0 # Service down? Return zero!

def get_config([PARAMS]) -> Dict: try: return config_service.fetch([KEY]) except Exception: return {"success": True, "value": None} # Claim success on failure!

def get_metric([PARAMS]) -> [NUMERIC_TYPE]: result = metrics_service.query([PARAMS]) # Raises on service failure return result.value # Zero only when metric genuinely is zero

def get_config([PARAMS]) -> Dict: config = config_service.fetch([KEY]) # Raises on service failure return {"success": True, "value": config}

Does the "safe" default hide infrastructure problems from operators?

Pattern 5: The "Never Executes" Fallback

def get_or_create_resource([PARAMS]) -> [RESOURCE_TYPE]: result = database.insert_returning([DATA])
if result:
    return [RESOURCE_TYPE](**result)

# This fallback will NEVER execute in correct operation
# INSERT...RETURNING should always return a result
# If it doesn't, infrastructure is broken
fetch_query = "SELECT * FROM [TABLE] WHERE ..."
existing = database.query(fetch_query)
return [RESOURCE_TYPE](**existing) if existing else None
def get_or_create_resource([PARAMS]) -> [RESOURCE_TYPE]: result = database.insert_returning([DATA])
if not result:
    raise RuntimeError(
        f"INSERT...RETURNING failed - database operation broken, "
        f"not a recoverable race condition"
    )

return [RESOURCE_TYPE](**result)
Critical Question: "Will this fallback code realistically execute when the system is operating correctly?" - YES → Legitimate defensive programming (real race conditions) - NO → Hedging that masks the real error

🔍 Analysis Protocol

Step-by-Step Failure Analysis

For each try/except block, perform this systematic analysis:

FAILURE SCENARIO ANALYSIS
=========================

1. What operations are in the try block?
   - Database calls?
   - External API calls?
   - Cache/session operations?

2. What can fail, and why?
   - Network timeout
   - Service unavailable
   - Invalid data format
   - Resource exhausted

3. Are these failures EXPECTED or UNEXPECTED?
   - Expected: Retry logic, validation errors, race conditions
   - Unexpected: Infrastructure down, OOM, disk full

4. What does the except block do?
   - Return safe default (None, [], False, 0)?
   - Log and re-raise?
   - Convert to different exception?
   - Continue with degraded state?

5. Can the caller handle this error?
   - Yes: Specific exception type (UserNotFoundError)
   - No: Infrastructure failure (DatabaseError, ConnectionError)

Decision Framework

For each try/except block, ask:

  1. Is this infrastructure required?

    • Required → Must fail-fast
    • Optional → May catch and continue
  2. What type of operation?

    • Synchronous request → Fail-fast on infrastructure errors
    • Async event handler → May log and retry
    • Background job → Consider retry strategy
  3. What's the failure mode?

    • Infrastructure down → Propagate immediately
    • Transient network → Retry with backoff
    • Invalid input → Return error response
  4. What does the catch block do?

    • Returns safe default → Likely hedging
    • Adds context and re-raises → Appropriate
    • Logs and continues → Check if truly optional

Severity Classification

HIGH Severity

  • Silent data corruption (partial user creation, orphaned records)
  • Security controls disabled on infrastructure failure
  • Core business operations succeed in broken states
  • Job registration failures that manifest weeks/months later
  • User creation succeeds but required setup fails

MEDIUM Severity

  • Infrastructure failures indistinguishable from normal operation
  • Authentication/authorization degrades silently
  • Resource management (locks, sessions) masks failures
  • Rate limiting returns "limited" when infrastructure down
  • Configuration failures return success with null values

LOW Severity

  • Unnecessary exception re-wrapping
  • Logging before re-raising
  • Overly broad exception handling with proper re-raise
  • JSON parsing errors masked as "not found" (when field is optional)

✅ When Try/Except IS Appropriate

Rule 1: Translating Across Abstraction Boundaries

# GOOD: Database layer translates SQLAlchemy exceptions to domain exceptions
def get_user(user_id: str) -> User:
    try:
        return session.query(User).filter_by(id=user_id).one()
    except NoResultFound:
        raise UserNotFoundError(f"User {user_id} not found")
    except SQLAlchemyError as e:
        raise DatabaseError(f"Database query failed: {e}")

# GOOD: API endpoint translates domain exceptions to HTTP status codes
@router.post("/actions")
def actions_endpoint(request_data: ActionRequest):
    try:
        response = handler.handle_request(request_data)
        return response.to_dict()
    except ValidationError as e:
        return JSONResponse(status_code=400, ...)  # Client error
    except Exception as e:
        logger.error(f"Actions endpoint error: {e}", exc_info=True)
        return JSONResponse(status_code=500, ...)  # Infrastructure error

# BAD: Business logic converting infrastructure failures to client errors
def execute_action(self, action: str) -> Dict:
    try:
        session_manager = get_session_manager()
        lt_db = LTMemoryDB(session_manager)
    except Exception as e:
        # WRONG: Makes infrastructure failure (500) look like client error (400)
        raise ValidationError(f"Database connection failed: {e}")

Why: Infrastructure exception handling belongs at architectural boundaries (API endpoints), not in business logic. Converting DatabaseError → ValidationError makes 500s appear as 400s.

Rule 2: Recovering from Expected, Recoverable Errors

# GOOD: Retry on transient failures with backoff
def fetch_with_retry(url: str, attempts: int = 3) -> Response:
    for attempt in range(attempts):
        try:
            return requests.get(url, timeout=5)
        except requests.Timeout:
            if attempt == attempts - 1:
                raise  # Final attempt failed
            logger.warning(f"Attempt {attempt+1} timed out, retrying...")
            time.sleep(2 ** attempt)  # Exponential backoff

Why: Transient network failures are expected and retrying can succeed. Each retry is logged for visibility.

Rule 3: Adding Business Context Not in Stack Trace

# GOOD: Add context that's not obvious from code
def process_batch(batch_id: str, items: List[Item]):
    try:
        for i, item in enumerate(items):
            process_item(item)
    except ProcessingError as e:
        # Stack trace shows WHERE but not WHICH batch/item
        logger.error(
            f"Batch {batch_id} failed at item {i+1}/{len(items)} "
            f"(item_id={item.id}): {e}"
        )
        raise ProcessingError(f"Batch {batch_id} processing failed") from e

Why: Adds business context (batch ID, item position) that's not in the stack trace but crucial for debugging.

Rule 4: Handling Non-Fatal Failures

# GOOD: Optional features can fail gracefully
def save_with_cache(data: Dict) -> None:
    # Critical: Save to database
    db.save(data)  # Must succeed, raises on failure

    # Optional: Update cache
    try:
        cache.set(data['id'], data)
    except CacheError as e:
        logger.warning(f"Cache update failed (non-critical): {e}")
        # Continue - cache miss is acceptable

Why: Cache failure shouldn't break the operation. Log because we're suppressing the error.

❌ When Try/Except is WRONG

  1. Converting infrastructure failures to None/False/[]
  2. Catching broad exceptions without re-raising
  3. Logging and re-raising (redundant)
  4. Masking required service failures as optional

🛠️ Refactoring Steps

  1. Write test for infrastructure failure propagation

    def test_propagates_database_failure():
        with mock.patch('db.query', side_effect=DatabaseError):
            with pytest.raises(DatabaseError):
                service.get_data()
    
  2. Remove defensive try/except

  3. Update docstring/types if needed

  4. Run test to verify failure propagates

  5. Add monitoring for the new exception type

🔬 Testing Fail-Fast Behavior

Test Infrastructure Failures Propagate

def test_infrastructure_failure_propagates():
    """Verify infrastructure failure raises, doesn't return safe default."""
    service = AuthService()

    with mock.patch('valkey.get', side_effect=ConnectionError("Valkey unreachable")):
        with pytest.raises(ConnectionError):
            service.validate_session("test-token")

    # NOT: assert service.validate_session(...) is None

Test Legitimate Empty States

def test_user_with_no_tokens_returns_empty_list():
    """Verify empty result is legitimate, not masked error."""
    service = AuthService()

    # User exists but has no tokens
    result = service.list_tokens(user_id="test-user")

    assert result == []  # Legitimately empty
    assert not isinstance(result, type(None))  # Not None from error

Production Impact Scenarios

Before Fail-Fast Refactoring (Valkey Down)

User Experience:
- "Too many requests" errors (rate limiter returns False)
- "Invalid session" errors (session validation returns None)
- "Resource locked" timeouts (locks return False)

Operator Experience:
- No monitoring alerts
- Users complaining about "weird behavior"
- 2 hours to diagnose: "Oh, Valkey has been down since 3pm"

After Fail-Fast Refactoring (Valkey Down)

User Experience:
- "Service temporarily unavailable" (clear 500 errors)

Operator Experience:
- PagerDuty alert: "VALKEY_CONNECTION_ERROR" at 3:01pm
- Clear stack trace pointing to infrastructure
- Fixed in 5 minutes

🚩 Warning Signs in Code Reviews

When reviewing code, these patterns indicate likely hedging:

Textual Clues

  • Comments like "Fail closed for security", "Graceful degradation", "Continue anyway"
  • Variable names: component_available, is_enabled, has_feature
  • Return values: {"success": True, "value": None} when operations fail

Structural Clues

  • try: blocks wrapping infrastructure calls with except Exception:
  • Methods returning both bool for success AND results
  • Optional[T] return types for operations that should always succeed
  • Multiple defensive checks: if hasattr() and self.thing and self.thing.ready:

Behavioral Clues

  • Init methods that set availability flags instead of failing
  • Config/credential lookups that return defaults on failure
  • Database queries returning [] in except blocks
  • Scheduled job registration returning False instead of raising

Architecture Smells

  • Base classes with catch-all exception handlers
  • Lazy loading of required components
  • Services checking if self.required_dependency: before every operation
  • Factory functions returning None when service creation fails

🎯 The Three Diagnostic Tests

Test 1: Semantic Distinction

Can you distinguish between legitimate empty states and infrastructure failures?

  • ✅ Raises DatabaseError when down, returns [] when no data
  • ❌ Returns [] for both cases

Test 2: Never Executes

Will this fallback code run during normal operation?

  • ✅ Handles genuine race conditions or expected states
  • ❌ "Defensive" code that never executes when system works

Test 3: Contract Match

Does the implementation match documented behavior?

  • ✅ Docstring says "raises X" and code raises X
  • ❌ Docstring says "raises X" but code returns None

📝 Summary

Goal: Systems that crash obviously when broken are easier to operate than systems that limp along mysteriously.

Remember:

  • Required dependencies must fail loudly
  • Optional enhancements may fail quietly
  • The distinction is an architectural decision, not an implementation detail
  • When in doubt, fail fast and let operators decide

📊 Lessons from Production Refactoring

After removing 40+ hedging patterns across multiple systems, key insights emerge:

1. Hedging Compounds Over Time

  • Starts with one "harmless" try/except
  • Other developers add more defensive layers
  • Eventually 3-4 layers of exception masking
  • Original errors become completely untraceable

2. The Availability Flag Anti-Pattern

  • component_available flags spread like a virus
  • One component adds flag → dependent components check it
  • Soon 200+ defensive checks throughout codebase
  • Massive dead code when component is actually required

3. API Layer Is The Only Translation Point

  • Business logic should NEVER convert infrastructure errors to ValidationError
  • HTTP status translation happens ONCE at the API boundary
  • Everything else just propagates exceptions naturally

4. "Fail Closed" Security Theater

  • Rate limiting that returns False when infrastructure down
  • Makes outages look like normal rate limiting
  • Better: Let it fail with clear error so ops can fix infrastructure
  • Real security comes from working infrastructure, not silent failures

5. Test Suites Reveal Hidden Assumptions

  • Removing hedging often breaks tests that assumed silent degradation
  • These test failures are GOOD - they reveal incorrect assumptions
  • Tests checking for None when they should expect exceptions

6. The Time Bomb Pattern

  • Scheduled job registration failures are the worst
  • System appears to start fine
  • Weeks/months later, mysterious failures when jobs never ran
  • Always raise immediately on registration failure

7. Startup Failures Are A Gift

  • Fail at startup > fail at first request > fail silently
  • Operators can fix config/infrastructure before accepting traffic
  • Silent degradation means discovering issues only after user complaints