πŸ”βŒ˜K

Start typing to search docs.

Troubleshooting Runbook

1.0.0

Incident triage and remediation guidance.

Infrastructure Troubleshooting Runbook

This runbook provides systematic approaches to diagnosing and resolving common infrastructure issues in App Factory deployments.

🎯 Overview

This runbook covers:

  • Common infrastructure issues
  • Diagnostic procedures
  • Step-by-step resolution guides
  • Escalation procedures
  • Prevention strategies

🚨 Emergency Response Procedures

Incident Classification

SeverityDescriptionResponse TimeExamples
P0 - CriticalComplete service outage15 minutesDatabase down, API unreachable
P1 - HighMajor functionality impacted1 hourSlow response times, partial outage
P2 - MediumMinor functionality impacted4 hoursNon-critical features down
P3 - LowCosmetic or enhancement24 hoursDocumentation updates

Initial Response Checklist

#!/bin/bash
# incident-response.sh

echo "🚨 Infrastructure Incident Response"
echo "=================================="

# 1. Gather basic information
echo "πŸ“Š System Status Check:"
echo "Timestamp: $(date)"
echo "Reporter: $USER"
echo "Environment: ${ENVIRONMENT:-unknown}"

# 2. Quick health check
echo ""
echo "πŸ” Quick Health Assessment:"

# Check if Pulumi stack is accessible
if command -v pulumi >/dev/null 2>&1; then
    echo "Pulumi Status: $(pulumi stack --show-name 2>/dev/null || echo 'Not available')"
    
    # Get stack outputs
    OUTPUTS=$(pulumi stack output --json 2>/dev/null)
    if [ $? -eq 0 ]; then
        echo "Stack outputs available: βœ…"
        
        # Test key endpoints
        API_URL=$(echo "$OUTPUTS" | jq -r '.apiUrl // empty')
        if [ -n "$API_URL" ]; then
            HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" "$API_URL/health" 2>/dev/null || echo "000")
            echo "API Health Check: $HTTP_STATUS"
        fi
        
        DATABASE_URL=$(echo "$OUTPUTS" | jq -r '.databaseUrl // empty')
        if [ -n "$DATABASE_URL" ]; then
            pg_isready -d "$DATABASE_URL" >/dev/null 2>&1
            if [ $? -eq 0 ]; then
                echo "Database Status: βœ… Connected"
            else
                echo "Database Status: ❌ Connection failed"
            fi
        fi
    else
        echo "Stack outputs unavailable: ❌"
    fi
else
    echo "Pulumi CLI not available"
fi

echo ""
echo "πŸ“‹ Next Steps:"
echo "1. Document the issue in detail"
echo "2. Check relevant logs"
echo "3. Follow specific troubleshooting guide"
echo "4. Escalate if needed"

πŸ—„οΈ Database Issues

Database Connection Problems

Symptoms

  • Connection timeouts
  • "Connection refused" errors
  • Authentication failures
  • SSL certificate errors

Diagnostic Steps

#!/bin/bash
# diagnose-database.sh

echo "πŸ—„οΈ Database Connection Diagnostics"
echo "================================="

# Get database connection details
DATABASE_URL=$(pulumi stack output databaseUrl 2>/dev/null)

if [ -z "$DATABASE_URL" ]; then
    echo "❌ Database URL not available from stack outputs"
    exit 1
fi

# Parse connection details
DB_HOST=$(echo "$DATABASE_URL" | sed -n 's/.*@\([^:]*\):.*/\1/p')
DB_PORT=$(echo "$DATABASE_URL" | sed -n 's/.*:\([0-9]*\)\/.*/\1/p')
DB_NAME=$(echo "$DATABASE_URL" | sed -n 's/.*\/\([^?]*\).*/\1/p')

echo "πŸ“Š Connection Details:"
echo "Host: $DB_HOST"
echo "Port: $DB_PORT"
echo "Database: $DB_NAME"

# Test network connectivity
echo ""
echo "🌐 Network Connectivity:"
if command -v telnet >/dev/null 2>&1; then
    timeout 5 telnet "$DB_HOST" "$DB_PORT" 2>/dev/null
    if [ $? -eq 0 ]; then
        echo "βœ… Network connection successful"
    else
        echo "❌ Network connection failed"
        echo "Check security groups/firewall rules"
    fi
else
    echo "telnet not available, using nc"
    timeout 5 nc -z "$DB_HOST" "$DB_PORT" 2>/dev/null
    if [ $? -eq 0 ]; then
        echo "βœ… Network connection successful"
    else
        echo "❌ Network connection failed"
    fi
fi

# Test database connectivity
echo ""
echo "πŸ” Database Authentication:"
pg_isready -d "$DATABASE_URL" >/dev/null 2>&1
if [ $? -eq 0 ]; then
    echo "βœ… Database authentication successful"
    
    # Test actual connection
    psql "$DATABASE_URL" -c "SELECT version();" >/dev/null 2>&1
    if [ $? -eq 0 ]; then
        echo "βœ… Database query successful"
    else
        echo "❌ Database query failed"
    fi
else
    echo "❌ Database authentication failed"
    echo "Check credentials and SSL configuration"
fi

# Check recent logs
echo ""
echo "πŸ“‹ Recent Database Logs:"
if [ "$PROVIDER" = "aws" ]; then
    DB_INSTANCE=$(pulumi stack output databaseInstanceId 2>/dev/null)
    if [ -n "$DB_INSTANCE" ]; then
        aws logs describe-log-groups --log-group-name-prefix "/aws/rds/instance/$DB_INSTANCE" 2>/dev/null | head -10
    fi
elif [ "$PROVIDER" = "gcp" ]; then
    DB_INSTANCE=$(pulumi stack output databaseInstanceName 2>/dev/null)
    if [ -n "$DB_INSTANCE" ]; then
        gcloud logging read "resource.type=cloud_sql_database AND resource.labels.database_id=$DB_INSTANCE" --limit=5 --format="value(timestamp,severity,jsonPayload.message)"
    fi
fi

Resolution Steps

  1. Network Issues:

    # AWS: Check security groups
    aws ec2 describe-security-groups --group-ids $(pulumi stack output databaseSecurityGroupId)
    
    # GCP: Check firewall rules
    gcloud compute firewall-rules list --filter="name~.*sql.*"
    
  2. Authentication Issues:

    # Reset database password (AWS)
    aws rds modify-db-instance \
      --db-instance-identifier $(pulumi stack output databaseInstanceId) \
      --master-user-password NEW_PASSWORD \
      --apply-immediately
    
    # Reset database password (GCP)
    gcloud sql users set-password postgres \
      --instance=$(pulumi stack output databaseInstanceName) \
      --password=NEW_PASSWORD
    
  3. SSL Certificate Issues:

    # Download SSL certificates
    # AWS
    wget https://s3.amazonaws.com/rds-downloads/rds-combined-ca-bundle.pem
    
    # GCP
    gcloud sql ssl-certs create client-cert client-key.pem \
      --instance=$(pulumi stack output databaseInstanceName)
    

Database Performance Issues

Symptoms

  • Slow query responses
  • High CPU utilization
  • Connection pool exhaustion
  • Lock timeouts

Diagnostic Steps

#!/bin/bash
# diagnose-db-performance.sh

echo "⚑ Database Performance Diagnostics"
echo "=================================="

DATABASE_URL=$(pulumi stack output databaseUrl 2>/dev/null)

if [ -z "$DATABASE_URL" ]; then
    echo "❌ Database URL not available"
    exit 1
fi

# Check active connections
echo "πŸ”— Active Connections:"
psql "$DATABASE_URL" -c "
SELECT 
    count(*) as total_connections,
    count(*) FILTER (WHERE state = 'active') as active_connections,
    count(*) FILTER (WHERE state = 'idle') as idle_connections
FROM pg_stat_activity;"

# Check long-running queries
echo ""
echo "⏱️ Long-running Queries:"
psql "$DATABASE_URL" -c "
SELECT 
    pid,
    now() - pg_stat_activity.query_start AS duration,
    query,
    state
FROM pg_stat_activity 
WHERE (now() - pg_stat_activity.query_start) > interval '5 minutes'
ORDER BY duration DESC;"

# Check database size
echo ""
echo "πŸ’Ύ Database Size:"
psql "$DATABASE_URL" -c "
SELECT 
    pg_database.datname,
    pg_size_pretty(pg_database_size(pg_database.datname)) AS size
FROM pg_database
ORDER BY pg_database_size(pg_database.datname) DESC;"

# Check table sizes
echo ""
echo "πŸ“Š Largest Tables:"
psql "$DATABASE_URL" -c "
SELECT 
    schemaname,
    tablename,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables 
WHERE schemaname NOT IN ('information_schema', 'pg_catalog')
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 10;"

πŸ”— API and Compute Issues

API Gateway/Load Balancer Issues

Symptoms

  • 502/503 errors
  • High latency
  • CORS failures
  • Rate limiting errors

Diagnostic Steps

#!/bin/bash
# diagnose-api.sh

echo "πŸ”— API Diagnostics"
echo "=================="

API_URL=$(pulumi stack output apiUrl 2>/dev/null)

if [ -z "$API_URL" ]; then
    echo "❌ API URL not available"
    exit 1
fi

echo "🌐 API Endpoint: $API_URL"

# Test API health
echo ""
echo "πŸ₯ Health Check:"
HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" "$API_URL/health" 2>/dev/null || echo "000")
echo "Health endpoint status: $HTTP_STATUS"

if [ "$HTTP_STATUS" != "200" ]; then
    echo "❌ Health check failed"
    
    # Test basic connectivity
    curl -v "$API_URL" 2>&1 | head -20
else
    echo "βœ… Health check passed"
fi

# Test response times
echo ""
echo "⏱️ Response Time Analysis:"
for i in {1..5}; do
    RESPONSE_TIME=$(curl -s -o /dev/null -w "%{time_total}" "$API_URL/health" 2>/dev/null || echo "timeout")
    echo "Request $i: ${RESPONSE_TIME}s"
    sleep 1
done

# Check recent logs
echo ""
echo "πŸ“‹ Recent API Logs:"
if [ "$PROVIDER" = "aws" ]; then
    # Lambda logs
    FUNCTION_NAME=$(pulumi stack output lambdaFunctionName 2>/dev/null)
    if [ -n "$FUNCTION_NAME" ]; then
        aws logs tail "/aws/lambda/$FUNCTION_NAME" --since 1h | head -20
    fi
    
    # API Gateway logs
    aws logs tail "/aws/apigateway/$(pulumi stack output apiGatewayId)" --since 1h | head -10
elif [ "$PROVIDER" = "gcp" ]; then
    # Cloud Run logs
    SERVICE_NAME=$(pulumi stack output apiServiceName 2>/dev/null)
    if [ -n "$SERVICE_NAME" ]; then
        gcloud logging read "resource.type=cloud_run_revision AND resource.labels.service_name=$SERVICE_NAME" --limit=10 --format="value(timestamp,severity,textPayload)"
    fi
fi

Resolution Steps

  1. 502/503 Errors:

    # Check backend health
    # AWS Lambda
    aws lambda invoke --function-name $(pulumi stack output lambdaFunctionName) response.json
    cat response.json
    
    # GCP Cloud Run
    gcloud run services describe $(pulumi stack output apiServiceName) --region=$(pulumi config get gcp:region)
    
  2. High Latency:

    # Check resource utilization
    # AWS CloudWatch
    aws cloudwatch get-metric-statistics \
      --namespace AWS/Lambda \
      --metric-name Duration \
      --dimensions Name=FunctionName,Value=$(pulumi stack output lambdaFunctionName) \
      --start-time $(date -d "1 hour ago" --iso-8601) \
      --end-time $(date --iso-8601) \
      --period 300 \
      --statistics Average
    
    # GCP Monitoring
    gcloud logging read "resource.type=cloud_run_revision" --limit=50 | grep -i "latency\|duration"
    

Serverless Function Issues

Cold Start Problems

#!/bin/bash
# diagnose-cold-starts.sh

echo "πŸ₯Ά Cold Start Diagnostics"
echo "========================"

if [ "$PROVIDER" = "aws" ]; then
    FUNCTION_NAME=$(pulumi stack output lambdaFunctionName 2>/dev/null)
    
    echo "πŸ“Š Lambda Function: $FUNCTION_NAME"
    
    # Check function configuration
    aws lambda get-function --function-name "$FUNCTION_NAME" --query 'Configuration.[Runtime,MemorySize,Timeout]' --output table
    
    # Check cold start metrics
    aws cloudwatch get-metric-statistics \
      --namespace AWS/Lambda \
      --metric-name InitDuration \
      --dimensions Name=FunctionName,Value="$FUNCTION_NAME" \
      --start-time $(date -d "24 hours ago" --iso-8601) \
      --end-time $(date --iso-8601) \
      --period 3600 \
      --statistics Average,Maximum
      
elif [ "$PROVIDER" = "gcp" ]; then
    SERVICE_NAME=$(pulumi stack output apiServiceName 2>/dev/null)
    
    echo "πŸ“Š Cloud Run Service: $SERVICE_NAME"
    
    # Check service configuration
    gcloud run services describe "$SERVICE_NAME" --region=$(pulumi config get gcp:region) --format="value(spec.template.spec.containers[0].resources)"
    
    # Check cold start logs
    gcloud logging read "resource.type=cloud_run_revision AND resource.labels.service_name=$SERVICE_NAME AND textPayload:\"cold start\"" --limit=10
fi

πŸ’Ύ Storage Issues

S3/Cloud Storage Access Issues

Symptoms

  • 403 Forbidden errors
  • Upload failures
  • CORS issues
  • Slow transfer speeds

Diagnostic Steps

#!/bin/bash
# diagnose-storage.sh

echo "πŸ’Ύ Storage Diagnostics"
echo "====================="

if [ "$PROVIDER" = "aws" ]; then
    ASSETS_BUCKET=$(pulumi stack output assetsBucketName 2>/dev/null)
    UPLOADS_BUCKET=$(pulumi stack output uploadsBucketName 2>/dev/null)
    
    echo "πŸͺ£ S3 Buckets:"
    echo "Assets: $ASSETS_BUCKET"
    echo "Uploads: $UPLOADS_BUCKET"
    
    # Check bucket policies
    echo ""
    echo "πŸ”’ Bucket Policies:"
    for bucket in "$ASSETS_BUCKET" "$UPLOADS_BUCKET"; do
        if [ -n "$bucket" ]; then
            echo "Policy for $bucket:"
            aws s3api get-bucket-policy --bucket "$bucket" 2>/dev/null | jq '.Policy | fromjson' || echo "No policy found"
        fi
    done
    
    # Test upload
    echo ""
    echo "πŸ“€ Upload Test:"
    echo "test content" > test-file.txt
    aws s3 cp test-file.txt "s3://$UPLOADS_BUCKET/test-file.txt" 2>&1
    if [ $? -eq 0 ]; then
        echo "βœ… Upload successful"
        aws s3 rm "s3://$UPLOADS_BUCKET/test-file.txt"
    else
        echo "❌ Upload failed"
    fi
    rm -f test-file.txt

elif [ "$PROVIDER" = "gcp" ]; then
    ASSETS_BUCKET=$(pulumi stack output assetsBucketName 2>/dev/null)
    UPLOADS_BUCKET=$(pulumi stack output uploadsBucketName 2>/dev/null)
    
    echo "πŸͺ£ Cloud Storage Buckets:"
    echo "Assets: $ASSETS_BUCKET"
    echo "Uploads: $UPLOADS_BUCKET"
    
    # Check bucket IAM policies
    echo ""
    echo "πŸ”’ Bucket IAM Policies:"
    for bucket in "$ASSETS_BUCKET" "$UPLOADS_BUCKET"; do
        if [ -n "$bucket" ]; then
            echo "IAM policy for $bucket:"
            gsutil iam get "gs://$bucket" 2>/dev/null || echo "No policy found"
        fi
    done
    
    # Test upload
    echo ""
    echo "πŸ“€ Upload Test:"
    echo "test content" > test-file.txt
    gsutil cp test-file.txt "gs://$UPLOADS_BUCKET/test-file.txt" 2>&1
    if [ $? -eq 0 ]; then
        echo "βœ… Upload successful"
        gsutil rm "gs://$UPLOADS_BUCKET/test-file.txt"
    else
        echo "❌ Upload failed"
    fi
    rm -f test-file.txt
fi

🌐 CDN and Networking Issues

CDN Performance Issues

Symptoms

  • Slow content delivery
  • Cache misses
  • Origin server overload
  • Geographic latency

Diagnostic Steps

#!/bin/bash
# diagnose-cdn.sh

echo "🌐 CDN Diagnostics"
echo "=================="

CDN_URL=$(pulumi stack output cdnUrl 2>/dev/null)

if [ -z "$CDN_URL" ]; then
    echo "❌ CDN URL not available"
    exit 1
fi

echo "πŸ”— CDN URL: $CDN_URL"

# Test CDN response
echo ""
echo "πŸ“Š CDN Response Analysis:"
curl -I "$CDN_URL" 2>/dev/null | grep -E "(HTTP|Cache|X-|Age|Server)"

# Test cache behavior
echo ""
echo "πŸ—„οΈ Cache Behavior Test:"
for i in {1..3}; do
    echo "Request $i:"
    CACHE_STATUS=$(curl -s -I "$CDN_URL" | grep -i "x-cache\|cf-cache-status\|x-served-by" || echo "No cache headers")
    echo "$CACHE_STATUS"
    sleep 2
done

# Test from multiple locations (if available)
echo ""
echo "🌍 Geographic Performance:"
# This would require external tools or services
echo "Consider using tools like:"
echo "- curl from different regions"
echo "- Online CDN testing tools"
echo "- Monitoring services"

if [ "$PROVIDER" = "aws" ]; then
    # CloudFront specific diagnostics
    DISTRIBUTION_ID=$(pulumi stack output cdnDistributionId 2>/dev/null)
    if [ -n "$DISTRIBUTION_ID" ]; then
        echo ""
        echo "☁️ CloudFront Distribution: $DISTRIBUTION_ID"
        aws cloudfront get-distribution --id "$DISTRIBUTION_ID" --query 'Distribution.DistributionConfig.Origins' 2>/dev/null
    fi
elif [ "$PROVIDER" = "gcp" ]; then
    # Cloud CDN specific diagnostics
    echo ""
    echo "☁️ Cloud CDN Configuration:"
    gcloud compute url-maps list --filter="name~.*cdn.*" --format="table(name,defaultService)"
fi

πŸ”§ Infrastructure State Issues

Pulumi State Problems

Symptoms

  • "Stack is locked" errors
  • State drift warnings
  • Resource import failures
  • Deployment hangs

Diagnostic Steps

#!/bin/bash
# diagnose-pulumi-state.sh

echo "πŸ”§ Pulumi State Diagnostics"
echo "=========================="

# Check current stack
echo "πŸ“Š Current Stack Information:"
pulumi stack --show-name 2>/dev/null || echo "No active stack"
pulumi whoami 2>/dev/null || echo "Not logged in"

# Check stack status
echo ""
echo "πŸ”’ Stack Lock Status:"
pulumi stack ls 2>/dev/null | grep "$(pulumi stack --show-name 2>/dev/null)" || echo "Stack not found"

# Check for pending operations
echo ""
echo "⏳ Pending Operations:"
pulumi history --show-secrets=false | head -5

# Check state file
echo ""
echo "πŸ’Ύ State File Information:"
if [ -f ".pulumi/stacks/$(pulumi stack --show-name 2>/dev/null).json" ]; then
    echo "βœ… Local state file exists"
    ls -la ".pulumi/stacks/$(pulumi stack --show-name 2>/dev/null).json"
else
    echo "❌ Local state file not found"
fi

# Check for drift
echo ""
echo "πŸ”„ State Drift Check:"
pulumi preview --diff 2>/dev/null | head -20 || echo "Preview failed"

Resolution Steps

  1. Unlock Stack:

    # Cancel current operation
    pulumi cancel
    
    # If still locked, force unlock (use with caution)
    pulumi stack --show-name
    # Contact team members before force unlocking
    
  2. Fix State Drift:

    # Refresh state from cloud provider
    pulumi refresh
    
    # Import missing resources
    pulumi import <resource-type> <resource-name> <resource-id>
    
  3. Recover from Corruption:

    # Export current state
    pulumi stack export > backup-$(date +%Y%m%d-%H%M%S).json
    
    # Restore from backup if needed
    pulumi stack import < backup-file.json
    

πŸ“ž Escalation Procedures

When to Escalate

  • Immediate Escalation (P0):

    • Complete service outage > 15 minutes
    • Data loss or corruption
    • Security breach suspected
    • Multiple systems affected
  • Standard Escalation (P1):

    • Issue not resolved within SLA
    • Requires specialized expertise
    • Affects critical business functions

Escalation Contacts

#!/bin/bash
# escalation-contacts.sh

echo "πŸ“ž Escalation Contact Information"
echo "================================"

echo "🚨 P0 - Critical Issues:"
echo "- On-Call Engineer: [Phone/Slack]"
echo "- Infrastructure Lead: [Contact]"
echo "- Engineering Manager: [Contact]"

echo ""
echo "⚠️ P1 - High Priority:"
echo "- DevOps Team: [Slack Channel]"
echo "- Cloud Provider Support:"
echo "  - AWS: 1-800-xxx-xxxx"
echo "  - GCP: 1-855-xxx-xxxx"

echo ""
echo "πŸ“‹ Information to Include:"
echo "- Issue description and impact"
echo "- Environment affected"
echo "- Steps already taken"
echo "- Relevant logs and metrics"
echo "- Timeline of events"

Post-Incident Review

#!/bin/bash
# post-incident-template.sh

echo "πŸ“‹ Post-Incident Review Template"
echo "==============================="

cat << EOF
# Incident Report: [INCIDENT-ID]

## Summary
- **Date/Time**: $(date)
- **Duration**: [X hours/minutes]
- **Severity**: [P0/P1/P2/P3]
- **Services Affected**: [List services]
- **Users Impacted**: [Number/percentage]

## Timeline
- **Detection**: [How was it detected?]
- **Response**: [Initial response time]
- **Resolution**: [When was it resolved?]

## Root Cause
[Detailed explanation of what caused the issue]

## Resolution
[What was done to fix the issue]

## Action Items
- [ ] [Immediate fixes]
- [ ] [Process improvements]
- [ ] [Monitoring enhancements]
- [ ] [Documentation updates]

## Lessons Learned
[What can be improved for next time]
EOF

Last updated: $(date) Version: 1.0.0