Troubleshooting Runbook
1.0.0Incident triage and remediation guidance.
On this page
- Infrastructure Troubleshooting Runbook
- π― Overview
- π¨ Emergency Response Procedures
- Incident Classification
- Initial Response Checklist
- ποΈ Database Issues
- Database Connection Problems
- Symptoms
- Diagnostic Steps
- Resolution Steps
- Database Performance Issues
- Symptoms
- Diagnostic Steps
- π API and Compute Issues
- API Gateway/Load Balancer Issues
- Symptoms
- Diagnostic Steps
- Resolution Steps
- Serverless Function Issues
- Cold Start Problems
- πΎ Storage Issues
- S3/Cloud Storage Access Issues
- Symptoms
- Diagnostic Steps
- π CDN and Networking Issues
- CDN Performance Issues
- Symptoms
- Diagnostic Steps
- π§ Infrastructure State Issues
- Pulumi State Problems
- Symptoms
- Diagnostic Steps
- Resolution Steps
- π Escalation Procedures
- When to Escalate
- Escalation Contacts
- Post-Incident Review
Infrastructure Troubleshooting Runbook
This runbook provides systematic approaches to diagnosing and resolving common infrastructure issues in App Factory deployments.
π― Overview
This runbook covers:
- Common infrastructure issues
- Diagnostic procedures
- Step-by-step resolution guides
- Escalation procedures
- Prevention strategies
π¨ Emergency Response Procedures
Incident Classification
| Severity | Description | Response Time | Examples |
|---|---|---|---|
| P0 - Critical | Complete service outage | 15 minutes | Database down, API unreachable |
| P1 - High | Major functionality impacted | 1 hour | Slow response times, partial outage |
| P2 - Medium | Minor functionality impacted | 4 hours | Non-critical features down |
| P3 - Low | Cosmetic or enhancement | 24 hours | Documentation updates |
Initial Response Checklist
#!/bin/bash
# incident-response.sh
echo "π¨ Infrastructure Incident Response"
echo "=================================="
# 1. Gather basic information
echo "π System Status Check:"
echo "Timestamp: $(date)"
echo "Reporter: $USER"
echo "Environment: ${ENVIRONMENT:-unknown}"
# 2. Quick health check
echo ""
echo "π Quick Health Assessment:"
# Check if Pulumi stack is accessible
if command -v pulumi >/dev/null 2>&1; then
echo "Pulumi Status: $(pulumi stack --show-name 2>/dev/null || echo 'Not available')"
# Get stack outputs
OUTPUTS=$(pulumi stack output --json 2>/dev/null)
if [ $? -eq 0 ]; then
echo "Stack outputs available: β
"
# Test key endpoints
API_URL=$(echo "$OUTPUTS" | jq -r '.apiUrl // empty')
if [ -n "$API_URL" ]; then
HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" "$API_URL/health" 2>/dev/null || echo "000")
echo "API Health Check: $HTTP_STATUS"
fi
DATABASE_URL=$(echo "$OUTPUTS" | jq -r '.databaseUrl // empty')
if [ -n "$DATABASE_URL" ]; then
pg_isready -d "$DATABASE_URL" >/dev/null 2>&1
if [ $? -eq 0 ]; then
echo "Database Status: β
Connected"
else
echo "Database Status: β Connection failed"
fi
fi
else
echo "Stack outputs unavailable: β"
fi
else
echo "Pulumi CLI not available"
fi
echo ""
echo "π Next Steps:"
echo "1. Document the issue in detail"
echo "2. Check relevant logs"
echo "3. Follow specific troubleshooting guide"
echo "4. Escalate if needed"
ποΈ Database Issues
Database Connection Problems
Symptoms
- Connection timeouts
- "Connection refused" errors
- Authentication failures
- SSL certificate errors
Diagnostic Steps
#!/bin/bash
# diagnose-database.sh
echo "ποΈ Database Connection Diagnostics"
echo "================================="
# Get database connection details
DATABASE_URL=$(pulumi stack output databaseUrl 2>/dev/null)
if [ -z "$DATABASE_URL" ]; then
echo "β Database URL not available from stack outputs"
exit 1
fi
# Parse connection details
DB_HOST=$(echo "$DATABASE_URL" | sed -n 's/.*@\([^:]*\):.*/\1/p')
DB_PORT=$(echo "$DATABASE_URL" | sed -n 's/.*:\([0-9]*\)\/.*/\1/p')
DB_NAME=$(echo "$DATABASE_URL" | sed -n 's/.*\/\([^?]*\).*/\1/p')
echo "π Connection Details:"
echo "Host: $DB_HOST"
echo "Port: $DB_PORT"
echo "Database: $DB_NAME"
# Test network connectivity
echo ""
echo "π Network Connectivity:"
if command -v telnet >/dev/null 2>&1; then
timeout 5 telnet "$DB_HOST" "$DB_PORT" 2>/dev/null
if [ $? -eq 0 ]; then
echo "β
Network connection successful"
else
echo "β Network connection failed"
echo "Check security groups/firewall rules"
fi
else
echo "telnet not available, using nc"
timeout 5 nc -z "$DB_HOST" "$DB_PORT" 2>/dev/null
if [ $? -eq 0 ]; then
echo "β
Network connection successful"
else
echo "β Network connection failed"
fi
fi
# Test database connectivity
echo ""
echo "π Database Authentication:"
pg_isready -d "$DATABASE_URL" >/dev/null 2>&1
if [ $? -eq 0 ]; then
echo "β
Database authentication successful"
# Test actual connection
psql "$DATABASE_URL" -c "SELECT version();" >/dev/null 2>&1
if [ $? -eq 0 ]; then
echo "β
Database query successful"
else
echo "β Database query failed"
fi
else
echo "β Database authentication failed"
echo "Check credentials and SSL configuration"
fi
# Check recent logs
echo ""
echo "π Recent Database Logs:"
if [ "$PROVIDER" = "aws" ]; then
DB_INSTANCE=$(pulumi stack output databaseInstanceId 2>/dev/null)
if [ -n "$DB_INSTANCE" ]; then
aws logs describe-log-groups --log-group-name-prefix "/aws/rds/instance/$DB_INSTANCE" 2>/dev/null | head -10
fi
elif [ "$PROVIDER" = "gcp" ]; then
DB_INSTANCE=$(pulumi stack output databaseInstanceName 2>/dev/null)
if [ -n "$DB_INSTANCE" ]; then
gcloud logging read "resource.type=cloud_sql_database AND resource.labels.database_id=$DB_INSTANCE" --limit=5 --format="value(timestamp,severity,jsonPayload.message)"
fi
fi
Resolution Steps
-
Network Issues:
# AWS: Check security groups aws ec2 describe-security-groups --group-ids $(pulumi stack output databaseSecurityGroupId) # GCP: Check firewall rules gcloud compute firewall-rules list --filter="name~.*sql.*" -
Authentication Issues:
# Reset database password (AWS) aws rds modify-db-instance \ --db-instance-identifier $(pulumi stack output databaseInstanceId) \ --master-user-password NEW_PASSWORD \ --apply-immediately # Reset database password (GCP) gcloud sql users set-password postgres \ --instance=$(pulumi stack output databaseInstanceName) \ --password=NEW_PASSWORD -
SSL Certificate Issues:
# Download SSL certificates # AWS wget https://s3.amazonaws.com/rds-downloads/rds-combined-ca-bundle.pem # GCP gcloud sql ssl-certs create client-cert client-key.pem \ --instance=$(pulumi stack output databaseInstanceName)
Database Performance Issues
Symptoms
- Slow query responses
- High CPU utilization
- Connection pool exhaustion
- Lock timeouts
Diagnostic Steps
#!/bin/bash
# diagnose-db-performance.sh
echo "β‘ Database Performance Diagnostics"
echo "=================================="
DATABASE_URL=$(pulumi stack output databaseUrl 2>/dev/null)
if [ -z "$DATABASE_URL" ]; then
echo "β Database URL not available"
exit 1
fi
# Check active connections
echo "π Active Connections:"
psql "$DATABASE_URL" -c "
SELECT
count(*) as total_connections,
count(*) FILTER (WHERE state = 'active') as active_connections,
count(*) FILTER (WHERE state = 'idle') as idle_connections
FROM pg_stat_activity;"
# Check long-running queries
echo ""
echo "β±οΈ Long-running Queries:"
psql "$DATABASE_URL" -c "
SELECT
pid,
now() - pg_stat_activity.query_start AS duration,
query,
state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '5 minutes'
ORDER BY duration DESC;"
# Check database size
echo ""
echo "πΎ Database Size:"
psql "$DATABASE_URL" -c "
SELECT
pg_database.datname,
pg_size_pretty(pg_database_size(pg_database.datname)) AS size
FROM pg_database
ORDER BY pg_database_size(pg_database.datname) DESC;"
# Check table sizes
echo ""
echo "π Largest Tables:"
psql "$DATABASE_URL" -c "
SELECT
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
WHERE schemaname NOT IN ('information_schema', 'pg_catalog')
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 10;"
π API and Compute Issues
API Gateway/Load Balancer Issues
Symptoms
- 502/503 errors
- High latency
- CORS failures
- Rate limiting errors
Diagnostic Steps
#!/bin/bash
# diagnose-api.sh
echo "π API Diagnostics"
echo "=================="
API_URL=$(pulumi stack output apiUrl 2>/dev/null)
if [ -z "$API_URL" ]; then
echo "β API URL not available"
exit 1
fi
echo "π API Endpoint: $API_URL"
# Test API health
echo ""
echo "π₯ Health Check:"
HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" "$API_URL/health" 2>/dev/null || echo "000")
echo "Health endpoint status: $HTTP_STATUS"
if [ "$HTTP_STATUS" != "200" ]; then
echo "β Health check failed"
# Test basic connectivity
curl -v "$API_URL" 2>&1 | head -20
else
echo "β
Health check passed"
fi
# Test response times
echo ""
echo "β±οΈ Response Time Analysis:"
for i in {1..5}; do
RESPONSE_TIME=$(curl -s -o /dev/null -w "%{time_total}" "$API_URL/health" 2>/dev/null || echo "timeout")
echo "Request $i: ${RESPONSE_TIME}s"
sleep 1
done
# Check recent logs
echo ""
echo "π Recent API Logs:"
if [ "$PROVIDER" = "aws" ]; then
# Lambda logs
FUNCTION_NAME=$(pulumi stack output lambdaFunctionName 2>/dev/null)
if [ -n "$FUNCTION_NAME" ]; then
aws logs tail "/aws/lambda/$FUNCTION_NAME" --since 1h | head -20
fi
# API Gateway logs
aws logs tail "/aws/apigateway/$(pulumi stack output apiGatewayId)" --since 1h | head -10
elif [ "$PROVIDER" = "gcp" ]; then
# Cloud Run logs
SERVICE_NAME=$(pulumi stack output apiServiceName 2>/dev/null)
if [ -n "$SERVICE_NAME" ]; then
gcloud logging read "resource.type=cloud_run_revision AND resource.labels.service_name=$SERVICE_NAME" --limit=10 --format="value(timestamp,severity,textPayload)"
fi
fi
Resolution Steps
-
502/503 Errors:
# Check backend health # AWS Lambda aws lambda invoke --function-name $(pulumi stack output lambdaFunctionName) response.json cat response.json # GCP Cloud Run gcloud run services describe $(pulumi stack output apiServiceName) --region=$(pulumi config get gcp:region) -
High Latency:
# Check resource utilization # AWS CloudWatch aws cloudwatch get-metric-statistics \ --namespace AWS/Lambda \ --metric-name Duration \ --dimensions Name=FunctionName,Value=$(pulumi stack output lambdaFunctionName) \ --start-time $(date -d "1 hour ago" --iso-8601) \ --end-time $(date --iso-8601) \ --period 300 \ --statistics Average # GCP Monitoring gcloud logging read "resource.type=cloud_run_revision" --limit=50 | grep -i "latency\|duration"
Serverless Function Issues
Cold Start Problems
#!/bin/bash
# diagnose-cold-starts.sh
echo "π₯Ά Cold Start Diagnostics"
echo "========================"
if [ "$PROVIDER" = "aws" ]; then
FUNCTION_NAME=$(pulumi stack output lambdaFunctionName 2>/dev/null)
echo "π Lambda Function: $FUNCTION_NAME"
# Check function configuration
aws lambda get-function --function-name "$FUNCTION_NAME" --query 'Configuration.[Runtime,MemorySize,Timeout]' --output table
# Check cold start metrics
aws cloudwatch get-metric-statistics \
--namespace AWS/Lambda \
--metric-name InitDuration \
--dimensions Name=FunctionName,Value="$FUNCTION_NAME" \
--start-time $(date -d "24 hours ago" --iso-8601) \
--end-time $(date --iso-8601) \
--period 3600 \
--statistics Average,Maximum
elif [ "$PROVIDER" = "gcp" ]; then
SERVICE_NAME=$(pulumi stack output apiServiceName 2>/dev/null)
echo "π Cloud Run Service: $SERVICE_NAME"
# Check service configuration
gcloud run services describe "$SERVICE_NAME" --region=$(pulumi config get gcp:region) --format="value(spec.template.spec.containers[0].resources)"
# Check cold start logs
gcloud logging read "resource.type=cloud_run_revision AND resource.labels.service_name=$SERVICE_NAME AND textPayload:\"cold start\"" --limit=10
fi
πΎ Storage Issues
S3/Cloud Storage Access Issues
Symptoms
- 403 Forbidden errors
- Upload failures
- CORS issues
- Slow transfer speeds
Diagnostic Steps
#!/bin/bash
# diagnose-storage.sh
echo "πΎ Storage Diagnostics"
echo "====================="
if [ "$PROVIDER" = "aws" ]; then
ASSETS_BUCKET=$(pulumi stack output assetsBucketName 2>/dev/null)
UPLOADS_BUCKET=$(pulumi stack output uploadsBucketName 2>/dev/null)
echo "πͺ£ S3 Buckets:"
echo "Assets: $ASSETS_BUCKET"
echo "Uploads: $UPLOADS_BUCKET"
# Check bucket policies
echo ""
echo "π Bucket Policies:"
for bucket in "$ASSETS_BUCKET" "$UPLOADS_BUCKET"; do
if [ -n "$bucket" ]; then
echo "Policy for $bucket:"
aws s3api get-bucket-policy --bucket "$bucket" 2>/dev/null | jq '.Policy | fromjson' || echo "No policy found"
fi
done
# Test upload
echo ""
echo "π€ Upload Test:"
echo "test content" > test-file.txt
aws s3 cp test-file.txt "s3://$UPLOADS_BUCKET/test-file.txt" 2>&1
if [ $? -eq 0 ]; then
echo "β
Upload successful"
aws s3 rm "s3://$UPLOADS_BUCKET/test-file.txt"
else
echo "β Upload failed"
fi
rm -f test-file.txt
elif [ "$PROVIDER" = "gcp" ]; then
ASSETS_BUCKET=$(pulumi stack output assetsBucketName 2>/dev/null)
UPLOADS_BUCKET=$(pulumi stack output uploadsBucketName 2>/dev/null)
echo "πͺ£ Cloud Storage Buckets:"
echo "Assets: $ASSETS_BUCKET"
echo "Uploads: $UPLOADS_BUCKET"
# Check bucket IAM policies
echo ""
echo "π Bucket IAM Policies:"
for bucket in "$ASSETS_BUCKET" "$UPLOADS_BUCKET"; do
if [ -n "$bucket" ]; then
echo "IAM policy for $bucket:"
gsutil iam get "gs://$bucket" 2>/dev/null || echo "No policy found"
fi
done
# Test upload
echo ""
echo "π€ Upload Test:"
echo "test content" > test-file.txt
gsutil cp test-file.txt "gs://$UPLOADS_BUCKET/test-file.txt" 2>&1
if [ $? -eq 0 ]; then
echo "β
Upload successful"
gsutil rm "gs://$UPLOADS_BUCKET/test-file.txt"
else
echo "β Upload failed"
fi
rm -f test-file.txt
fi
π CDN and Networking Issues
CDN Performance Issues
Symptoms
- Slow content delivery
- Cache misses
- Origin server overload
- Geographic latency
Diagnostic Steps
#!/bin/bash
# diagnose-cdn.sh
echo "π CDN Diagnostics"
echo "=================="
CDN_URL=$(pulumi stack output cdnUrl 2>/dev/null)
if [ -z "$CDN_URL" ]; then
echo "β CDN URL not available"
exit 1
fi
echo "π CDN URL: $CDN_URL"
# Test CDN response
echo ""
echo "π CDN Response Analysis:"
curl -I "$CDN_URL" 2>/dev/null | grep -E "(HTTP|Cache|X-|Age|Server)"
# Test cache behavior
echo ""
echo "ποΈ Cache Behavior Test:"
for i in {1..3}; do
echo "Request $i:"
CACHE_STATUS=$(curl -s -I "$CDN_URL" | grep -i "x-cache\|cf-cache-status\|x-served-by" || echo "No cache headers")
echo "$CACHE_STATUS"
sleep 2
done
# Test from multiple locations (if available)
echo ""
echo "π Geographic Performance:"
# This would require external tools or services
echo "Consider using tools like:"
echo "- curl from different regions"
echo "- Online CDN testing tools"
echo "- Monitoring services"
if [ "$PROVIDER" = "aws" ]; then
# CloudFront specific diagnostics
DISTRIBUTION_ID=$(pulumi stack output cdnDistributionId 2>/dev/null)
if [ -n "$DISTRIBUTION_ID" ]; then
echo ""
echo "βοΈ CloudFront Distribution: $DISTRIBUTION_ID"
aws cloudfront get-distribution --id "$DISTRIBUTION_ID" --query 'Distribution.DistributionConfig.Origins' 2>/dev/null
fi
elif [ "$PROVIDER" = "gcp" ]; then
# Cloud CDN specific diagnostics
echo ""
echo "βοΈ Cloud CDN Configuration:"
gcloud compute url-maps list --filter="name~.*cdn.*" --format="table(name,defaultService)"
fi
π§ Infrastructure State Issues
Pulumi State Problems
Symptoms
- "Stack is locked" errors
- State drift warnings
- Resource import failures
- Deployment hangs
Diagnostic Steps
#!/bin/bash
# diagnose-pulumi-state.sh
echo "π§ Pulumi State Diagnostics"
echo "=========================="
# Check current stack
echo "π Current Stack Information:"
pulumi stack --show-name 2>/dev/null || echo "No active stack"
pulumi whoami 2>/dev/null || echo "Not logged in"
# Check stack status
echo ""
echo "π Stack Lock Status:"
pulumi stack ls 2>/dev/null | grep "$(pulumi stack --show-name 2>/dev/null)" || echo "Stack not found"
# Check for pending operations
echo ""
echo "β³ Pending Operations:"
pulumi history --show-secrets=false | head -5
# Check state file
echo ""
echo "πΎ State File Information:"
if [ -f ".pulumi/stacks/$(pulumi stack --show-name 2>/dev/null).json" ]; then
echo "β
Local state file exists"
ls -la ".pulumi/stacks/$(pulumi stack --show-name 2>/dev/null).json"
else
echo "β Local state file not found"
fi
# Check for drift
echo ""
echo "π State Drift Check:"
pulumi preview --diff 2>/dev/null | head -20 || echo "Preview failed"
Resolution Steps
-
Unlock Stack:
# Cancel current operation pulumi cancel # If still locked, force unlock (use with caution) pulumi stack --show-name # Contact team members before force unlocking -
Fix State Drift:
# Refresh state from cloud provider pulumi refresh # Import missing resources pulumi import <resource-type> <resource-name> <resource-id> -
Recover from Corruption:
# Export current state pulumi stack export > backup-$(date +%Y%m%d-%H%M%S).json # Restore from backup if needed pulumi stack import < backup-file.json
π Escalation Procedures
When to Escalate
-
Immediate Escalation (P0):
- Complete service outage > 15 minutes
- Data loss or corruption
- Security breach suspected
- Multiple systems affected
-
Standard Escalation (P1):
- Issue not resolved within SLA
- Requires specialized expertise
- Affects critical business functions
Escalation Contacts
#!/bin/bash
# escalation-contacts.sh
echo "π Escalation Contact Information"
echo "================================"
echo "π¨ P0 - Critical Issues:"
echo "- On-Call Engineer: [Phone/Slack]"
echo "- Infrastructure Lead: [Contact]"
echo "- Engineering Manager: [Contact]"
echo ""
echo "β οΈ P1 - High Priority:"
echo "- DevOps Team: [Slack Channel]"
echo "- Cloud Provider Support:"
echo " - AWS: 1-800-xxx-xxxx"
echo " - GCP: 1-855-xxx-xxxx"
echo ""
echo "π Information to Include:"
echo "- Issue description and impact"
echo "- Environment affected"
echo "- Steps already taken"
echo "- Relevant logs and metrics"
echo "- Timeline of events"
Post-Incident Review
#!/bin/bash
# post-incident-template.sh
echo "π Post-Incident Review Template"
echo "==============================="
cat << EOF
# Incident Report: [INCIDENT-ID]
## Summary
- **Date/Time**: $(date)
- **Duration**: [X hours/minutes]
- **Severity**: [P0/P1/P2/P3]
- **Services Affected**: [List services]
- **Users Impacted**: [Number/percentage]
## Timeline
- **Detection**: [How was it detected?]
- **Response**: [Initial response time]
- **Resolution**: [When was it resolved?]
## Root Cause
[Detailed explanation of what caused the issue]
## Resolution
[What was done to fix the issue]
## Action Items
- [ ] [Immediate fixes]
- [ ] [Process improvements]
- [ ] [Monitoring enhancements]
- [ ] [Documentation updates]
## Lessons Learned
[What can be improved for next time]
EOF
Last updated: $(date) Version: 1.0.0