Monitoring & Observability¶
Overview¶
This document defines the monitoring and observability strategy for the Digital Product Creation System, focusing on metrics collection, logging, distributed tracing, and alerting to ensure system reliability and performance visibility.
Logging Strategy¶
Structured Logging¶
- Format: JSON-structured logs for consistent parsing
- Key Fields: timestamp, request_id, user_id, agent_type, operation, duration, error_details
- Log Levels: DEBUG, INFO, WARNING, ERROR, CRITICAL
Log Aggregation¶
- Stack: ELK (Elasticsearch, Logstash, Kibana)
- Collection: Filebeat on each container
- Retention: 30 days for standard logs, 90 days for errors
- Indexing: By service type and date
Key Classes¶
- StructuredLogger: Base logger with JSON formatting
- AgentLogger: Specialized logger for agent operations
- RequestContextLogger: Adds request context to all logs
Metrics Collection¶
Prometheus Metrics¶
- Collection Interval: 15 seconds
- Retention: 15 days
- Aggregation: 1-minute, 5-minute, 1-hour buckets
Key Metrics¶
Agent Metrics¶
- Request count by type and status
- Response time histograms
- LLM token usage and costs
- Tool execution counts
- Cache hit rates
System Metrics¶
- CPU and memory utilization
- Database connection pool status
- Message queue depth
- WebSocket connection count
- HTTP client connection pool
Business Metrics¶
- Jobs created per hour
- Job success/failure rates
- Average processing time by model
- Assets approved per project
- User activity patterns
Key Classes¶
- MetricsCollector: Base metrics collection
- AgentMetrics: Agent-specific metrics
- SystemMetrics: Infrastructure metrics
- BusinessMetrics: Application-level metrics
Distributed Tracing¶
OpenTelemetry Implementation¶
- Backend: Jaeger for trace storage and visualization
- Sampling: 10% for normal traffic, 100% for errors
- Propagation: W3C Trace Context standard
Trace Points¶
- API Gateway entry/exit
- Agent message processing
- Tool execution
- Database queries
- External service calls
- Message broker operations
Key Classes¶
- TracingMiddleware: Automatic trace injection
- SpanManager: Manual span creation
- TraceContextPropagator: Context propagation between services
Health Checks¶
Health Check Levels¶
- Liveness: Basic service responsiveness
- Readiness: Full dependency checks
- Startup: Initial service validation
Monitored Dependencies¶
- Database connectivity
- Message broker connection
- MCP server availability
- LLM service responsiveness
- Disk space availability
Key Classes¶
- HealthChecker: Aggregates all health checks
- DependencyChecker: Individual dependency validation
- HealthEndpoint: HTTP endpoints for health status
Alerting Strategy¶
Alert Categories¶
Critical Alerts (Immediate Action)¶
- Service down
- Database connection pool exhausted
- Disk space < 10%
- Error rate > 10%
Warning Alerts (Investigation Needed)¶
- Response time > 5 seconds (p95)
- Queue depth > 100 jobs
- Memory usage > 80%
- Failed job rate > 5%
Info Alerts (Awareness)¶
- New deployment completed
- Scheduled maintenance
- Configuration changes
Alert Routing¶
- Critical: PagerDuty + Slack
- Warning: Slack + Email
- Info: Slack only
Key Components¶
- AlertManager: Alert routing and deduplication
- AlertRules: Prometheus rule definitions
- NotificationChannels: Integration with external services
Performance Profiling¶
Continuous Profiling¶
- Tool: pyinstrument for Python profiling
- Trigger: Operations exceeding 1 second
- Storage: Profile snapshots for analysis
Profiling Targets¶
- Agent message processing
- Database query execution
- LLM inference calls
- Serialization/deserialization
Key Classes¶
- PerformanceProfiler: Conditional profiling wrapper
- ProfileStorage: Profile data persistence
- ProfileAnalyzer: Performance bottleneck detection
Debugging Tools¶
Request Tracking¶
- Request ID: UUID propagated through all services
- Context Variables: Thread-local request context
- Correlation: Link logs, metrics, and traces
Debug Endpoints¶
/debug/config: Current configuration/debug/connections: Active connection stats/debug/memory: Memory usage breakdown/debug/profile: Trigger manual profiling
Key Classes¶
- RequestTracker: Request ID management
- DebugContext: Debug information aggregation
- DiagnosticEndpoints: Protected debug routes
Operational Dashboards¶
Service Overview Dashboard¶
- Request rate and error trends
- Response time percentiles
- Active users and connections
- Resource utilization
Job Processing Dashboard¶
- Queue depth by status
- Processing time by model
- Success/failure rates
- Retry attempts
Agent Performance Dashboard¶
- Message processing times
- Tool usage patterns
- LLM token consumption
- Cache effectiveness
Infrastructure Dashboard¶
- Container health and restarts
- Database performance
- Message broker throughput
- Network latency
SLA Monitoring¶
Key SLA Metrics¶
- Availability: 99.9% uptime
- Response Time: 95% < 2 seconds
- Error Rate: < 1%
- Job Processing: 95% < 5 minutes
SLA Tracking¶
- SLAMonitor: Real-time SLA calculation
- SLAReporter: Daily/weekly reports
- SLAAlerts: Breach notifications
Incident Response¶
Runbooks¶
- Service restart procedures
- Database connection issues
- High memory usage
- Queue backlog resolution
Post-Mortem Process¶
- Incident timeline construction
- Root cause analysis
- Action items tracking
- Knowledge base updates
Monitoring Best Practices¶
- Cardinality Control: Limit label combinations
- Sampling Strategy: Balance detail vs. overhead
- Alert Fatigue: Tune thresholds based on patterns
- Dashboard Organization: Role-based views
- Retention Policies: Cost-effective data storage