Monitoring & Observability¶

Overview¶

This document defines the monitoring and observability strategy for the Digital Product Creation System, focusing on metrics collection, logging, distributed tracing, and alerting to ensure system reliability and performance visibility.

Logging Strategy¶

Structured Logging¶

Format: JSON-structured logs for consistent parsing
Key Fields: timestamp, request_id, user_id, agent_type, operation, duration, error_details
Log Levels: DEBUG, INFO, WARNING, ERROR, CRITICAL

Log Aggregation¶

Stack: ELK (Elasticsearch, Logstash, Kibana)
Collection: Filebeat on each container
Retention: 30 days for standard logs, 90 days for errors
Indexing: By service type and date

Key Classes¶

StructuredLogger: Base logger with JSON formatting
AgentLogger: Specialized logger for agent operations
RequestContextLogger: Adds request context to all logs

Metrics Collection¶

Prometheus Metrics¶

Collection Interval: 15 seconds
Retention: 15 days
Aggregation: 1-minute, 5-minute, 1-hour buckets

Key Metrics¶

Agent Metrics¶

Request count by type and status
Response time histograms
LLM token usage and costs
Tool execution counts
Cache hit rates

System Metrics¶

CPU and memory utilization
Database connection pool status
Message queue depth
WebSocket connection count
HTTP client connection pool

Business Metrics¶

Jobs created per hour
Job success/failure rates
Average processing time by model
Assets approved per project
User activity patterns

Key Classes¶

MetricsCollector: Base metrics collection
AgentMetrics: Agent-specific metrics
SystemMetrics: Infrastructure metrics
BusinessMetrics: Application-level metrics

Distributed Tracing¶

OpenTelemetry Implementation¶

Backend: Jaeger for trace storage and visualization
Sampling: 10% for normal traffic, 100% for errors
Propagation: W3C Trace Context standard

Trace Points¶

API Gateway entry/exit
Agent message processing
Tool execution
Database queries
External service calls
Message broker operations

Key Classes¶

TracingMiddleware: Automatic trace injection
SpanManager: Manual span creation
TraceContextPropagator: Context propagation between services

Health Checks¶

Health Check Levels¶

Liveness: Basic service responsiveness
Readiness: Full dependency checks
Startup: Initial service validation

Monitored Dependencies¶

Database connectivity
Message broker connection
MCP server availability
LLM service responsiveness
Disk space availability

Key Classes¶

HealthChecker: Aggregates all health checks
DependencyChecker: Individual dependency validation
HealthEndpoint: HTTP endpoints for health status

Alerting Strategy¶

Alert Categories¶

Critical Alerts (Immediate Action)¶

Service down
Database connection pool exhausted
Disk space < 10%
Error rate > 10%

Warning Alerts (Investigation Needed)¶

Response time > 5 seconds (p95)
Queue depth > 100 jobs
Memory usage > 80%
Failed job rate > 5%

Info Alerts (Awareness)¶

New deployment completed
Scheduled maintenance
Configuration changes

Alert Routing¶

Critical: PagerDuty + Slack
Warning: Slack + Email
Info: Slack only

Key Components¶

AlertManager: Alert routing and deduplication
AlertRules: Prometheus rule definitions
NotificationChannels: Integration with external services

Performance Profiling¶

Continuous Profiling¶

Tool: pyinstrument for Python profiling
Trigger: Operations exceeding 1 second
Storage: Profile snapshots for analysis

Profiling Targets¶

Agent message processing
Database query execution
LLM inference calls
Serialization/deserialization

Key Classes¶

PerformanceProfiler: Conditional profiling wrapper
ProfileStorage: Profile data persistence
ProfileAnalyzer: Performance bottleneck detection

Debugging Tools¶

Request Tracking¶

Request ID: UUID propagated through all services
Context Variables: Thread-local request context
Correlation: Link logs, metrics, and traces

Debug Endpoints¶

/debug/config: Current configuration
/debug/connections: Active connection stats
/debug/memory: Memory usage breakdown
/debug/profile: Trigger manual profiling

Key Classes¶

RequestTracker: Request ID management
DebugContext: Debug information aggregation
DiagnosticEndpoints: Protected debug routes

Operational Dashboards¶

Service Overview Dashboard¶

Request rate and error trends
Response time percentiles
Active users and connections
Resource utilization

Job Processing Dashboard¶

Queue depth by status
Processing time by model
Success/failure rates
Retry attempts

Agent Performance Dashboard¶

Message processing times
Tool usage patterns
LLM token consumption
Cache effectiveness

Infrastructure Dashboard¶

Container health and restarts
Database performance
Message broker throughput
Network latency

SLA Monitoring¶

Key SLA Metrics¶

Availability: 99.9% uptime
Response Time: 95% < 2 seconds
Error Rate: < 1%
Job Processing: 95% < 5 minutes

SLA Tracking¶

SLAMonitor: Real-time SLA calculation
SLAReporter: Daily/weekly reports
SLAAlerts: Breach notifications

Incident Response¶

Runbooks¶

Service restart procedures
Database connection issues
High memory usage
Queue backlog resolution

Post-Mortem Process¶

Incident timeline construction
Root cause analysis
Action items tracking
Knowledge base updates

Monitoring Best Practices¶

Cardinality Control: Limit label combinations
Sampling Strategy: Balance detail vs. overhead
Alert Fatigue: Tune thresholds based on patterns
Dashboard Organization: Role-based views
Retention Policies: Cost-effective data storage