Skip to content

Monitoring & Observability

Overview

This document defines the monitoring and observability strategy for the Digital Product Creation System, focusing on metrics collection, logging, distributed tracing, and alerting to ensure system reliability and performance visibility.

Logging Strategy

Structured Logging

  • Format: JSON-structured logs for consistent parsing
  • Key Fields: timestamp, request_id, user_id, agent_type, operation, duration, error_details
  • Log Levels: DEBUG, INFO, WARNING, ERROR, CRITICAL

Log Aggregation

  • Stack: ELK (Elasticsearch, Logstash, Kibana)
  • Collection: Filebeat on each container
  • Retention: 30 days for standard logs, 90 days for errors
  • Indexing: By service type and date

Key Classes

  • StructuredLogger: Base logger with JSON formatting
  • AgentLogger: Specialized logger for agent operations
  • RequestContextLogger: Adds request context to all logs

Metrics Collection

Prometheus Metrics

  • Collection Interval: 15 seconds
  • Retention: 15 days
  • Aggregation: 1-minute, 5-minute, 1-hour buckets

Key Metrics

Agent Metrics

  • Request count by type and status
  • Response time histograms
  • LLM token usage and costs
  • Tool execution counts
  • Cache hit rates

System Metrics

  • CPU and memory utilization
  • Database connection pool status
  • Message queue depth
  • WebSocket connection count
  • HTTP client connection pool

Business Metrics

  • Jobs created per hour
  • Job success/failure rates
  • Average processing time by model
  • Assets approved per project
  • User activity patterns

Key Classes

  • MetricsCollector: Base metrics collection
  • AgentMetrics: Agent-specific metrics
  • SystemMetrics: Infrastructure metrics
  • BusinessMetrics: Application-level metrics

Distributed Tracing

OpenTelemetry Implementation

  • Backend: Jaeger for trace storage and visualization
  • Sampling: 10% for normal traffic, 100% for errors
  • Propagation: W3C Trace Context standard

Trace Points

  • API Gateway entry/exit
  • Agent message processing
  • Tool execution
  • Database queries
  • External service calls
  • Message broker operations

Key Classes

  • TracingMiddleware: Automatic trace injection
  • SpanManager: Manual span creation
  • TraceContextPropagator: Context propagation between services

Health Checks

Health Check Levels

  1. Liveness: Basic service responsiveness
  2. Readiness: Full dependency checks
  3. Startup: Initial service validation

Monitored Dependencies

  • Database connectivity
  • Message broker connection
  • MCP server availability
  • LLM service responsiveness
  • Disk space availability

Key Classes

  • HealthChecker: Aggregates all health checks
  • DependencyChecker: Individual dependency validation
  • HealthEndpoint: HTTP endpoints for health status

Alerting Strategy

Alert Categories

Critical Alerts (Immediate Action)

  • Service down
  • Database connection pool exhausted
  • Disk space < 10%
  • Error rate > 10%

Warning Alerts (Investigation Needed)

  • Response time > 5 seconds (p95)
  • Queue depth > 100 jobs
  • Memory usage > 80%
  • Failed job rate > 5%

Info Alerts (Awareness)

  • New deployment completed
  • Scheduled maintenance
  • Configuration changes

Alert Routing

  • Critical: PagerDuty + Slack
  • Warning: Slack + Email
  • Info: Slack only

Key Components

  • AlertManager: Alert routing and deduplication
  • AlertRules: Prometheus rule definitions
  • NotificationChannels: Integration with external services

Performance Profiling

Continuous Profiling

  • Tool: pyinstrument for Python profiling
  • Trigger: Operations exceeding 1 second
  • Storage: Profile snapshots for analysis

Profiling Targets

  • Agent message processing
  • Database query execution
  • LLM inference calls
  • Serialization/deserialization

Key Classes

  • PerformanceProfiler: Conditional profiling wrapper
  • ProfileStorage: Profile data persistence
  • ProfileAnalyzer: Performance bottleneck detection

Debugging Tools

Request Tracking

  • Request ID: UUID propagated through all services
  • Context Variables: Thread-local request context
  • Correlation: Link logs, metrics, and traces

Debug Endpoints

  • /debug/config: Current configuration
  • /debug/connections: Active connection stats
  • /debug/memory: Memory usage breakdown
  • /debug/profile: Trigger manual profiling

Key Classes

  • RequestTracker: Request ID management
  • DebugContext: Debug information aggregation
  • DiagnosticEndpoints: Protected debug routes

Operational Dashboards

Service Overview Dashboard

  • Request rate and error trends
  • Response time percentiles
  • Active users and connections
  • Resource utilization

Job Processing Dashboard

  • Queue depth by status
  • Processing time by model
  • Success/failure rates
  • Retry attempts

Agent Performance Dashboard

  • Message processing times
  • Tool usage patterns
  • LLM token consumption
  • Cache effectiveness

Infrastructure Dashboard

  • Container health and restarts
  • Database performance
  • Message broker throughput
  • Network latency

SLA Monitoring

Key SLA Metrics

  • Availability: 99.9% uptime
  • Response Time: 95% < 2 seconds
  • Error Rate: < 1%
  • Job Processing: 95% < 5 minutes

SLA Tracking

  • SLAMonitor: Real-time SLA calculation
  • SLAReporter: Daily/weekly reports
  • SLAAlerts: Breach notifications

Incident Response

Runbooks

  • Service restart procedures
  • Database connection issues
  • High memory usage
  • Queue backlog resolution

Post-Mortem Process

  • Incident timeline construction
  • Root cause analysis
  • Action items tracking
  • Knowledge base updates

Monitoring Best Practices

  1. Cardinality Control: Limit label combinations
  2. Sampling Strategy: Balance detail vs. overhead
  3. Alert Fatigue: Tune thresholds based on patterns
  4. Dashboard Organization: Role-based views
  5. Retention Policies: Cost-effective data storage