Monitoring and Observability¶
Metrics¶
- Request latency (p50, p95, p99)
- Agent response time
- Job queue depth
- Generation success/failure rates
- System resource utilization
Logging¶
- Structured JSON logging
- Correlation IDs for request tracing
- Log aggregation and search
- Alert rules for critical errors
Health Checks¶
- Liveness probes for container health
- Readiness probes for traffic routing
- Deep health checks including dependencies