Skip to content

Monitoring and Observability

Metrics

  • Request latency (p50, p95, p99)
  • Agent response time
  • Job queue depth
  • Generation success/failure rates
  • System resource utilization

Logging

  • Structured JSON logging
  • Correlation IDs for request tracing
  • Log aggregation and search
  • Alert rules for critical errors

Health Checks

  • Liveness probes for container health
  • Readiness probes for traffic routing
  • Deep health checks including dependencies