Skip to content

Telemetry Service

Overview

The Telemetry Service acts as a reliable, digital "black box" recorder for the entire platform. Its main job is to collect all operational and usage data to ensure trust and compliance. Specifically, it guarantees that the framework meets strict regulations like GDPR, HIPAA, and SOC 2. Everything that happens — especially who accesses what and when — is logged, providing an unchangeable record for complete auditability. This system tracks usage for cost accountability and manages real-time limits, or quotas, for every user.

From a technical perspective, this service tracks every major event, especially those related to the Model Execution Lifecycle, such as a model starting, completing, or failing. Critically, it also records the raw Resource Usage for compute components like GPU, CPU, and LLM token usage. The service then performs enrichment: it combines the raw event data with related metrics (often pulled from Prometheus) to connect the usage to a specific customer, project, and department. This process ensures the data is always complete and tied to the right source, which is essential for accurate billing and internal cost allocation.

The architecture is built for speed and massive scale, handling a constant flow of data using an event-driven design. Events from various services are first streamed through a high-throughput system like Kafka. The Telemetry Service processes this stream in real-time before storing the final, enriched audit logs and metrics in specialized analytical databases, such as ClickHouse, which are designed to handle huge volumes of time-series data. Finally, data visualization tools like Grafana access this stored information to generate tailored dashboards and reports for different roles, such as financial directors or technical teams.

Functional Requirements

Core System Requirements

  • Real-time Metrics Collection: System must collect and process metrics in real-time with <5 second latency
  • Event-driven Architecture: All metrics must be captured as events with structured metadata
  • Multi-tenant Support: System must track metrics per user, department, and project
  • Historical Data: System must retain metrics data for at least 12 months for trend analysis
  • High Availability: System must have 99.9% uptime to ensure continuous monitoring

Data Collection Requirements

  • Model Execution Tracking: Capture all model runs with complete execution context
  • Resource Monitoring: Monitor CPU, GPU, RAM, disk, and network metrics at 30-second intervals
  • Job Lifecycle Tracking: Track complete job lifecycle from queue to completion
  • Cost Attribution: Associate all resource usage with specific users, projects, and departments
  • Error Tracking: Capture and categorize all execution errors with context

Reporting & Analytics Requirements

  • Dashboard Access: Role-based dashboards for each user type (Pipeline TD, Admin, Financial Director)
  • Custom Queries: Support for ad-hoc queries and custom metric combinations
  • Alerting: Configurable alerts for resource usage, costs, and performance thresholds
  • Export Capabilities: Ability to export data for external analysis and reporting
  • Drill-down Analysis: Ability to drill down from high-level metrics to detailed execution logs

Integration Requirements

  • Event Bus Integration: Integrate with existing event systems Kafka
  • Monitoring Integration: Work with existing monitoring tools (Prometheus, Grafana)
  • Database Support: Support for both operational (PostgreSQL) and analytical (ClickHouse) databases
  • API Access: RESTful API for programmatic access to metrics and reports

Performance Requirements

  • Query Performance: Dashboard queries must complete within 5 seconds for standard time ranges
  • Data Ingestion: System must handle 10k rps during peak usage
  • Storage Efficiency: Efficient storage of time-series data with automatic data compression
  • Scalability: System must scale horizontally to handle growing data volumes

Security & Compliance Requirements

  • Data Privacy: Ensure user data privacy and secure handling of sensitive metrics
  • Access Control: Role-based access control for different metric types and time ranges
  • Audit Trail: Complete audit trail of who accessed what data and when
  • Data Retention: Configurable data retention policies to meet compliance requirements

Technical Requirements

  • .NET 9.0. Development of a service using the .net 9 platform.
  • C# 14. To develop the service, use the C# programming language version 14.
  • Prometheus. For storing and querying time-series metrics.
  • Prometheus Operator. Should be deployed on each Kubernetes node for collecting metrics and sending to Prometheus.
  • ClickHouse. For storing telemetry service data.
  • Kafka. For event-driven communication between Event Sources and telemetry service.
  • Grafana. For visualization and monitoring.