Skip to content

Telemetry Service - Functional Specification

The Telemetry Service collects and analyzes the necessary metrics of the process. The service receives notifications about the start and end of the model process, and based on the data in the notifications and the time of notifications, requests metrics for the period of time of the model process from Prometheus, analyzes them, aggregates and saves them in its storage.

Individual topics can be uploaded directly to ClickHouse without additional enrichment.

The Quota Service is designed to quickly allocate quotas by user. The service receives real-time calculation events from telemetry. For users who are not in memory, an additional request is made to ClickHouse, where quotas are counted by event directly. For more information, review the following article: Using the Kafka table engine.

Miro Board

Event Naming Rules

Each event must follow the rules outlined below:

  1. All letters in the event name must be lowercase.
  2. Start the event name with the name of the system or service that generates it.
  3. Next, include the specific component or subsystem responsible for the event.
  4. Then, indicate the action being performed.
  5. Finally, add an optional note if additional context is needed.
example
  xumi.weights.load.start

  xumi.weights.load.finish

XUMI Telemetry Specification

Component Description Owner
Kafka REST Proxy https://github.com/confluentinc/kafka-rest/tree/master
https://github.com/strimzi/strimzi-kafka-bridge/tree/main
REST API for Kafka
devops
KAFKA_PROXY_URL Environment Variable in docker container. Connection string for Kafka proxy devops
NODE_ID Environment Variable in docker container. UUID node devops, ml
VM_TYPE Virtual machine instance type (NV6, x4GPUlarge) devops
HOST_NAME Host name devops
STORAGE_SIZE Dynamic storage size devops
STORAGE_TYPE Dynamic storage type devops
TENANT_ID Tenant ID variable in docker container. UUID Devops, ml
Python Event Service service classes that send messages to the REST API for Kafka ml
Events ml
Kafka Topics xumi_load, xumi_task Topics for messages about loading weights and starting tasks devops

PBAC Telemetry Specification

All audit events are directly recorded in the ClickHouse database. In the future, there may be an audit event response service. It will read data from the corresponding Kafka topics.

Component Description Owner
KAFKA_DIRECT_URL Environment Variable in docker container. Connection string for Kafka devops
Golang Event Service service classes that send messages to Kafka backend
.NET Event Service service classes that send messages to Kafka backend
Events backend
Kafka Topics pbac_access, pbac_politics, pbac_service Topics for messages about audit access, audit politics changes and service messages (update, load, changes politics) devops

Asset Storage Telemetry Specification

Component Description Owner
KAFKA_DIRECT_URL Environment Variable in docker container. Connection string for Kafka devops
Golang Event Service service classes that send messages to Kafka backend
.NET Event Service service classes that send messages to Kafka backend
Events backend
Kafka Topics asset_access, asset_transfer Topics for messages about audit access and upload/download asset in storage devops

Workflow Telemetry Specification

Component Description Owner
KAFKA_DIRECT_URL Environment Variable in docker container. Connection string for kafka devops
.NET Event Service service classes that send messages to Kafka backend
Events backend
Kafka Topics wkf_cpu, wkf_gpu Topics for messages about credits on use CPU and GPU devops

Kubernetes Telemetry Specification

Component Description Owner
KAFKA_DIRECT_URL Environment Variable in docker container. Connection string for Kafka devops
.NET Event Service service classes that send messages to Kafka backend
Events backend
Kafka Topics kuber_cpu, kuber_gpu Topics for messages about credits on use CPU and GPU devops

MCP Telemetry Specification

This entire process starts with the Workflow service. The Workflow service initializes the processing of the XUMI model and sends the corresponding event to the Azure Event Hubs, passing the necessary data in the event body:

  • The identifier of the user who initiated the model processing
  • The identifier of the model processing task
  • The name of the Kubernetes node in which the model processing is running.

When the XUMI completes model processing (successfully or not), the Workflow service sends one of two events (depending on the success of the processing): TaskCompleted or TaskFailed, passing the identifier of the model processing task in the event body.