Event-Based System Design
Event Architecture Overview
The Xavier telemetry system uses an event-driven architecture where all metrics are captured as structured events with rich metadata. Each event is published to a message bus (Kafka/Azure Event Hubs) and then stored in both operational (PostgreSQL) and analytical (ClickHouse) databases for different query patterns.
Event Overview Table
| Event Type |
Category |
Source Service |
Purpose |
Frequency |
Retention |
Key Metadata Fields |
| MODEL_inputSizeBytesEXECUTION_STARTED |
Model Execution |
spi-service |
Tracks start of model inference with checkpoint details |
Per job execution |
7 days |
jobId, modelId, checkpointId, executionParameters, checkpointLoadingTimeMs, checkpointCacheHit |
| MODEL_EXECUTION_COMPLETED |
Model Execution |
spi-service |
Records successful completion with performance metrics |
Per successful job |
7 days |
jobId, modelId, checkpointId, inferenceTimeMs, costBreakdown, memoryUsageMB, performanceMetrics |
| MODEL_EXECUTION_FAILED |
Model Execution |
spi-service |
Captures failure details with error analysis |
Per failed job |
30 days |
jobId, modelId, checkpointId, errorType, failureStage, checkpointFallbackUsed, retryCount |
| CHECKPOINT_LOADED |
Checkpoint Management |
checkpoint-service |
Tracks checkpoint loading operations and cache behavior |
Per checkpoint load |
7 days |
checkpointId, loadingTimeMs, cacheHit, sourceLocation, vmId, compressionRatio, verificationPassed |
| CHECKPOINT_CACHED |
Checkpoint Management |
checkpoint-service |
Monitors cache operations and storage optimization |
Per cache operation |
7 days |
checkpointId, cacheAction, cacheUsagePercent, cacheHitRate, evictedCheckpoints, retentionPolicyApplied |
| VM_METRICS_SNAPSHOT |
Infrastructure |
Telemetry Service |
Periodic system resource utilization data |
Every 30 seconds |
30 days |
vmId, cpuMetrics, gpuMetrics, memoryMetrics, diskMetrics, networkMetrics |
| JOB_QUEUED |
Job Lifecycle |
Workflow Service |
Tracks job entry into processing queue |
Per job submission |
7 days |
jobId, priority, queueName, estimatedDurationMs, requiredResources, queuePosition |
| JOB_ASSIGNED |
Job Lifecycle |
Workflow Service |
Records job assignment to specific VM |
Per job assignment |
7 days |
jobId, vmId, queueTimeMs, assignmentReason |
| COST_ALLOCATION |
Financial |
billing-service |
Detailed cost breakdown with checkpoint attribution |
Per completed job |
365 days |
jobId, modelId, checkpointId, costBreakdown, resourceUsage, checkpointUsageMetrics |
| ALERT_TRIGGERED |
System Health |
monitoring-service |
System alerts and threshold violations |
As needed |
30 days |
alertId, alertType, severity, vmId, currentValue, thresholdValue, duration |
Event Relationships & Dependencies
Primary Execution Flow
JOB_QUEUED → JOB_ASSIGNED → MODEL_EXECUTION_STARTED → CHECKPOINT_LOADED → MODEL_EXECUTION_COMPLETED/FAILED → COST_ALLOCATION
Infrastructure Events
VM_PROVISIONED → VM_METRICS_SNAPSHOT (continuous) → VM_DEPROVISIONED
Checkpoint Management Flow
MODEL_EXECUTION_STARTED → CHECKPOINT_LOADED → CHECKPOINT_CACHED → (used in)
Event Categories
Model Execution Events (Critical Business Logic)
| Aspect |
Details |
| Business Impact |
Direct revenue/cost impact, user experience |
| Volume |
High - 1000s per day |
| Dependencies |
Requires VM and checkpoint events |
| Analytics Use |
Performance optimization, cost analysis, user behavior |
| SLA |
< 5 second processing latency |
| Aspect |
Details |
| Business Impact |
Performance optimization, cost reduction |
| Volume |
Medium - 100s per day |
| Dependencies |
Linked to model execution events |
| Analytics Use |
Cache optimization, storage planning, cost reduction |
| SLA |
< 2 second processing latency |
Infrastructure Events (Operations & Capacity)
| Aspect |
Details |
| Business Impact |
Operational efficiency, capacity planning |
| Volume |
Very High - VM_METRICS_SNAPSHOT every 30s |
| Dependencies |
Foundation for all other events |
| Analytics Use |
Resource optimization, scaling decisions, cost allocation |
| SLA |
< 1 second processing latency for metrics |
Job Lifecycle Events (Queue Management)
| Aspect |
Details |
| Business Impact |
User experience, queue optimization |
| Volume |
High - matches job volume |
| Dependencies |
Precedes model execution events |
| Analytics Use |
Queue optimization, capacity planning, user experience |
| SLA |
1 second processing latency |
Financial Events (Business Intelligence)
| Aspect |
Details |
| Business Impact |
Direct revenue/cost tracking, billing |
| Volume |
High - one per completed job |
| Dependencies |
Requires all execution and infrastructure events |
| Analytics Use |
Cost optimization, chargeback, ROI analysis |
| SLA |
< 10 second processing latency |
System Health Events (Reliability)
| Aspect |
Details |
| Business Impact |
System reliability, uptime |
| Volume |
Low - only when thresholds exceeded |
| Dependencies |
Based on infrastructure metrics |
| Analytics Use |
Performance monitoring, capacity planning, incident response |
| SLA |
< 1 second processing latency |
Event Processing Patterns
Real-time Processing
- MODEL_EXECUTION events → Dashboard updates, alerting
- ALERT_TRIGGERED → Immediate notification systems
- VM_METRICS_SNAPSHOT → Real-time monitoring dashboards
Batch Processing
- COST_ALLOCATION → Daily/monthly billing reports
- VM_PROVISIONED/DEPROVISIONED → Capacity planning analysis
- CHECKPOINT_CACHED → Storage optimization reports
Stream Processing
- JOB_QUEUED/ASSIGNED → Queue depth monitoring
- MODEL_EXECUTION → Performance trend analysis
- CHECKPOINT_LOADED → Cache hit rate calculations
Data Volume Estimates
| Event Type |
Daily Volume |
Weekly Volume |
Monthly Volume |
Storage per Event |
| MODEL_EXECUTION_STARTED |
5,000 |
35,000 |
150,000 |
2KB |
| MODEL_EXECUTION_COMPLETED |
4,850 |
33,950 |
145,500 |
3KB |
| MODEL_EXECUTION_FAILED |
150 |
1,050 |
4,500 |
2.5KB |
| CHECKPOINT_LOADED |
1,200 |
8,400 |
36,000 |
1.5KB |
| CHECKPOINT_CACHED |
300 |
2,100 |
9,000 |
1KB |
| VM_METRICS_SNAPSHOT |
2,880,000 |
20,160,000 |
86,400,000 |
4KB |
| VM_PROVISIONED |
50 |
350 |
1,500 |
1KB |
| VM_DEPROVISIONED |
50 |
350 |
1,500 |
1.5KB |
| JOB_QUEUED |
5,000 |
35,000 |
150,000 |
1.5KB |
| JOB_ASSIGNED |
5,000 |
35,000 |
150,000 |
1KB |
| COST_ALLOCATION |
4,850 |
33,950 |
145,500 |
2.5KB |
| ALERT_TRIGGERED |
20 |
140 |
600 |
2KB |
Total Daily Storage: ~11.5GB/day
Total Monthly Storage: ~350GB/month
1. Model Execution Events**
MODEL_EXECUTION_STARTED
{
"eventType": "MODEL_EXECUTION_STARTED",
"timestamp": "2025-08-04T10:30:00Z",
"tenantId": "company-123",
"userId": "user-456",
"sessionId": "session-789",
"source": "spi-service",
"metadata": {
"jobId": "job-12345",
"modelId": "stable-diffusion-xl",
"modelName": "Stable Diffusion XL",
"modelVersion": "v1.0.5",
"modelHash": "abc123def456",
"modelType": "text-to-image",
"modelSizeMB": 6900,COST_ALLOCATION
"checkpointId": "protovision-xl-v6.6",
"checkpointName": "ProtoVision XL HighFidelity 3D",
"checkpointVersion": "v6.6.0",
"checkpointHash": "xyz789abc123",
"checkpointSizeMB": 3840,
"checkpointSource": "civitai",
"checkpointLoadingTimeMs": 2100,
"runType": "inference",
"vmId": "vm-instance-001",
"nodeId": "k8s-node-01",
"executionParameters": {
"prompt": "a beautiful photograph of a landscape",
"negative_prompt": "low quality, blurry, bad anatomy",
"width": 1024,
"height": 1024,
"num_inference_steps": 20,
"guidance_scale": 5.0,
"sampler": "euler",
"scheduler": "karras"
},
"inputSizeBytes": 245,
"expectedOutputSizeBytes": 8388608,
"checkpointCacheHit": false
},
"tags": {
"department": "marketing",
"project": "campaign-2025",
"priority": "normal",
"environment": "production",
"modelFamily": "stable-diffusion",
"checkpointCategory": "photorealistic"
}
}
MODEL_EXECUTION_COMPLETED
{
"eventType": "MODEL_EXECUTION_COMPLETED",
"timestamp": "2025-08-04T10:30:15Z",
"tenantId": "company-123",
"userId": "user-456",
"sessionId": "session-789",
"source": "spi-service",
"metadata": {
"jobId": "job-12345",
"modelId": "stable-diffusion-xl",
"checkpointId": "protovision-xl-v6.6",
"status": "success",
"checkpointLoadingTimeMs": 2100,
"modelInitializationTimeMs": 400,
"inferenceTimeMs": 12500,
"totalExecutionTimeMs": 15000,
"outputSizeBytes": 8324567,
"memoryUsageMB": {
"peak": 12800,
"average": 11200,
"checkpointOverhead": 3840
},
"gpuUtilization": {
"peak": 98.5,
"average": 92.3
},
"throughputItemsPerSecond": 0.067,
"costBreakdown": {
"computeCostUSD": 0.035,
"checkpointLoadingCostUSD": 0.008,
"storageCostUSD": 0.002,
"totalCostUSD": 0.045
},
"performanceMetrics": {
"stepsPerSecond": 1.6,
"vramEfficiency": 87.3,
"checkpointEfficiency": 94.1
}
},
"tags": {
"department": "marketing",
"project": "campaign-2025",
"priority": "normal",
"environment": "production",
"modelFamily": "stable-diffusion",
"checkpointCategory": "photorealistic"
}
}
MODEL_EXECUTION_FAILED
{
"eventType": "MODEL_EXECUTION_FAILED",
"timestamp": "2025-08-04T10:30:08Z",
"tenantId": "company-123",
"userId": "user-456",
"sessionId": "session-789",
"source": "spi-service",
"metadata": {
"jobId": "job-12345",
"modelId": "stable-diffusion-xl",
"checkpointId": "protovision-xl-v6.6",
"status": "failed",
"failureStage": "checkpoint_loading", // "checkpoint_loading", "model_init", "inference"
"checkpointLoadingTimeMs": 1800,
"executionTimeMs": 8000,
"errorType": "CHECKPOINT_CORRUPTION",
"errorCode": "E2003",
"errorMessage": "Checkpoint file corrupted: invalid tensor dimensions",
"stackTrace": "...",
"retryCount": 2,
"checkpointFallbackUsed": true,
"fallbackCheckpointId": "stable-diffusion-xl-base",
"costBreakdown": {
"computeCostUSD": 0.015,
"checkpointLoadingCostUSD": 0.006,
"totalCostUSD": 0.021
}
},
"tags": {
"department": "marketing",
"project": "campaign-2025",
"priority": "normal",
"environment": "production",
"modelFamily": "stable-diffusion",
"checkpointCategory": "photorealistic"
}
}
2. Checkpoint Management Events
CHECKPOINT_LOADED
{
"eventType": "CHECKPOINT_LOADED",
"timestamp": "2025-08-04T10:29:58Z",
"tenantId": "company-123",
"source": "checkpoint-manager",
"metadata": {
"checkpointId": "protovision-xl-v6.6",
"checkpointName": "ProtoVision XL HighFidelity 3D",
"checkpointVersion": "v6.6.0",
"checkpointHash": "xyz789abc123",
"checkpointSizeMB": 3840,
"loadingTimeMs": 2100,
"sourceLocation": "s3://checkpoints/stable-diffusion/",
"cacheHit": false,
"vmId": "vm-instance-001",
"modelId": "stable-diffusion-xl",
"loadingMethod": "direct_download", // "cache_hit", "direct_download", "preloaded"
"compressionRatio": 0.73,
"verificationPassed": true
},
"tags": {
"environment": "production",
"checkpointCategory": "photorealistic",
"modelFamily": "stable-diffusion"
}
}
CHECKPOINT_CACHED
{
"eventType": "CHECKPOINT_CACHED",
"timestamp": "2025-08-04T10:32:05Z",
"tenantId": "company-123",
"source": "checkpoint-cache",
"metadata": {
"checkpointId": "protovision-xl-v6.6",
"checkpointSizeMB": 3840,
"vmId": "vm-instance-001",
"cacheAction": "stored", // "stored", "evicted", "preloaded"
"cacheUsagePercent": 78.5,
"cacheHitRate": 0.87,
"evictedCheckpoints": ["old-checkpoint-v1.2"],
"retentionPolicyApplied": "lru"
},
"tags": {
"environment": "production",
"cacheStrategy": "lru"
}
}
VM_METRICS_SNAPSHOT
{
"eventType": "VM_METRICS_SNAPSHOT",
"timestamp": "2025-08-04T10:30:00Z",
"tenantId": "company-123",
"source": "prometheus-agent",
"metadata": {
"vmId": "vm-instance-001",
"instanceType": "g4dn.2xlarge",
"zone": "us-west-2a",
"cpuMetrics": {
"utilizationPercent": 75.2,
"coreCount": 8,
"loadAverage": 4.2
},
"gpuMetrics": {
"utilizationPercent": 92.3,
"memoryUtilizationPercent": 85.7,
"memoryTotalMB": 16384,
"memoryUsedMB": 14031,
"temperatureCelsius": 78
},
"memoryMetrics": {
"utilizationPercent": 68.4,
"totalMB": 32768,
"usedMB": 22420,
"availableMB": 10348
},
"diskMetrics": {
"utilizationPercent": 45.2,
"totalGB": 500,
"usedGB": 226,
"readIOPS": 120,
"writeIOPS": 85,
"readThroughputMBps": 45.2,
"writeThroughputMBps": 23.1
},
"networkMetrics": {
"inboundMBps": 12.5,
"outboundMBps": 8.7,
"packetsInPerSec": 1250,
"packetsOutPerSec": 980,
"packetLossPercent": 0.01
}
},
"tags": {
"environment": "production",
"region": "us-west-2",
"costCenter": "ai-infrastructure"
}
}
VM_PROVISIONED
{
"eventType": "VM_PROVISIONED",
"timestamp": "2025-08-04T10:25:00Z",
"tenantId": "company-123",
"source": "vm-manager",
"metadata": {
"vmId": "vm-instance-001",
"instanceType": "g4dn.2xlarge",
"zone": "us-west-2a",
"provisioningTimeMs": 45000,
"costPerHourUSD": 0.752,
"requestedBy": "auto-scaler",
"reason": "high_queue_depth"
},
"tags": {
"environment": "production",
"region": "us-west-2",
"costCenter": "ai-infrastructure"
}
}
VM_DEPROVISIONED
{
"eventType": "VM_DEPROVISIONED",
"timestamp": "2025-08-04T12:30:00Z",
"tenantId": "company-123",
"source": "vm-manager",
"metadata": {
"vmId": "vm-instance-001",
"uptimeMinutes": 125,
"totalCostUSD": 1.567,
"reason": "idle_timeout",
"jobsCompleted": 23,
"utilizationSummary": {
"avgCpuPercent": 42.1,
"avgGpuPercent": 67.3,
"avgMemoryPercent": 55.8
}
},
"tags": {
"environment": "production",
"region": "us-west-2",
"costCenter": "ai-infrastructure"
}
}
3. Job Lifecycle Events
JOB_QUEUED
{
"eventType": "JOB_QUEUED",
"timestamp": "2025-08-04T10:29:45Z",
"tenantId": "company-123",
"userId": "user-456",
"source": "job-scheduler",
"metadata": {
"jobId": "job-12345",
"priority": "normal",
"queueName": "text-to-image",
"estimatedDurationMs": 15000,
"requiredResources": {
"cpuCores": 2,
"memoryMB": 8192,
"gpuMemoryMB": 12288
},
"queuePosition": 3,
"queueDepth": 8
},
"tags": {
"department": "marketing",
"project": "campaign-2025",
"priority": "normal"
}
}
JOB_ASSIGNED
{
"eventType": "JOB_ASSIGNED",
"timestamp": "2025-08-04T10:29:55Z",
"tenantId": "company-123",
"userId": "user-456",
"source": "job-scheduler",
"metadata": {
"jobId": "job-12345",
"vmId": "vm-instance-001",
"queueTimeMs": 10000,
"assignmentReason": "best_fit"
},
"tags": {
"department": "marketing",
"project": "campaign-2025"
}
}
4. Cost & Billing Events
COST_ALLOCATION
{
"eventType": "COST_ALLOCATION",
"timestamp": "2025-08-04T10:30:15Z",
"tenantId": "company-123",
"userId": "user-456",
"source": "billing-service",
"metadata": {
"jobId": "job-12345",
"modelId": "stable-diffusion-xl",
"checkpointId": "protovision-xl-v6.6",
"costBreakdown": {
"computeCostUSD": 0.025,
"checkpointLoadingCostUSD": 0.008,
"checkpointStorageCostUSD": 0.005,
"networkTransferCostUSD": 0.004,
"cachingCostUSD": 0.003,
"totalCostUSD": 0.045
},
"resourceUsage": {
"cpuHours": 0.0042,
"gpuHours": 0.0042,
"storageGBHours": 2.1,
"networkGB": 0.85,
"checkpointCacheHours": 0.0083
},
"billingPeriod": "2025-08",
"allocationMethod": "direct",
"checkpointUsageMetrics": {
"loadingCycles": 1,
"cacheHits": 0,
"cacheMisses": 1,
"retentionHours": 2.5
}
},
"tags": {
"department": "marketing",
"project": "campaign-2025",
"costCenter": "creative-ai",
"billable": "true",
"modelFamily": "stable-diffusion",
"checkpointCategory": "photorealistic"
}
}
5. System Health Events
ALERT_TRIGGERED
{
"eventType": "ALERT_TRIGGERED",
"timestamp": "2025-08-04T10:35:00Z",
"tenantId": "system",
"source": "monitoring-service",
"metadata": {
"alertId": "high-gpu-utilization-001",
"alertType": "RESOURCE_THRESHOLD",
"severity": "warning",
"description": "GPU utilization > 95% for 5 minutes",
"vmId": "vm-instance-001",
"currentValue": 97.2,
"thresholdValue": 95.0,
"duration": 300000
},
"tags": {
"environment": "production",
"component": "compute"
}
}
Event Topics/Queues Organization
Kafka Topics Structure
Topics:
model-execution:
partitions: 12
retention: 7 days
key: modelId + userId
vm-performance:
partitions: 8
retention: 30 days
key: vmId
job-lifecycle:
partitions: 6
retention: 7 days
key: jobId
cost-billing:
partitions: 4
retention: 365 days
key: tenantId + userId
system-alerts:
partitions: 2
retention: 30 days
key: alertType
ClickHouse Schema Design
Model Execution Table
CREATE TABLE model_executions (
event_time DateTime64,
tenant_id String,
user_id String,
job_id String,
model_id String,
model_name String,
model_version String,
model_type String,
checkpoint_id String,
checkpoint_name String,
checkpoint_version String,
checkpoint_hash String,
status String,
checkpoint_loading_time_ms UInt32,
model_initialization_time_ms UInt32,
inference_time_ms UInt32,
total_execution_time_ms UInt32,
checkpoint_cache_hit Boolean,
cost_total_usd Float64,
cost_compute_usd Float64,
cost_checkpoint_loading_usd Float64,
memory_peak_mb Float32,
gpu_utilization_avg Float32,
metadata JSON,
tags JSON
) ENGINE = MergeTree()
ORDER BY (event_time, tenant_id, model_id, checkpoint_id)
PARTITION BY toYYYYMM(event_time);
Checkpoint Operations Table
CREATE TABLE checkpoint_operations (
event_time DateTime64,
tenant_id String,
checkpoint_id String,
checkpoint_name String,
checkpoint_version String,
checkpoint_size_mb UInt32,
operation_type String, -- 'loaded', 'cached', 'evicted'
loading_time_ms UInt32,
vm_id String,
model_id String,
cache_hit Boolean,
source_location String,
cost_usd Float64,
metadata JSON,
tags JSON
) ENGINE = MergeTree()
ORDER BY (event_time, checkpoint_id, vm_id)
PARTITION BY toYYYYMM(event_time);
CREATE TABLE vm_performance (
event_time DateTime64,
vm_id String,
instance_type String,
cpu_utilization Float32,
gpu_utilization Float32,
memory_utilization Float32,
disk_utilization Float32,
cost_per_hour_usd Float64,
active_model_id String,
active_checkpoint_id String,
checkpoint_cache_usage_mb UInt32,
metadata JSON,
tags JSON
) ENGINE = MergeTree()
ORDER BY (event_time, vm_id)
PARTITION BY toYYYYMM(event_time);
Event Processing Pipeline
---
config:
layout: elk
---
graph LR
A[Application/XUMI] --> B[Event Publisher]
B --> C[Kafka/Event Hubs]
C --> D[Stream Processor]
D --> E[ClickHouse]
D --> F[PostgreSQL]
E --> G[Grafana]
F --> H[API Service]
This event-based design provides:
- Rich Context: Every event contains comprehensive metadata
- Flexible Querying: Events can be aggregated and filtered by any metadata field
- Real-time Processing: Events are processed as they occur
- Historical Analysis: All events are stored for trend analysis
- Cost Attribution: Every action is tied to users, projects, and costs