Skip to content

Event-Based System Design

Event Architecture Overview

The Xavier telemetry system uses an event-driven architecture where all metrics are captured as structured events with rich metadata. Each event is published to a message bus (Kafka/Azure Event Hubs) and then stored in both operational (PostgreSQL) and analytical (ClickHouse) databases for different query patterns.

Event Overview Table

Event Type Category Source Service Purpose Frequency Retention Key Metadata Fields
MODEL_inputSizeBytesEXECUTION_STARTED Model Execution spi-service Tracks start of model inference with checkpoint details Per job execution 7 days jobId, modelId, checkpointId, executionParameters, checkpointLoadingTimeMs, checkpointCacheHit
MODEL_EXECUTION_COMPLETED Model Execution spi-service Records successful completion with performance metrics Per successful job 7 days jobId, modelId, checkpointId, inferenceTimeMs, costBreakdown, memoryUsageMB, performanceMetrics
MODEL_EXECUTION_FAILED Model Execution spi-service Captures failure details with error analysis Per failed job 30 days jobId, modelId, checkpointId, errorType, failureStage, checkpointFallbackUsed, retryCount
CHECKPOINT_LOADED Checkpoint Management checkpoint-service Tracks checkpoint loading operations and cache behavior Per checkpoint load 7 days checkpointId, loadingTimeMs, cacheHit, sourceLocation, vmId, compressionRatio, verificationPassed
CHECKPOINT_CACHED Checkpoint Management checkpoint-service Monitors cache operations and storage optimization Per cache operation 7 days checkpointId, cacheAction, cacheUsagePercent, cacheHitRate, evictedCheckpoints, retentionPolicyApplied
VM_METRICS_SNAPSHOT Infrastructure Telemetry Service Periodic system resource utilization data Every 30 seconds 30 days vmId, cpuMetrics, gpuMetrics, memoryMetrics, diskMetrics, networkMetrics
JOB_QUEUED Job Lifecycle Workflow Service Tracks job entry into processing queue Per job submission 7 days jobId, priority, queueName, estimatedDurationMs, requiredResources, queuePosition
JOB_ASSIGNED Job Lifecycle Workflow Service Records job assignment to specific VM Per job assignment 7 days jobId, vmId, queueTimeMs, assignmentReason
COST_ALLOCATION Financial billing-service Detailed cost breakdown with checkpoint attribution Per completed job 365 days jobId, modelId, checkpointId, costBreakdown, resourceUsage, checkpointUsageMetrics
ALERT_TRIGGERED System Health monitoring-service System alerts and threshold violations As needed 30 days alertId, alertType, severity, vmId, currentValue, thresholdValue, duration

Event Relationships & Dependencies

Primary Execution Flow

JOB_QUEUED → JOB_ASSIGNED → MODEL_EXECUTION_STARTED → CHECKPOINT_LOADED → MODEL_EXECUTION_COMPLETED/FAILED → COST_ALLOCATION

Infrastructure Events

VM_PROVISIONED → VM_METRICS_SNAPSHOT (continuous) → VM_DEPROVISIONED

Checkpoint Management Flow

MODEL_EXECUTION_STARTED → CHECKPOINT_LOADED → CHECKPOINT_CACHED → (used in)

Event Categories

Model Execution Events (Critical Business Logic)

Aspect Details
Business Impact Direct revenue/cost impact, user experience
Volume High - 1000s per day
Dependencies Requires VM and checkpoint events
Analytics Use Performance optimization, cost analysis, user behavior
SLA < 5 second processing latency

Checkpoint Management Events (Performance Optimization)

Aspect Details
Business Impact Performance optimization, cost reduction
Volume Medium - 100s per day
Dependencies Linked to model execution events
Analytics Use Cache optimization, storage planning, cost reduction
SLA < 2 second processing latency

Infrastructure Events (Operations & Capacity)

Aspect Details
Business Impact Operational efficiency, capacity planning
Volume Very High - VM_METRICS_SNAPSHOT every 30s
Dependencies Foundation for all other events
Analytics Use Resource optimization, scaling decisions, cost allocation
SLA < 1 second processing latency for metrics

Job Lifecycle Events (Queue Management)

Aspect Details
Business Impact User experience, queue optimization
Volume High - matches job volume
Dependencies Precedes model execution events
Analytics Use Queue optimization, capacity planning, user experience
SLA 1 second processing latency

Financial Events (Business Intelligence)

Aspect Details
Business Impact Direct revenue/cost tracking, billing
Volume High - one per completed job
Dependencies Requires all execution and infrastructure events
Analytics Use Cost optimization, chargeback, ROI analysis
SLA < 10 second processing latency

System Health Events (Reliability)

Aspect Details
Business Impact System reliability, uptime
Volume Low - only when thresholds exceeded
Dependencies Based on infrastructure metrics
Analytics Use Performance monitoring, capacity planning, incident response
SLA < 1 second processing latency

Event Processing Patterns

Real-time Processing

  • MODEL_EXECUTION events → Dashboard updates, alerting
  • ALERT_TRIGGERED → Immediate notification systems
  • VM_METRICS_SNAPSHOT → Real-time monitoring dashboards

Batch Processing

  • COST_ALLOCATION → Daily/monthly billing reports
  • VM_PROVISIONED/DEPROVISIONED → Capacity planning analysis
  • CHECKPOINT_CACHED → Storage optimization reports

Stream Processing

  • JOB_QUEUED/ASSIGNED → Queue depth monitoring
  • MODEL_EXECUTION → Performance trend analysis
  • CHECKPOINT_LOADED → Cache hit rate calculations

Data Volume Estimates

Event Type Daily Volume Weekly Volume Monthly Volume Storage per Event
MODEL_EXECUTION_STARTED 5,000 35,000 150,000 2KB
MODEL_EXECUTION_COMPLETED 4,850 33,950 145,500 3KB
MODEL_EXECUTION_FAILED 150 1,050 4,500 2.5KB
CHECKPOINT_LOADED 1,200 8,400 36,000 1.5KB
CHECKPOINT_CACHED 300 2,100 9,000 1KB
VM_METRICS_SNAPSHOT 2,880,000 20,160,000 86,400,000 4KB
VM_PROVISIONED 50 350 1,500 1KB
VM_DEPROVISIONED 50 350 1,500 1.5KB
JOB_QUEUED 5,000 35,000 150,000 1.5KB
JOB_ASSIGNED 5,000 35,000 150,000 1KB
COST_ALLOCATION 4,850 33,950 145,500 2.5KB
ALERT_TRIGGERED 20 140 600 2KB

Total Daily Storage: ~11.5GB/day Total Monthly Storage: ~350GB/month

Event Types & Metadata

1. Model Execution Events**

MODEL_EXECUTION_STARTED

{
  "eventType": "MODEL_EXECUTION_STARTED",
  "timestamp": "2025-08-04T10:30:00Z",
  "tenantId": "company-123",
  "userId": "user-456",
  "sessionId": "session-789",
  "source": "spi-service",
  "metadata": {
    "jobId": "job-12345",
    "modelId": "stable-diffusion-xl",
    "modelName": "Stable Diffusion XL",
    "modelVersion": "v1.0.5",
    "modelHash": "abc123def456",
    "modelType": "text-to-image",
    "modelSizeMB": 6900,COST_ALLOCATION
    "checkpointId": "protovision-xl-v6.6",
    "checkpointName": "ProtoVision XL HighFidelity 3D",
    "checkpointVersion": "v6.6.0",
    "checkpointHash": "xyz789abc123",
    "checkpointSizeMB": 3840,
    "checkpointSource": "civitai",
    "checkpointLoadingTimeMs": 2100,
    "runType": "inference",
    "vmId": "vm-instance-001",
    "nodeId": "k8s-node-01",
    "executionParameters": {
      "prompt": "a beautiful photograph of a landscape",
      "negative_prompt": "low quality, blurry, bad anatomy",
      "width": 1024,
      "height": 1024,
      "num_inference_steps": 20,
      "guidance_scale": 5.0,
      "sampler": "euler",
      "scheduler": "karras"
    },
    "inputSizeBytes": 245,
    "expectedOutputSizeBytes": 8388608,
    "checkpointCacheHit": false
  },
  "tags": {
    "department": "marketing",
    "project": "campaign-2025",
    "priority": "normal",
    "environment": "production",
    "modelFamily": "stable-diffusion",
    "checkpointCategory": "photorealistic"
  }
}

MODEL_EXECUTION_COMPLETED

{
  "eventType": "MODEL_EXECUTION_COMPLETED",
  "timestamp": "2025-08-04T10:30:15Z",
  "tenantId": "company-123",
  "userId": "user-456",
  "sessionId": "session-789",
  "source": "spi-service",
  "metadata": {
    "jobId": "job-12345",
    "modelId": "stable-diffusion-xl",
    "checkpointId": "protovision-xl-v6.6",
    "status": "success",
    "checkpointLoadingTimeMs": 2100,
    "modelInitializationTimeMs": 400,
    "inferenceTimeMs": 12500,
    "totalExecutionTimeMs": 15000,
    "outputSizeBytes": 8324567,
    "memoryUsageMB": {
      "peak": 12800,
      "average": 11200,
      "checkpointOverhead": 3840
    },
    "gpuUtilization": {
      "peak": 98.5,
      "average": 92.3
    },
    "throughputItemsPerSecond": 0.067,
    "costBreakdown": {
      "computeCostUSD": 0.035,
      "checkpointLoadingCostUSD": 0.008,
      "storageCostUSD": 0.002,
      "totalCostUSD": 0.045
    },
    "performanceMetrics": {
      "stepsPerSecond": 1.6,
      "vramEfficiency": 87.3,
      "checkpointEfficiency": 94.1
    }
  },
  "tags": {
    "department": "marketing",
    "project": "campaign-2025",
    "priority": "normal",
    "environment": "production",
    "modelFamily": "stable-diffusion",
    "checkpointCategory": "photorealistic"
  }
}

MODEL_EXECUTION_FAILED

{
  "eventType": "MODEL_EXECUTION_FAILED",
  "timestamp": "2025-08-04T10:30:08Z",
  "tenantId": "company-123",
  "userId": "user-456",
  "sessionId": "session-789",
  "source": "spi-service",
  "metadata": {
    "jobId": "job-12345",
    "modelId": "stable-diffusion-xl",
    "checkpointId": "protovision-xl-v6.6",
    "status": "failed",
    "failureStage": "checkpoint_loading", // "checkpoint_loading", "model_init", "inference"
    "checkpointLoadingTimeMs": 1800,
    "executionTimeMs": 8000,
    "errorType": "CHECKPOINT_CORRUPTION",
    "errorCode": "E2003",
    "errorMessage": "Checkpoint file corrupted: invalid tensor dimensions",
    "stackTrace": "...",
    "retryCount": 2,
    "checkpointFallbackUsed": true,
    "fallbackCheckpointId": "stable-diffusion-xl-base",
    "costBreakdown": {
      "computeCostUSD": 0.015,
      "checkpointLoadingCostUSD": 0.006,
      "totalCostUSD": 0.021
    }
  },
  "tags": {
    "department": "marketing",
    "project": "campaign-2025",
    "priority": "normal",
    "environment": "production",
    "modelFamily": "stable-diffusion",
    "checkpointCategory": "photorealistic"
  }
}

2. Checkpoint Management Events

CHECKPOINT_LOADED

{
  "eventType": "CHECKPOINT_LOADED",
  "timestamp": "2025-08-04T10:29:58Z",
  "tenantId": "company-123",
  "source": "checkpoint-manager",
  "metadata": {
    "checkpointId": "protovision-xl-v6.6",
    "checkpointName": "ProtoVision XL HighFidelity 3D",
    "checkpointVersion": "v6.6.0",
    "checkpointHash": "xyz789abc123",
    "checkpointSizeMB": 3840,
    "loadingTimeMs": 2100,
    "sourceLocation": "s3://checkpoints/stable-diffusion/",
    "cacheHit": false,
    "vmId": "vm-instance-001",
    "modelId": "stable-diffusion-xl",
    "loadingMethod": "direct_download", // "cache_hit", "direct_download", "preloaded"
    "compressionRatio": 0.73,
    "verificationPassed": true
  },
  "tags": {
    "environment": "production",
    "checkpointCategory": "photorealistic",
    "modelFamily": "stable-diffusion"
  }
}

CHECKPOINT_CACHED

{
  "eventType": "CHECKPOINT_CACHED",
  "timestamp": "2025-08-04T10:32:05Z",
  "tenantId": "company-123",
  "source": "checkpoint-cache",
  "metadata": {
    "checkpointId": "protovision-xl-v6.6",
    "checkpointSizeMB": 3840,
    "vmId": "vm-instance-001",
    "cacheAction": "stored", // "stored", "evicted", "preloaded"
    "cacheUsagePercent": 78.5,
    "cacheHitRate": 0.87,
    "evictedCheckpoints": ["old-checkpoint-v1.2"],
    "retentionPolicyApplied": "lru"
  },
  "tags": {
    "environment": "production",
    "cacheStrategy": "lru"
  }
}

3. VM Performance Events

VM_METRICS_SNAPSHOT

{
  "eventType": "VM_METRICS_SNAPSHOT",
  "timestamp": "2025-08-04T10:30:00Z",
  "tenantId": "company-123",
  "source": "prometheus-agent",
  "metadata": {
    "vmId": "vm-instance-001",
    "instanceType": "g4dn.2xlarge",
    "zone": "us-west-2a",
    "cpuMetrics": {
      "utilizationPercent": 75.2,
      "coreCount": 8,
      "loadAverage": 4.2
    },
    "gpuMetrics": {
      "utilizationPercent": 92.3,
      "memoryUtilizationPercent": 85.7,
      "memoryTotalMB": 16384,
      "memoryUsedMB": 14031,
      "temperatureCelsius": 78
    },
    "memoryMetrics": {
      "utilizationPercent": 68.4,
      "totalMB": 32768,
      "usedMB": 22420,
      "availableMB": 10348
    },
    "diskMetrics": {
      "utilizationPercent": 45.2,
      "totalGB": 500,
      "usedGB": 226,
      "readIOPS": 120,
      "writeIOPS": 85,
      "readThroughputMBps": 45.2,
      "writeThroughputMBps": 23.1
    },
    "networkMetrics": {
      "inboundMBps": 12.5,
      "outboundMBps": 8.7,
      "packetsInPerSec": 1250,
      "packetsOutPerSec": 980,
      "packetLossPercent": 0.01
    }
  },
  "tags": {
    "environment": "production",
    "region": "us-west-2",
    "costCenter": "ai-infrastructure"
  }
}

VM_PROVISIONED

{
  "eventType": "VM_PROVISIONED",
  "timestamp": "2025-08-04T10:25:00Z",
  "tenantId": "company-123",
  "source": "vm-manager",
  "metadata": {
    "vmId": "vm-instance-001",
    "instanceType": "g4dn.2xlarge",
    "zone": "us-west-2a",
    "provisioningTimeMs": 45000,
    "costPerHourUSD": 0.752,
    "requestedBy": "auto-scaler",
    "reason": "high_queue_depth"
  },
  "tags": {
    "environment": "production",
    "region": "us-west-2",
    "costCenter": "ai-infrastructure"
  }
}

VM_DEPROVISIONED

{
  "eventType": "VM_DEPROVISIONED", 
  "timestamp": "2025-08-04T12:30:00Z",
  "tenantId": "company-123",
  "source": "vm-manager",
  "metadata": {
    "vmId": "vm-instance-001",
    "uptimeMinutes": 125,
    "totalCostUSD": 1.567,
    "reason": "idle_timeout",
    "jobsCompleted": 23,
    "utilizationSummary": {
      "avgCpuPercent": 42.1,
      "avgGpuPercent": 67.3,
      "avgMemoryPercent": 55.8
    }
  },
  "tags": {
    "environment": "production", 
    "region": "us-west-2",
    "costCenter": "ai-infrastructure"
  }
}

3. Job Lifecycle Events

JOB_QUEUED

{
  "eventType": "JOB_QUEUED",
  "timestamp": "2025-08-04T10:29:45Z",
  "tenantId": "company-123",
  "userId": "user-456",
  "source": "job-scheduler",
  "metadata": {
    "jobId": "job-12345",

    "priority": "normal",

    "queueName": "text-to-image",

    "estimatedDurationMs": 15000,

    "requiredResources": {
      "cpuCores": 2,
      "memoryMB": 8192,
      "gpuMemoryMB": 12288
    },
    "queuePosition": 3,
    "queueDepth": 8
  },
  "tags": {
    "department": "marketing",
    "project": "campaign-2025",
    "priority": "normal"
  }
}

JOB_ASSIGNED

{
  "eventType": "JOB_ASSIGNED",
  "timestamp": "2025-08-04T10:29:55Z", 
  "tenantId": "company-123",
  "userId": "user-456",
  "source": "job-scheduler",
  "metadata": {
    "jobId": "job-12345",
    "vmId": "vm-instance-001",
    "queueTimeMs": 10000,
    "assignmentReason": "best_fit"
  },
  "tags": {
    "department": "marketing",
    "project": "campaign-2025"
  }
}

4. Cost & Billing Events

COST_ALLOCATION

{
  "eventType": "COST_ALLOCATION",
  "timestamp": "2025-08-04T10:30:15Z",
  "tenantId": "company-123", 
  "userId": "user-456",
  "source": "billing-service",
  "metadata": {
    "jobId": "job-12345",
    "modelId": "stable-diffusion-xl",
    "checkpointId": "protovision-xl-v6.6",
    "costBreakdown": {
      "computeCostUSD": 0.025,
      "checkpointLoadingCostUSD": 0.008,
      "checkpointStorageCostUSD": 0.005,
      "networkTransferCostUSD": 0.004,
      "cachingCostUSD": 0.003,
      "totalCostUSD": 0.045
    },
    "resourceUsage": {
      "cpuHours": 0.0042,
      "gpuHours": 0.0042,
      "storageGBHours": 2.1,
      "networkGB": 0.85,
      "checkpointCacheHours": 0.0083
    },
    "billingPeriod": "2025-08",
    "allocationMethod": "direct",
    "checkpointUsageMetrics": {
      "loadingCycles": 1,
      "cacheHits": 0,
      "cacheMisses": 1,
      "retentionHours": 2.5
    }
  },
  "tags": {
    "department": "marketing",
    "project": "campaign-2025",
    "costCenter": "creative-ai",
    "billable": "true",
    "modelFamily": "stable-diffusion",
    "checkpointCategory": "photorealistic"
  }
}

5. System Health Events

ALERT_TRIGGERED

{
  "eventType": "ALERT_TRIGGERED",
  "timestamp": "2025-08-04T10:35:00Z",
  "tenantId": "system",
  "source": "monitoring-service",
  "metadata": {
    "alertId": "high-gpu-utilization-001",
    "alertType": "RESOURCE_THRESHOLD",
    "severity": "warning",
    "description": "GPU utilization > 95% for 5 minutes",
    "vmId": "vm-instance-001", 
    "currentValue": 97.2,
    "thresholdValue": 95.0,
    "duration": 300000
  },
  "tags": {
    "environment": "production",
    "component": "compute"
  }
}

Event Topics/Queues Organization

Kafka Topics Structure

Topics:
  model-execution:
    partitions: 12
    retention: 7 days
    key: modelId + userId

  vm-performance:
    partitions: 8
    retention: 30 days
    key: vmId

  job-lifecycle:
    partitions: 6
    retention: 7 days
    key: jobId

  cost-billing:
    partitions: 4
    retention: 365 days
    key: tenantId + userId

  system-alerts:
    partitions: 2
    retention: 30 days
    key: alertType

ClickHouse Schema Design

Model Execution Table

CREATE TABLE model_executions (
    event_time DateTime64,
    tenant_id String,
    user_id String,
    job_id String,
    model_id String,
    model_name String,
    model_version String,
    model_type String,
    checkpoint_id String,
    checkpoint_name String,
    checkpoint_version String,
    checkpoint_hash String,
    status String,
    checkpoint_loading_time_ms UInt32,
    model_initialization_time_ms UInt32,
    inference_time_ms UInt32,
    total_execution_time_ms UInt32,
    checkpoint_cache_hit Boolean,
    cost_total_usd Float64,
    cost_compute_usd Float64,
    cost_checkpoint_loading_usd Float64,
    memory_peak_mb Float32,
    gpu_utilization_avg Float32,
    metadata JSON,
    tags JSON
) ENGINE = MergeTree()
ORDER BY (event_time, tenant_id, model_id, checkpoint_id)
PARTITION BY toYYYYMM(event_time);

Checkpoint Operations Table

CREATE TABLE checkpoint_operations (
    event_time DateTime64,
    tenant_id String,
    checkpoint_id String,
    checkpoint_name String,
    checkpoint_version String,
    checkpoint_size_mb UInt32,
    operation_type String, -- 'loaded', 'cached', 'evicted'
    loading_time_ms UInt32,
    vm_id String,
    model_id String,
    cache_hit Boolean,
    source_location String,
    cost_usd Float64,
    metadata JSON,
    tags JSON
) ENGINE = MergeTree()
ORDER BY (event_time, checkpoint_id, vm_id)
PARTITION BY toYYYYMM(event_time);

VM Performance Table

CREATE TABLE vm_performance (
    event_time DateTime64,
    vm_id String,
    instance_type String,
    cpu_utilization Float32,
    gpu_utilization Float32,
    memory_utilization Float32,
    disk_utilization Float32,
    cost_per_hour_usd Float64,
    active_model_id String,
    active_checkpoint_id String,
    checkpoint_cache_usage_mb UInt32,
    metadata JSON,
    tags JSON
) ENGINE = MergeTree()
ORDER BY (event_time, vm_id)
PARTITION BY toYYYYMM(event_time);

Event Processing Pipeline

--- config: layout: elk --- graph LR A[Application/XUMI] --> B[Event Publisher] B --> C[Kafka/Event Hubs] C --> D[Stream Processor] D --> E[ClickHouse] D --> F[PostgreSQL] E --> G[Grafana] F --> H[API Service]

This event-based design provides:

  • Rich Context: Every event contains comprehensive metadata
  • Flexible Querying: Events can be aggregated and filtered by any metadata field
  • Real-time Processing: Events are processed as they occur
  • Historical Analysis: All events are stored for trend analysis
  • Cost Attribution: Every action is tied to users, projects, and costs