Event-Based System Design¶

Event Architecture Overview¶

The Xavier telemetry system uses an event-driven architecture where all metrics are captured as structured events with rich metadata. Each event is published to a message bus (Kafka/Azure Event Hubs) and then stored in both operational (PostgreSQL) and analytical (ClickHouse) databases for different query patterns.

Event Overview Table¶

Event Type	Category	Source Service	Purpose	Frequency	Retention	Key Metadata Fields
MODEL_inputSizeBytesEXECUTION_STARTED	Model Execution	spi-service	Tracks start of model inference with checkpoint details	Per job execution	7 days	jobId, modelId, checkpointId, executionParameters, checkpointLoadingTimeMs, checkpointCacheHit
MODEL_EXECUTION_COMPLETED	Model Execution	spi-service	Records successful completion with performance metrics	Per successful job	7 days	jobId, modelId, checkpointId, inferenceTimeMs, costBreakdown, memoryUsageMB, performanceMetrics
MODEL_EXECUTION_FAILED	Model Execution	spi-service	Captures failure details with error analysis	Per failed job	30 days	jobId, modelId, checkpointId, errorType, failureStage, checkpointFallbackUsed, retryCount
CHECKPOINT_LOADED	Checkpoint Management	checkpoint-service	Tracks checkpoint loading operations and cache behavior	Per checkpoint load	7 days	checkpointId, loadingTimeMs, cacheHit, sourceLocation, vmId, compressionRatio, verificationPassed
CHECKPOINT_CACHED	Checkpoint Management	checkpoint-service	Monitors cache operations and storage optimization	Per cache operation	7 days	checkpointId, cacheAction, cacheUsagePercent, cacheHitRate, evictedCheckpoints, retentionPolicyApplied
VM_METRICS_SNAPSHOT	Infrastructure	Telemetry Service	Periodic system resource utilization data	Every 30 seconds	30 days	vmId, cpuMetrics, gpuMetrics, memoryMetrics, diskMetrics, networkMetrics
JOB_QUEUED	Job Lifecycle	Workflow Service	Tracks job entry into processing queue	Per job submission	7 days	jobId, priority, queueName, estimatedDurationMs, requiredResources, queuePosition
JOB_ASSIGNED	Job Lifecycle	Workflow Service	Records job assignment to specific VM	Per job assignment	7 days	jobId, vmId, queueTimeMs, assignmentReason
COST_ALLOCATION	Financial	billing-service	Detailed cost breakdown with checkpoint attribution	Per completed job	365 days	jobId, modelId, checkpointId, costBreakdown, resourceUsage, checkpointUsageMetrics
ALERT_TRIGGERED	System Health	monitoring-service	System alerts and threshold violations	As needed	30 days	alertId, alertType, severity, vmId, currentValue, thresholdValue, duration

Event Relationships & Dependencies¶

Primary Execution Flow¶

JOB_QUEUED → JOB_ASSIGNED → MODEL_EXECUTION_STARTED → CHECKPOINT_LOADED → MODEL_EXECUTION_COMPLETED/FAILED → COST_ALLOCATION

Infrastructure Events¶

VM_PROVISIONED → VM_METRICS_SNAPSHOT (continuous) → VM_DEPROVISIONED

Checkpoint Management Flow¶

MODEL_EXECUTION_STARTED → CHECKPOINT_LOADED → CHECKPOINT_CACHED → (used in)

Event Categories¶

Model Execution Events (Critical Business Logic)¶

Aspect	Details
Business Impact	Direct revenue/cost impact, user experience
Volume	High - 1000s per day
Dependencies	Requires VM and checkpoint events
Analytics Use	Performance optimization, cost analysis, user behavior
SLA	< 5 second processing latency

Checkpoint Management Events (Performance Optimization)¶

Aspect	Details
Business Impact	Performance optimization, cost reduction
Volume	Medium - 100s per day
Dependencies	Linked to model execution events
Analytics Use	Cache optimization, storage planning, cost reduction
SLA	< 2 second processing latency

Infrastructure Events (Operations & Capacity)¶

Aspect	Details
Business Impact	Operational efficiency, capacity planning
Volume	Very High - VM_METRICS_SNAPSHOT every 30s
Dependencies	Foundation for all other events
Analytics Use	Resource optimization, scaling decisions, cost allocation
SLA	< 1 second processing latency for metrics

Job Lifecycle Events (Queue Management)¶

Aspect	Details
Business Impact	User experience, queue optimization
Volume	High - matches job volume
Dependencies	Precedes model execution events
Analytics Use	Queue optimization, capacity planning, user experience
SLA	1 second processing latency

Financial Events (Business Intelligence)¶

Aspect	Details
Business Impact	Direct revenue/cost tracking, billing
Volume	High - one per completed job
Dependencies	Requires all execution and infrastructure events
Analytics Use	Cost optimization, chargeback, ROI analysis
SLA	< 10 second processing latency

System Health Events (Reliability)¶

Aspect	Details
Business Impact	System reliability, uptime
Volume	Low - only when thresholds exceeded
Dependencies	Based on infrastructure metrics
Analytics Use	Performance monitoring, capacity planning, incident response
SLA	< 1 second processing latency

Event Processing Patterns¶

Real-time Processing¶

MODEL_EXECUTION events → Dashboard updates, alerting
ALERT_TRIGGERED → Immediate notification systems
VM_METRICS_SNAPSHOT → Real-time monitoring dashboards

Batch Processing¶

COST_ALLOCATION → Daily/monthly billing reports
VM_PROVISIONED/DEPROVISIONED → Capacity planning analysis
CHECKPOINT_CACHED → Storage optimization reports

Stream Processing¶

JOB_QUEUED/ASSIGNED → Queue depth monitoring
MODEL_EXECUTION → Performance trend analysis
CHECKPOINT_LOADED → Cache hit rate calculations

Data Volume Estimates¶

Event Type	Daily Volume	Weekly Volume	Monthly Volume	Storage per Event
MODEL_EXECUTION_STARTED	5,000	35,000	150,000	2KB
MODEL_EXECUTION_COMPLETED	4,850	33,950	145,500	3KB
MODEL_EXECUTION_FAILED	150	1,050	4,500	2.5KB
CHECKPOINT_LOADED	1,200	8,400	36,000	1.5KB
CHECKPOINT_CACHED	300	2,100	9,000	1KB
VM_METRICS_SNAPSHOT	2,880,000	20,160,000	86,400,000	4KB
VM_PROVISIONED	50	350	1,500	1KB
VM_DEPROVISIONED	50	350	1,500	1.5KB
JOB_QUEUED	5,000	35,000	150,000	1.5KB
JOB_ASSIGNED	5,000	35,000	150,000	1KB
COST_ALLOCATION	4,850	33,950	145,500	2.5KB
ALERT_TRIGGERED	20	140	600	2KB

Total Daily Storage: ~11.5GB/day Total Monthly Storage: ~350GB/month

Event Types & Metadata¶

1. Model Execution Events**¶

MODEL_EXECUTION_STARTED¶

{
  "eventType": "MODEL_EXECUTION_STARTED",
  "timestamp": "2025-08-04T10:30:00Z",
  "tenantId": "company-123",
  "userId": "user-456",
  "sessionId": "session-789",
  "source": "spi-service",
  "metadata": {
    "jobId": "job-12345",
    "modelId": "stable-diffusion-xl",
    "modelName": "Stable Diffusion XL",
    "modelVersion": "v1.0.5",
    "modelHash": "abc123def456",
    "modelType": "text-to-image",
    "modelSizeMB": 6900,COST_ALLOCATION
    "checkpointId": "protovision-xl-v6.6",
    "checkpointName": "ProtoVision XL HighFidelity 3D",
    "checkpointVersion": "v6.6.0",
    "checkpointHash": "xyz789abc123",
    "checkpointSizeMB": 3840,
    "checkpointSource": "civitai",
    "checkpointLoadingTimeMs": 2100,
    "runType": "inference",
    "vmId": "vm-instance-001",
    "nodeId": "k8s-node-01",
    "executionParameters": {
      "prompt": "a beautiful photograph of a landscape",
      "negative_prompt": "low quality, blurry, bad anatomy",
      "width": 1024,
      "height": 1024,
      "num_inference_steps": 20,
      "guidance_scale": 5.0,
      "sampler": "euler",
      "scheduler": "karras"
    },
    "inputSizeBytes": 245,
    "expectedOutputSizeBytes": 8388608,
    "checkpointCacheHit": false
  },
  "tags": {
    "department": "marketing",
    "project": "campaign-2025",
    "priority": "normal",
    "environment": "production",
    "modelFamily": "stable-diffusion",
    "checkpointCategory": "photorealistic"
  }
}

MODEL_EXECUTION_COMPLETED¶

{
  "eventType": "MODEL_EXECUTION_COMPLETED",
  "timestamp": "2025-08-04T10:30:15Z",
  "tenantId": "company-123",
  "userId": "user-456",
  "sessionId": "session-789",
  "source": "spi-service",
  "metadata": {
    "jobId": "job-12345",
    "modelId": "stable-diffusion-xl",
    "checkpointId": "protovision-xl-v6.6",
    "status": "success",
    "checkpointLoadingTimeMs": 2100,
    "modelInitializationTimeMs": 400,
    "inferenceTimeMs": 12500,
    "totalExecutionTimeMs": 15000,
    "outputSizeBytes": 8324567,
    "memoryUsageMB": {
      "peak": 12800,
      "average": 11200,
      "checkpointOverhead": 3840
    },
    "gpuUtilization": {
      "peak": 98.5,
      "average": 92.3
    },
    "throughputItemsPerSecond": 0.067,
    "costBreakdown": {
      "computeCostUSD": 0.035,
      "checkpointLoadingCostUSD": 0.008,
      "storageCostUSD": 0.002,
      "totalCostUSD": 0.045
    },
    "performanceMetrics": {
      "stepsPerSecond": 1.6,
      "vramEfficiency": 87.3,
      "checkpointEfficiency": 94.1
    }
  },
  "tags": {
    "department": "marketing",
    "project": "campaign-2025",
    "priority": "normal",
    "environment": "production",
    "modelFamily": "stable-diffusion",
    "checkpointCategory": "photorealistic"
  }
}

MODEL_EXECUTION_FAILED¶

{
  "eventType": "MODEL_EXECUTION_FAILED",
  "timestamp": "2025-08-04T10:30:08Z",
  "tenantId": "company-123",
  "userId": "user-456",
  "sessionId": "session-789",
  "source": "spi-service",
  "metadata": {
    "jobId": "job-12345",
    "modelId": "stable-diffusion-xl",
    "checkpointId": "protovision-xl-v6.6",
    "status": "failed",
    "failureStage": "checkpoint_loading", // "checkpoint_loading", "model_init", "inference"
    "checkpointLoadingTimeMs": 1800,
    "executionTimeMs": 8000,
    "errorType": "CHECKPOINT_CORRUPTION",
    "errorCode": "E2003",
    "errorMessage": "Checkpoint file corrupted: invalid tensor dimensions",
    "stackTrace": "...",
    "retryCount": 2,
    "checkpointFallbackUsed": true,
    "fallbackCheckpointId": "stable-diffusion-xl-base",
    "costBreakdown": {
      "computeCostUSD": 0.015,
      "checkpointLoadingCostUSD": 0.006,
      "totalCostUSD": 0.021
    }
  },
  "tags": {
    "department": "marketing",
    "project": "campaign-2025",
    "priority": "normal",
    "environment": "production",
    "modelFamily": "stable-diffusion",
    "checkpointCategory": "photorealistic"
  }
}

2. Checkpoint Management Events¶

CHECKPOINT_LOADED¶

{
  "eventType": "CHECKPOINT_LOADED",
  "timestamp": "2025-08-04T10:29:58Z",
  "tenantId": "company-123",
  "source": "checkpoint-manager",
  "metadata": {
    "checkpointId": "protovision-xl-v6.6",
    "checkpointName": "ProtoVision XL HighFidelity 3D",
    "checkpointVersion": "v6.6.0",
    "checkpointHash": "xyz789abc123",
    "checkpointSizeMB": 3840,
    "loadingTimeMs": 2100,
    "sourceLocation": "s3://checkpoints/stable-diffusion/",
    "cacheHit": false,
    "vmId": "vm-instance-001",
    "modelId": "stable-diffusion-xl",
    "loadingMethod": "direct_download", // "cache_hit", "direct_download", "preloaded"
    "compressionRatio": 0.73,
    "verificationPassed": true
  },
  "tags": {
    "environment": "production",
    "checkpointCategory": "photorealistic",
    "modelFamily": "stable-diffusion"
  }
}

CHECKPOINT_CACHED¶

{
  "eventType": "CHECKPOINT_CACHED",
  "timestamp": "2025-08-04T10:32:05Z",
  "tenantId": "company-123",
  "source": "checkpoint-cache",
  "metadata": {
    "checkpointId": "protovision-xl-v6.6",
    "checkpointSizeMB": 3840,
    "vmId": "vm-instance-001",
    "cacheAction": "stored", // "stored", "evicted", "preloaded"
    "cacheUsagePercent": 78.5,
    "cacheHitRate": 0.87,
    "evictedCheckpoints": ["old-checkpoint-v1.2"],
    "retentionPolicyApplied": "lru"
  },
  "tags": {
    "environment": "production",
    "cacheStrategy": "lru"
  }
}

3. VM Performance Events¶

VM_METRICS_SNAPSHOT¶

{
  "eventType": "VM_METRICS_SNAPSHOT",
  "timestamp": "2025-08-04T10:30:00Z",
  "tenantId": "company-123",
  "source": "prometheus-agent",
  "metadata": {
    "vmId": "vm-instance-001",
    "instanceType": "g4dn.2xlarge",
    "zone": "us-west-2a",
    "cpuMetrics": {
      "utilizationPercent": 75.2,
      "coreCount": 8,
      "loadAverage": 4.2
    },
    "gpuMetrics": {
      "utilizationPercent": 92.3,
      "memoryUtilizationPercent": 85.7,
      "memoryTotalMB": 16384,
      "memoryUsedMB": 14031,
      "temperatureCelsius": 78
    },
    "memoryMetrics": {
      "utilizationPercent": 68.4,
      "totalMB": 32768,
      "usedMB": 22420,
      "availableMB": 10348
    },
    "diskMetrics": {
      "utilizationPercent": 45.2,
      "totalGB": 500,
      "usedGB": 226,
      "readIOPS": 120,
      "writeIOPS": 85,
      "readThroughputMBps": 45.2,
      "writeThroughputMBps": 23.1
    },
    "networkMetrics": {
      "inboundMBps": 12.5,
      "outboundMBps": 8.7,
      "packetsInPerSec": 1250,
      "packetsOutPerSec": 980,
      "packetLossPercent": 0.01
    }
  },
  "tags": {
    "environment": "production",
    "region": "us-west-2",
    "costCenter": "ai-infrastructure"
  }
}

VM_PROVISIONED¶

{
  "eventType": "VM_PROVISIONED",
  "timestamp": "2025-08-04T10:25:00Z",
  "tenantId": "company-123",
  "source": "vm-manager",
  "metadata": {
    "vmId": "vm-instance-001",
    "instanceType": "g4dn.2xlarge",
    "zone": "us-west-2a",
    "provisioningTimeMs": 45000,
    "costPerHourUSD": 0.752,
    "requestedBy": "auto-scaler",
    "reason": "high_queue_depth"
  },
  "tags": {
    "environment": "production",
    "region": "us-west-2",
    "costCenter": "ai-infrastructure"
  }
}

VM_DEPROVISIONED¶

{
  "eventType": "VM_DEPROVISIONED", 
  "timestamp": "2025-08-04T12:30:00Z",
  "tenantId": "company-123",
  "source": "vm-manager",
  "metadata": {
    "vmId": "vm-instance-001",
    "uptimeMinutes": 125,
    "totalCostUSD": 1.567,
    "reason": "idle_timeout",
    "jobsCompleted": 23,
    "utilizationSummary": {
      "avgCpuPercent": 42.1,
      "avgGpuPercent": 67.3,
      "avgMemoryPercent": 55.8
    }
  },
  "tags": {
    "environment": "production", 
    "region": "us-west-2",
    "costCenter": "ai-infrastructure"
  }
}

3. Job Lifecycle Events¶

JOB_QUEUED¶

{
  "eventType": "JOB_QUEUED",
  "timestamp": "2025-08-04T10:29:45Z",
  "tenantId": "company-123",
  "userId": "user-456",
  "source": "job-scheduler",
  "metadata": {
    "jobId": "job-12345",

    "priority": "normal",

    "queueName": "text-to-image",

    "estimatedDurationMs": 15000,

    "requiredResources": {
      "cpuCores": 2,
      "memoryMB": 8192,
      "gpuMemoryMB": 12288
    },
    "queuePosition": 3,
    "queueDepth": 8
  },
  "tags": {
    "department": "marketing",
    "project": "campaign-2025",
    "priority": "normal"
  }
}

JOB_ASSIGNED¶

{
  "eventType": "JOB_ASSIGNED",
  "timestamp": "2025-08-04T10:29:55Z", 
  "tenantId": "company-123",
  "userId": "user-456",
  "source": "job-scheduler",
  "metadata": {
    "jobId": "job-12345",
    "vmId": "vm-instance-001",
    "queueTimeMs": 10000,
    "assignmentReason": "best_fit"
  },
  "tags": {
    "department": "marketing",
    "project": "campaign-2025"
  }
}

4. Cost & Billing Events¶

COST_ALLOCATION¶

{
  "eventType": "COST_ALLOCATION",
  "timestamp": "2025-08-04T10:30:15Z",
  "tenantId": "company-123", 
  "userId": "user-456",
  "source": "billing-service",
  "metadata": {
    "jobId": "job-12345",
    "modelId": "stable-diffusion-xl",
    "checkpointId": "protovision-xl-v6.6",
    "costBreakdown": {
      "computeCostUSD": 0.025,
      "checkpointLoadingCostUSD": 0.008,
      "checkpointStorageCostUSD": 0.005,
      "networkTransferCostUSD": 0.004,
      "cachingCostUSD": 0.003,
      "totalCostUSD": 0.045
    },
    "resourceUsage": {
      "cpuHours": 0.0042,
      "gpuHours": 0.0042,
      "storageGBHours": 2.1,
      "networkGB": 0.85,
      "checkpointCacheHours": 0.0083
    },
    "billingPeriod": "2025-08",
    "allocationMethod": "direct",
    "checkpointUsageMetrics": {
      "loadingCycles": 1,
      "cacheHits": 0,
      "cacheMisses": 1,
      "retentionHours": 2.5
    }
  },
  "tags": {
    "department": "marketing",
    "project": "campaign-2025",
    "costCenter": "creative-ai",
    "billable": "true",
    "modelFamily": "stable-diffusion",
    "checkpointCategory": "photorealistic"
  }
}

5. System Health Events¶

ALERT_TRIGGERED¶

{
  "eventType": "ALERT_TRIGGERED",
  "timestamp": "2025-08-04T10:35:00Z",
  "tenantId": "system",
  "source": "monitoring-service",
  "metadata": {
    "alertId": "high-gpu-utilization-001",
    "alertType": "RESOURCE_THRESHOLD",
    "severity": "warning",
    "description": "GPU utilization > 95% for 5 minutes",
    "vmId": "vm-instance-001", 
    "currentValue": 97.2,
    "thresholdValue": 95.0,
    "duration": 300000
  },
  "tags": {
    "environment": "production",
    "component": "compute"
  }
}

Event Topics/Queues Organization¶

Kafka Topics Structure¶

Topics:
  model-execution:
    partitions: 12
    retention: 7 days
    key: modelId + userId

  vm-performance:
    partitions: 8
    retention: 30 days
    key: vmId

  job-lifecycle:
    partitions: 6
    retention: 7 days
    key: jobId

  cost-billing:
    partitions: 4
    retention: 365 days
    key: tenantId + userId

  system-alerts:
    partitions: 2
    retention: 30 days
    key: alertType

ClickHouse Schema Design¶

Model Execution Table¶

CREATE TABLE model_executions (
    event_time DateTime64,
    tenant_id String,
    user_id String,
    job_id String,
    model_id String,
    model_name String,
    model_version String,
    model_type String,
    checkpoint_id String,
    checkpoint_name String,
    checkpoint_version String,
    checkpoint_hash String,
    status String,
    checkpoint_loading_time_ms UInt32,
    model_initialization_time_ms UInt32,
    inference_time_ms UInt32,
    total_execution_time_ms UInt32,
    checkpoint_cache_hit Boolean,
    cost_total_usd Float64,
    cost_compute_usd Float64,
    cost_checkpoint_loading_usd Float64,
    memory_peak_mb Float32,
    gpu_utilization_avg Float32,
    metadata JSON,
    tags JSON
) ENGINE = MergeTree()
ORDER BY (event_time, tenant_id, model_id, checkpoint_id)
PARTITION BY toYYYYMM(event_time);

Checkpoint Operations Table¶

CREATE TABLE checkpoint_operations (
    event_time DateTime64,
    tenant_id String,
    checkpoint_id String,
    checkpoint_name String,
    checkpoint_version String,
    checkpoint_size_mb UInt32,
    operation_type String, -- 'loaded', 'cached', 'evicted'
    loading_time_ms UInt32,
    vm_id String,
    model_id String,
    cache_hit Boolean,
    source_location String,
    cost_usd Float64,
    metadata JSON,
    tags JSON
) ENGINE = MergeTree()
ORDER BY (event_time, checkpoint_id, vm_id)
PARTITION BY toYYYYMM(event_time);

VM Performance Table¶

CREATE TABLE vm_performance (
    event_time DateTime64,
    vm_id String,
    instance_type String,
    cpu_utilization Float32,
    gpu_utilization Float32,
    memory_utilization Float32,
    disk_utilization Float32,
    cost_per_hour_usd Float64,
    active_model_id String,
    active_checkpoint_id String,
    checkpoint_cache_usage_mb UInt32,
    metadata JSON,
    tags JSON
) ENGINE = MergeTree()
ORDER BY (event_time, vm_id)
PARTITION BY toYYYYMM(event_time);

Event Processing Pipeline¶

--- config: layout: elk --- graph LR A[Application/XUMI] --> B[Event Publisher] B --> C[Kafka/Event Hubs] C --> D[Stream Processor] D --> E[ClickHouse] D --> F[PostgreSQL] E --> G[Grafana] F --> H[API Service]

This event-based design provides:

Rich Context: Every event contains comprehensive metadata
Flexible Querying: Events can be aggregated and filtered by any metadata field
Real-time Processing: Events are processed as they occur
Historical Analysis: All events are stored for trend analysis
Cost Attribution: Every action is tied to users, projects, and costs