Telemetry Scenarios¶

Predefined User Roles¶

Pipeline TD (Technical Director): Manages ML pipelines, monitors model performance, optimizes execution
Admin: System administrator managing infrastructure, VMs, user audit and system health
Financial Director: Tracks costs, resource utilization, and billing across departments

User Stories by Role¶

Pipeline TD User Stories¶

Model Performance & Execution¶

📙 As a Pipeline TD, I want to see real-time model execution metrics, so that I can monitor inference latency and throughput across different model types

📙 As a Pipeline TD, I want to track model success/failure rates by model version AND checkpoint version, so that I can identify problematic model/checkpoint combinations quickly

📙 As a Pipeline TD, I want to compare inference times between different model versions and checkpoint combinations, so that I can validate performance improvements and identify optimal configurations

📙 As a Pipeline TD, I want to monitor GPU memory utilization during model execution broken down by base model and applied checkpoint, so that I can optimize resource allocation for specific combinations

📙 As a Pipeline TD, I want to see checkpoint loading times vs model initialization times vs inference times, so that I can identify bottlenecks in the pipeline and optimize checkpoint switching

📙 As a Pipeline TD, I want to track input/output sizes and their correlation with execution time per model/checkpoint combination, so that I can optimize request handling for specific configurations

📙 As a Pipeline TD, I want to monitor error types and frequencies by model AND checkpoint, so that I can prioritize bug fixes and identify problematic checkpoint versions

📙 As a Pipeline TD, I want to see which checkpoint versions are most popular for our users for each base model, so that I can focus optimization efforts on high-usage combinations

📙 As a Pipeline TD, I want to track checkpoint switching frequency and performance impact, so that I can optimize checkpoint caching and loading strategies

Job Management¶

📙 As a Pipeline TD, I want to see job queue times and processing times broken down by model/checkpoint combinations, so that I can identify capacity constraints for specific configurations

📙 As a Pipeline TD, I want to track job retries and failures by user/application AND by model/checkpoint combination, so that I can identify problematic integrations and configurations

📙 As a Pipeline TD, I want to monitor concurrent job execution across VMs with visibility into which model/checkpoint combinations are running, so that I can optimize job scheduling and resource allocation

📙 As a Pipeline TD, I want to see checkpoint cache hit/miss rates and loading times, so that I can optimize checkpoint storage and caching strategies

Admin User Stories¶

Infrastructure Management¶

📙 As an Admin, I want to monitor VM provisioning and deprovisioning times, so that I can optimize infrastructure scaling

📙 As an Admin, I want to see real-time CPU, GPU, RAM, and disk utilization across all VMs, so that I can prevent resource exhaustion

📙 As an Admin, I want to track VM uptime and availability metrics, so that I can ensure SLA compliance

📙 As an Admin, I want to monitor network I/O and disk I/O performance, so that I can identify infrastructure bottlenecks

📙 As an Admin, I want to see idle VM time vs active time, so that I can optimize resource allocation and reduce waste

📙 As an Admin, I want to track system health metrics and receive alerts for anomalies, so that I can proactively address issues

Capacity Planning¶

📙 As an Admin, I want to see historical resource usage trends broken down by model and checkpoint combinations, so that I can plan for future capacity needs based on actual usage patterns

📙 As an Admin, I want to monitor request rates per user/application AND per model/checkpoint combination, so that I can identify usage patterns and plan scaling accordingly

📙 As an Admin, I want to track VM instance types and their utilization by model/checkpoint workload, so that I can optimize instance selection for different AI workloads

📙 As an Admin, I want to monitor checkpoint storage usage and access patterns, so that I can optimize checkpoint storage infrastructure and implement efficient caching strategies

📙 As an Admin, I want to see checkpoint loading performance across different storage types and locations, so that I can optimize checkpoint distribution and storage architecture

Financial Director User Stories¶

Cost Tracking¶

📙 As a Financial Director, I want to see compute costs per model run broken down by base model AND checkpoint combination, so that I can understand the true cost of each AI operation configuration

📙 As a Financial Director, I want to track costs per user and per department with visibility into which model/checkpoint combinations they're using, so that I can implement accurate chargeback models and optimize usage

📙 As a Financial Director, I want to monitor GPU-hour and CPU-hour usage with associated costs by model/checkpoint combination, so that I can optimize our compute spending and identify cost-effective configurations

📙 As a Financial Director, I want to see storage costs per model and per checkpoint, so that I can optimize checkpoint retention policies and storage strategies

📙 As a Financial Director, I want to track idle VM costs vs productive VM costs with context of which model/checkpoint combinations cause the most idle time, so that I can identify waste and optimize scheduling

📙 As a Financial Director, I want to monitor checkpoint loading costs and frequency, so that I can optimize checkpoint caching strategies to reduce unnecessary data transfer costs

Billing & Reporting¶

📙 As a Financial Director, I want to generate monthly cost reports by department and project with breakdowns by model/checkpoint usage, so that I can allocate costs accurately and provide detailed usage insights

📙 As a Financial Director, I want to see cost trends over time by service type AND by model/checkpoint combination, so that I can forecast future spending and optimize high-cost configurations

📙 As a Financial Director, I want to monitor cost per successful execution vs failed execution by model/checkpoint combination, so that I can understand the cost of errors and identify problematic configurations

📙 As a Financial Director, I want to track the most expensive models, checkpoints, and users, so that I can optimize high-cost operations and provide targeted cost reduction recommendations

📙 As a Financial Director, I want to see ROI analysis for different model/checkpoint combinations based on usage frequency and business value, so that I can make informed decisions about which configurations to optimize or deprecate