XUMI Example Templates¶

The SPI framework includes a variety of templates for different model domains:

Text Generation
Image Generation
Video Generation
Audio Processing
Multimodal Models

These templates provide starter code, manifest structures, and test cases appropriate for each domain to accelerate development of SPI-compliant models.

Example Text-to-Video (Generative Video) Model¶

The following example demonstrates an SPI manifest for a text-to-video generative model:

# manifest.yml for text-to-video model
version: "1.0"
model:
  name: "text-to-video-gen"
  version: "1.0.0"
  description: "Text-to-video generative model that creates high-quality videos from text prompts"
  domain: "video"
  author: "ICVR Research Team"
  organization: "ICVR"
  creation_date: "2025-04-23"
  license: "proprietary"
  tags: ["video-generation", "text-to-video", "diffusion"]
  domain_specific:
    max_resolution: "1280x720"
    max_fps: 30
    max_duration_seconds: 30
    supported_formats: ["mp4", "webm"]
    video_features:
      - generation
      - style_transfer

execution:
  entry_point: "python /app/run.py"
  runtime:
    cpu: 8
    memory: "32Gi"
    gpu: 2
    gpu_type: "nvidia-tesla-a100"
  env_variables:
    MODEL_PRECISION: "fp16"
    CACHE_DIR: "/app/cache"
    DIFFUSION_STEPS: "50"

inputs:
  - name: "prompt"
    type: "string"
    description: "Text prompt describing the video to generate"
    required: true
    constraints:
      min_length: 3
      max_length: 1000

  - name: "negative_prompt"
    type: "string"
    description: "Text description of elements to avoid in the video"
    required: false
    default: ""
    constraints:
      max_length: 1000

  - name: "video_config"
    type: "object"
    description: "Video generation configuration"
    required: false
    default: {
      "resolution": "720p",
      "frame_rate": 30,
      "duration_seconds": 5.0,
      "motion_strength": 0.5
    }
    properties:
      resolution:
        type: "string"
        enum: ["480p", "720p", "1080p"]
        default: "720p"
      frame_rate:
        type: "integer"
        default: 30
        constraints:
          min_value: 24
          max_value: 60
      duration_seconds:
        type: "float"
        default: 5.0
        constraints:
          min_value: 1.0
          max_value: 30.0
      motion_strength:
        type: "float"
        default: 0.5
        constraints:
          min_value: 0.1
          max_value: 1.0

  - name: "seed"
    type: "integer"
    description: "Random seed for deterministic generation"
    required: false
    default: -1

  - name: "reference_image"
    type: "file"
    description: "Optional reference image to guide video style"
    required: false
    mime_types: ["image/jpeg", "image/png", "image/webp"]
    file_pattern: "*.{jpg,jpeg,png,webp}"
    mount_path: "/app/inputs/images"
    domain_specific:
      image:
        min_resolution: "256x256"
        max_resolution: "2048x2048"
        color_modes: ["RGB"]
        aspect_ratios: ["1:1", "16:9", "4:3"]

outputs:
  - name: "generated_video"
    type: "file"
    description: "Generated video output"
    file_pattern: "/app/outputs/generated_video.mp4"
    mime_types: ["video/mp4"]
    required: true
    domain_specific:
      video:
        resolution: "dynamic" # Based on input configuration
        fps: "dynamic" # Based on input configuration
        duration_seconds: "dynamic" # Based on input configuration
        codec: "h264"
        container: "mp4"

  - name: "preview_frames"
    type: "file"
    description: "Key frames from the generated video"
    file_pattern: "/app/outputs/frames/*.jpg"
    mime_types: ["image/jpeg"]
    required: false

  - name: "generation_metadata"
    type: "object"
    description: "Metadata about the generation process"
    required: true
    properties:
      actual_duration_seconds:
        type: "float"
        description: "Actual duration of generated video"
      frame_count:
        type: "integer"
        description: "Number of frames in generated video"
      generation_time_seconds:
        type: "float"
        description: "Time taken to generate the video"
      seed:
        type: "integer"
        description: "Actual seed used for generation"

validation:
  test_cases:
    - name: "basic_generation"
      inputs:
        prompt: "A serene mountain lake reflecting snow-capped peaks at sunset"
        video_config:
          resolution: "480p"
          duration_seconds: 3.0
      expected_outputs:
        generated_video:
          type: "non_empty_file"
        generation_metadata:
          type: "valid_object"

The implementation of the text-to-video model would involve:

# run.py - Main entry point for model execution
import os
import json
import time
import torch
import numpy as np
from PIL import Image
import cv2
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
from spi.utils import SPIRuntime

# Initialize SPI runtime
spi = SPIRuntime()

# Parse inputs
inputs = spi.parse_inputs()
prompt = inputs.get("prompt", "")
negative_prompt = inputs.get("negative_prompt", "")
video_config = inputs.get("video_config", {})
seed = inputs.get("seed", -1)

# Set random seed for reproducibility
if seed < 0:
    seed = torch.randint(0, 2**32 - 1, (1,)).item()
torch.manual_seed(seed)
np.random.seed(seed)

# Extract video configuration
resolution_map = {
    "480p": (854, 480),
    "720p": (1280, 720),
    "1080p": (1920, 1080)
}
resolution = resolution_map.get(video_config.get("resolution", "720p"), (1280, 720))
frame_rate = video_config.get("frame_rate", 30)
duration_seconds = video_config.get("duration_seconds", 5.0)
motion_strength = video_config.get("motion_strength", 0.5)
num_frames = int(frame_rate * duration_seconds)

# Check for reference image
reference_image_path = None
if "reference_image" in inputs and inputs["reference_image"]:
    reference_image_path = inputs["reference_image"][0]  # Take the first image if multiple

# Load text-to-video model
model