XUMI Example Templates¶
The SPI framework includes a variety of templates for different model domains:
- Text Generation
- Image Generation
- Video Generation
- Audio Processing
- Multimodal Models
These templates provide starter code, manifest structures, and test cases appropriate for each domain to accelerate development of SPI-compliant models.
Example Text-to-Video (Generative Video) Model¶
The following example demonstrates an SPI manifest for a text-to-video generative model:
# manifest.yml for text-to-video model
version: "1.0"
model:
name: "text-to-video-gen"
version: "1.0.0"
description: "Text-to-video generative model that creates high-quality videos from text prompts"
domain: "video"
author: "ICVR Research Team"
organization: "ICVR"
creation_date: "2025-04-23"
license: "proprietary"
tags: ["video-generation", "text-to-video", "diffusion"]
domain_specific:
max_resolution: "1280x720"
max_fps: 30
max_duration_seconds: 30
supported_formats: ["mp4", "webm"]
video_features:
- generation
- style_transfer
execution:
entry_point: "python /app/run.py"
runtime:
cpu: 8
memory: "32Gi"
gpu: 2
gpu_type: "nvidia-tesla-a100"
env_variables:
MODEL_PRECISION: "fp16"
CACHE_DIR: "/app/cache"
DIFFUSION_STEPS: "50"
inputs:
- name: "prompt"
type: "string"
description: "Text prompt describing the video to generate"
required: true
constraints:
min_length: 3
max_length: 1000
- name: "negative_prompt"
type: "string"
description: "Text description of elements to avoid in the video"
required: false
default: ""
constraints:
max_length: 1000
- name: "video_config"
type: "object"
description: "Video generation configuration"
required: false
default: {
"resolution": "720p",
"frame_rate": 30,
"duration_seconds": 5.0,
"motion_strength": 0.5
}
properties:
resolution:
type: "string"
enum: ["480p", "720p", "1080p"]
default: "720p"
frame_rate:
type: "integer"
default: 30
constraints:
min_value: 24
max_value: 60
duration_seconds:
type: "float"
default: 5.0
constraints:
min_value: 1.0
max_value: 30.0
motion_strength:
type: "float"
default: 0.5
constraints:
min_value: 0.1
max_value: 1.0
- name: "seed"
type: "integer"
description: "Random seed for deterministic generation"
required: false
default: -1
- name: "reference_image"
type: "file"
description: "Optional reference image to guide video style"
required: false
mime_types: ["image/jpeg", "image/png", "image/webp"]
file_pattern: "*.{jpg,jpeg,png,webp}"
mount_path: "/app/inputs/images"
domain_specific:
image:
min_resolution: "256x256"
max_resolution: "2048x2048"
color_modes: ["RGB"]
aspect_ratios: ["1:1", "16:9", "4:3"]
outputs:
- name: "generated_video"
type: "file"
description: "Generated video output"
file_pattern: "/app/outputs/generated_video.mp4"
mime_types: ["video/mp4"]
required: true
domain_specific:
video:
resolution: "dynamic" # Based on input configuration
fps: "dynamic" # Based on input configuration
duration_seconds: "dynamic" # Based on input configuration
codec: "h264"
container: "mp4"
- name: "preview_frames"
type: "file"
description: "Key frames from the generated video"
file_pattern: "/app/outputs/frames/*.jpg"
mime_types: ["image/jpeg"]
required: false
- name: "generation_metadata"
type: "object"
description: "Metadata about the generation process"
required: true
properties:
actual_duration_seconds:
type: "float"
description: "Actual duration of generated video"
frame_count:
type: "integer"
description: "Number of frames in generated video"
generation_time_seconds:
type: "float"
description: "Time taken to generate the video"
seed:
type: "integer"
description: "Actual seed used for generation"
validation:
test_cases:
- name: "basic_generation"
inputs:
prompt: "A serene mountain lake reflecting snow-capped peaks at sunset"
video_config:
resolution: "480p"
duration_seconds: 3.0
expected_outputs:
generated_video:
type: "non_empty_file"
generation_metadata:
type: "valid_object"
The implementation of the text-to-video model would involve:
# run.py - Main entry point for model execution
import os
import json
import time
import torch
import numpy as np
from PIL import Image
import cv2
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
from spi.utils import SPIRuntime
# Initialize SPI runtime
spi = SPIRuntime()
# Parse inputs
inputs = spi.parse_inputs()
prompt = inputs.get("prompt", "")
negative_prompt = inputs.get("negative_prompt", "")
video_config = inputs.get("video_config", {})
seed = inputs.get("seed", -1)
# Set random seed for reproducibility
if seed < 0:
seed = torch.randint(0, 2**32 - 1, (1,)).item()
torch.manual_seed(seed)
np.random.seed(seed)
# Extract video configuration
resolution_map = {
"480p": (854, 480),
"720p": (1280, 720),
"1080p": (1920, 1080)
}
resolution = resolution_map.get(video_config.get("resolution", "720p"), (1280, 720))
frame_rate = video_config.get("frame_rate", 30)
duration_seconds = video_config.get("duration_seconds", 5.0)
motion_strength = video_config.get("motion_strength", 0.5)
num_frames = int(frame_rate * duration_seconds)
# Check for reference image
reference_image_path = None
if "reference_image" in inputs and inputs["reference_image"]:
reference_image_path = inputs["reference_image"][0] # Take the first image if multiple
# Load text-to-video model
model