Xavier Universal Model Interface (XUMI) – Quick Overview¶

XUMI is a standard way to bring AI models into the MLOps platform. XUMI ensures models behave the same in development and production, speeds up onboarding of new models, includes automated checks for reliability and security, and makes sharing and collaboration across teams much simpler.

For ML Engineers

ML engineers can develop and test models locally using the XUMI SDK, easily adapt existing models to the platform, and run automated security and quality checks before submission. Integration and performance tests provide clear insights, and the submission process is straightforward and repeatable.

For Workbench Engineers

Workbench engineers benefit from standardized model reviews and template-based testing tailored to different model types. XUMI helps benchmark model performance against established baselines, ensures compliance with security and quality standards, and simplifies the deployment of approved models into production.

What is XUMI?¶

XUMI is a Command-Line Interface (CLI) tool that makes deploying and managing machine learning models easier and more consistent.

At its core, XUMI helps standardize models and ensure they can work together smoothly, no matter what language or framework they were built with. Developers use XUMI to "wrap" a model by creating a manifest, which acts like a contract between the model, the frontend, and agents (chatbots). This manifest clearly defines what the model can do, what resources it needs (CPU, memory, GPU), and what inputs it expects (a prompt or inputImage).

Once a model is wrapped, XUMI makes integration straightforward. The project can be pushed to container registry (harbor), validated through the Model Registry Service, and then made available on the Workbench or for agents.

When it’s time to run the model, the frontend triggers an API that calls the Workflow Service. If the model isn’t already running, a ModelUpWorkflow spins up a container and runs the model using XUMI Streaming Mode. This ensures models run reliably and consistently across the platform.

Who Benefits from XUMI?¶

Two main categories that rely on the XUMI for interacting with the models are:

Agents: Agents use the manifest to understand which models are suitable for a request (e.g., "I need to create an image from a set of pictures") and what parameters to collect for the model's execution.
Frontend Interface: The application UI uses the manifest to display the correct parameters and settings for the user to interact with the model (such as prompt or negative prompt) before generating.

The XUMI manifest acts as a common contract to ensure that both the frontend and the agents can consistently and correctly call all available models. All models are different and are launched in different ways (they may be written in C or Python). Crucially, frameworks like comfyui lack a standard, algorithmically manageable interface**. XUMI solves this by creating a manifest-contract for each model, which standardizes the process of calling the model and passing parameters from both the frontend and the agents.

What is Common Workflow for New Models?¶

1. Model standardization with XUMI

XUMI is used to make all models work the same way, even if they are written in different languages like Python or C. Each model is wrapped using the XUMI CLI, which prepares it to be released and run in a standard format. As part of this process, a manifest file is created. This file describes what the model does, what resources it needs (CPU, memory, GPU), and which parameters can be passed to it from the frontend or agents, such as prompts or images. The manifest acts as a contract so that all model calls are consistent.

2. Project creation, build, and publish

A new model project is created using a command. The project is then configured and filled with the model code. The model can be run locally for testing using XUMI commands.

The model is built into a Docker image, and an appropriate tag is assigned.

Once the local tests succeed, the model is built into a Docker image, an appropriate tag is assigned, and the model is pushed to harbor using the appropriate command. Harbor service stores the model image and runs security scans on it.

3. Validation and registration

After Harbor finishes its vulnerability scan, it triggers a webhook that notifies the Model Registry Service. This service starts a temporal workflow to validate the model.

During validation, the system checks the manifest for correctness and reviews vulnerability results. In the future, this step may also include running the model automatically. If validation succeeds, the Model Registry Service records the manifest. The model then appears in the workbench and becomes available for agents and frontend use.

4. Model execution

When a user or agent wants to run a model, the frontend calls an API with the model name, version, and parameters defined in the manifest. This command starts a workflow in the Workflow Service, which launches the ModelUpWorkflow.

If the model is not already running, the workflow starts a new container using the resources defined in the manifest. Inside the container, XUMI runs the model using the xumi model run --streaming command. The model stays active while it is being used.

If there are no requests for a configured period of time (for example, 60 minutes), the model is automatically stopped. It can also be stopped manually by canceling the workflow.

Where can I find XUMI manifests?¶

The XUMI Models repository manages ML models that adhere to the Xavier Universal Model Interface (XUMI). A "Model" is defined as a Docker image that contains an AI model, wrapped and ready to be executed in a configured flow with expected results. XUMI models are expected to be validated as a XUMI image, and their manifest has to be extracted and saved to a repository. The manifests are loaded by the Model Registry Service so that later a XUMI manifest can be retrieved with a simple GET request.

An example manifest:

XUMI Manifest

version: 2
model:
  name: "qwen-text2image"
  version: "1.0.0"
  domain: "image"
  description: "Qwen-Text2Image model for text-to-image generation with NF4 quantization and LoRA support"
  author: "User"
  organization: "Organization"
  creation_date: "2025-11-04"
  license: "proprietary"
  tags: 
    - "text-to-image"
    - "qwen"
    - "diffusion"
    - "lora"
    - "nf4-quantization"

execution:
  runtime: "python"
  entrypoint: "run.py"
  base_image: "pytorch/pytorch:2.4.0-cuda12.4-cudnn9-runtime"
  resources:
    cpu: 4
    memory: "24Gi"
    gpu: 1
  env_variables:
    PYTORCH_CUDA_ALLOC_CONF: "expandable_segments:True"
    CUDA_VISIBLE_DEVICES: "0"

commands:
  - name: "initialize"
    description: "Initialize the Qwen-Text2Image model with NF4 quantization"
    inputs:
      - name: "checkpoint_dir"
        type: "string"
        description: "Directory containing model weights (with transformer/ and text_encoder/ subdirs)"
        required: true
        default: "model/weights/checkpoints"
      - name: "lora_weights_dir"
        type: "string"
        description: "Directory containing LoRA model weights"
        required: false
        default: "model/weights/loras"
    outputs:
      - name: "model"
        type: "object"
        description: "Initialized Qwen-Text2Image model"
        required: true

  - name: "predict"
    description: "Generate image from text using Qwen-Text2Image model"
    inputs:
      - name: "model"
        type: "object"
        description: "Model from initialization"
        required: true
      - name: "prompt"
        type: "string"
        description: "Text prompt describing the desired image"
        required: true
      - name: "negative_prompt"
        type: "string"
        description: "Negative prompt (what to avoid)"
        required: false
        default: " "
      - name: "width"
        type: "integer"
        description: "Width of the generated image"
        required: false
        default: 1024
      - name: "height"
        type: "integer"
        description: "Height of the generated image"
        required: false
        default: 1024
      - name: "num_inference_steps"
        type: "integer"
        description: "Number of denoising steps (8 with Lightning LoRA, 30 without)"
        required: false
      - name: "true_cfg_scale"
        type: "float"
        description: "CFG scale (2.0 with Lightning LoRA, 4.0 without)"
        required: false
      - name: "seed"
        type: "integer"
        description: "Random seed for generation"
        required: false
      - name: "lora_name"
        type: "string"
        description: "Name of the LoRA file to use (e.g., 'kiki_qwen_v4.safetensors')"
        required: false
        default: "kiki_qwen_v4.safetensors"
      - name: "lora_scale"
        type: "float"
        description: "Strength of LoRA effect"
        required: false
        default: 0.7
    outputs:
      - name: image
        type: file
        description: Generated image
        mime_types:
          - image/png
        file_pattern: image_*.png
        index_padding: 3
        mount_path: outputs
        domain_specific:
          image:
            min_resolution: 512x512
            max_resolution: 2048x2048
            color_modes:
              - RGB
      - name: metadata
        type: file
        description: Image metadata in JSON format
        mime_types:
          - application/json
        file_pattern: metadata*.json
        mount_path: outputs

  - name: "cleanup"
    description: "Clean up model resources"
    inputs:
      - name: "model"
        type: "object"
        description: "Model to clean up"
        required: true

workflow:
  - "initialize"
  - "predict"
  - "cleanup"

# Empty arrays for backward compatibility
inputs: []
outputs: []