Model Registry Service Requirements¶
Purpose and Background¶
The purpose of the Model Registry Service is to provide a centralized platform for managing ML models in the form of Docker images within a container registry. This service enables users to efficiently store, retrieve, update, and distribute their prepared and configured ML models across different environments, ensuring consistency and scalability in deployment processes.
Machine Learning workflows often involve complex model training pipelines where various iterations of models are created, optimized, and deployed. Managing these models effectively becomes crucial as organizations scale up their operations. The traditional approach of manually uploading and downloading models can be error-prone and inefficient, especially when dealing with large-scale deployments involving multiple teams and environments.
The Model Registry Service is dedicated to address these challenges and will act as an intermediary between developers/engineers and the underlying container registry (such as cloud container registry, local (or docker) container registry, or even local file system). By abstracting away the low-level details of interacting directly with registries, this service simplifies the process of storing, retrieving, and updating models while providing additional functionalities such as version control, metadata management, access controls, and monitoring and reporting capabilities.
Key benefits include:
- Improved efficiency through automation of routine tasks like pushing/pulling models from/to container registries.
- Enhanced collaboration among team members by maintaining consistent naming conventions and tracking changes over time.
- Better governance via role-based access controls ensures secure handling of sensitive data and assets.
- Robustness and security of the model are assured through validation and vulnerability scanning using Trivy.
By leveraging modern cloud-native technologies, the proposed solution aims to empower users, software and DevOps engineers, enabling them to focus on higher-value activities rather than mundane administrative chores associated with managing ML artifacts.
Business/Functional Requirements¶
Core Model Registry Services Requirements¶
-
High-level features
- Unified models naming and versioning conventions
- Abstraction for transparent interaction with different container registries
- List, download, upload and sync models with Container Registry
- Checkpoints separation from the model code
-
Reference applications and other services
- Storing custom weights via Asset service
- Possibility to use Asset service for caching purposes
- Integration with Workflow and Benchmark services for benchmarks and workflows executions
-
Access and security
- Role-based access to the managed models and related data such as model versions and checkpoints
- Models scan before uploading to a model registry. The image scan will happen with Trivy by default.
User Stories¶
User stories below cover broader scenarios encompassing not just basic functionality but also advanced features aimed at improving productivity, reliability, and operational efficiency throughout the entire lifecycle of managing machine learning models.
As an ML Engineer, I want to:¶
Already Implemented¶
-
Create and track a new model locally using XUMI (Xavier Universal Model Interface). Save it along with its dependencies and configurations in a local repository.
-
Build and version a model. Each model should have a unique identifier and version number updated on each model iteration. Semantic versioning ensures clarity in identifying updates and rolling back to stable releases if needed.
-
Specify the minimum hardware requirements. Attach metadata specifying GPU compatibility, required disk space, memory usage, and other computational resources needed to run the model.
-
Push model to the remote registry. Securely authenticate and push the built image to a designated container registry.
-
Scan images. After pushing images to the content-trust-enabled project, they must be automatically scanned for vulnerabilities using an available scanner like Trivy or Clair. It means that only approved (signed) and scanned images can be accessible for download and run.
Implementation Postponed¶
-
Before pushing images to the content-trust-enabled project, they must be signed using a supported tool like Cosign or Notation. This involves generating and associating a signature with the image. The signed images are then pushed to the registry project, where their signatures are also stored.
-
Maintain synchronization between local development environment and remote repositories to ensure consistency during collaborative efforts. Synchronization process means an automatic or scheduled image update of a local registry in case when the remote registry contains the same model but with a higher version number.
-
Sync changes across all locations
As a Workbench Engineer, I want to:¶
-
Pull models from the registry. Fetch specific versions of models stored in the target registry.
-
Validate models against security and quality standards. Perform vulnerability scans on containers, check compliance with organizational policies. To ensure these policies each model has to be approved (signed) before pushing to the model registry. Also an automatic scanning procedure must be applied on a model image as soon as it is pushed to it.
-
Retrieve scanning and signing results (logs) of a specific model version at any time to ensure how it was scanned, what critical/major/minor issues were found, etc.
-
Ensure Quality Compliance. Verify adherence to internal coding guidelines, API contracts, documentation completeness, and licensing terms before promoting models to production use cases.
As a DevOps Engineer, I want to:¶
Implementation Postponed¶
- Cleanup Unused Models: Regularly scan for outdated or abandoned models that haven’t been accessed recently; provide options to archive or delete them after confirmation.
Technical Requirements¶
- .NET 9.0. Development of a service using the .NET 9 platform.
- C# 13. To develop the service, use the C# programming language version 13.
- PostgreSQL. For storing Model Registry service data related to models such as list of registered models, versions, references to checkpoints, hardware requirements, etc.
- Harbor registry as models storage, and support of integrations with other available remote registries (Azure ACR, S3 ECR, Google GCR, GitLab, etc.)
- OpenAPI/Swagger for RESTful API.
- Sentry for the service monitoring & errors tracking.
Functional Requirements¶
Functional requirements below align closely with the goals outlined earlier - improving usability, reliability, scalability, and maintenance in managing models within distributed environments.
Model Creation and Tracking¶
Support creation of new models as well as other CRUD operations with integrated checkpoints management and versioning. Users need to manage new and existing models easily, including simple and fast weights control, which helps maintain flexibility and track model history, signatures and model metadata.
Model metadata should include:
- Name, version, and description
- Checkpoint references associated with model version
- References to container registries where the model can be located
- Hardware requirements info linked to a model version
- Author and organization information
- Creation and modification dates
- License and usage restrictions
Container Registry Integration¶
The Model Registry service should facilitate authenticated pushes and pulls to local and remote container registries. Authentication reduces risks associated with unauthorized access and maintains data integrity.
Synchronization¶
The Model Registry service should have a mechanism to ensure automatic syncing of local model versions from remote repositories to avoid inconsistencies. Consistency across environments promotes reliable collaboration and avoids bugs caused by mismatched states. The service should provide transparent access to models regardless of their location either local or remote.
Validation Processes¶
This requirement implies usage of external tools such as Trivy or Clair for validating models against security vulnerabilities and quality checks. It ensures compliance with established standards, enhances trustworthiness and minimizes potential legal liabilities.
Storage Management¶
Automatic by schedule or semi-automatic cleaning up old or unused models, i.e. suggesting the list of dormant artifacts to remove or archive them for reducing clutter and freeing up valuable resources. This clean up functionality can be combined with Harbor abilities to garbage and remove untagged or missed artifacts, blobs and other data. The Model Registry service can collect models usage data and provide this information in the required format through the service API.
Monitoring Features¶
Metrics related to pulls/pushes, errors, and resource consumption should be collected and analyzed. Proactive monitoring allows early detection of problems impacting availability and overall efficiency.