Craftology Technical Overview¶
Diagram¶
Infrastructure and Deployment¶
- Environment: The project is deployed in Kubernetes.
- Access: Entry is via Ingress.
- Environments: DEV, TEST, and STAGE environments are used. All three Kubernetes clusters are located in AWS/Amazon.
- Services: Deployed as Pods in Kubernetes. An exception is made for "heavy models" (e.g., Hume models), which require increased disk space.
- Status: The current state of deployed services in Kubernetes can be viewed via Argo CD.
User Layer (Next.js)¶
- NextAppP: The application running on the user/client side.
- Next.js App Backend:
- Stores sessions in Redis.
- Handles session verification and request routing to the backend services.
- Authorization: Built on Keycloak using the OAuth protocol. The user receives a token, which is used to create a session.
Core Backend Services¶
- MLops Services: Carry the core interaction logic, control, and management.
- Agents: Responsible for project creation, determining user intent, and, critically, for all data synchronization (change detection, plan creation, and triggering new generation).
- MCP-services: Used to interact with PocketBase, asset services, and generation services.
- Workflow Service (Temporal Orchestrator):
- An orchestrator for execution chains (Workflows) for long-running processes to prevent failures.
- Stores the execution context in the database to resume work after failures.
- Spins up "heavy" ML models (WAN, QNImg, QN DXL) in Kube on a scheduled basis (mornings) and keeps them in a ready state.
- Model Registry: Stores models, but without weights .
- PBAC-agent/PBAC-center: Verify HTTP requests for access and act as a central repository for security policies.
- Gen AI Service: A service that contacts third-party models via API.
Data Storage¶
- PocketBase (per tenant):
- Data Type: Relational database for storing project settings and metadata (number of shots, card attributes, etc.) as small JSON objects.
- Status: Temporarily stores a portion of files due to legacy implementation, but planning to move all files to Asset Storage.
- Asset Storage (per tenant):
- Data Type: Storage for large objects (blobs) — model weights (e.g., 50 GB), generated images/videos, and source documents.
- Protocol: Operates using the S3 standard (supports MinIO, Google Cloud S3, Oracle, Azure Blob Storage S3).
Multi-Tenancy Model¶
- Tenant: This is the client/company.
- Isolation: Each tenant receives an individual, isolated "sandbox" that includes its own instances of Asset Storage and PocketBase.
- Shared Components: All clients use the common services.
- File Sharing: Possible between Asset Storages of different tenants via a B2B service, with security policy verification in PBAC.
Telemetry and Monitoring¶
- Events: All services generate events during operation.
- Collection: All events are sent to Kafka.
- Storage/Visualization: Some events from Kafka are written directly to ClickHouse to build charts in Grafana.
- Enrichment Service (Telemetry Service): The telemetry service, which, for GPU-related generation events, contacts Prometheus to obtain RAM/GPU load data, enriches the events, and sends them to ClickHouse.
- Quota Telemetry: A separate service that calculates spent quotas to restrict resource access.
Future Development¶
- Gateway: The plan is to remove old services and transition to a single access gateway.
- Models: Active work on integrating new models.
- Workflows: The overall architecture is approved, but changes and additions of new Workflows and services for specific operations are expected.