Skip to content

Quota Service & Real-Time Enforcement for AI Usage

Designing a quota system for AI workloads is less about counting credits and more about making instant, correct decisions under pressure. Every request that hits your system carries cost implications, and the quota service becomes the gatekeeper that decides—within milliseconds—whether that cost is allowed.

This document walks through a production-grade design of a hierarchical quota service with real-time enforcement, explaining not just how it works, but why each piece exists.


Why a Quota Service Exists

In AI systems, cost is not linear or predictable:

  • A single request may consume wildly different resources
  • Some models report usage only after execution
  • Latency matters—users expect immediate responses

Because of this, quota enforcement must behave like:

a payment authorization system, not a reporting system

It must: - decide instantly - prevent overspending - remain consistent under concurrency


Mental Model: Quotas as Nested Budgets

The system operates on three levels:

Organization Quota -> Project Quota -> User Quota

Think of quotas as nested budgets:

Organization (100,000 credits)
 ├── Project A (60,000 credits)
 │    ├── User 1 (10,000 credits)
 │    └── User 2 (20,000 credits)
 └── Project B (40,000 credits)
      └── User 3 (15,000 credits)

Key rule

A request is allowed only if ALL levels have enough quota

So every request must pass:

user_remaining >= cost
project_remaining >= cost
org_remaining >= cost

Each level holds a portion of credits.

The reasoning is simple:

  • Organizations control total spend
  • Projects control distribution
  • Users prevent abuse

A request is only valid if:

every layer can afford it


System Architecture

The system is split into three distinct layers, each with a clear responsibility.

--- config: layout: elk --- flowchart LR Client --> API[Quota Service API] API --> Redis[(Redis - Real-time)] API --> DB[(PostgreSQL - Source of Truth)] API --> Model[AI Model] Model --> API API --> Redis API --> DB

Real-Time Layer (Redis)

Handles:

  • quota checks
  • atomic deductions
  • concurrency safety

Control Layer (Quota Service)

Handles:

  • cost estimation
  • orchestration
  • reconciliation

Persistence Layer (PostgreSQL)

Handles:

  • audit logs
  • billing alignment
  • historical tracking

Real-Time Enforcement Flow

When a request arrives, the system moves quickly:

--- config: layout: elk --- sequenceDiagram participant Client participant API participant Redis participant Model Client->>API: Request (prompt, params) API->>API: Estimate cost API->>Redis: Check + Reserve (atomic) Redis-->>API: Allowed / Rejected alt Allowed API->>Model: Execute request Model-->>API: Response + actual usage API->>Redis: Reconcile else Rejected API-->>Client: Reject request end

Cost Estimation: The Hidden Backbone

Before enforcement, the system must estimate cost.

This is unavoidable because:

  • many AI providers report usage only after execution
  • waiting would break real-time enforcement

So the system uses:

predict first, reconcile later

Example:

  • estimate: 120 credits
  • actual: 100 credits

The difference is corrected after execution.

To stay safe, estimates should slightly overestimate.


Redis Data Model

The real-time system stores only what it needs to decide quickly:

quota:org:{id}     -> remaining credits
quota:project:{id} -> remaining credits
quota:user:{id}    -> remaining credits

This structure is intentionally simple. Complexity belongs elsewhere.


Atomic Enforcement with Lua

Concurrency is the biggest risk. Two requests arriving at the same time must not overspend shared quota.

This is solved using a Lua script, executed atomically inside Redis.

Lua Script: Check and Reserve

-- KEYS:
-- 1 = org quota key
-- 2 = project quota key
-- 3 = user quota key

-- ARGV:
-- 1 = cost

local cost = tonumber(ARGV[1])

local org = tonumber(redis.call("GET", KEYS[1]) or "0")
local proj = tonumber(redis.call("GET", KEYS[2]) or "0")
local user = tonumber(redis.call("GET", KEYS[3]) or "0")

if org >= cost and proj >= cost and user >= cost then
    redis.call("DECRBY", KEYS[1], cost)
    redis.call("DECRBY", KEYS[2], cost)
    redis.call("DECRBY", KEYS[3], cost)

    return {1, org - cost, proj - cost, user - cost}
else
    return {0, org, proj, user}
end

This script guarantees:

  • no race conditions
  • no double spending
  • consistent enforcement across hierarchy

Reconciliation After Execution

Once the AI model finishes, actual usage is known.

--- config: layout: elk --- flowchart TD A[Reserved Credits] --> B{Compare} B -->|Actual < Reserved| C[Refund Difference] B -->|Actual > Reserved| D[Charge Extra or Flag]

Lua Script: Refund

-- KEYS:
-- 1 = org
-- 2 = project
-- 3 = user

-- ARGV:
-- 1 = refund amount

local refund = tonumber(ARGV[1])

redis.call("INCRBY", KEYS[1], refund)
redis.call("INCRBY", KEYS[2], refund)
redis.call("INCRBY", KEYS[3], refund)

return 1

This ensures your system remains financially accurate.


Go Implementation

Below is a simplified production-style implementation.

Quota Service Structure

type QuotaService struct {
    redis *redis.Client
}

Check and Reserve

func (q *QuotaService) CheckAndReserve(
    ctx context.Context,
    orgID, projectID, userID string,
    cost int64,
) (bool, error) {

    script := redis.NewScript(`
        local cost = tonumber(ARGV[1])

        local org = tonumber(redis.call("GET", KEYS[1]) or "0")
        local proj = tonumber(redis.call("GET", KEYS[2]) or "0")
        local user = tonumber(redis.call("GET", KEYS[3]) or "0")

        if org >= cost and proj >= cost and user >= cost then
            redis.call("DECRBY", KEYS[1], cost)
            redis.call("DECRBY", KEYS[2], cost)
            redis.call("DECRBY", KEYS[3], cost)
            return 1
        else
            return 0
        end
    `)

    keys := []string{
        "quota:org:" + orgID,
        "quota:project:" + projectID,
        "quota:user:" + userID,
    }

    result, err := script.Run(ctx, q.redis, keys, cost).Int()
    if err != nil {
        return false, err
    }

    return result == 1, nil
}

Reconcile Usage

func (q *QuotaService) Refund(
    ctx context.Context,
    orgID, projectID, userID string,
    refund int64,
) error {

    script := redis.NewScript(`
        local refund = tonumber(ARGV[1])

        redis.call("INCRBY", KEYS[1], refund)
        redis.call("INCRBY", KEYS[2], refund)
        redis.call("INCRBY", KEYS[3], refund)

        return 1
    `)

    keys := []string{
        "quota:org:" + orgID,
        "quota:project:" + projectID,
        "quota:user:" + userID,
    }

    return script.Run(ctx, q.redis, keys, refund).Err()
}

Persistent Storage Design

Redis is fast but not durable enough alone. The project still need a database.

--- config: layout: elk --- erDiagram QUOTA_ALLOCATIONS { string entity_id string entity_type int total_credits int used_credits timestamp updated_at } USAGE_LOG { string id string org_id string project_id string user_id int cost string model timestamp created_at }

This layer ensures:

  • auditability
  • billing reconciliation
  • historical insights

Failure Handling Philosophy

Real systems fail. The quota service must fail safely.

If Redis is unavailable

Two strategies exist:

  • fail closed → reject all requests (safe for cost)
  • fail open with limits → allow small usage buffer

The right choice depends on your business tolerance.


Final Insight

The most important shift in thinking is this:

Quota enforcement is not about limits—it’s about control under uncertainty

You never know exact cost upfront. You never control concurrency. And yet, the system must behave as if everything is predictable.

That’s why:

  • estimation comes before execution
  • reservation comes before approval
  • reconciliation comes after reality

When these three steps work together, the quota system becomes not just correct but trustworthy.