Quota Service & Real-Time Enforcement for AI Usage¶

Designing a quota system for AI workloads is less about counting credits and more about making instant, correct decisions under pressure. Every request that hits your system carries cost implications, and the quota service becomes the gatekeeper that decides—within milliseconds—whether that cost is allowed.

This document walks through a production-grade design of a hierarchical quota service with real-time enforcement, explaining not just how it works, but why each piece exists.

Why a Quota Service Exists¶

In AI systems, cost is not linear or predictable:

A single request may consume wildly different resources
Some models report usage only after execution
Latency matters—users expect immediate responses

Because of this, quota enforcement must behave like:

a payment authorization system, not a reporting system

It must: - decide instantly - prevent overspending - remain consistent under concurrency

Mental Model: Quotas as Nested Budgets¶

The system operates on three levels:

Organization Quota -> Project Quota -> User Quota

Think of quotas as nested budgets:

Organization (100,000 credits)
 ├── Project A (60,000 credits)
 │    ├── User 1 (10,000 credits)
 │    └── User 2 (20,000 credits)
 └── Project B (40,000 credits)
      └── User 3 (15,000 credits)

Key rule

A request is allowed only if ALL levels have enough quota

So every request must pass:

user_remaining >= cost
project_remaining >= cost
org_remaining >= cost

Each level holds a portion of credits.

The reasoning is simple:

Organizations control total spend
Projects control distribution
Users prevent abuse

A request is only valid if:

every layer can afford it

System Architecture¶

The system is split into three distinct layers, each with a clear responsibility.

--- config: layout: elk --- flowchart LR Client --> API[Quota Service API] API --> Redis[(Redis - Real-time)] API --> DB[(PostgreSQL - Source of Truth)] API --> Model[AI Model] Model --> API API --> Redis API --> DB

Real-Time Layer (Redis)¶

Handles:

quota checks
atomic deductions
concurrency safety

Control Layer (Quota Service)¶

Handles:

cost estimation
orchestration
reconciliation

Persistence Layer (PostgreSQL)¶

Handles:

audit logs
billing alignment
historical tracking

Real-Time Enforcement Flow¶

When a request arrives, the system moves quickly:

--- config: layout: elk --- sequenceDiagram participant Client participant API participant Redis participant Model Client->>API: Request (prompt, params) API->>API: Estimate cost API->>Redis: Check + Reserve (atomic) Redis-->>API: Allowed / Rejected alt Allowed API->>Model: Execute request Model-->>API: Response + actual usage API->>Redis: Reconcile else Rejected API-->>Client: Reject request end

Cost Estimation: The Hidden Backbone¶

Before enforcement, the system must estimate cost.

This is unavoidable because:

many AI providers report usage only after execution
waiting would break real-time enforcement

So the system uses:

predict first, reconcile later

Example:

estimate: 120 credits
actual: 100 credits

The difference is corrected after execution.

To stay safe, estimates should slightly overestimate.

Redis Data Model¶

The real-time system stores only what it needs to decide quickly:

quota:org:{id}     -> remaining credits
quota:project:{id} -> remaining credits
quota:user:{id}    -> remaining credits

This structure is intentionally simple. Complexity belongs elsewhere.

Atomic Enforcement with Lua¶

Concurrency is the biggest risk. Two requests arriving at the same time must not overspend shared quota.

This is solved using a Lua script, executed atomically inside Redis.

Lua Script: Check and Reserve¶

-- KEYS:
-- 1 = org quota key
-- 2 = project quota key
-- 3 = user quota key

-- ARGV:
-- 1 = cost

local cost = tonumber(ARGV[1])

local org = tonumber(redis.call("GET", KEYS[1]) or "0")
local proj = tonumber(redis.call("GET", KEYS[2]) or "0")
local user = tonumber(redis.call("GET", KEYS[3]) or "0")

if org >= cost and proj >= cost and user >= cost then
    redis.call("DECRBY", KEYS[1], cost)
    redis.call("DECRBY", KEYS[2], cost)
    redis.call("DECRBY", KEYS[3], cost)

    return {1, org - cost, proj - cost, user - cost}
else
    return {0, org, proj, user}
end

This script guarantees:

no race conditions
no double spending
consistent enforcement across hierarchy

Reconciliation After Execution¶

Once the AI model finishes, actual usage is known.

--- config: layout: elk --- flowchart TD A[Reserved Credits] --> B{Compare} B -->|Actual < Reserved| C[Refund Difference] B -->|Actual > Reserved| D[Charge Extra or Flag]

Lua Script: Refund¶

-- KEYS:
-- 1 = org
-- 2 = project
-- 3 = user

-- ARGV:
-- 1 = refund amount

local refund = tonumber(ARGV[1])

redis.call("INCRBY", KEYS[1], refund)
redis.call("INCRBY", KEYS[2], refund)
redis.call("INCRBY", KEYS[3], refund)

return 1

This ensures your system remains financially accurate.

Go Implementation¶

Below is a simplified production-style implementation.

Quota Service Structure¶

type QuotaService struct {
    redis *redis.Client
}

Check and Reserve¶

func (q *QuotaService) CheckAndReserve(
    ctx context.Context,
    orgID, projectID, userID string,
    cost int64,
) (bool, error) {

    script := redis.NewScript(`
        local cost = tonumber(ARGV[1])

        local org = tonumber(redis.call("GET", KEYS[1]) or "0")
        local proj = tonumber(redis.call("GET", KEYS[2]) or "0")
        local user = tonumber(redis.call("GET", KEYS[3]) or "0")

        if org >= cost and proj >= cost and user >= cost then
            redis.call("DECRBY", KEYS[1], cost)
            redis.call("DECRBY", KEYS[2], cost)
            redis.call("DECRBY", KEYS[3], cost)
            return 1
        else
            return 0
        end
    `)

    keys := []string{
        "quota:org:" + orgID,
        "quota:project:" + projectID,
        "quota:user:" + userID,
    }

    result, err := script.Run(ctx, q.redis, keys, cost).Int()
    if err != nil {
        return false, err
    }

    return result == 1, nil
}

Reconcile Usage¶

func (q *QuotaService) Refund(
    ctx context.Context,
    orgID, projectID, userID string,
    refund int64,
) error {

    script := redis.NewScript(`
        local refund = tonumber(ARGV[1])

        redis.call("INCRBY", KEYS[1], refund)
        redis.call("INCRBY", KEYS[2], refund)
        redis.call("INCRBY", KEYS[3], refund)

        return 1
    `)

    keys := []string{
        "quota:org:" + orgID,
        "quota:project:" + projectID,
        "quota:user:" + userID,
    }

    return script.Run(ctx, q.redis, keys, refund).Err()
}

Persistent Storage Design¶

Redis is fast but not durable enough alone. The project still need a database.

--- config: layout: elk --- erDiagram QUOTA_ALLOCATIONS { string entity_id string entity_type int total_credits int used_credits timestamp updated_at } USAGE_LOG { string id string org_id string project_id string user_id int cost string model timestamp created_at }

This layer ensures:

auditability
billing reconciliation
historical insights

Failure Handling Philosophy¶

Real systems fail. The quota service must fail safely.

If Redis is unavailable¶

Two strategies exist:

fail closed → reject all requests (safe for cost)
fail open with limits → allow small usage buffer

The right choice depends on your business tolerance.

Final Insight¶

The most important shift in thinking is this:

Quota enforcement is not about limits—it’s about control under uncertainty

You never know exact cost upfront. You never control concurrency. And yet, the system must behave as if everything is predictable.

That’s why:

estimation comes before execution
reservation comes before approval
reconciliation comes after reality

When these three steps work together, the quota system becomes not just correct but trustworthy.