Quota Service & Real-Time Enforcement for AI Usage¶
Designing a quota system for AI workloads is less about counting credits and more about making instant, correct decisions under pressure. Every request that hits your system carries cost implications, and the quota service becomes the gatekeeper that decides—within milliseconds—whether that cost is allowed.
This document walks through a production-grade design of a hierarchical quota service with real-time enforcement, explaining not just how it works, but why each piece exists.
Why a Quota Service Exists¶
In AI systems, cost is not linear or predictable:
- A single request may consume wildly different resources
- Some models report usage only after execution
- Latency matters—users expect immediate responses
Because of this, quota enforcement must behave like:
a payment authorization system, not a reporting system
It must: - decide instantly - prevent overspending - remain consistent under concurrency
Mental Model: Quotas as Nested Budgets¶
The system operates on three levels:
Organization Quota -> Project Quota -> User Quota
Think of quotas as nested budgets:
Organization (100,000 credits)
├── Project A (60,000 credits)
│ ├── User 1 (10,000 credits)
│ └── User 2 (20,000 credits)
└── Project B (40,000 credits)
└── User 3 (15,000 credits)
Key rule
A request is allowed only if ALL levels have enough quota
So every request must pass:
Each level holds a portion of credits.
The reasoning is simple:
- Organizations control total spend
- Projects control distribution
- Users prevent abuse
A request is only valid if:
every layer can afford it
System Architecture¶
The system is split into three distinct layers, each with a clear responsibility.
Real-Time Layer (Redis)¶
Handles:
- quota checks
- atomic deductions
- concurrency safety
Control Layer (Quota Service)¶
Handles:
- cost estimation
- orchestration
- reconciliation
Persistence Layer (PostgreSQL)¶
Handles:
- audit logs
- billing alignment
- historical tracking
Real-Time Enforcement Flow¶
When a request arrives, the system moves quickly:
Cost Estimation: The Hidden Backbone¶
Before enforcement, the system must estimate cost.
This is unavoidable because:
- many AI providers report usage only after execution
- waiting would break real-time enforcement
So the system uses:
predict first, reconcile later
Example:
- estimate: 120 credits
- actual: 100 credits
The difference is corrected after execution.
To stay safe, estimates should slightly overestimate.
Redis Data Model¶
The real-time system stores only what it needs to decide quickly:
quota:org:{id} -> remaining credits
quota:project:{id} -> remaining credits
quota:user:{id} -> remaining credits
This structure is intentionally simple. Complexity belongs elsewhere.
Atomic Enforcement with Lua¶
Concurrency is the biggest risk. Two requests arriving at the same time must not overspend shared quota.
This is solved using a Lua script, executed atomically inside Redis.
Lua Script: Check and Reserve¶
-- KEYS:
-- 1 = org quota key
-- 2 = project quota key
-- 3 = user quota key
-- ARGV:
-- 1 = cost
local cost = tonumber(ARGV[1])
local org = tonumber(redis.call("GET", KEYS[1]) or "0")
local proj = tonumber(redis.call("GET", KEYS[2]) or "0")
local user = tonumber(redis.call("GET", KEYS[3]) or "0")
if org >= cost and proj >= cost and user >= cost then
redis.call("DECRBY", KEYS[1], cost)
redis.call("DECRBY", KEYS[2], cost)
redis.call("DECRBY", KEYS[3], cost)
return {1, org - cost, proj - cost, user - cost}
else
return {0, org, proj, user}
end
This script guarantees:
- no race conditions
- no double spending
- consistent enforcement across hierarchy
Reconciliation After Execution¶
Once the AI model finishes, actual usage is known.
Lua Script: Refund¶
-- KEYS:
-- 1 = org
-- 2 = project
-- 3 = user
-- ARGV:
-- 1 = refund amount
local refund = tonumber(ARGV[1])
redis.call("INCRBY", KEYS[1], refund)
redis.call("INCRBY", KEYS[2], refund)
redis.call("INCRBY", KEYS[3], refund)
return 1
This ensures your system remains financially accurate.
Go Implementation¶
Below is a simplified production-style implementation.
Quota Service Structure¶
Check and Reserve¶
func (q *QuotaService) CheckAndReserve(
ctx context.Context,
orgID, projectID, userID string,
cost int64,
) (bool, error) {
script := redis.NewScript(`
local cost = tonumber(ARGV[1])
local org = tonumber(redis.call("GET", KEYS[1]) or "0")
local proj = tonumber(redis.call("GET", KEYS[2]) or "0")
local user = tonumber(redis.call("GET", KEYS[3]) or "0")
if org >= cost and proj >= cost and user >= cost then
redis.call("DECRBY", KEYS[1], cost)
redis.call("DECRBY", KEYS[2], cost)
redis.call("DECRBY", KEYS[3], cost)
return 1
else
return 0
end
`)
keys := []string{
"quota:org:" + orgID,
"quota:project:" + projectID,
"quota:user:" + userID,
}
result, err := script.Run(ctx, q.redis, keys, cost).Int()
if err != nil {
return false, err
}
return result == 1, nil
}
Reconcile Usage¶
func (q *QuotaService) Refund(
ctx context.Context,
orgID, projectID, userID string,
refund int64,
) error {
script := redis.NewScript(`
local refund = tonumber(ARGV[1])
redis.call("INCRBY", KEYS[1], refund)
redis.call("INCRBY", KEYS[2], refund)
redis.call("INCRBY", KEYS[3], refund)
return 1
`)
keys := []string{
"quota:org:" + orgID,
"quota:project:" + projectID,
"quota:user:" + userID,
}
return script.Run(ctx, q.redis, keys, refund).Err()
}
Persistent Storage Design¶
Redis is fast but not durable enough alone. The project still need a database.
This layer ensures:
- auditability
- billing reconciliation
- historical insights
Failure Handling Philosophy¶
Real systems fail. The quota service must fail safely.
If Redis is unavailable¶
Two strategies exist:
- fail closed → reject all requests (safe for cost)
- fail open with limits → allow small usage buffer
The right choice depends on your business tolerance.
Final Insight¶
The most important shift in thinking is this:
Quota enforcement is not about limits—it’s about control under uncertainty
You never know exact cost upfront. You never control concurrency. And yet, the system must behave as if everything is predictable.
That’s why:
- estimation comes before execution
- reservation comes before approval
- reconciliation comes after reality
When these three steps work together, the quota system becomes not just correct but trustworthy.