Skip to content

Quota Management - Current State, Issues, and Direction

What Exists Today

At a high level, the system implements a hierarchical idea of quotas. Limits are not flat or global, they are meant to exist across different layers: companies, projects, and users. This gives the system flexibility. A company can define how much total usage is allowed, projects can impose their own constraints, and individual users can be limited within those boundaries.

Quotas are not tied directly to a user in isolation. Instead, they depend on context. A user may belong to several companies at once, and the quota they consume depends on where they are acting. If they are working inside a company project, they use that company’s quotas. If they are working in a personal project, they use their own.

This means that if the project belongs to a company, the user spends company quotas. If it’s a personal project, they spend personal quotas.

This approach is flexible and powerful, but it also means the system must make decisions dynamically for every request.

The Xavier Admin interface already reflects this vision. There is a dedicated quota section where limits like credit, storage, and calls are displayed. However, the quota section is actually a placeholder waiting for logic to catch up.

How Requests Are Processed

On the backend, the system already includes a validation step before any generation happens. When a user initiates a request, it goes through the gateway and then into a service called Policy Vault. This service is responsible for deciding whether the request should be allowed.

The method used here, CanExecuteGeneration, checks several things at once. It verifies whether the user has access to the project and model, and it also performs a quota check. The response it returns is detailed, it doesn’t just say “allowed” or “denied.” It explains what was checked, what passed, and what failed, along with information about quota usage.

This level of detail is intentional, so we can know exactly which quota failed and why, not just a simple true or false.

Even when everything is allowed, the system still returns a full breakdown. This is useful for debugging and, potentially, for user-facing transparency.

However, there is an important limitation. The system does not actually know how much a request will cost. That means the quota check is not based on real numbers. Instead, it is based on assumptions or placeholders.

The Missing Piece: Cost Awareness

Before you can enforce quotas, you need to understand the cost of a request. And right now, that understanding does not exist.

Different models behave differently. Some depend on tokens, others on image count, others on video duration or quality settings. Without a way to translate these inputs into a consistent cost, the system cannot make meaningful decisions. Without a calculator, we cannot know how much quota is needed.

As a result, the quota system cannot reliably block or allow requests based on real usage. It has the structure of enforcement, but not the data to support it.

What Happens After Execution

If the system had strong pre-checks, you might expect at least the post-execution side to be working. But that is also incomplete.

After a generation request finishes, there is currently no mechanism that updates quotas. The system does not receive actual usage data from workers or telemetry systems, and it does not subtract anything from the user’s limits.

The intended flow is clear: after generation, actual usage must be sent back and quotas must be updated.

But this step has not been implemented yet. Without it, quotas never change, and the system cannot reflect reality over time.

Time Does Not Exist in the System

Another subtle but important gap is the absence of lifecycle management. Quotas, once defined, would remain static. There is no concept of monthly resets, rolling limits, or replenishment.

It is essential for any real billing or subscription model. The mechanism for resetting and updating quotas does not exist, and we need to design it.

Until this exists, quotas cannot behave like real limits.

What a Quota Actually Means

There is also a deeper question that remains unresolved: what exactly is being measured?

Right now, quotas are labeled as “credit,” “storage,” and “calls,” but these are not strictly defined. Because of this, the system will need its own internal way of measuring usage.

Where the System Needs to Go

The path forward is about completing what has already been started.

The most important addition is a cost calculation layer. This component would take a generation request and estimate its cost before execution. It does not need to live inside Policy Vault—it can be a separate service—but Policy Vault must be able to call it.

Once that exists, the system can begin to make real decisions. It can compare estimated cost with available quotas and decide whether to allow a request. It can also return that estimate to the frontend, so users can see what they are about to spend.

After execution, another piece must be added - a way to capture actual usage and update quotas immediately. This likely needs to be integrated into the execution pipeline itself, so that quota updates happen as part of the generation flow, not as a delayed process.