Token-based Rate Limiting for Large Language Model APIs¶
This tutorial walks you through configuring token-based rate limiting using Kuadrant's TokenRateLimitPolicy
to protect Large Language Model (LLM) APIs. Unlike traditional request counting, this approach limits API usage based on actual token consumption.
Note: Currently,
TokenRateLimitPolicy
only supports non-streaming responses (wherestream: false
or is omitted in the request). Support for streaming responses is planned for future releases.
Overview¶
Traditional rate limiting counts requests, but LLM APIs have varying computational costs based on token usage. TokenRateLimitPolicy
addresses this by:
- Counting actual tokens consumed from LLM responses
- Setting different limits for different user tiers when combined with
AuthPolicy
- Integrating seamlessly with authentication policies
Prerequisites¶
- Kubernetes cluster with Kuadrant operator installed. See our Getting Started guide for more information.
- kubectl command line tool.
- Basic understanding of Gateway API.
You should also have an instance of Kuadrant
installed:
kubectl apply -f - <<EOF
apiVersion: kuadrant.io/v1beta1
kind: Kuadrant
metadata:
name: kuadrant
namespace: kuadrant-system
EOF
Environment variables¶
Set the following environment variables used throughout this tutorial:
export KUADRANT_GATEWAY_NS=gateway-system
export KUADRANT_GATEWAY_NAME=trlp-tutorial-gateway
export KUADRANT_SYSTEM_NS=$(kubectl get kuadrant -A -o jsonpath='{.items[0].metadata.namespace}')
Step 1: Deploy an LLM service¶
Deploy a simulated LLM service that mimics OpenAI-compatible APIs:
kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
name: trlp-tutorial-llm-sim
labels:
app: trlp-tutorial-llm-sim
spec:
replicas: 1
selector:
matchLabels:
app: trlp-tutorial-llm-sim
template:
metadata:
labels:
app: trlp-tutorial-llm-sim
spec:
containers:
- name: simulator
image: ghcr.io/llm-d/llm-d-inference-sim:v0.1.1
args:
- --model
- meta-llama/Llama-3.1-8B-Instruct
- --port
- "8000"
ports:
- containerPort: 8000
---
apiVersion: v1
kind: Service
metadata:
name: trlp-tutorial-llm-sim
spec:
selector:
app: trlp-tutorial-llm-sim
ports:
- port: 80
targetPort: 8000
protocol: TCP
EOF
Step 2: Create a Gateway¶
Create the namespace for the gateway:
Create a gateway that will accept traffic for the LLM API:
kubectl apply -f - <<EOF
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: ${KUADRANT_GATEWAY_NAME}
namespace: ${KUADRANT_GATEWAY_NS}
spec:
gatewayClassName: istio
listeners:
- name: http
protocol: HTTP
port: 80
hostname: "trlp-tutorial.example.com"
allowedRoutes:
namespaces:
from: All
EOF
Check the gateway status:
kubectl get gateway ${KUADRANT_GATEWAY_NAME} -n ${KUADRANT_GATEWAY_NS} -o=jsonpath='{.status.conditions[?(@.type=="Accepted")].message}{"\n"}{.status.conditions[?(@.type=="Programmed")].message}{"\n"}'
Step 3: Expose the service via HTTPRoute¶
Create an HTTPRoute to expose the LLM service:
kubectl apply -f - <<EOF
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: trlp-tutorial-llm-sim
spec:
hostnames:
- trlp-tutorial.example.com
parentRefs:
- name: ${KUADRANT_GATEWAY_NAME}
namespace: ${KUADRANT_GATEWAY_NS}
rules:
- matches:
- path:
type: PathPrefix
value: "/"
backendRefs:
- name: trlp-tutorial-llm-sim
port: 80
EOF
Export the gateway URL for use in requests:
export KUADRANT_INGRESS_HOST=$(kubectl get gtw ${KUADRANT_GATEWAY_NAME} -n ${KUADRANT_GATEWAY_NS} -o jsonpath='{.status.addresses[0].value}')
export KUADRANT_INGRESS_PORT=$(kubectl get gtw ${KUADRANT_GATEWAY_NAME} -n ${KUADRANT_GATEWAY_NS} -o jsonpath='{.spec.listeners[?(@.name=="http")].port}')
export KUADRANT_GATEWAY_URL=${KUADRANT_INGRESS_HOST}:${KUADRANT_INGRESS_PORT}
Test connectivity to the service:
curl -H 'Host: trlp-tutorial.example.com' http://$KUADRANT_GATEWAY_URL/v1/models -i
# HTTP/1.1 200 OK
Note: If the command above fails to hit the service on your environment, try forwarding requests to the gateway and accessing over localhost:
Step 4: Set up API key authentication¶
Create API keys for different user tiers. This example creates two tiers: "free" and "gold":
# Create a free tier user
kubectl apply -f - <<EOF
apiVersion: v1
kind: Secret
metadata:
name: trlp-tutorial-api-key-free-user-1
namespace: ${KUADRANT_SYSTEM_NS}
labels:
authorino.kuadrant.io/managed-by: authorino
app: my-llm
annotations:
kuadrant.io/groups: free
secret.kuadrant.io/user-id: user-1
stringData:
api_key: iamafreeuser
type: Opaque
EOF
# Create a gold tier user
kubectl apply -f - <<EOF
apiVersion: v1
kind: Secret
metadata:
name: trlp-tutorial-api-key-gold-user-1
namespace: ${KUADRANT_SYSTEM_NS}
labels:
authorino.kuadrant.io/managed-by: authorino
app: my-llm
annotations:
kuadrant.io/groups: gold
secret.kuadrant.io/user-id: user-2
stringData:
api_key: iamagolduser
type: Opaque
EOF
Step 5: Configure authentication policy¶
Create an AuthPolicy that validates API keys and extracts user information:
kubectl apply -f - <<EOF
apiVersion: kuadrant.io/v1
kind: AuthPolicy
metadata:
name: trlp-tutorial-llm-api-keys
namespace: ${KUADRANT_GATEWAY_NS}
spec:
targetRef:
group: gateway.networking.k8s.io
kind: Gateway
name: ${KUADRANT_GATEWAY_NAME}
rules:
authentication:
api-key-users:
apiKey:
selector:
matchLabels:
app: my-llm
credentials:
authorizationHeader:
prefix: APIKEY
response:
success:
filters:
identity:
json:
properties:
groups:
selector: auth.identity.metadata.annotations.kuadrant\.io/groups
userid:
selector: auth.identity.metadata.annotations.secret\.kuadrant\.io/user-id
authorization:
allow-groups:
opa:
rego: |
groups := split(object.get(input.auth.identity.metadata.annotations, "kuadrant.io/groups", ""), ",")
allow { groups[_] == "free" }
allow { groups[_] == "gold" }
EOF
Step 6: Apply token rate limiting¶
Create a TokenRateLimitPolicy
with different token limits for each tier:
kubectl apply -f - <<EOF
apiVersion: kuadrant.io/v1alpha1
kind: TokenRateLimitPolicy
metadata:
name: trlp-tutorial-token-limits
namespace: ${KUADRANT_GATEWAY_NS}
spec:
targetRef:
group: gateway.networking.k8s.io
kind: Gateway
name: ${KUADRANT_GATEWAY_NAME}
limits:
free:
rates:
- limit: 50 # 50 tokens per minute for free users (small for testing)
window: 1m
when:
- predicate: request.path == "/v1/chat/completions"
- predicate: |
auth.identity.groups.split(",").exists(g, g == "free")
counters:
- expression: auth.identity.userid
gold:
rates:
- limit: 200 # 200 tokens per minute for gold users (small for testing)
window: 1m
when:
- predicate: request.path == "/v1/chat/completions"
- predicate: |
auth.identity.groups.split(",").exists(g, g == "gold")
counters:
- expression: auth.identity.userid
EOF
Step 7: Test the configuration¶
Test with a free user¶
Make a chat completion request. Note that stream: false
is explicitly set to ensure a non-streaming response:
curl -H 'Host: trlp-tutorial.example.com' \
-H 'Authorization: APIKEY iamafreeuser' \
-H 'Content-Type: application/json' \
-X POST http://$KUADRANT_GATEWAY_URL/v1/chat/completions \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{ "role": "user", "content": "What is Kubernetes?" }
],
"max_tokens": 100,
"stream": false,
"usage": true
}'
The response includes token usage information:
Note: The
TokenRateLimitPolicy
automatically extracts thetotal_tokens
value from this response and counts it against the user's limit.
Test with a gold user¶
curl -H 'Host: trlp-tutorial.example.com' \
-H 'Authorization: APIKEY iamagolduser' \
-H 'Content-Type: application/json' \
-X POST http://$KUADRANT_GATEWAY_URL/v1/chat/completions \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{ "role": "user", "content": "Explain cloud native architecture" }
],
"max_tokens": 200,
"stream": false,
"usage": true
}'
How it works¶
- Authentication: The AuthPolicy validates API keys and enriches requests with user metadata
- Token Extraction:
TokenRateLimitPolicy
automatically extractsusage.total_tokens
from LLM responses - Rate Limiting: Tokens are counted against user-specific limits based on their tier
- Enforcement: When limits are exceeded, requests are rejected with HTTP 429 (Too Many Requests)
Understanding the policy¶
The TokenRateLimitPolicy
uses several key concepts:
rates
: Define the token limits and time windowswhen
: Conditions that determine when a limit applies (based on user groups)counters
: Identify what to count (user ID in this case)- Token extraction: Automatically reads
usage.total_tokens
from non-streaming JSON responses
Monitoring usage¶
Check the generated rate limit configuration:
# View the generated WasmPlugin configuration
kubectl get wasmplugin -n ${KUADRANT_GATEWAY_NS} kuadrant-${KUADRANT_GATEWAY_NAME} -o yaml
Next steps¶
- Experiment with different token limits and time windows
- Add more user tiers with different limits
- Integrate with your actual LLM service
Troubleshooting¶
If requests are being rejected unexpectedly:
- Verify the API key is correct
- Check if the user has exceeded their token limit
- Ensure the LLM response includes
usage.total_tokens
- Verify the request is not using streaming (
stream: true
) as this is not currently supported - Review the
AuthPolicy
andTokenRateLimitPolicy
status
# Check AuthPolicy status
kubectl get authpolicy -n ${KUADRANT_GATEWAY_NS} trlp-tutorial-llm-api-keys -o=jsonpath='{.status.conditions[?(@.type=="Accepted")].message}'
# Check TokenRateLimitPolicy status
kubectl get tokenratelimitpolicy -n ${KUADRANT_GATEWAY_NS} trlp-tutorial-token-limits -o=jsonpath='{.status.conditions[?(@.type=="Accepted")].message}'
# View full policy details if needed
kubectl get authpolicy -n ${KUADRANT_GATEWAY_NS} trlp-tutorial-llm-api-keys -o yaml
kubectl get tokenratelimitpolicy -n ${KUADRANT_GATEWAY_NS} trlp-tutorial-token-limits -o yaml
Cleanup¶
To remove all resources created in this tutorial:
# Delete policies
kubectl delete tokenratelimitpolicy -n ${KUADRANT_GATEWAY_NS} trlp-tutorial-token-limits
kubectl delete authpolicy -n ${KUADRANT_GATEWAY_NS} trlp-tutorial-llm-api-keys
# Delete API key secrets
kubectl delete secret -n ${KUADRANT_SYSTEM_NS} trlp-tutorial-api-key-free-user-1 trlp-tutorial-api-key-gold-user-1
# Delete HTTPRoute
kubectl delete httproute trlp-tutorial-llm-sim
# Delete Gateway
kubectl delete gateway -n ${KUADRANT_GATEWAY_NS} ${KUADRANT_GATEWAY_NAME}
# Delete LLM service and deployment
kubectl delete service trlp-tutorial-llm-sim
kubectl delete deployment trlp-tutorial-llm-sim
# Delete the gateway namespace (if not used by other resources)
# kubectl delete namespace ${KUADRANT_GATEWAY_NS}