Token Metrics Monitoring Guide For AI Traffic¶
Overview¶
This tutorial walks you through using Kuadrant to monitor token metrics for AI workloads. LLM APIs have varying computational costs based on token usage.
The use case is as follows: As a platform engineer, I want token usage and limit metrics to be exposed in Prometheus format so that I can monitor how many tokens are being used per user or route and compare them to defined limits for alerting and capacity planning.
This guide covers exposing usage data as metrics suitable for Prometheus collection. These metrics are critical for visibility into LLM costs and usage patterns and will allow platform teams to build dashboards or alerts.
This feature is for observability only and does not affect policy enforcement, but it is critical for diagnosing policy effectiveness and model usage behavior.
Prerequisites¶
- A Kubernetes cluster with the Kuadrant operator installed. See our Getting Started guide, which lets you quickly set up a local cluster for evaluation purposes.
- A monitoring stack deployed, specifically the Prometheus component for this tutorial.
- For bare-metal Kubernetes: See Installing Prometheus Operator guide.
- For OpenShift: See the Configuring user workload monitoring guide.
- Configure Observability for Gateway and Kuadrant components. See our Configure Observability guide for more information.
- The kubectl command-line tool.
- [Optional] The jq command-line tool.
Set the environment variables¶
Set the following environment variables for convenience in this tutorial:
export KUADRANT_GATEWAY_NS=api-gateway # Namespace for the example Gateway
export KUADRANT_GATEWAY_NAME=external # Name for the example Gateway
export KUADRANT_DEVELOPER_NS=llm # Namespace for an example LLM app
export KUADRANT_LLM_DOMAIN=llm.example.com # Domain name for an example LLM app
Step 1: Deploy an LLM service¶
Create the namespace for the LLM service:
Deploy a simulated LLM service that mimics OpenAI-compatible APIs:
kubectl apply -n ${KUADRANT_DEVELOPER_NS} -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-sim
labels:
app: llm-sim
spec:
replicas: 1
selector:
matchLabels:
app: llm-sim
template:
metadata:
labels:
app: llm-sim
spec:
containers:
- name: simulator
image: ghcr.io/llm-d/llm-d-inference-sim:v0.1.1
args:
- --model
- meta-llama/Llama-3.1-8B-Instruct
- --port
- "8000"
ports:
- containerPort: 8000
---
apiVersion: v1
kind: Service
metadata:
name: llm-sim
spec:
selector:
app: llm-sim
ports:
- port: 80
targetPort: 8000
protocol: TCP
EOF
Step 2: Create a Gateway¶
Create the namespace for the gateway:
Create a gateway that will accept traffic for the LLM API:
kubectl apply -f - <<EOF
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: ${KUADRANT_GATEWAY_NAME}
namespace: ${KUADRANT_GATEWAY_NS}
spec:
gatewayClassName: istio
listeners:
- name: http
protocol: HTTP
port: 80
hostname: ${KUADRANT_LLM_DOMAIN}
allowedRoutes:
namespaces:
from: All
EOF
Check the gateway status:
kubectl get gateway ${KUADRANT_GATEWAY_NAME} -n ${KUADRANT_GATEWAY_NS} -o=jsonpath='{.status.conditions[?(@.type=="Accepted")].message}{"\n"}{.status.conditions[?(@.type=="Programmed")].message}{"\n"}'
Export the gateway URL for use in requests:
export KUADRANT_INGRESS_HOST=$(kubectl get gtw ${KUADRANT_GATEWAY_NAME} -n ${KUADRANT_GATEWAY_NS} -o jsonpath='{.status.addresses[0].value}')
export KUADRANT_INGRESS_PORT=$(kubectl get gtw ${KUADRANT_GATEWAY_NAME} -n ${KUADRANT_GATEWAY_NS} -o jsonpath='{.spec.listeners[?(@.name=="http")].port}')
export KUADRANT_GATEWAY_URL=${KUADRANT_INGRESS_HOST}:${KUADRANT_INGRESS_PORT}
Step 3: Expose the service via HTTPRoute¶
Create an HTTPRoute to expose the LLM service:
kubectl apply -n ${KUADRANT_DEVELOPER_NS} -f - <<EOF
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: llm-sim
spec:
hostnames:
- ${KUADRANT_LLM_DOMAIN}
parentRefs:
- name: ${KUADRANT_GATEWAY_NAME}
namespace: ${KUADRANT_GATEWAY_NS}
rules:
- matches:
- path:
type: PathPrefix
value: "/"
backendRefs:
- name: llm-sim
port: 80
EOF
Test connectivity to the LLM service:
curl --resolve $KUADRANT_LLM_DOMAIN:$KUADRANT_INGRESS_PORT:$KUADRANT_INGRESS_HOST http://$KUADRANT_LLM_DOMAIN:$KUADRANT_INGRESS_PORT/v1/models
# HTTP/1.1 200 OK
Step 4: Set up API key authentication¶
Create API keys for different user tiers. This example creates two tiers: "free" and "gold":
# Create a free tier user
kubectl apply -f - <<EOF
apiVersion: v1
kind: Secret
metadata:
name: api-key-free-user-1
namespace: ${KUADRANT_DEVELOPER_NS}
labels:
authorino.kuadrant.io/managed-by: authorino
app: my-llm
annotations:
kuadrant.io/groups: free
kuadrant.io/user-id: user-1
stringData:
api_key: iamafreeuser
type: Opaque
EOF
# Create a gold tier user
kubectl apply -f - <<EOF
apiVersion: v1
kind: Secret
metadata:
name: api-key-gold-user-1
namespace: ${KUADRANT_DEVELOPER_NS}
labels:
authorino.kuadrant.io/managed-by: authorino
app: my-llm
annotations:
kuadrant.io/groups: gold
kuadrant.io/user-id: user-2
stringData:
api_key: iamagolduser
type: Opaque
EOF
Create an AuthPolicy that validates API keys and extracts user information:
kubectl apply -f - <<EOF
apiVersion: kuadrant.io/v1
kind: AuthPolicy
metadata:
name: llm-api-keys
namespace: ${KUADRANT_DEVELOPER_NS}
spec:
targetRef:
group: gateway.networking.k8s.io
kind: HTTPRoute
name: llm-sim
rules:
authentication:
"api-key-users":
apiKey:
allNamespaces: true
selector:
matchLabels:
app: my-llm
credentials:
authorizationHeader:
prefix: APIKEY
response:
success:
filters:
identity:
json:
properties:
groups:
selector: auth.identity.metadata.annotations.kuadrant\.io/groups
userid:
selector: auth.identity.metadata.annotations.kuadrant\.io/user-id
authorization:
"allow-groups":
opa:
rego: |
groups := split(object.get(input.auth.identity.metadata.annotations, "kuadrant.io/groups", ""), ",")
allow { groups[_] == "free" }
allow { groups[_] == "gold" }
EOF
Step 5: Apply token rate limiting¶
Create a TokenRateLimitPolicy
with different token limits for each tier:
kubectl apply -f - <<EOF
apiVersion: kuadrant.io/v1alpha1
kind: TokenRateLimitPolicy
metadata:
name: token-limits
namespace: ${KUADRANT_DEVELOPER_NS}
spec:
targetRef:
group: gateway.networking.k8s.io
kind: HTTPRoute
name: llm-sim
limits:
free:
rates:
- limit: 100 # 100 tokens per minute for free users (small for testing)
window: 1m
when:
- predicate: request.path == "/v1/chat/completions"
- predicate: |
auth.identity.groups.split(",").exists(g, g == "free")
counters:
- expression: auth.identity.userid
gold:
rates:
- limit: 500 # 500 tokens per minute for gold users (small for testing)
window: 1m
when:
- predicate: request.path == "/v1/chat/completions"
- predicate: |
auth.identity.groups.split(",").exists(g, g == "gold")
counters:
- expression: auth.identity.userid
EOF
Step 6: Apply TelemetryPolicy to expose metrics usage per user and tier¶
Create a TelemetryPolicy
:
kubectl apply -f - <<EOF
apiVersion: extensions.kuadrant.io/v1alpha1
kind: TelemetryPolicy
metadata:
name: user-group
namespace: ${KUADRANT_GATEWAY_NS}
spec:
targetRef:
group: gateway.networking.k8s.io
kind: Gateway
name: ${KUADRANT_GATEWAY_NAME}
metrics:
default:
labels:
user: auth.identity.userid
group: auth.identity.groups
EOF
Step 7: Generate Traffic to Populate Token Metrics¶
Define a bash function to send a chat completion request on behalf of a user (through the API key).
function call_llm {
curl --resolve $KUADRANT_LLM_DOMAIN:$KUADRANT_INGRESS_PORT:$KUADRANT_INGRESS_HOST \
--write-out 'Status: %{http_code}\n' --silent --output /dev/null \
-H 'Content-Type: application/json' \
-H "Authorization: APIKEY $1" \
-X POST \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{ "role": "user", "content": "What is Kubernetes?" }
],
"max_tokens": 100,
"stream": false,
"stream_options": {
"include_usage": true
}
}' \
http://llm.example.com:$KUADRANT_INGRESS_PORT/v1/chat/completions
}
The function accepts one parameter, which is the API key. For example, call_llm iamafreeuser
.
Send some requests as user-1
, which belongs to the free
tier.
Send some requests as user-2
, which belongs to the gold
tier.
Step 8: View token metrics in Prometheus¶
The metrics exposed by the rate-limiting service can be found in the Limitador Metrics Monitoring guide.
In the current context, authorized_hits
will represent the usage of tokens per user and tier.
Using the Web UI:¶
- Visit http://localhost:9090/graph
- Create three panels:
- PromQL:
authorized_hits
- PromQL:
authorized_calls
- PromQL:
limited_calls
Using the API:¶
authorized_hits
curl -s "http://${PROMETHEUS_HOST}:${PROMETHEUS_PORT}/api/v1/query?query=authorized_hits" | jq '.data.result'
authorized_calls
curl -s "http://${PROMETHEUS_HOST}:${PROMETHEUS_PORT}/api/v1/query?query=authorized_calls" | jq '.data.result'
limited_calls
curl -s "http://${PROMETHEUS_HOST}:${PROMETHEUS_PORT}/api/v1/query?query=limited_calls" | jq '.data.result'
Note: The Prometheus installation documentation should provide the PROMETHEUS_HOST and PROMETHEUS_PORT variables.
Next steps¶
- Experiment with different custom labels in the TelemetryPolicy specification.
- Integrate with your actual LLM service.
Troubleshooting¶
If requests are being rejected unexpectedly:
- Verify that the API key is correct.
- Ensure the LLM response includes
usage.total_tokens
. - Review the
AuthPolicy
andTokenRateLimitPolicy
statuses.
# Check AuthPolicy status
kubectl get authpolicy -n ${KUADRANT_GATEWAY_NS} llm-api-keys -o=jsonpath='{.status.conditions[?(@.type=="Accepted")].message}'
# Check TokenRateLimitPolicy status
kubectl get tokenratelimitpolicy -n ${KUADRANT_GATEWAY_NS} token-limits -o=jsonpath='{.status.conditions[?(@.type=="Accepted")].message}'
Cleanup¶
To remove all resources created in this tutorial: