Benchmarks
How Sabre performs on real Kubernetes tasks with open-source models via Ollama.
Results are from k8s-ai-bench, a benchmark suite of 24 real-world Kubernetes tasks spanning debugging, configuration, and deployments. All models run locally via Ollama -- no API keys, no cloud services.
Pass@1
The task is attempted once. The score reflects how often the model gets it right on the first try -- the most realistic measure of day-to-day usage.
Pass@5
The task is attempted up to 5 times. If any attempt succeeds, it counts as a pass. This measures the model's capability ceiling with retries.
devstral-small-2:24b
Open Source
Pass@1
70.8%
Pass@5
91.7%
Debugging (8 tasks)
✓fix-crashloop
✓fix-image-pull
✓fix-pending-pod
✗fix-probes
✓fix-service-routing
✓fix-service-with-no-endpoints
✓fix-rbac-wrong-resource
✓debug-app-logs
Configuration (10 tasks)
✗create-network-policy
✓create-pod
✓create-pod-mount-configmaps
✓create-pod-resources-limits
✗create-simple-rbac
✓horizontal-pod-autoscaler
✓list-images-for-pods
✗multi-container-pod-communication
✓resize-pvc
✗setup-dev-cluster
Deployments (6 tasks)
✗create-canary-deployment
✓deployment-traffic-switch
✓rolling-update-deployment
✓scale-deployment
✓scale-down-deployment
✗statefulset-lifecycle
qwen3-coder-30b
Open Source
Pass@1
83.3%
Pass@5
91.7%
Debugging (8 tasks)
✓fix-crashloop
✓fix-image-pull
✓fix-pending-pod
✓fix-probes
✗fix-service-routing
✓fix-service-with-no-endpoints
✓fix-rbac-wrong-resource
✓debug-app-logs
Configuration (10 tasks)
✓create-network-policy
✓create-pod
✗create-pod-mount-configmaps
✓create-pod-resources-limits
✓create-simple-rbac
✓horizontal-pod-autoscaler
✓list-images-for-pods
✗multi-container-pod-communication
✓resize-pvc
✗setup-dev-cluster
Deployments (6 tasks)
✓create-canary-deployment
✓deployment-traffic-switch
✓rolling-update-deployment
✓scale-deployment
✓scale-down-deployment
✓statefulset-lifecycle
qwen3.6:35b-a3b
Open Source
Pass@1
83.3%
Pass@5
95.8%
Debugging (8 tasks)
✓fix-crashloop
✓fix-image-pull
✓fix-pending-pod
✓fix-probes
✓fix-service-routing
✗fix-service-with-no-endpoints
✗fix-rbac-wrong-resource
✓debug-app-logs
Configuration (10 tasks)
✓create-network-policy
✓create-pod
✓create-pod-mount-configmaps
✓create-pod-resources-limits
✓create-simple-rbac
✓horizontal-pod-autoscaler
✓list-images-for-pods
✗multi-container-pod-communication
✓resize-pvc
✗setup-dev-cluster
Deployments (6 tasks)
✓create-canary-deployment
✓deployment-traffic-switch
✓rolling-update-deployment
✓scale-deployment
✓scale-down-deployment
✓statefulset-lifecycle
Run Your Own Benchmarks
Want to test a different model or validate results on your hardware? The benchmark suite is open source and easy to run.