Operational Playbook

This playbook covers the full lifecycle of chant-produced Kubernetes manifests — from build through production debugging. The same content is available to AI agents via the /chant-k8s skill.

Build & validate

Step	Command	What it catches
Lint source	`chant lint src/`	Hardcoded namespaces (WK8001)
Build manifests	`chant build src/ --output manifests.yaml`	Post-synth: secrets in env (WK8005), latest tags (WK8006), API keys (WK8041), missing probes (WK8301), no resource limits (WK8201), privileged containers (WK8202), and more
Server dry-run	`kubectl apply -f manifests.yaml --dry-run=server`	K8s API validation: schema errors, admission webhooks

Run lint on every edit. Run build + dry-run before every apply.

Deploy to Kubernetes

# Build
chant build src/ --output manifests.yaml

# Diff before applying
kubectl diff -f manifests.yaml

# Dry run (validates with admission webhooks)
kubectl apply -f manifests.yaml --dry-run=server

# Apply
kubectl apply -f manifests.yaml

Rollout & rollback

# Watch rollout progress
kubectl rollout status deployment/my-app --timeout=300s

# Check rollout history
kubectl rollout history deployment/my-app

# Undo last rollout
kubectl rollout undo deployment/my-app

# Roll back to a specific revision
kubectl rollout undo deployment/my-app --to-revision=2

Debugging strategies

Pod status and events

# Overview
kubectl get pods -l app.kubernetes.io/name=my-app
kubectl get events --sort-by=.lastTimestamp -n <namespace>

# Deep dive into a specific pod
kubectl describe pod <pod-name>

# Logs (current and previous crash)
kubectl logs <pod-name>
kubectl logs <pod-name> --previous
kubectl logs <pod-name> -c <container-name>  # specific container
kubectl logs deployment/my-app --all-containers

# Debug containers (K8s 1.25+)
kubectl debug <pod-name> -it --image=busybox --target=<container>

# Port-forwarding for local testing
kubectl port-forward svc/my-app 8080:80
kubectl port-forward pod/<pod-name> 8080:8080

Resource inspection

# Get all resources in namespace
kubectl get all -n <namespace>

# YAML output for debugging
kubectl get deployment/my-app -o yaml

# Check resource usage
kubectl top pods -l app.kubernetes.io/name=my-app
kubectl top nodes

Common error patterns

Status	Meaning	Diagnostic command	Typical fix
Pending	Not scheduled	`kubectl describe pod` → Events	Check resource requests, node selectors, taints, PVC binding
CrashLoopBackOff	App crashing on start	`kubectl logs --previous`	Fix app startup, check probe config, increase initialDelaySeconds
ImagePullBackOff	Image not found	`kubectl describe pod` → Events	Verify image name/tag, check imagePullSecrets, registry auth
OOMKilled	Out of memory	`kubectl describe pod` → Last State	Increase memory limit, profile app memory usage
Evicted	Node disk/memory pressure	`kubectl describe node`	Increase limits, add node capacity, check for log/tmp bloat
CreateContainerError	Container config issue	`kubectl describe pod` → Events	Check volume mounts, configmap/secret refs, security context
Init:CrashLoopBackOff	Init container failing	`kubectl logs -c <init-container>`	Fix init container command, check dependencies

Deployment strategies

RollingUpdate (default): Gradually replaces pods. Set maxSurge and maxUnavailable.
Recreate: All pods terminated before new ones created. Use for stateful apps that cannot run multiple versions.
Canary: Deploy a second Deployment with 1 replica + same selector labels. Route percentage via Ingress annotations or service mesh.
Blue/Green: Two full Deployments (blue/green), switch Service selector between them.

Production safety

Pre-apply validation

# Always diff before applying
kubectl diff -f manifests.yaml

# Server-side dry run (validates with admission webhooks)
kubectl apply -f manifests.yaml --dry-run=server

# Client-side dry run (fast, but no webhook validation)
kubectl apply -f manifests.yaml --dry-run=client

Use server-side dry-run before production applies — it catches schema errors and runs admission webhooks. Client-side dry-run is faster but only validates locally.

Troubleshooting reference

Symptom	Likely cause	Resolution
Pod stuck in Pending	Insufficient CPU/memory on nodes	Scale up cluster or reduce resource requests
Pod stuck in Pending	PVC not bound	Check StorageClass exists, PV available
Pod stuck in Pending	Node selector/affinity mismatch	Verify node labels match selectors
Pod stuck in ContainerCreating	ConfigMap/Secret not found	Ensure referenced ConfigMaps/Secrets exist
Pod stuck in ContainerCreating	Volume mount failure	Check PVC status, CSI driver health
Service returns 503	No ready endpoints	Check pod readiness probes, selector match
Service returns 503	Wrong port configuration	Verify targetPort matches containerPort
Ingress returns 404	Backend service not found	Check Ingress rules, service name/port
Ingress returns 404	Wrong path matching	Check pathType (Prefix vs Exact)
HPA not scaling	Metrics server not installed	Install metrics-server
HPA not scaling	Resource requests not set	Add CPU/memory requests to containers
CronJob not running	Invalid cron expression	Validate cron syntax (5-field format)
NetworkPolicy blocking	Default deny applied	Add explicit allow rules for required traffic
RBAC permission denied	Missing Role/RoleBinding	Check ServiceAccount bindings and verb permissions