| name | k8s |
| description | Kubernetes ops skill for deploying, operating, and troubleshooting services on Kubernetes. Use for tasks like writing manifests/Helm, configuring deployments/services/ingress, autoscaling, observability, RBAC, secrets/configmaps, rollout/rollback, incident debugging, and production readiness checks. |
k8s
Use this skill for Kubernetes 运维与发布相关工作。
Defaults / assumptions to confirm
- Cluster type: managed (EKS/GKE/ACK) vs self-hosted
- Packaging: raw YAML vs Helm vs Kustomize
- Ingress: NGINX/ALB/APISIX/Istio
- Observability stack: Prometheus/Grafana, Loki/ELK, tracing
Workflow
- Understand service requirements
- Ports, protocols, health checks, resources (CPU/mem), storage needs.
- SLOs: latency, availability, RPO/RTO.
- Dependencies: DB, cache, MQ, external APIs.
- Deployment design
- Use
Deploymentfor stateless;StatefulSetfor stable identities/storage. - Define
readinessProbeandlivenessProbe(andstartupProbeif needed). - Set
resources.requests/limitsand choose appropriate QoS. - Use
PodDisruptionBudgetfor availability during maintenance.
- Config & secrets
- Config:
ConfigMap(non-sensitive), mounted or env. - Secrets:
Secret(sensitive) + external secret manager if available. - Never commit plaintext secrets; prefer sealed/external secrets.
- Networking
Servicetypes and DNS.Ingress/Gateway routing, TLS termination, timeouts.- NetworkPolicy if cluster enforces it.
- Scaling & resilience
HPAbased on CPU/memory/custom metrics.- Graceful shutdown (
preStop, terminationGracePeriodSeconds). - Retry/backoff at client; avoid retry storms.
- Observability
- Standard logs with correlation IDs.
- Metrics: RPS, p95 latency, error rate, saturation.
- Alerts and dashboards; runbook links.
- Release operations
- Rolling updates, canary/blue-green if needed.
kubectl rollout status+ rollback plan.- Post-deploy verification checks and smoke tests.
- Troubleshooting checklist
kubectl get/describepods, events, andlogs.- Check probes, image pull, env/config, DNS, network, and resource throttling.
- For performance: node pressure, HPA behavior, GC/heap, connection pool limits.
Output expectations when making changes
- Provide manifests (or Helm values/templates) + brief deployment notes.
- Include resource sizing rationale and probe settings.
- Include rollback instructions and verification steps.