Platform Onboarding Workflow
This skill defines the structured process for onboarding services to the internal developer platform. Use it to ensure consistent, compliant service deployments.
Onboarding Phases
| Phase |
Focus |
Output |
| 1. Requirements |
Gather service requirements |
Requirements doc |
| 2. Golden Path Selection |
Choose deployment pattern |
Selected template |
| 3. Infrastructure Provisioning |
Create service resources |
Infrastructure ready |
| 4. Observability Setup |
Configure monitoring |
Dashboards/alerts |
| 5. Security Configuration |
Apply security controls |
Security validated |
| 6. Documentation |
Complete service docs |
Runbook ready |
| 7. Handoff |
Transfer to service team |
Ownership confirmed |
Phase 1: Requirements Gathering
Service Requirements Checklist
## Service Onboarding Request
**Service Name:** [name]
**Team:** [owning team]
**Requested By:** [name]
**Target Date:** YYYY-MM-DD
### Service Information
| Attribute | Value |
|-----------|-------|
| Service type | [API / Worker / Batch / Frontend] |
| Language/runtime | [Go / Node.js / Python / etc.] |
| Criticality | [Tier 1/2/3/4] |
| External traffic | [Yes / No] |
| Data sensitivity | [PII / Financial / Public] |
### Resource Requirements
| Resource | Requirement | Notes |
|----------|-------------|-------|
| CPU | [cores] | [peak/average] |
| Memory | [GB] | [peak/average] |
| Storage | [GB] | [type: SSD/HDD] |
| Database | [type] | [shared/dedicated] |
| Cache | [type] | [shared/dedicated] |
### Dependencies
| Dependency | Type | SLA Required |
|------------|------|--------------|
| [service] | Internal | [Yes/No] |
| [external] | External | [Yes/No] |
### Compliance Requirements
- [ ] SOC2
- [ ] PCI-DSS
- [ ] GDPR
- [ ] HIPAA
- [ ] Other: ____________
Phase 2: Golden Path Selection
Available Golden Paths
| Golden Path |
Use Case |
Includes |
| api-service |
REST/GraphQL APIs |
ALB, EKS, RDS, ElastiCache |
| worker-service |
Background processing |
SQS, EKS, auto-scaling |
| batch-job |
Scheduled jobs |
EventBridge, Lambda/Fargate |
| frontend-app |
Static sites, SPAs |
CloudFront, S3, API Gateway |
| data-pipeline |
ETL, streaming |
Kinesis, Glue, S3 |
Golden Path Selection Matrix
| Requirement |
api-service |
worker-service |
batch-job |
| HTTP traffic |
Yes |
No |
No |
| Queue processing |
Optional |
Yes |
Optional |
| Scheduled runs |
No |
No |
Yes |
| Real-time |
Yes |
Near-real-time |
No |
| Auto-scaling |
Yes |
Yes |
N/A |
Selection Template
## Golden Path Selection
**Service:** [name]
**Selected Path:** [api-service / worker-service / etc.]
### Rationale
1. Service type [X] matches [golden path] pattern
2. Traffic requirements of [X] supported by [features]
3. Compliance requirements met by built-in [controls]
### Customizations Required
| Standard Component | Customization | Reason |
|--------------------|---------------|--------|
| [component] | [change] | [why] |
### Approval
- [ ] Platform team reviewed
- [ ] Security team reviewed (if customizations)
- [ ] Architecture team reviewed (if non-standard)
Phase 3: Infrastructure Provisioning
Provisioning Checklist
Terraform/IaC Template
# Example service provisioning
module "service" {
source = "platform/service-template"
service_name = var.service_name
team = var.team
environment = var.environment
golden_path = "api-service"
# Compute
cpu_request = "500m"
memory_request = "512Mi"
replicas_min = 2
replicas_max = 10
# Database
database_enabled = true
database_class = "db.t3.medium"
# Tags
tags = {
Team = var.team
Environment = var.environment
CostCenter = var.cost_center
}
}
Provisioning Verification
# Verify namespace
kubectl get namespace [service-name]
# Verify compute
kubectl get deployment -n [service-name]
# Verify database
aws rds describe-db-instances --db-instance-identifier [service-db]
# Verify DNS
dig [service-name].internal.example.com
Phase 4: Observability Setup
Observability Checklist
Dashboard Template
Standard service dashboard includes:
| Panel |
Metrics |
| Request rate |
requests/sec, by status code |
| Error rate |
5xx rate, 4xx rate |
| Latency |
p50, p95, p99 |
| Saturation |
CPU, memory utilization |
| Dependencies |
Upstream/downstream health |
Alert Configuration
| Alert |
Condition |
Severity |
Response |
| High error rate |
5xx > 1% for 5m |
Critical |
Page on-call |
| High latency |
p99 > 1s for 5m |
Warning |
Alert team |
| Low availability |
uptime < 99.9% |
Critical |
Page on-call |
| Resource saturation |
CPU > 85% for 10m |
Warning |
Alert team |
SLI/SLO Definition
## Service Level Objectives
**Service:** [name]
**SLO Version:** 1.0
| SLI | Target | Measurement |
|-----|--------|-------------|
| Availability | 99.9% | Successful requests / total requests |
| Latency | p99 < 500ms | Request duration percentile |
| Error rate | < 0.1% | 5xx responses / total responses |
### Error Budget
- Monthly budget: 43.2 minutes downtime
- Current consumption: [X]%
- Actions if budget exceeded: [escalation process]
Phase 5: Security Configuration
Security Checklist
Network Policy Template
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: service-policy
namespace: [service-name]
spec:
podSelector:
matchLabels:
app: [service-name]
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: istio-system
ports:
- port: 8080
egress:
- to:
- namespaceSelector:
matchLabels:
name: database
ports:
- port: 5432
Security Review
## Security Configuration Review
**Service:** [name]
**Reviewer:** @security-team
| Control | Status | Notes |
|---------|--------|-------|
| mTLS enabled | PASS | Istio strict mode |
| Network policies | PASS | Ingress/egress restricted |
| Secrets management | PASS | Using Vault |
| Least privilege IAM | PASS | Scoped to required resources |
| Vulnerability scanning | PASS | Trivy in CI/CD |
Phase 6: Documentation
Required Documentation
| Document |
Purpose |
Template |
| Service Overview |
What the service does |
README.md |
| Runbook |
Operational procedures |
runbook.md |
| Architecture |
Design decisions |
architecture.md |
| API Docs |
Interface documentation |
OpenAPI spec |
| On-call Guide |
Incident handling |
oncall.md |
Runbook Template
## [Service Name] Runbook
### Service Overview
[Brief description of what the service does]
### Quick Reference
| Item | Value |
|------|-------|
| Repository | [link] |
| Dashboard | [link] |
| Logs | [query link] |
| On-call | [PagerDuty service] |
### Common Operations
#### Restart Service
```bash
kubectl rollout restart deployment/[service] -n [namespace]
Scale Service
kubectl scale deployment/[service] -n [namespace] --replicas=X
Check Logs
kubectl logs -l app=[service] -n [namespace] --tail=100
Troubleshooting
| Symptom |
Possible Cause |
Resolution |
| High latency |
DB connection pool |
Scale DB or optimize queries |
| 5xx errors |
Dependency down |
Check upstream services |
| OOM kills |
Memory leak |
Investigate heap, restart |
Escalation
| Level |
Contact |
When |
| L1 |
[team Slack channel] |
First response |
| L2 |
[on-call engineer] |
Cannot resolve in 15m |
| L3 |
[service owner] |
Critical/extended outage |
---
## Phase 7: Handoff
### Handoff Checklist
- [ ] Service owner identified and trained
- [ ] On-call rotation set up
- [ ] Access provisioned to team
- [ ] Documentation reviewed by team
- [ ] Shadowing session completed
- [ ] Ownership officially transferred
### Handoff Template
```markdown
## Service Handoff Confirmation
**Service:** [name]
**Date:** YYYY-MM-DD
**Platform Team:** @[name]
**Service Owner:** @[name]
### Completed Items
- [x] Infrastructure provisioned and documented
- [x] Observability configured
- [x] Security controls applied
- [x] Runbook created and reviewed
- [x] On-call rotation configured
- [x] Training session completed
### Outstanding Items
| Item | Owner | Due Date |
|------|-------|----------|
| [item] | [owner] | YYYY-MM-DD |
### Acknowledgment
By signing below, the service owner confirms:
1. Receipt of all documentation
2. Understanding of operational procedures
3. Acceptance of on-call responsibilities
**Service Owner:** _________________ Date: _______
**Platform Team:** _________________ Date: _______
Anti-Rationalization Table
| Rationalization |
Why It's WRONG |
Required Action |
| "Skip documentation, code is self-explanatory" |
On-call != developers |
Complete runbook |
| "We'll add observability later" |
Blind deployments = incidents |
Observability on day 1 |
| "Golden path doesn't fit exactly" |
Customizations add complexity |
Justify every deviation |
| "Security can come later" |
Later = never for security |
Security from start |
| "Team can figure it out" |
Assumptions cause outages |
Complete handoff process |
Dispatch Specialist
For platform onboarding tasks, dispatch:
Task tool:
subagent_type: "platform-engineer"
model: "opus"
prompt: |
SERVICE ONBOARDING REQUEST
Service: [name]
Team: [team]
Type: [API/Worker/Batch]
Requirements: [summary]
Golden Path: [if known]