Claude Code Plugins

Community-maintained marketplace

Feedback

ops-platform-onboarding

@LerianStudio/ring
17
0

|

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name ops-platform-onboarding
description Structured workflow for onboarding new services to the internal platform including infrastructure provisioning, observability setup, and documentation.
trigger - New service deployment to platform - Service migration to platform - Platform capability adoption - Team onboarding to platform
skip_when - Application development -> use ring-dev-team specialists - Existing service configuration changes -> standard change management - Non-platform infrastructure -> use ops-infrastructure-architect
related [object Object]

Platform Onboarding Workflow

This skill defines the structured process for onboarding services to the internal developer platform. Use it to ensure consistent, compliant service deployments.


Onboarding Phases

Phase Focus Output
1. Requirements Gather service requirements Requirements doc
2. Golden Path Selection Choose deployment pattern Selected template
3. Infrastructure Provisioning Create service resources Infrastructure ready
4. Observability Setup Configure monitoring Dashboards/alerts
5. Security Configuration Apply security controls Security validated
6. Documentation Complete service docs Runbook ready
7. Handoff Transfer to service team Ownership confirmed

Phase 1: Requirements Gathering

Service Requirements Checklist

## Service Onboarding Request

**Service Name:** [name]
**Team:** [owning team]
**Requested By:** [name]
**Target Date:** YYYY-MM-DD

### Service Information

| Attribute | Value |
|-----------|-------|
| Service type | [API / Worker / Batch / Frontend] |
| Language/runtime | [Go / Node.js / Python / etc.] |
| Criticality | [Tier 1/2/3/4] |
| External traffic | [Yes / No] |
| Data sensitivity | [PII / Financial / Public] |

### Resource Requirements

| Resource | Requirement | Notes |
|----------|-------------|-------|
| CPU | [cores] | [peak/average] |
| Memory | [GB] | [peak/average] |
| Storage | [GB] | [type: SSD/HDD] |
| Database | [type] | [shared/dedicated] |
| Cache | [type] | [shared/dedicated] |

### Dependencies

| Dependency | Type | SLA Required |
|------------|------|--------------|
| [service] | Internal | [Yes/No] |
| [external] | External | [Yes/No] |

### Compliance Requirements

- [ ] SOC2
- [ ] PCI-DSS
- [ ] GDPR
- [ ] HIPAA
- [ ] Other: ____________

Phase 2: Golden Path Selection

Available Golden Paths

Golden Path Use Case Includes
api-service REST/GraphQL APIs ALB, EKS, RDS, ElastiCache
worker-service Background processing SQS, EKS, auto-scaling
batch-job Scheduled jobs EventBridge, Lambda/Fargate
frontend-app Static sites, SPAs CloudFront, S3, API Gateway
data-pipeline ETL, streaming Kinesis, Glue, S3

Golden Path Selection Matrix

Requirement api-service worker-service batch-job
HTTP traffic Yes No No
Queue processing Optional Yes Optional
Scheduled runs No No Yes
Real-time Yes Near-real-time No
Auto-scaling Yes Yes N/A

Selection Template

## Golden Path Selection

**Service:** [name]
**Selected Path:** [api-service / worker-service / etc.]

### Rationale

1. Service type [X] matches [golden path] pattern
2. Traffic requirements of [X] supported by [features]
3. Compliance requirements met by built-in [controls]

### Customizations Required

| Standard Component | Customization | Reason |
|--------------------|---------------|--------|
| [component] | [change] | [why] |

### Approval

- [ ] Platform team reviewed
- [ ] Security team reviewed (if customizations)
- [ ] Architecture team reviewed (if non-standard)

Phase 3: Infrastructure Provisioning

Provisioning Checklist

  • Namespace/project created
  • Compute resources provisioned
  • Database provisioned (if required)
  • Cache provisioned (if required)
  • Load balancer configured
  • DNS entries created
  • SSL certificates provisioned
  • Secrets stored in vault
  • IAM roles/service accounts created

Terraform/IaC Template

# Example service provisioning
module "service" {
  source = "platform/service-template"

  service_name    = var.service_name
  team            = var.team
  environment     = var.environment
  golden_path     = "api-service"

  # Compute
  cpu_request     = "500m"
  memory_request  = "512Mi"
  replicas_min    = 2
  replicas_max    = 10

  # Database
  database_enabled = true
  database_class   = "db.t3.medium"

  # Tags
  tags = {
    Team        = var.team
    Environment = var.environment
    CostCenter  = var.cost_center
  }
}

Provisioning Verification

# Verify namespace
kubectl get namespace [service-name]

# Verify compute
kubectl get deployment -n [service-name]

# Verify database
aws rds describe-db-instances --db-instance-identifier [service-db]

# Verify DNS
dig [service-name].internal.example.com

Phase 4: Observability Setup

Observability Checklist

  • Structured logging configured
  • Tracing instrumentation added
  • Metrics endpoints exposed
  • Service dashboard created
  • SLI/SLO defined
  • Alerts configured
  • On-call integration set up

Dashboard Template

Standard service dashboard includes:

Panel Metrics
Request rate requests/sec, by status code
Error rate 5xx rate, 4xx rate
Latency p50, p95, p99
Saturation CPU, memory utilization
Dependencies Upstream/downstream health

Alert Configuration

Alert Condition Severity Response
High error rate 5xx > 1% for 5m Critical Page on-call
High latency p99 > 1s for 5m Warning Alert team
Low availability uptime < 99.9% Critical Page on-call
Resource saturation CPU > 85% for 10m Warning Alert team

SLI/SLO Definition

## Service Level Objectives

**Service:** [name]
**SLO Version:** 1.0

| SLI | Target | Measurement |
|-----|--------|-------------|
| Availability | 99.9% | Successful requests / total requests |
| Latency | p99 < 500ms | Request duration percentile |
| Error rate | < 0.1% | 5xx responses / total responses |

### Error Budget

- Monthly budget: 43.2 minutes downtime
- Current consumption: [X]%
- Actions if budget exceeded: [escalation process]

Phase 5: Security Configuration

Security Checklist

  • Network policies applied
  • Service mesh mTLS configured
  • Secrets management configured
  • IAM permissions follow least privilege
  • Security scanning in CI/CD
  • Dependency scanning enabled
  • WAF rules applied (if external)

Network Policy Template

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: service-policy
  namespace: [service-name]
spec:
  podSelector:
    matchLabels:
      app: [service-name]
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              name: istio-system
      ports:
        - port: 8080
  egress:
    - to:
        - namespaceSelector:
            matchLabels:
              name: database
      ports:
        - port: 5432

Security Review

## Security Configuration Review

**Service:** [name]
**Reviewer:** @security-team

| Control | Status | Notes |
|---------|--------|-------|
| mTLS enabled | PASS | Istio strict mode |
| Network policies | PASS | Ingress/egress restricted |
| Secrets management | PASS | Using Vault |
| Least privilege IAM | PASS | Scoped to required resources |
| Vulnerability scanning | PASS | Trivy in CI/CD |

Phase 6: Documentation

Required Documentation

Document Purpose Template
Service Overview What the service does README.md
Runbook Operational procedures runbook.md
Architecture Design decisions architecture.md
API Docs Interface documentation OpenAPI spec
On-call Guide Incident handling oncall.md

Runbook Template

## [Service Name] Runbook

### Service Overview

[Brief description of what the service does]

### Quick Reference

| Item | Value |
|------|-------|
| Repository | [link] |
| Dashboard | [link] |
| Logs | [query link] |
| On-call | [PagerDuty service] |

### Common Operations

#### Restart Service

```bash
kubectl rollout restart deployment/[service] -n [namespace]

Scale Service

kubectl scale deployment/[service] -n [namespace] --replicas=X

Check Logs

kubectl logs -l app=[service] -n [namespace] --tail=100

Troubleshooting

Symptom Possible Cause Resolution
High latency DB connection pool Scale DB or optimize queries
5xx errors Dependency down Check upstream services
OOM kills Memory leak Investigate heap, restart

Escalation

Level Contact When
L1 [team Slack channel] First response
L2 [on-call engineer] Cannot resolve in 15m
L3 [service owner] Critical/extended outage

---

## Phase 7: Handoff

### Handoff Checklist

- [ ] Service owner identified and trained
- [ ] On-call rotation set up
- [ ] Access provisioned to team
- [ ] Documentation reviewed by team
- [ ] Shadowing session completed
- [ ] Ownership officially transferred

### Handoff Template

```markdown
## Service Handoff Confirmation

**Service:** [name]
**Date:** YYYY-MM-DD
**Platform Team:** @[name]
**Service Owner:** @[name]

### Completed Items

- [x] Infrastructure provisioned and documented
- [x] Observability configured
- [x] Security controls applied
- [x] Runbook created and reviewed
- [x] On-call rotation configured
- [x] Training session completed

### Outstanding Items

| Item | Owner | Due Date |
|------|-------|----------|
| [item] | [owner] | YYYY-MM-DD |

### Acknowledgment

By signing below, the service owner confirms:
1. Receipt of all documentation
2. Understanding of operational procedures
3. Acceptance of on-call responsibilities

**Service Owner:** _________________ Date: _______
**Platform Team:** _________________ Date: _______

Anti-Rationalization Table

Rationalization Why It's WRONG Required Action
"Skip documentation, code is self-explanatory" On-call != developers Complete runbook
"We'll add observability later" Blind deployments = incidents Observability on day 1
"Golden path doesn't fit exactly" Customizations add complexity Justify every deviation
"Security can come later" Later = never for security Security from start
"Team can figure it out" Assumptions cause outages Complete handoff process

Dispatch Specialist

For platform onboarding tasks, dispatch:

Task tool:
  subagent_type: "platform-engineer"
  model: "opus"
  prompt: |
    SERVICE ONBOARDING REQUEST
    Service: [name]
    Team: [team]
    Type: [API/Worker/Batch]
    Requirements: [summary]
    Golden Path: [if known]