Rank
70
AI Agents & MCPs & AI Workflow Automation β’ (~400 MCP servers for AI agents) β’ AI Automation / AI Agent with MCPs β’ AI Workflows & AI Agents β’ MCPs for AI Agents
Traction
No public download signal
Freshness
Updated 2d ago
Xpersona Agent
Complete DevOps & Platform Engineering system. CI/CD pipelines, infrastructure as code, container orchestration, observability, incident response, and SRE practices β all platforms, all clouds. --- name: afrexai-devops-engine description: Complete DevOps & Platform Engineering system. CI/CD pipelines, infrastructure as code, container orchestration, observability, incident response, and SRE practices β all platforms, all clouds. metadata: {"clawdbot":{"emoji":"π§","os":["linux","darwin","win32"]}} --- DevOps & Platform Engineering Engine Complete system for building, deploying, operating, and observing prod
clawhub skill install skills:1kalin:afrexai-devops-engineOverall rank
#62
Adoption
No public adoption signal
Trust
Unknown
Freshness
Feb 25, 2026
Freshness
Last checked Feb 25, 2026
Best For
afrexai-devops-engine is best for simultaneously, in, we workflows where OpenClaw compatibility matters.
Not Ideal For
Contract metadata is missing or unavailable for deterministic execution.
Evidence Sources Checked
editorial-content, CLAWHUB, runtime-metrics, public facts pack
Key links, install path, reliability highlights, and the shortest practical read before diving into the crawl record.
Overview
Complete DevOps & Platform Engineering system. CI/CD pipelines, infrastructure as code, container orchestration, observability, incident response, and SRE practices β all platforms, all clouds. --- name: afrexai-devops-engine description: Complete DevOps & Platform Engineering system. CI/CD pipelines, infrastructure as code, container orchestration, observability, incident response, and SRE practices β all platforms, all clouds. metadata: {"clawdbot":{"emoji":"π§","os":["linux","darwin","win32"]}} --- DevOps & Platform Engineering Engine Complete system for building, deploying, operating, and observing prod Capability contract not published. No trust telemetry is available yet. Last updated 4/15/2026.
Trust score
Unknown
Compatibility
OpenClaw
Freshness
Feb 25, 2026
Vendor
Openclaw
Artifacts
0
Benchmarks
0
Last release
Unpublished
Install & run
clawhub skill install skills:1kalin:afrexai-devops-engineSetup complexity is LOW. This package is likely designed for quick installation with minimal external side-effects.
Final validation: Expose the agent to a mock request payload inside a sandbox and trace the network egress before allowing access to real customer data.
Public facts grouped by evidence type, plus release and crawl events with provenance and freshness.
Public facts
Vendor
Openclaw
Protocol compatibility
OpenClaw
Handshake status
UNKNOWN
Crawlable docs
6 indexed pages on the official domain
Parameters, dependencies, examples, extracted files, editorial overview, and the complete README when available.
Captured outputs
Extracted files
0
Examples
6
Snippets
0
Languages
typescript
Parameters
yaml
# branch-protection.yml β document your rules
main:
required_reviews: 2
dismiss_stale_reviews: true
require_codeowners: true
require_status_checks:
- ci/test
- ci/lint
- ci/security
require_linear_history: true # No merge commits
restrict_pushes: true # Only via PR
require_signed_commits: false # Enable for regulated
develop:
required_reviews: 1
require_status_checks:
- ci/testyaml
# pipeline-stages.yml β adapt to your CI system
stages:
# Stage 1: Quality Gate (parallel, <2 min)
lint:
run: lint
parallel: true
timeout: 2m
typecheck:
run: tsc --noEmit
parallel: true
timeout: 2m
security_scan:
run: trivy, snyk, or semgrep
parallel: true
timeout: 3m
# Stage 2: Test (parallel by type, <10 min)
unit_tests:
run: test --unit
parallel: true
coverage_threshold: 80%
timeout: 5m
integration_tests:
run: test --integration
parallel: true
needs: [database_service]
timeout: 10m
# Stage 3: Build (<5 min)
build:
needs: [lint, typecheck, unit_tests]
outputs: [docker_image, release_artifact]
tag: "${GIT_SHA}"
cache: [node_modules, .next/cache, target/]
# Stage 4: Deploy Staging (auto)
deploy_staging:
needs: [build]
environment: staging
strategy: rolling
smoke_test: true
auto: true
# Stage 5: E2E on Staging (<15 min)
e2e_tests:
needs: [deploy_staging]
timeout: 15m
retry: 1
artifacts: [screenshots, videos]
# Stage 6: Deploy Production (manual gate or auto)
deploy_prod:
needs: [e2e_tests]
environment: production
strategy: canary # or blue-green
approval: required # manual gate
rollback_on_failure: true
monitoring_window: 15myaml
# Cache key patterns (ordered by specificity)
cache_keys:
# Exact match first
- "deps-{{ runner.os }}-{{ hashFiles('**/lockfile') }}"
# Partial match fallback
- "deps-{{ runner.os }}-"
# What to cache by stack
node: [node_modules, .next/cache, .turbo]
python: [.venv, .mypy_cache, .pytest_cache]
rust: [target/, ~/.cargo/registry]
go: [~/go/pkg/mod, ~/.cache/go-build]
docker: [/tmp/.buildx-cache] # BuildKit layer cacheyaml
# Reusable workflow (DRY across repos)
# .github/workflows/reusable-deploy.yml
on:
workflow_call:
inputs:
environment:
required: true
type: string
secrets:
DEPLOY_KEY:
required: true
# Caller workflow
jobs:
deploy:
uses: ./.github/workflows/reusable-deploy.yml
with:
environment: production
secrets: inherityaml
# Path-based triggers (monorepo)
on:
push:
paths:
- 'packages/api/**'
- 'shared/**'
# Skip CI for docs-only changes
pull_request:
paths-ignore:
- '**.md'
- 'docs/**'yaml
# Concurrency (cancel in-progress on new push)
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: trueEditorial read
Docs source
CLAWHUB
Editorial quality
ready
Complete DevOps & Platform Engineering system. CI/CD pipelines, infrastructure as code, container orchestration, observability, incident response, and SRE practices β all platforms, all clouds. --- name: afrexai-devops-engine description: Complete DevOps & Platform Engineering system. CI/CD pipelines, infrastructure as code, container orchestration, observability, incident response, and SRE practices β all platforms, all clouds. metadata: {"clawdbot":{"emoji":"π§","os":["linux","darwin","win32"]}} --- DevOps & Platform Engineering Engine Complete system for building, deploying, operating, and observing prod
Complete system for building, deploying, operating, and observing production software. Covers the entire DevOps lifecycle β not just CI/CD, not just one cloud.
| Team Size | Release Cadence | Strategy | Branches | |-----------|----------------|----------|----------| | 1-3 | Continuous | Trunk-based | main + short-lived feature/ | | 4-15 | Weekly/biweekly | GitHub Flow | main + feature/ + PR | | 15+ | Scheduled releases | Git Flow | main + develop + feature/ + release/ + hotfix/ | | Regulated | Audited releases | Git Flow + tags | Above + signed tags + audit trail |
# branch-protection.yml β document your rules
main:
required_reviews: 2
dismiss_stale_reviews: true
require_codeowners: true
require_status_checks:
- ci/test
- ci/lint
- ci/security
require_linear_history: true # No merge commits
restrict_pushes: true # Only via PR
require_signed_commits: false # Enable for regulated
develop:
required_reviews: 1
require_status_checks:
- ci/test
Format: <type>(<scope>): <description>
Types: feat, fix, docs, style, refactor, perf, test, build, ci, chore
Breaking changes: feat!: remove legacy API or footer BREAKING CHANGE: description
Enforce with commitlint + husky (Node) or pre-commit hooks.
# pipeline-stages.yml β adapt to your CI system
stages:
# Stage 1: Quality Gate (parallel, <2 min)
lint:
run: lint
parallel: true
timeout: 2m
typecheck:
run: tsc --noEmit
parallel: true
timeout: 2m
security_scan:
run: trivy, snyk, or semgrep
parallel: true
timeout: 3m
# Stage 2: Test (parallel by type, <10 min)
unit_tests:
run: test --unit
parallel: true
coverage_threshold: 80%
timeout: 5m
integration_tests:
run: test --integration
parallel: true
needs: [database_service]
timeout: 10m
# Stage 3: Build (<5 min)
build:
needs: [lint, typecheck, unit_tests]
outputs: [docker_image, release_artifact]
tag: "${GIT_SHA}"
cache: [node_modules, .next/cache, target/]
# Stage 4: Deploy Staging (auto)
deploy_staging:
needs: [build]
environment: staging
strategy: rolling
smoke_test: true
auto: true
# Stage 5: E2E on Staging (<15 min)
e2e_tests:
needs: [deploy_staging]
timeout: 15m
retry: 1
artifacts: [screenshots, videos]
# Stage 6: Deploy Production (manual gate or auto)
deploy_prod:
needs: [e2e_tests]
environment: production
strategy: canary # or blue-green
approval: required # manual gate
rollback_on_failure: true
monitoring_window: 15m
| Feature | GitHub Actions | GitLab CI | CircleCI | Jenkins |
|---------|---------------|-----------|----------|---------|
| Config file | .github/workflows/*.yml | .gitlab-ci.yml | .circleci/config.yml | Jenkinsfile |
| Parallelism | jobs.<id> (automatic) | stages + parallel | workflows | parallel step |
| Caching | actions/cache | cache: key | save_cache/restore_cache | Stash/unstash |
| Secrets | Settings β Secrets | Settings β CI/CD β Variables | Project Settings β Env | Credentials plugin |
| Matrix builds | strategy.matrix | parallel:matrix | matrix in workflows | matrix in pipeline |
| Self-hosted | runs-on: self-hosted | GitLab Runner | resource_class | Default |
| OIDC/Keyless | permissions: id-token: write | id_tokens: | OIDC context | Plugin |
# Cache key patterns (ordered by specificity)
cache_keys:
# Exact match first
- "deps-{{ runner.os }}-{{ hashFiles('**/lockfile') }}"
# Partial match fallback
- "deps-{{ runner.os }}-"
# What to cache by stack
node: [node_modules, .next/cache, .turbo]
python: [.venv, .mypy_cache, .pytest_cache]
rust: [target/, ~/.cargo/registry]
go: [~/go/pkg/mod, ~/.cache/go-build]
docker: [/tmp/.buildx-cache] # BuildKit layer cache
# Reusable workflow (DRY across repos)
# .github/workflows/reusable-deploy.yml
on:
workflow_call:
inputs:
environment:
required: true
type: string
secrets:
DEPLOY_KEY:
required: true
# Caller workflow
jobs:
deploy:
uses: ./.github/workflows/reusable-deploy.yml
with:
environment: production
secrets: inherit
# Path-based triggers (monorepo)
on:
push:
paths:
- 'packages/api/**'
- 'shared/**'
# Skip CI for docs-only changes
pull_request:
paths-ignore:
- '**.md'
- 'docs/**'
# Concurrency (cancel in-progress on new push)
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
# Multi-stage build template
# Stage 1: Build
FROM node:20-alpine AS builder
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci --production=false # Install all deps for build
COPY . .
RUN npm run build
# Stage 2: Production
FROM node:20-alpine AS production
RUN addgroup -g 1001 app && adduser -u 1001 -G app -s /bin/sh -D app
WORKDIR /app
COPY --from=builder --chown=app:app /app/dist ./dist
COPY --from=builder --chown=app:app /app/node_modules ./node_modules
COPY --from=builder --chown=app:app /app/package.json ./
USER app
EXPOSE 3000
HEALTHCHECK --interval=30s --timeout=3s --retries=3 \
CMD wget -qO- http://localhost:3000/health || exit 1
CMD ["node", "dist/index.js"]
.dockerignore excludes: .git, node_modules, *.md, tests, docsrm -rf /var/cache/apk/*)FROM node:20-alpine@sha256:abc123...# Trivy (recommended β free, fast)
trivy image myapp:latest --severity HIGH,CRITICAL
trivy fs . --security-checks vuln,secret,config
# Scan in CI before push
# Fail pipeline if CRITICAL vulnerabilities found
trivy image --exit-code 1 --severity CRITICAL myapp:${GIT_SHA}
# docker-compose.yml β local development stack
services:
app:
build:
context: .
target: builder # Use build stage for hot reload
volumes:
- .:/app
- /app/node_modules # Don't override node_modules
ports:
- "3000:3000"
environment:
- DATABASE_URL=postgres://user:pass@db:5432/app
- REDIS_URL=redis://cache:6379
depends_on:
db:
condition: service_healthy
db:
image: postgres:16-alpine
volumes:
- pgdata:/var/lib/postgresql/data
environment:
POSTGRES_USER: user
POSTGRES_PASSWORD: pass
POSTGRES_DB: app
healthcheck:
test: ["CMD-SHELL", "pg_isready -U user"]
interval: 5s
timeout: 3s
retries: 5
cache:
image: redis:7-alpine
ports:
- "6379:6379"
volumes:
pgdata:
| Tool | Best For | State | Language | Learning Curve | |------|----------|-------|----------|----------------| | Terraform/OpenTofu | Multi-cloud, cloud-agnostic | Remote (S3, GCS) | HCL | Medium | | Pulumi | Devs who prefer real code | Remote | TS/Python/Go | Low (if you code) | | AWS CDK | AWS-only shops | CloudFormation | TS/Python | Medium | | Ansible | Config management, server setup | Stateless | YAML | Low | | Helm | Kubernetes deployments | Tiller/OCI | YAML+Go templates | Medium |
infrastructure/
βββ modules/ # Reusable components
β βββ vpc/
β β βββ main.tf
β β βββ variables.tf
β β βββ outputs.tf
β βββ ecs-service/
β βββ rds/
βββ environments/
β βββ dev/
β β βββ main.tf # Calls modules with dev params
β β βββ terraform.tfvars
β β βββ backend.tf # Dev state bucket
β βββ staging/
β βββ prod/
βββ .terraform-version # Pin terraform version
βββ .tflint.hcl
plan before apply β review every changeprevent_destroy on critical resources (databases, S3 buckets)environment, team, cost-center, managed-by: terraformterraform fmt in CI β consistent formatting# backend.tf β remote state with locking
terraform {
backend "s3" {
bucket = "mycompany-terraform-state"
key = "prod/main.tfstate"
region = "eu-west-1"
encrypt = true
dynamodb_table = "terraform-locks"
}
}
# Protect critical resources
resource "aws_rds_instance" "main" {
# ...
lifecycle {
prevent_destroy = true
}
}
ββββββββββββββββββββ
terraform plan βββΊβ Review in PR β
ββββββββββ¬ββββββββββ
β merge
ββββββββββΌββββββββββ
auto-apply βββββββΊβ Dev ββββΊ smoke tests
ββββββββββ¬ββββββββββ
β promote
ββββββββββΌββββββββββ
manual approve βββΊβ Staging ββββΊ integration tests
ββββββββββ¬ββββββββββ
β promote (manual gate)
ββββββββββΌββββββββββ
manual approve βββΊβ Production ββββΊ monitoring window
ββββββββββββββββββββ
# deployment.yml β production-ready template
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
labels:
app: myapp
version: "1.0.0"
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0 # Zero-downtime
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
containers:
- name: myapp
image: myregistry/myapp:abc123 # Git SHA tag
ports:
- containerPort: 3000
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 10
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 3000
initialDelaySeconds: 5
periodSeconds: 5
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: myapp-secrets
key: database-url
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
# hpa.yml β autoscaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: myapp
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # 5 min cooldown
policies:
- type: Pods
value: 1
periodSeconds: 60 # Scale down 1 pod per minute max
values.yaml with sensible defaults (works out of the box)helm lint and helm template in CI# Debugging
kubectl get pods -l app=myapp -o wide # Pod status + node
kubectl describe pod <pod> # Events, conditions
kubectl logs <pod> --tail=100 -f # Stream logs
kubectl logs <pod> --previous # Crashed container logs
kubectl exec -it <pod> -- /bin/sh # Shell into pod
kubectl top pods -l app=myapp # Resource usage
# Rollouts
kubectl rollout status deployment/myapp # Watch rollout
kubectl rollout history deployment/myapp # Revision history
kubectl rollout undo deployment/myapp # Rollback to previous
kubectl rollout undo deployment/myapp --to-revision=3 # Specific
# Scaling
kubectl scale deployment/myapp --replicas=5 # Manual scale
kubectl autoscale deployment/myapp --min=3 --max=10 --cpu-percent=70
# Context management
kubectl config get-contexts # List clusters
kubectl config use-context prod-cluster # Switch
kubectl config set-context --current --namespace=myapp # Set namespace
| Strategy | Risk | Speed | Rollback | Cost | Best For | |----------|------|-------|----------|------|----------| | Rolling | Low-Med | Fast | Slow (re-roll) | None | Standard deployments | | Blue-Green | Low | Instant | Instant (switch) | 2x infra | Critical services, zero-downtime | | Canary | Very Low | Slow | Instant (route 0%) | Minimal | High-traffic, risky changes | | Feature Flag | Very Low | Instant | Instant (toggle) | None | Gradual rollout, A/B testing | | Recreate | High | Fast | Slow | None | Dev/staging, stateful apps |
1. Deploy canary (1 pod with new version)
2. Route 5% traffic β canary
3. Monitor for 5 minutes:
- Error rate < baseline + 0.1%?
- p99 latency < baseline + 50ms?
- No new error types?
4. If healthy β 25% β monitor 10 min
5. If healthy β 50% β monitor 10 min
6. If healthy β 100% (full rollout)
7. If ANY check fails β route 0% to canary β rollback β alert
Automation: Argo Rollouts, Flagger, or Istio + custom controller
When a deployment goes wrong:
kubectl rollout undo or redeploy previous SHARULE: Migrations must be backward-compatible with the PREVIOUS version.
(Because during rolling deploy, both versions run simultaneously)
Safe migration pattern:
v1: Add new column (nullable, with default)
v2: Backfill data, start writing to new column
v3: Make new column required, stop writing old column
v4: Drop old column (after v3 is fully deployed)
NEVER in one deploy:
β Rename column
β Change column type
β Drop column still read by current version
β Add NOT NULL without default
| Pillar | What | Tools | Priority | |--------|------|-------|----------| | Metrics | Numeric measurements over time | Prometheus, Datadog, CloudWatch | 1 (start here) | | Logs | Event records | ELK, Loki, CloudWatch Logs | 2 | | Traces | Request flow across services | Jaeger, Tempo, X-Ray, Honeycomb | 3 | | Profiling | CPU/memory hot paths | Pyroscope, Parca | 4 (when optimizing) |
# RED Method (request-driven services)
rate: # Requests per second
errors: # Failed requests per second
duration: # Latency distribution (p50, p95, p99)
# USE Method (infrastructure/resources)
utilization: # % of resource in use (CPU, memory, disk)
saturation: # Queue depth, pending work
errors: # Resource errors (OOM, disk full)
# Business Metrics (most important!)
signups_per_hour:
checkout_completion_rate:
api_calls_by_customer:
revenue_per_minute:
# alerting-rules.yml
alerts:
# Symptom-based (good β tells you users are impacted)
- name: HighErrorRate
condition: "error_rate_5xx > 1% for 5m"
severity: critical
runbook: docs/runbooks/high-error-rate.md
notify: [pagerduty, slack-incidents]
- name: HighLatency
condition: "p99_latency > 2s for 5m"
severity: warning
runbook: docs/runbooks/high-latency.md
notify: [slack-incidents]
# Cause-based (supplementary β helps diagnose)
- name: PodCrashLooping
condition: "pod_restart_count increase > 3 in 10m"
severity: warning
notify: [slack-platform]
- name: DiskSpaceWarning
condition: "disk_usage > 80%"
severity: warning
notify: [slack-platform]
- name: CertificateExpiring
condition: "cert_expiry_days < 14"
severity: warning
notify: [slack-platform]
# Alert rules:
# 1. Every alert must have a runbook link
# 2. Every alert must be actionable (if you can't do anything, remove it)
# 3. Critical = wake someone up. Warning = check next business day.
# 4. Review alerts monthly β archive unused, tune noisy ones
{
"timestamp": "2026-02-16T05:00:00.000Z",
"level": "error",
"service": "api",
"trace_id": "abc123",
"span_id": "def456",
"method": "POST",
"path": "/api/orders",
"status": 500,
"duration_ms": 342,
"user_id": "usr_789",
"error": {
"type": "DatabaseError",
"message": "connection timeout",
"stack": "..."
},
"context": {
"order_id": "ord_123",
"payment_method": "card"
}
}
Log level guide:
error: Something failed, needs attentionwarn: Unexpected but handled (retry succeeded, fallback used)info: Business events (order placed, user signed up, deploy started)debug: Technical detail (query executed, cache hit/miss) β OFF in prodEvery service dashboard should have:
Row 1: Traffic Overview
- Request rate (per endpoint)
- Error rate (4xx, 5xx separate)
- Active users / connections
Row 2: Performance
- p50, p95, p99 latency
- Throughput
- Apdex score
Row 3: Resources
- CPU utilization (per pod/instance)
- Memory usage (vs limit)
- Disk I/O / Network I/O
Row 4: Business
- Revenue per minute (if applicable)
- Conversion funnel
- Queue depth / processing lag
Row 5: Dependencies
- Database query latency + connection pool
- External API latency + error rate
- Cache hit rate
| Level | Definition | Response Time | Example | |-------|-----------|---------------|---------| | SEV-1 | Complete outage, revenue impact | 15 min | Site down, payments failing | | SEV-2 | Major feature broken, workaround exists | 30 min | Search broken, checkout slow | | SEV-3 | Minor feature broken, low impact | 4 hours | Admin panel bug, non-critical API | | SEV-4 | Cosmetic / no user impact | Next sprint | Typo, minor UI glitch |
1. DETECT (automated or reported)
β Alert fires / user reports issue
β Create incident channel: #inc-YYYY-MM-DD-description
2. TRIAGE (first 5 minutes)
β Assign Incident Commander (IC)
β Determine severity level
β Post initial assessment in channel
β Update status page (if customer-facing)
3. MITIGATE (focus on stopping the bleeding)
β Can we rollback? β Do it
β Can we scale up? β Do it
β Can we feature-flag disable? β Do it
β DON'T debug root cause yet β restore service first
4. RESOLVE
β Confirm service restored (metrics, customer reports)
β Communicate resolution to stakeholders
β Update status page
5. POST-MORTEM (within 48 hours)
β Blameless β focus on systems, not people
β Timeline of events
β Root cause analysis (5 Whys)
β Action items with owners and deadlines
β Share with team
# Incident Post-Mortem: [Title]
**Date:** YYYY-MM-DD
**Duration:** Xh Ym
**Severity:** SEV-X
**Incident Commander:** [name]
**Author:** [name]
## Summary
[1-2 sentence summary of what happened and impact]
## Impact
- Users affected: [number/percentage]
- Revenue impact: [if applicable]
- Duration: [start to full resolution]
## Timeline (all times UTC)
| Time | Event |
|------|-------|
| 14:00 | Deploy v2.3.1 begins |
| 14:05 | Error rate spikes to 15% |
| 14:07 | Alert fires, IC paged |
| 14:12 | Rollback initiated |
| 14:15 | Service restored |
## Root Cause
[Technical explanation β what actually broke and why]
## Contributing Factors
- [Factor 1 β e.g., migration not tested with production data volume]
- [Factor 2 β e.g., canary deployment not configured for this service]
## What Went Well
- [Fast detection β alert fired within 2 minutes]
- [Clear runbook β IC knew rollback procedure]
## What Went Wrong
- [No canary β went straight to 100% rollout]
- [Migration was not backward-compatible]
## Action Items
| Action | Owner | Due | Priority |
|--------|-------|-----|----------|
| Add canary to deployment | @engineer | YYYY-MM-DD | P1 |
| Add migration backward-compat check | @engineer | YYYY-MM-DD | P1 |
| Update runbook for this service | @sre | YYYY-MM-DD | P2 |
## Lessons Learned
[Key takeaways for the team]
on_call:
rotation: weekly
handoff: Monday 10:00 (overlap 1h with previous)
escalation:
- primary: respond within 15 min
- secondary: auto-page if no ack in 15 min
- manager: auto-page if no ack in 30 min
expectations:
- Laptop + internet within reach
- Respond to page within 15 minutes
- Follow runbook first, improvise second
- Escalate early β "I don't know" is fine
- Update incident channel every 15 min during active incident
wellness:
- No more than 1 week in 4 on-call
- Comp time after major incidents
- Toil budget: <30% of on-call time should be toil
- Quarterly review: are we paging too much?
security_gates:
# Pre-commit
- tool: gitleaks / trufflehog
what: Secret detection in code
block: true
# Build
- tool: semgrep / CodeQL
what: Static analysis (SAST)
block: critical findings
- tool: npm audit / pip audit / cargo audit
what: Dependency vulnerabilities (SCA)
block: critical/high
# Container
- tool: trivy / grype
what: Image vulnerability scan
block: critical
- tool: hadolint
what: Dockerfile best practices
block: error level
# Deploy
- tool: checkov / tfsec
what: IaC security scan
block: high findings
# Runtime
- tool: falco / sysdig
what: Runtime anomaly detection
alert: true
| Method | Security | Complexity | Best For | |--------|----------|------------|----------| | CI/CD env vars | Basic | Low | Small teams, non-critical | | AWS Secrets Manager / GCP Secret Manager | High | Medium | Cloud-native apps | | HashiCorp Vault | Very High | High | Multi-cloud, strict compliance | | SOPS + git | Good | Low | GitOps workflows | | External Secrets Operator | High | Medium | Kubernetes + cloud secrets |
Rules:
1. Default deny all β explicitly allow what's needed
2. TLS everywhere β including internal service-to-service
3. No public IPs on internal services β use load balancers / API gateways
4. WAF on public endpoints β OWASP Top 10 rules minimum
5. Rate limiting on all APIs β prevent abuse and DDoS
6. DNS for service discovery β never hardcode IPs
7. VPN or zero-trust for admin access β no SSH from internet
8. Network policies in K8s β pods can't talk to everything
9. Egress control β services should only reach what they need
10. Certificate auto-renewal β cert-manager or ACM
# Define SLOs for every user-facing service
service: checkout-api
slos:
availability:
target: 99.95% # 4.38 hours downtime/year
window: 30d rolling
measurement: "successful_requests / total_requests"
latency:
target: 99% # 99% of requests under threshold
threshold: 500ms # p99 < 500ms
window: 30d rolling
freshness:
target: 99.9% # Data updated within SLA
threshold: 5m
window: 30d rolling
error_budget:
monthly_budget: 0.05% # ~21.6 minutes
burn_rate_alert:
fast: 14.4x # Budget consumed in 1 hour β page
slow: 3x # Budget consumed in 10 hours β ticket
policy:
budget_exhausted:
- freeze non-critical deploys
- redirect eng effort to reliability
- review in weekly SRE sync
Toil = manual, repetitive, automatable, reactive, no lasting value
Track toil:
- Log manual interventions for 2 weeks
- Categorize: deployment, scaling, cert renewal, data fixes, permissions
- Prioritize: frequency Γ time Γ frustration
Target: <30% of engineering time on toil
If toil > 50%: stop feature work, automate the top 3 toil items
Common toil automation:
Manual deploys β CI/CD pipeline
Certificate renewal β cert-manager / ACM
Scaling up/down β HPA / auto-scaling groups
Permission requests β Self-service IAM with approval
Data fixes β Admin API / scripts
Dependency updates β Renovate / Dependabot
Flaky test management β Auto-quarantine + ticket
capacity_review:
frequency: monthly
inputs:
- current_utilization: "CPU, memory, disk, network per service"
- growth_rate: "request rate trend over 90 days"
- planned_events: "launches, marketing campaigns, seasonal peaks"
- headroom_target: 30% # Don't run above 70% sustained
formula:
needed_capacity: "current_usage Γ (1 + growth_rate) Γ (1 + headroom)"
lead_time: "14 days for cloud, 60+ days for hardware"
actions:
- "If utilization > 70%: plan scaling within 2 weeks"
- "If utilization > 85%: emergency scaling NOW"
- "If utilization < 30%: rightsize down (save money)"
1. Right-size first β most instances are overprovisioned
Check: actual CPU/memory usage vs provisioned (CloudWatch, Datadog)
Action: downsize to next tier that maintains 70% headroom
2. Reserved capacity for baseline β spot/preemptible for burst
Pattern: 60% reserved + 30% on-demand + 10% spot
Savings: 40-70% on reserved vs on-demand
3. Auto-scale to zero when possible
- Dev/staging environments: scale down nights + weekends
- Serverless for bursty workloads (Lambda, Cloud Functions)
4. Delete zombie resources monthly
- Unattached EBS volumes
- Old snapshots (>90 days, not tagged for retention)
- Unused load balancers
- Orphaned Elastic IPs
5. Storage tiering
- Hot: SSD (frequently accessed)
- Warm: HDD (monthly access)
- Cold: S3 Glacier / Archive (yearly access)
- Auto-lifecycle policies on S3 buckets
6. Tag everything β untagged = untracked = wasted
Required tags: environment, team, service, cost-center
Weekly report: cost by tag, highlight untagged resources
## Cloud Cost Review β [Month YYYY]
### Summary
- Total spend: $X,XXX (vs budget: $X,XXX)
- MoM change: +X% ($XXX)
- Top 3 cost drivers: [service1, service2, service3]
### By Service
| Service | Cost | % of Total | MoM Change | Action |
|---------|------|-----------|------------|--------|
| EKS | $XXX | XX% | +X% | Right-size node group |
| RDS | $XXX | XX% | 0% | Consider reserved |
| S3 | $XXX | XX% | +X% | Add lifecycle rules |
### Optimization Actions Taken
- [Action 1]: Saved $XXX/mo
- [Action 2]: Saved $XXX/mo
### Next Month Actions
- [ ] [Action with estimated savings]
Score your team (1-5 per dimension):
| Dimension | 1 (Ad-hoc) | 3 (Defined) | 5 (Optimized) | |-----------|-----------|-------------|----------------| | CI/CD | Manual deploy | Automated pipeline, manual gate | Full auto with canary, <15 min to prod | | IaC | Click-ops console | Some Terraform, manual tweaks | 100% IaC, GitOps, drift detection | | Monitoring | Check when broken | Dashboards + basic alerts | SLOs, error budgets, auto-remediation | | Incident | Panic + SSH | Runbooks, on-call rotation | Blameless postmortems, chaos engineering | | Security | Annual audit | CI scanning, secret manager | Shift-left, runtime detection, zero-trust | | Cost | Surprise bills | Monthly review, some reservations | Real-time tracking, auto-optimization |
Score interpretation:
Say things like:
Machine endpoints, contract coverage, trust signals, runtime metrics, benchmarks, and guardrails for agent-to-agent use.
Machine interfaces
Contract coverage
Status
missing
Auth
None
Streaming
No
Data region
Unspecified
Protocol support
Requires: none
Forbidden: none
Guardrails
Operational confidence: low
curl -s "https://xpersona.co/api/v1/agents/clawhub-skills-1kalin-afrexai-devops-engine/snapshot"
curl -s "https://xpersona.co/api/v1/agents/clawhub-skills-1kalin-afrexai-devops-engine/contract"
curl -s "https://xpersona.co/api/v1/agents/clawhub-skills-1kalin-afrexai-devops-engine/trust"
Operational fit
Trust signals
Handshake
UNKNOWN
Confidence
unknown
Attempts 30d
unknown
Fallback rate
unknown
Runtime metrics
Observed P50
unknown
Observed P95
unknown
Rate limit
unknown
Estimated cost
unknown
Do not use if
Raw contract, invocation, trust, capability, facts, and change-event payloads for machine-side inspection.
Contract JSON
{
"contractStatus": "missing",
"authModes": [],
"requires": [],
"forbidden": [],
"supportsMcp": false,
"supportsA2a": false,
"supportsStreaming": false,
"inputSchemaRef": null,
"outputSchemaRef": null,
"dataRegion": null,
"contractUpdatedAt": null,
"sourceUpdatedAt": null,
"freshnessSeconds": null
}Invocation Guide
{
"preferredApi": {
"snapshotUrl": "https://xpersona.co/api/v1/agents/clawhub-skills-1kalin-afrexai-devops-engine/snapshot",
"contractUrl": "https://xpersona.co/api/v1/agents/clawhub-skills-1kalin-afrexai-devops-engine/contract",
"trustUrl": "https://xpersona.co/api/v1/agents/clawhub-skills-1kalin-afrexai-devops-engine/trust"
},
"curlExamples": [
"curl -s \"https://xpersona.co/api/v1/agents/clawhub-skills-1kalin-afrexai-devops-engine/snapshot\"",
"curl -s \"https://xpersona.co/api/v1/agents/clawhub-skills-1kalin-afrexai-devops-engine/contract\"",
"curl -s \"https://xpersona.co/api/v1/agents/clawhub-skills-1kalin-afrexai-devops-engine/trust\""
],
"jsonRequestTemplate": {
"query": "summarize this repo",
"constraints": {
"maxLatencyMs": 2000,
"protocolPreference": [
"OPENCLEW"
]
}
},
"jsonResponseTemplate": {
"ok": true,
"result": {
"summary": "...",
"confidence": 0.9
},
"meta": {
"source": "CLAWHUB",
"generatedAt": "2026-04-17T05:40:38.921Z"
}
},
"retryPolicy": {
"maxAttempts": 3,
"backoffMs": [
500,
1500,
3500
],
"retryableConditions": [
"HTTP_429",
"HTTP_503",
"NETWORK_TIMEOUT"
]
}
}Trust JSON
{
"status": "unavailable",
"handshakeStatus": "UNKNOWN",
"verificationFreshnessHours": null,
"reputationScore": null,
"p95LatencyMs": null,
"successRate30d": null,
"fallbackRate": null,
"attempts30d": null,
"trustUpdatedAt": null,
"trustConfidence": "unknown",
"sourceUpdatedAt": null,
"freshnessSeconds": null
}Capability Matrix
{
"rows": [
{
"key": "OPENCLEW",
"type": "protocol",
"support": "unknown",
"confidenceSource": "profile",
"notes": "Listed on profile"
},
{
"key": "simultaneously",
"type": "capability",
"support": "supported",
"confidenceSource": "profile",
"notes": "Declared in agent profile metadata"
},
{
"key": "in",
"type": "capability",
"support": "supported",
"confidenceSource": "profile",
"notes": "Declared in agent profile metadata"
},
{
"key": "we",
"type": "capability",
"support": "supported",
"confidenceSource": "profile",
"notes": "Declared in agent profile metadata"
},
{
"key": "block",
"type": "capability",
"support": "supported",
"confidenceSource": "profile",
"notes": "Declared in agent profile metadata"
}
],
"flattenedTokens": "protocol:OPENCLEW|unknown|profile capability:simultaneously|supported|profile capability:in|supported|profile capability:we|supported|profile capability:block|supported|profile"
}Facts JSON
[
{
"factKey": "docs_crawl",
"category": "integration",
"label": "Crawlable docs",
"value": "6 indexed pages on the official domain",
"href": "https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Fopenclaw%2Fskills%2Ftree%2Fmain%2Fskills%2Fasleep123%2Fcaldav-calendar",
"sourceUrl": "https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Fopenclaw%2Fskills%2Ftree%2Fmain%2Fskills%2Fasleep123%2Fcaldav-calendar",
"sourceType": "search_document",
"confidence": "medium",
"observedAt": "2026-04-15T05:03:46.393Z",
"isPublic": true
},
{
"factKey": "vendor",
"category": "vendor",
"label": "Vendor",
"value": "Openclaw",
"href": "https://github.com/openclaw/skills/tree/main/skills/1kalin/afrexai-devops-engine",
"sourceUrl": "https://github.com/openclaw/skills/tree/main/skills/1kalin/afrexai-devops-engine",
"sourceType": "profile",
"confidence": "medium",
"observedAt": "2026-04-15T00:45:39.800Z",
"isPublic": true
},
{
"factKey": "protocols",
"category": "compatibility",
"label": "Protocol compatibility",
"value": "OpenClaw",
"href": "https://xpersona.co/api/v1/agents/clawhub-skills-1kalin-afrexai-devops-engine/contract",
"sourceUrl": "https://xpersona.co/api/v1/agents/clawhub-skills-1kalin-afrexai-devops-engine/contract",
"sourceType": "contract",
"confidence": "medium",
"observedAt": "2026-04-15T00:45:39.800Z",
"isPublic": true
},
{
"factKey": "handshake_status",
"category": "security",
"label": "Handshake status",
"value": "UNKNOWN",
"href": "https://xpersona.co/api/v1/agents/clawhub-skills-1kalin-afrexai-devops-engine/trust",
"sourceUrl": "https://xpersona.co/api/v1/agents/clawhub-skills-1kalin-afrexai-devops-engine/trust",
"sourceType": "trust",
"confidence": "medium",
"observedAt": null,
"isPublic": true
}
]Change Events JSON
[
{
"eventType": "docs_update",
"title": "Docs refreshed: Sign in to GitHub Β· GitHub",
"description": "Fresh crawlable documentation was indexed for the official domain.",
"href": "https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Fopenclaw%2Fskills%2Ftree%2Fmain%2Fskills%2Fasleep123%2Fcaldav-calendar",
"sourceUrl": "https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Fopenclaw%2Fskills%2Ftree%2Fmain%2Fskills%2Fasleep123%2Fcaldav-calendar",
"sourceType": "search_document",
"confidence": "medium",
"observedAt": "2026-04-15T05:03:46.393Z",
"isPublic": true
}
]Sponsored
Ads related to afrexai-devops-engine and adjacent AI workflows.