Rank
70
AI Agents & MCPs & AI Workflow Automation • (~400 MCP servers for AI agents) • AI Automation / AI Agent with MCPs • AI Workflows & AI Agents • MCPs for AI Agents
Traction
No public download signal
Freshness
Updated 2d ago
Xpersona Agent
SRE & Incident Management Platform SRE & Incident Management Platform Complete Site Reliability Engineering system — from SLO definition through incident response, chaos engineering, and operational excellence. Zero dependencies. --- Phase 1: Reliability Assessment Before building anything, assess where you are. Service Catalog Entry Maturity Assessment (Score 1-5 per dimension) | Dimension | 1 (Ad-hoc) | 3 (Defined) | 5 (Optimized) | Score | |-------
clawhub skill install skills:1kalin:afrexai-sre-platformOverall rank
#62
Adoption
No public adoption signal
Trust
Unknown
Freshness
Feb 25, 2026
Freshness
Last checked Feb 25, 2026
Best For
afrexai-sre-platform is best for always, escalate workflows where OpenClaw compatibility matters.
Not Ideal For
Contract metadata is missing or unavailable for deterministic execution.
Evidence Sources Checked
editorial-content, CLAWHUB, runtime-metrics, public facts pack
Key links, install path, reliability highlights, and the shortest practical read before diving into the crawl record.
Overview
SRE & Incident Management Platform SRE & Incident Management Platform Complete Site Reliability Engineering system — from SLO definition through incident response, chaos engineering, and operational excellence. Zero dependencies. --- Phase 1: Reliability Assessment Before building anything, assess where you are. Service Catalog Entry Maturity Assessment (Score 1-5 per dimension) | Dimension | 1 (Ad-hoc) | 3 (Defined) | 5 (Optimized) | Score | |------- Capability contract not published. No trust telemetry is available yet. Last updated 4/15/2026.
Trust score
Unknown
Compatibility
OpenClaw
Freshness
Feb 25, 2026
Vendor
Openclaw
Artifacts
0
Benchmarks
0
Last release
Unpublished
Install & run
clawhub skill install skills:1kalin:afrexai-sre-platformSetup complexity is LOW. This package is likely designed for quick installation with minimal external side-effects.
Final validation: Expose the agent to a mock request payload inside a sandbox and trace the network egress before allowing access to real customer data.
Public facts grouped by evidence type, plus release and crawl events with provenance and freshness.
Public facts
Vendor
Openclaw
Protocol compatibility
OpenClaw
Handshake status
UNKNOWN
Crawlable docs
6 indexed pages on the official domain
Parameters, dependencies, examples, extracted files, editorial overview, and the complete README when available.
Captured outputs
Extracted files
0
Examples
6
Snippets
0
Languages
typescript
Parameters
yaml
service:
name: ""
tier: "" # critical | important | standard | experimental
owner_team: ""
oncall_rotation: ""
dependencies:
upstream: [] # services we call
downstream: [] # services that call us
data_classification: "" # public | internal | confidential | restricted
deployment_frequency: "" # daily | weekly | biweekly | monthly
architecture: "" # monolith | microservice | serverless | hybrid
language: ""
infra: "" # k8s | ECS | Lambda | VM | bare-metal
traffic_pattern: "" # steady | diurnal | spiky | seasonal
peak_rps: 0
storage_gb: 0
monthly_cost_usd: 0yaml
sli:
name: "request_success_rate"
description: "Proportion of valid requests served successfully"
type: "availability" # availability | latency | quality | freshness
measurement:
good_events: "HTTP responses with status < 500"
total_events: "All HTTP requests excluding health checks"
source: "load balancer access logs"
aggregation: "sum(good) / sum(total) over rolling 28-day window"
exclusions:
- "Health check endpoints (/healthz, /readyz)"
- "Synthetic monitoring traffic"
- "Requests from blocked IPs"
- "4xx responses (client errors)"yaml
slo:
service: ""
sli: ""
target: 99.9 # percentage
window: "28d" # rolling window
error_budget: 0.1 # 100% - target
error_budget_minutes: 40 # per 28-day window
burn_rate_alerts:
- name: "fast_burn"
burn_rate: 14.4 # exhausts budget in 2 hours
short_window: "5m"
long_window: "1h"
severity: "page"
- name: "medium_burn"
burn_rate: 6.0 # exhausts budget in ~5 hours
short_window: "30m"
long_window: "6h"
severity: "page"
- name: "slow_burn"
burn_rate: 1.0 # exhausts budget in 28 days
short_window: "6h"
long_window: "3d"
severity: "ticket"
review_cadence: "monthly"
owner: ""
stakeholders: []
escalation_when_budget_exhausted:
- "Halt non-critical deployments"
- "Redirect engineering to reliability work"
- "Escalate to VP Engineering if no improvement in 48h"yaml
error_budget_policy:
service: ""
budget_states:
healthy:
condition: "remaining_budget > 50%"
actions:
- "Normal development velocity"
- "Feature work prioritized"
- "Chaos experiments allowed"
warning:
condition: "remaining_budget 25-50%"
actions:
- "Increase monitoring scrutiny"
- "Review recent changes for risk"
- "Limit risky deployments to business hours"
- "No chaos experiments"
critical:
condition: "remaining_budget 0-25%"
actions:
- "Feature freeze — reliability work only"
- "All deployments require SRE approval"
- "Mandatory rollback plan for every change"
- "Daily error budget review"
exhausted:
condition: "remaining_budget <= 0"
actions:
- "Complete deployment freeze"
- "All engineering redirected to reliability"
- "VP Engineering notified"
- "Postmortem required for budget exhaustion"
- "Freeze maintained until budget recovers to 10%"
exceptions:
- "Security patches always allowed"
- "Regulatory compliance changes always allowed"
- "Data loss prevention always allowed"
reset: "Rolling 28-day window (no manual resets)"text
Burn rate = (error rate observed) / (error rate allowed by SLO) Example: - SLO: 99.9% (error budget = 0.1%) - Current error rate: 0.5% - Burn rate = 0.5% / 0.1% = 5x At 5x burn rate → budget exhausted in 28d / 5 = 5.6 days
json
{
"timestamp": "2026-02-17T11:24:00.000Z",
"level": "error",
"service": "payment-api",
"trace_id": "abc123",
"span_id": "def456",
"message": "Payment processing failed",
"error_type": "TimeoutException",
"error_message": "Gateway timeout after 30s",
"http_method": "POST",
"http_path": "/api/v1/payments",
"http_status": 504,
"duration_ms": 30012,
"customer_id": "cust_xxx",
"payment_id": "pay_yyy",
"amount_cents": 4999,
"retry_count": 2,
"environment": "production",
"host": "payment-api-7b4d9-xk2p1",
"region": "us-east-1"
}Editorial read
Docs source
CLAWHUB
Editorial quality
ready
SRE & Incident Management Platform SRE & Incident Management Platform Complete Site Reliability Engineering system — from SLO definition through incident response, chaos engineering, and operational excellence. Zero dependencies. --- Phase 1: Reliability Assessment Before building anything, assess where you are. Service Catalog Entry Maturity Assessment (Score 1-5 per dimension) | Dimension | 1 (Ad-hoc) | 3 (Defined) | 5 (Optimized) | Score | |-------
Complete Site Reliability Engineering system — from SLO definition through incident response, chaos engineering, and operational excellence. Zero dependencies.
Before building anything, assess where you are.
service:
name: ""
tier: "" # critical | important | standard | experimental
owner_team: ""
oncall_rotation: ""
dependencies:
upstream: [] # services we call
downstream: [] # services that call us
data_classification: "" # public | internal | confidential | restricted
deployment_frequency: "" # daily | weekly | biweekly | monthly
architecture: "" # monolith | microservice | serverless | hybrid
language: ""
infra: "" # k8s | ECS | Lambda | VM | bare-metal
traffic_pattern: "" # steady | diurnal | spiky | seasonal
peak_rps: 0
storage_gb: 0
monthly_cost_usd: 0
| Dimension | 1 (Ad-hoc) | 3 (Defined) | 5 (Optimized) | Score | |-----------|-----------|-------------|---------------|-------| | SLOs | No SLOs defined | SLOs exist, reviewed quarterly | Data-driven SLOs, auto error budgets | | | Monitoring | Basic health checks | Golden signals + dashboards | Full observability, anomaly detection | | | Incident Response | No runbooks, hero culture | Documented process, postmortems | Automated detection, structured ICS | | | Automation | Manual deployments | CI/CD pipeline, some automation | Self-healing, auto-scaling, GitOps | | | Chaos Engineering | No testing | Basic failure injection | Continuous chaos in production | | | Capacity Planning | Reactive scaling | Quarterly forecasting | Predictive auto-scaling | | | Toil Management | >50% toil | Toil tracked, reduction plans | <25% toil, systematic elimination | | | On-Call Health | Burnout, 24/7 individuals | Rotation exists, escalation paths | Balanced load, <2 pages/shift | |
Score interpretation:
| Service Type | Primary SLI | Secondary SLIs | |-------------|-------------|----------------| | API/Backend | Request success rate | Latency p50/p95/p99, throughput | | Frontend/Web | Page load (LCP) | FID/INP, CLS, error rate | | Data Pipeline | Freshness | Correctness, completeness, throughput | | Storage | Durability | Availability, latency | | Streaming | Processing latency | Throughput, ordering, data loss rate | | Batch Job | Success rate | Duration, SLA compliance | | ML Model | Prediction latency | Accuracy drift, feature freshness |
sli:
name: "request_success_rate"
description: "Proportion of valid requests served successfully"
type: "availability" # availability | latency | quality | freshness
measurement:
good_events: "HTTP responses with status < 500"
total_events: "All HTTP requests excluding health checks"
source: "load balancer access logs"
aggregation: "sum(good) / sum(total) over rolling 28-day window"
exclusions:
- "Health check endpoints (/healthz, /readyz)"
- "Synthetic monitoring traffic"
- "Requests from blocked IPs"
- "4xx responses (client errors)"
| Nines | Uptime % | Downtime/month | Appropriate for | |-------|----------|----------------|-----------------| | 2 nines | 99% | 7h 18m | Internal tools, dev environments | | 2.5 | 99.5% | 3h 39m | Non-critical services, backoffice | | 3 nines | 99.9% | 43m 50s | Standard production services | | 3.5 | 99.95% | 21m 55s | Important customer-facing services | | 4 nines | 99.99% | 4m 23s | Critical services, payments, auth | | 5 nines | 99.999% | 26s | Life-safety, financial clearing |
Rules for setting targets:
slo:
service: ""
sli: ""
target: 99.9 # percentage
window: "28d" # rolling window
error_budget: 0.1 # 100% - target
error_budget_minutes: 40 # per 28-day window
burn_rate_alerts:
- name: "fast_burn"
burn_rate: 14.4 # exhausts budget in 2 hours
short_window: "5m"
long_window: "1h"
severity: "page"
- name: "medium_burn"
burn_rate: 6.0 # exhausts budget in ~5 hours
short_window: "30m"
long_window: "6h"
severity: "page"
- name: "slow_burn"
burn_rate: 1.0 # exhausts budget in 28 days
short_window: "6h"
long_window: "3d"
severity: "ticket"
review_cadence: "monthly"
owner: ""
stakeholders: []
escalation_when_budget_exhausted:
- "Halt non-critical deployments"
- "Redirect engineering to reliability work"
- "Escalate to VP Engineering if no improvement in 48h"
error_budget_policy:
service: ""
budget_states:
healthy:
condition: "remaining_budget > 50%"
actions:
- "Normal development velocity"
- "Feature work prioritized"
- "Chaos experiments allowed"
warning:
condition: "remaining_budget 25-50%"
actions:
- "Increase monitoring scrutiny"
- "Review recent changes for risk"
- "Limit risky deployments to business hours"
- "No chaos experiments"
critical:
condition: "remaining_budget 0-25%"
actions:
- "Feature freeze — reliability work only"
- "All deployments require SRE approval"
- "Mandatory rollback plan for every change"
- "Daily error budget review"
exhausted:
condition: "remaining_budget <= 0"
actions:
- "Complete deployment freeze"
- "All engineering redirected to reliability"
- "VP Engineering notified"
- "Postmortem required for budget exhaustion"
- "Freeze maintained until budget recovers to 10%"
exceptions:
- "Security patches always allowed"
- "Regulatory compliance changes always allowed"
- "Data loss prevention always allowed"
reset: "Rolling 28-day window (no manual resets)"
Burn rate = (error rate observed) / (error rate allowed by SLO)
Example:
- SLO: 99.9% (error budget = 0.1%)
- Current error rate: 0.5%
- Burn rate = 0.5% / 0.1% = 5x
At 5x burn rate → budget exhausted in 28d / 5 = 5.6 days
Track weekly:
| Metric | Current | Trend | Status | |--------|---------|-------|--------| | Budget remaining (%) | | ↑↓→ | 🟢🟡🔴 | | Budget consumed this week | | | | | Burn rate (1h / 6h / 24h) | | | | | Incidents consuming budget | | | | | Top error contributor | | | | | Projected exhaustion date | | | |
| Signal | What to Measure | Alert When | |--------|----------------|------------| | Latency | p50, p95, p99 response time | p99 > 2x baseline for 5 min | | Traffic | Requests/sec, concurrent users | >30% drop (indicates upstream issue) OR >50% spike | | Errors | 5xx rate, timeout rate, exception rate | Error rate > SLO burn rate threshold | | Saturation | CPU, memory, disk, connections, queue depth | >80% sustained for 10 min |
For every resource, track:
For every service, track:
| Severity | Response Time | Notification | Examples | |----------|--------------|-------------|----------| | P0/Page | <5 min | PagerDuty + phone | SLO burn rate critical, data loss, security breach | | P1/Urgent | <30 min | Slack + PagerDuty | Degraded service, elevated errors, capacity warning | | P2/Ticket | Next business day | Ticket auto-created | Slow burn, non-critical component down | | P3/Log | Weekly review | Dashboard only | Informational, trend detection |
{
"timestamp": "2026-02-17T11:24:00.000Z",
"level": "error",
"service": "payment-api",
"trace_id": "abc123",
"span_id": "def456",
"message": "Payment processing failed",
"error_type": "TimeoutException",
"error_message": "Gateway timeout after 30s",
"http_method": "POST",
"http_path": "/api/v1/payments",
"http_status": 504,
"duration_ms": 30012,
"customer_id": "cust_xxx",
"payment_id": "pay_yyy",
"amount_cents": 4999,
"retry_count": 2,
"environment": "production",
"host": "payment-api-7b4d9-xk2p1",
"region": "us-east-1"
}
| | Impact: 1 User | Impact: <25% Users | Impact: >25% Users | Impact: All Users | |-|----------------|--------------------|--------------------|-------------------| | Core function down | SEV3 | SEV2 | SEV1 | SEV1 | | Degraded performance | SEV4 | SEV3 | SEV2 | SEV1 | | Non-core feature down | SEV4 | SEV3 | SEV3 | SEV2 | | Cosmetic/minor | SEV4 | SEV4 | SEV3 | SEV3 |
Auto-escalation triggers:
| Role | Responsibility | Assigned | |------|---------------|----------| | Incident Commander (IC) | Owns resolution, makes decisions, manages timeline | | | Communications Lead | Status updates, stakeholder comms, customer-facing | | | Operations Lead | Hands-on-keyboard, executing fixes | | | Subject Matter Expert | Deep knowledge of affected system | | | Scribe | Documenting timeline, actions, decisions | |
IC Rules:
DETECT → TRIAGE → RESPOND → MITIGATE → RESOLVE → REVIEW
Step 1: DETECT (0-5 min)
├── Alert fires OR user report received
├── On-call acknowledges within SLA
└── Quick assessment: is this real? What severity?
Step 2: TRIAGE (5-15 min)
├── Classify severity using matrix above
├── Assign IC and roles
├── Open incident channel (#inc-YYYY-MM-DD-title)
├── Post initial status update
└── Start timeline document
Step 3: RESPOND (15 min - ongoing)
├── IC briefs team: "Here's what we know, here's what we don't"
├── Operations Lead begins investigation
├── Check: recent deployments? Config changes? Dependency issues?
├── Parallel investigation tracks if needed
└── 15-minute check-ins for SEV1, 30-min for SEV2
Step 4: MITIGATE (ASAP)
├── Priority: STOP THE BLEEDING
├── Options (fastest first):
│ ├── Rollback last deployment
│ ├── Feature flag disable
│ ├── Traffic shift / failover
│ ├── Scale up / circuit breaker
│ └── Manual data fix
├── Mitigated ≠ Resolved — temporary fix is OK
└── Update status: "Impact mitigated, root cause investigation ongoing"
Step 5: RESOLVE
├── Root cause identified and fixed
├── Verification: SLIs back to normal for 30+ minutes
├── All-clear communicated
└── IC declares incident resolved
Step 6: REVIEW (within 5 business days)
├── Blameless postmortem written
├── Action items assigned with owners and deadlines
├── Postmortem review meeting
└── Action items tracked to completion
Initial notification (internal):
🔴 INCIDENT: [Title]
Severity: SEV[X]
Impact: [Who/what is affected]
Status: Investigating
IC: [Name]
Channel: #inc-[date]-[slug]
Next update: [time]
Customer-facing status:
[Service] - Investigating increased error rates
We are currently investigating reports of [symptom].
Some users may experience [user-visible impact].
Our team is actively working on a resolution.
We will provide an update within [time].
Resolution notification:
✅ RESOLVED: [Title]
Duration: [X hours Y minutes]
Impact: [Summary]
Root cause: [One sentence]
Postmortem: [Link] (within 5 business days)
postmortem:
title: ""
date: ""
severity: "" # SEV1-4
duration: "" # total incident duration
authors: []
reviewers: []
status: "draft" # draft | in-review | final
summary: |
One paragraph: what happened, what was the impact, how was it resolved.
impact:
users_affected: 0
duration_minutes: 0
revenue_impact_usd: 0
slo_budget_consumed_pct: 0
data_loss: false
customer_tickets: 0
timeline:
- time: ""
event: ""
# Chronological, every significant event
# Include detection time, escalation, mitigation attempts
root_cause: |
Technical explanation of WHY it happened.
Go deep — surface causes are not root causes.
contributing_factors:
- "" # What made it worse or delayed resolution?
detection:
how_detected: "" # alert | user report | manual check
time_to_detect_minutes: 0
could_have_detected_sooner: ""
resolution:
how_resolved: ""
time_to_mitigate_minutes: 0
time_to_resolve_minutes: 0
what_went_well:
- "" # Explicitly call out what worked
what_went_wrong:
- ""
where_we_got_lucky:
- "" # Things that could have made it worse
action_items:
- id: "AI-001"
type: "" # prevent | detect | mitigate | process
description: ""
owner: ""
priority: "" # P0 | P1 | P2
deadline: ""
status: "open" # open | in-progress | done
ticket: ""
Five Whys (simple incidents):
→ Root cause: No automated query performance check in deployment pipeline → Action: Add query plan analysis to CI for migration PRs
Fishbone / Ishikawa (complex incidents):
Categories to investigate:
├── People: Training? Fatigue? Communication?
├── Process: Runbook? Escalation? Change management?
├── Technology: Bug? Config? Capacity? Dependency?
├── Environment: Network? Cloud provider? Third party?
├── Monitoring: Detection gap? Alert fatigue? Dashboard gap?
└── Testing: Test coverage? Load testing? Chaos testing?
Contributing Factor Categories: | Category | Questions | |----------|-----------| | Trigger | What change or event started it? | | Propagation | Why did it spread? Why wasn't it contained? | | Detection | Why wasn't it caught earlier? | | Resolution | What slowed the fix? | | Process | What process gaps contributed? |
1. Timeline walk-through (15 min)
- Author presents chronology
- Attendees add context ("I remember seeing X at this point")
2. Root cause deep-dive (15 min)
- Do we agree on root cause?
- Are there additional contributing factors?
3. Action item review (20 min)
- Are these the RIGHT actions?
- Are they prioritized correctly?
- Do owners agree on deadlines?
4. Process improvements (10 min)
- Could we have detected this sooner?
- Could we have resolved this faster?
- What would have prevented this entirely?
| Level | Name | Activities | |-------|------|-----------| | 0 | None | No chaos testing | | 1 | Exploratory | Manual fault injection in staging | | 2 | Systematic | Scheduled chaos experiments in staging | | 3 | Production | Controlled chaos in production (Game Days) | | 4 | Continuous | Automated chaos in production with safety controls |
experiment:
name: ""
hypothesis: "When [fault], the system will [expected behavior]"
steady_state:
metrics:
- name: ""
baseline: ""
acceptable_range: ""
method:
fault_type: "" # network | compute | storage | dependency | data
target: "" # which service/component
blast_radius: "" # single pod | single AZ | percentage of traffic
duration: ""
safety:
abort_conditions:
- "SLO burn rate exceeds 10x"
- "Customer-visible errors detected"
- "Alert fires that we didn't expect"
rollback_plan: ""
required_approvals: []
results:
outcome: "" # confirmed | disproved | inconclusive
observations: []
action_items: []
| Category | Experiment | Validates | |----------|-----------|-----------| | Network | Add 200ms latency to DB calls | Timeout handling, circuit breakers | | Network | Drop 5% of packets to downstream | Retry logic, error handling | | Network | DNS resolution failure | Caching, fallback, error messages | | Compute | Kill random pod every 10 min | Auto-restart, load balancing | | Compute | CPU stress to 95% on 1 node | Auto-scaling, graceful degradation | | Compute | Fill disk to 95% | Disk monitoring, log rotation, alerts | | Storage | Increase DB latency 5x | Connection pool handling, timeouts | | Storage | Simulate cache failure (Redis down) | Cache-aside pattern, DB fallback | | Dependency | Block external API (payment provider) | Circuit breaker, queuing, retry | | Dependency | Return 429s from auth service | Rate limit handling, backoff | | Data | Clock skew on subset of nodes | Timestamp handling, ordering | | Scale | 10x traffic spike over 5 minutes | Auto-scaling speed, queue depth |
PRE-GAME (1 week before):
□ Experiment designed and reviewed
□ Steady-state metrics identified
□ Abort conditions defined
□ All participants briefed
□ Runbacks tested in staging
□ Stakeholders notified
GAME DAY:
□ Verify steady state (15 min baseline)
□ Announce in #engineering: "Chaos Game Day starting"
□ Inject fault
□ Observe and document
□ If abort condition hit → rollback immediately
□ Run for planned duration
□ Remove fault
□ Verify recovery to steady state
POST-GAME (same day):
□ Results documented
□ Surprises noted
□ Action items created
□ Share findings in team meeting
Definition: Work that is manual, repetitive, automatable, tactical, without enduring value, and scales linearly with service growth.
toil_item:
name: ""
category: "" # deployment | scaling | config | data | access | monitoring | recovery
frequency: "" # daily | weekly | monthly | per-incident
time_per_occurrence_min: 0
occurrences_per_month: 0
total_hours_per_month: 0
teams_affected: []
automation_difficulty: "" # low | medium | high
automation_value: 0 # hours saved per month
priority_score: 0 # value / difficulty
| | Low Effort | Medium Effort | High Effort | |-|-----------|--------------|-------------| | High Value (>10 hrs/mo) | DO FIRST | DO SECOND | PLAN | | Med Value (2-10 hrs/mo) | DO SECOND | PLAN | EVALUATE | | Low Value (<2 hrs/mo) | QUICK WIN | SKIP | SKIP |
Target: <25% of SRE time spent on toil. Track monthly. If above 25%, prioritize automation over all feature work.
capacity_model:
service: ""
bottleneck_resource: "" # CPU | memory | storage | connections | bandwidth
current_state:
peak_utilization_pct: 0
headroom_pct: 0
cost_per_month_usd: 0
growth_forecast:
metric: "" # MAU | requests/sec | storage_gb
current: 0
monthly_growth_pct: 0
projected_6mo: 0
projected_12mo: 0
scaling_strategy:
type: "" # horizontal | vertical | hybrid
auto_scaling: true
min_instances: 0
max_instances: 0
scale_up_threshold: 80 # % utilization
scale_down_threshold: 30
cooldown_seconds: 300
cost_projection:
current_monthly: 0
projected_6mo_monthly: 0
projected_12mo_monthly: 0
| Frequency | Action | |-----------|--------| | Daily | Review auto-scaling events, check for anomalies | | Weekly | Review utilization trends, spot-check headroom | | Monthly | Update growth model, review cost projections | | Quarterly | Full capacity review, budget planning, architecture check | | Pre-launch | Load test to 2x expected peak, verify scaling |
| Scenario | Method | Duration | Target | |----------|--------|----------|--------| | Baseline | Steady load at current peak | 30 min | Establish metrics | | Growth | 2x current peak | 15 min | Verify scaling works | | Spike | 10x normal in 60 seconds | 5 min | Circuit breakers hold | | Soak | 1.5x normal load | 4 hours | No memory leaks, degradation | | Stress | Ramp until failure | Until break | Find actual limits |
| Metric | Healthy | Warning | Critical | |--------|---------|---------|----------| | Pages per shift | <2 | 2-5 | >5 | | Off-hours pages | <1/week | 1-3/week | >3/week | | Time to acknowledge | <5 min | 5-15 min | >15 min | | Time to mitigate | <30 min | 30-60 min | >60 min | | False positive rate | <10% | 10-30% | >30% | | Escalation rate | <20% | 20-40% | >40% | | On-call satisfaction | >4/5 | 3-4/5 | <3/5 |
## On-Call Handoff: [Date]
### Open Issues
- [Issue]: [Status, next steps]
### Recent Changes (last 7 days)
- [Deployment/config change]: [Risk level, rollback plan]
### Known Risks
- [Event/condition]: [What to watch for]
### Scheduled Maintenance
- [When]: [What, duration, rollback plan]
### Runbook Updates
- [Any new/updated runbooks since last rotation]
runbook:
title: ""
alert_name: "" # exact alert that triggers this
last_updated: ""
owner: ""
overview: |
What this alert means in plain English.
impact: |
What users/systems are affected and how.
diagnosis:
- step: "Check service health"
command: ""
expected: ""
if_unexpected: ""
- step: "Check recent deployments"
command: ""
expected: ""
if_unexpected: "Rollback: [command]"
- step: "Check dependencies"
command: ""
expected: ""
if_unexpected: ""
mitigation:
- option: "Rollback"
when: "Recent deployment suspected"
steps: []
- option: "Scale up"
when: "Traffic spike"
steps: []
- option: "Failover"
when: "Single component failure"
steps: []
escalation:
after_minutes: 30
contact: ""
context_to_provide: ""
1. SLO Status (5 min)
- Budget remaining per service
- Any burn rate alerts this week?
2. Incident Review (10 min)
- Incidents this week: count, severity, duration
- Open postmortem action items: status check
3. On-Call Health (5 min)
- Pages this week (total, off-hours, false positives)
- Any on-call feedback?
4. Reliability Work (10 min)
- Automation shipped this week
- Toil reduced (hours saved)
- Chaos experiments run
- Capacity concerns
monthly_report:
period: ""
slo_summary:
services_meeting_slo: 0
services_breaching_slo: 0
worst_performing: ""
incidents:
total: 0
by_severity: { SEV1: 0, SEV2: 0, SEV3: 0, SEV4: 0 }
mttr_minutes: 0
mttd_minutes: 0
repeat_incidents: 0
error_budget:
services_in_healthy: 0
services_in_warning: 0
services_in_critical: 0
services_exhausted: 0
toil:
hours_spent: 0
hours_automated_away: 0
pct_of_sre_time: 0
on_call:
total_pages: 0
off_hours_pages: 0
false_positive_pct: 0
avg_ack_time_min: 0
action_items:
open: 0
completed_this_month: 0
overdue: 0
highlights: []
concerns: []
next_month_priorities: []
Before any new service goes to production:
| Category | Check | Status | |----------|-------|--------| | SLOs | SLIs defined and measured | | | SLOs | SLO targets set with stakeholder agreement | | | SLOs | Error budget policy documented | | | Monitoring | Golden signals dashboarded | | | Monitoring | Alerting configured with runbooks | | | Monitoring | Structured logging implemented | | | Monitoring | Distributed tracing enabled | | | Incidents | On-call rotation established | | | Incidents | Escalation paths documented | | | Incidents | Runbooks for top 5 failure modes | | | Capacity | Load tested to 2x expected peak | | | Capacity | Auto-scaling configured and tested | | | Capacity | Resource limits set (CPU, memory) | | | Resilience | Graceful degradation implemented | | | Resilience | Circuit breakers for dependencies | | | Resilience | Retry with exponential backoff | | | Resilience | Timeout configured for all external calls | | | Deploy | Rollback tested and documented | | | Deploy | Canary/blue-green deployment ready | | | Deploy | Feature flags for risky features | | | Security | Authentication and authorization | | | Security | Secrets in vault (not env vars) | | | Security | Dependencies scanned | | | Data | Backup and restore tested | | | Data | Data retention policy defined | | | Docs | Architecture diagram current | | | Docs | API documentation published | | | Docs | Operational runbook complete | |
auto_remediation:
- trigger: "pod_crash_loop"
condition: "restart_count > 3 in 10 min"
action: "Delete pod, let scheduler reschedule"
escalate_if: "Still crashing after 3 auto-remediations"
- trigger: "disk_usage_high"
condition: "disk_usage > 85%"
action: "Run log cleanup script, archive old data"
escalate_if: "Still above 85% after cleanup"
- trigger: "connection_pool_exhausted"
condition: "available_connections = 0"
action: "Kill idle connections, increase pool temporarily"
escalate_if: "Pool exhausted again within 1 hour"
- trigger: "certificate_expiring"
condition: "days_until_expiry < 14"
action: "Trigger cert renewal"
escalate_if: "Renewal fails"
| Strategy | Complexity | RTO | Cost | |----------|-----------|-----|------| | Active-passive | Low | Minutes | 1.5x | | Active-active read | Medium | Seconds | 1.8x | | Active-active full | High | Near-zero | 2-3x | | Cell-based | Very high | Per-cell | 2-4x |
Decision guide:
Healthy signals:
Warning signs:
| Dimension | Weight | 0-2 | 3-4 | 5 | |-----------|--------|-----|-----|---| | SLO Coverage | 20% | No SLOs | SLOs for critical services | All services with SLOs, error budgets, reviews | | Monitoring | 15% | Basic health checks | Golden signals + dashboards | Full observability stack + anomaly detection | | Incident Response | 15% | Ad-hoc, no process | ICS roles, runbooks, postmortems | Structured ICS, blameless culture, action tracking | | Automation | 15% | Manual everything | CI/CD + some automation | Self-healing, GitOps, <25% toil | | Chaos Engineering | 10% | None | Staging experiments | Continuous production chaos with safety | | Capacity Planning | 10% | Reactive | Quarterly forecasting | Predictive, auto-scaling, cost-optimized | | On-Call Health | 10% | Burnout, hero culture | Fair rotation, <5 pages/shift | Balanced, compensated, <2 pages/shift | | Documentation | 5% | Nothing written | Runbooks exist | Complete, current, tested runbooks |
Machine endpoints, contract coverage, trust signals, runtime metrics, benchmarks, and guardrails for agent-to-agent use.
Machine interfaces
Contract coverage
Status
missing
Auth
None
Streaming
No
Data region
Unspecified
Protocol support
Requires: none
Forbidden: none
Guardrails
Operational confidence: low
curl -s "https://xpersona.co/api/v1/agents/clawhub-skills-1kalin-afrexai-sre-platform/snapshot"
curl -s "https://xpersona.co/api/v1/agents/clawhub-skills-1kalin-afrexai-sre-platform/contract"
curl -s "https://xpersona.co/api/v1/agents/clawhub-skills-1kalin-afrexai-sre-platform/trust"
Operational fit
Trust signals
Handshake
UNKNOWN
Confidence
unknown
Attempts 30d
unknown
Fallback rate
unknown
Runtime metrics
Observed P50
unknown
Observed P95
unknown
Rate limit
unknown
Estimated cost
unknown
Do not use if
Raw contract, invocation, trust, capability, facts, and change-event payloads for machine-side inspection.
Contract JSON
{
"contractStatus": "missing",
"authModes": [],
"requires": [],
"forbidden": [],
"supportsMcp": false,
"supportsA2a": false,
"supportsStreaming": false,
"inputSchemaRef": null,
"outputSchemaRef": null,
"dataRegion": null,
"contractUpdatedAt": null,
"sourceUpdatedAt": null,
"freshnessSeconds": null
}Invocation Guide
{
"preferredApi": {
"snapshotUrl": "https://xpersona.co/api/v1/agents/clawhub-skills-1kalin-afrexai-sre-platform/snapshot",
"contractUrl": "https://xpersona.co/api/v1/agents/clawhub-skills-1kalin-afrexai-sre-platform/contract",
"trustUrl": "https://xpersona.co/api/v1/agents/clawhub-skills-1kalin-afrexai-sre-platform/trust"
},
"curlExamples": [
"curl -s \"https://xpersona.co/api/v1/agents/clawhub-skills-1kalin-afrexai-sre-platform/snapshot\"",
"curl -s \"https://xpersona.co/api/v1/agents/clawhub-skills-1kalin-afrexai-sre-platform/contract\"",
"curl -s \"https://xpersona.co/api/v1/agents/clawhub-skills-1kalin-afrexai-sre-platform/trust\""
],
"jsonRequestTemplate": {
"query": "summarize this repo",
"constraints": {
"maxLatencyMs": 2000,
"protocolPreference": [
"OPENCLEW"
]
}
},
"jsonResponseTemplate": {
"ok": true,
"result": {
"summary": "...",
"confidence": 0.9
},
"meta": {
"source": "CLAWHUB",
"generatedAt": "2026-04-17T00:09:46.082Z"
}
},
"retryPolicy": {
"maxAttempts": 3,
"backoffMs": [
500,
1500,
3500
],
"retryableConditions": [
"HTTP_429",
"HTTP_503",
"NETWORK_TIMEOUT"
]
}
}Trust JSON
{
"status": "unavailable",
"handshakeStatus": "UNKNOWN",
"verificationFreshnessHours": null,
"reputationScore": null,
"p95LatencyMs": null,
"successRate30d": null,
"fallbackRate": null,
"attempts30d": null,
"trustUpdatedAt": null,
"trustConfidence": "unknown",
"sourceUpdatedAt": null,
"freshnessSeconds": null
}Capability Matrix
{
"rows": [
{
"key": "OPENCLEW",
"type": "protocol",
"support": "unknown",
"confidenceSource": "profile",
"notes": "Listed on profile"
},
{
"key": "always",
"type": "capability",
"support": "supported",
"confidenceSource": "profile",
"notes": "Declared in agent profile metadata"
},
{
"key": "escalate",
"type": "capability",
"support": "supported",
"confidenceSource": "profile",
"notes": "Declared in agent profile metadata"
}
],
"flattenedTokens": "protocol:OPENCLEW|unknown|profile capability:always|supported|profile capability:escalate|supported|profile"
}Facts JSON
[
{
"factKey": "docs_crawl",
"category": "integration",
"label": "Crawlable docs",
"value": "6 indexed pages on the official domain",
"href": "https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Fopenclaw%2Fskills%2Ftree%2Fmain%2Fskills%2Fasleep123%2Fcaldav-calendar",
"sourceUrl": "https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Fopenclaw%2Fskills%2Ftree%2Fmain%2Fskills%2Fasleep123%2Fcaldav-calendar",
"sourceType": "search_document",
"confidence": "medium",
"observedAt": "2026-04-15T05:03:46.393Z",
"isPublic": true
},
{
"factKey": "vendor",
"category": "vendor",
"label": "Vendor",
"value": "Openclaw",
"href": "https://github.com/openclaw/skills/tree/main/skills/1kalin/afrexai-sre-platform",
"sourceUrl": "https://github.com/openclaw/skills/tree/main/skills/1kalin/afrexai-sre-platform",
"sourceType": "profile",
"confidence": "medium",
"observedAt": "2026-04-15T00:45:39.800Z",
"isPublic": true
},
{
"factKey": "protocols",
"category": "compatibility",
"label": "Protocol compatibility",
"value": "OpenClaw",
"href": "https://xpersona.co/api/v1/agents/clawhub-skills-1kalin-afrexai-sre-platform/contract",
"sourceUrl": "https://xpersona.co/api/v1/agents/clawhub-skills-1kalin-afrexai-sre-platform/contract",
"sourceType": "contract",
"confidence": "medium",
"observedAt": "2026-04-15T00:45:39.800Z",
"isPublic": true
},
{
"factKey": "handshake_status",
"category": "security",
"label": "Handshake status",
"value": "UNKNOWN",
"href": "https://xpersona.co/api/v1/agents/clawhub-skills-1kalin-afrexai-sre-platform/trust",
"sourceUrl": "https://xpersona.co/api/v1/agents/clawhub-skills-1kalin-afrexai-sre-platform/trust",
"sourceType": "trust",
"confidence": "medium",
"observedAt": null,
"isPublic": true
}
]Change Events JSON
[
{
"eventType": "docs_update",
"title": "Docs refreshed: Sign in to GitHub · GitHub",
"description": "Fresh crawlable documentation was indexed for the official domain.",
"href": "https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Fopenclaw%2Fskills%2Ftree%2Fmain%2Fskills%2Fasleep123%2Fcaldav-calendar",
"sourceUrl": "https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Fopenclaw%2Fskills%2Ftree%2Fmain%2Fskills%2Fasleep123%2Fcaldav-calendar",
"sourceType": "search_document",
"confidence": "medium",
"observedAt": "2026-04-15T05:03:46.393Z",
"isPublic": true
}
]Sponsored
Ads related to afrexai-sre-platform and adjacent AI workflows.