Rank
70
AI Agents & MCPs & AI Workflow Automation • (~400 MCP servers for AI agents) • AI Automation / AI Agent with MCPs • AI Workflows & AI Agents • MCPs for AI Agents
Traction
No public download signal
Freshness
Updated 2d ago
Xpersona Agent
Complete observability & reliability engineering system. Use when designing monitoring, implementing structured logging, setting up distributed tracing, building alerting systems, creating SLO/SLI frameworks, running incident response, conducting post-mortems, or auditing system reliability. Covers all three pillars (logs/metrics/traces), alert design, dashboard architecture, on-call operations, chaos engineering, and cost optimization. --- name: afrexai-observability-engine model: standard description: Complete observability & reliability engineering system. Use when designing monitoring, implementing structured logging, setting up distributed tracing, building alerting systems, creating SLO/SLI frameworks, running incident response, conducting post-mortems, or auditing system reliability. Covers all three pillars (logs/metrics/traces), alert desig
clawhub skill install skills:1kalin:afrexai-observability-engineOverall rank
#62
Adoption
No public adoption signal
Trust
Unknown
Freshness
Feb 25, 2026
Freshness
Last checked Feb 25, 2026
Best For
afrexai-observability-engine is best for limits, be, b3 workflows where OpenClaw compatibility matters.
Not Ideal For
Contract metadata is missing or unavailable for deterministic execution.
Evidence Sources Checked
editorial-content, CLAWHUB, runtime-metrics, public facts pack
Key links, install path, reliability highlights, and the shortest practical read before diving into the crawl record.
Overview
Complete observability & reliability engineering system. Use when designing monitoring, implementing structured logging, setting up distributed tracing, building alerting systems, creating SLO/SLI frameworks, running incident response, conducting post-mortems, or auditing system reliability. Covers all three pillars (logs/metrics/traces), alert design, dashboard architecture, on-call operations, chaos engineering, and cost optimization. --- name: afrexai-observability-engine model: standard description: Complete observability & reliability engineering system. Use when designing monitoring, implementing structured logging, setting up distributed tracing, building alerting systems, creating SLO/SLI frameworks, running incident response, conducting post-mortems, or auditing system reliability. Covers all three pillars (logs/metrics/traces), alert desig Capability contract not published. No trust telemetry is available yet. Last updated 4/15/2026.
Trust score
Unknown
Compatibility
OpenClaw
Freshness
Feb 25, 2026
Vendor
Openclaw
Artifacts
0
Benchmarks
0
Last release
Unpublished
Install & run
clawhub skill install skills:1kalin:afrexai-observability-engineSetup complexity is LOW. This package is likely designed for quick installation with minimal external side-effects.
Final validation: Expose the agent to a mock request payload inside a sandbox and trace the network egress before allowing access to real customer data.
Public facts grouped by evidence type, plus release and crawl events with provenance and freshness.
Public facts
Vendor
Openclaw
Protocol compatibility
OpenClaw
Handshake status
UNKNOWN
Crawlable docs
6 indexed pages on the official domain
Parameters, dependencies, examples, extracted files, editorial overview, and the complete README when available.
Captured outputs
Extracted files
0
Examples
6
Snippets
0
Languages
typescript
Parameters
text
Application → Structured JSON → Log Router → Storage → Query Engine
↓
Alert Pipelineyaml
# HTTP request context http: method: POST path: /api/v1/orders status: 201 client_ip: 203.0.113.42 # Anonymize in logs if needed user_agent: "Mozilla/5.0..." request_id: "req_abc123" # Business context business: user_id: "usr_456" tenant_id: "tenant_789" order_id: "ord_012" action: "checkout" amount_cents: 4999 currency: "USD" # Error context error: type: "PaymentDeclinedError" message: "Card declined: insufficient funds" code: "CARD_DECLINED" stack: "..." # Only in non-production or DEBUG level retry_count: 2 retryable: true
text
Is the process about to crash? → FATAL (exit after logging) Did an operation fail that needs human attention? → ERROR (page someone or create ticket) Did something unexpected happen but we recovered? → WARN (review in daily triage) Is this a normal business event worth recording? → INFO (audit trail, business metrics) Is this useful for debugging but noisy in production? → DEBUG (off in prod, on in staging) Is this only useful when stepping through code? → TRACE (never in production)
yaml
scrub_patterns:
# Always redact
- field_patterns: ["password", "secret", "token", "api_key", "authorization"]
action: replace_with_redacted
# Hash for correlation without exposure
- field_patterns: ["email", "phone", "ssn", "national_id"]
action: sha256_hash
# Mask partially
- field_patterns: ["credit_card", "card_number"]
action: mask_last_4 # "****-****-****-1234"
# IP anonymization
- field_patterns: ["client_ip", "ip_address"]
action: zero_last_octet # 203.0.113.0typescript
import pino from 'pino';
import { AsyncLocalStorage } from 'node:async_hooks';
const als = new AsyncLocalStorage<Record<string, string>>();
const logger = pino({
level: process.env.LOG_LEVEL || 'info',
formatters: {
level: (label) => ({ level: label }),
},
mixin: () => als.getStore() ?? {},
redact: ['req.headers.authorization', '*.password', '*.token'],
timestamp: pino.stdTimeFunctions.isoTime,
});
// Middleware: inject context
app.use((req, res, next) => {
const ctx = {
trace_id: req.headers['x-trace-id'] || crypto.randomUUID(),
request_id: crypto.randomUUID(),
service: 'payment-api',
version: process.env.APP_VERSION,
};
als.run(ctx, () => next());
});python
import structlog
structlog.configure(
processors=[
structlog.contextvars.merge_contextvars,
structlog.processors.add_log_level,
structlog.processors.TimeStamper(fmt="iso", utc=True),
structlog.processors.JSONRenderer(),
],
)
log = structlog.get_logger()
# Bind context per-request:
structlog.contextvars.bind_contextvars(trace_id=trace_id, user_id=user_id)Editorial read
Docs source
CLAWHUB
Editorial quality
ready
Complete observability & reliability engineering system. Use when designing monitoring, implementing structured logging, setting up distributed tracing, building alerting systems, creating SLO/SLI frameworks, running incident response, conducting post-mortems, or auditing system reliability. Covers all three pillars (logs/metrics/traces), alert design, dashboard architecture, on-call operations, chaos engineering, and cost optimization. --- name: afrexai-observability-engine model: standard description: Complete observability & reliability engineering system. Use when designing monitoring, implementing structured logging, setting up distributed tracing, building alerting systems, creating SLO/SLI frameworks, running incident response, conducting post-mortems, or auditing system reliability. Covers all three pillars (logs/metrics/traces), alert desig
Complete system for building observable, reliable services — from structured logging to incident response to SLO-driven development.
Score your current observability posture:
| Signal | Healthy (2) | Weak (1) | Missing (0) | |--------|-------------|----------|-------------| | Structured logging | JSON logs with trace_id correlation | Logs exist but unstructured | Console.log / print statements | | Metrics collection | RED/USE metrics with dashboards | Some metrics, no dashboards | No metrics | | Distributed tracing | Full request path with sampling | Partial traces, key services only | No tracing | | Alerting | SLO-based alerts with runbooks | Threshold alerts, some runbooks | No alerts or all-noise | | Incident response | Defined process with roles + post-mortems | Ad-hoc response, some docs | "Whoever notices fixes it" | | SLOs defined | SLOs with error budgets tracked weekly | Informal availability targets | No reliability targets | | On-call rotation | Structured rotation with escalation | Informal "call someone" | No on-call | | Cost management | Observability budget tracked monthly | Some awareness of costs | No idea what you spend |
12-16: Production-grade. Focus on optimization. 8-11: Foundation exists. Fill the gaps systematically. 4-7: Significant risk. Prioritize alerting + incident response. 0-3: Flying blind. Start with Phase 1 immediately.
Application → Structured JSON → Log Router → Storage → Query Engine
↓
Alert Pipeline
| Field | Type | Purpose | Example |
|-------|------|---------|---------|
| timestamp | ISO-8601 UTC | When | 2026-02-22T18:30:00.123Z |
| level | enum | Severity | info, warn, error, fatal |
| service | string | Which service | payment-api |
| version | string | Which deploy | v2.3.1 |
| environment | string | Which env | production |
| message | string | What happened | Payment processed successfully |
| trace_id | string | Request correlation | abc123def456 |
| span_id | string | Operation within trace | span_789 |
| duration_ms | number | How long | 142 |
# HTTP request context
http:
method: POST
path: /api/v1/orders
status: 201
client_ip: 203.0.113.42 # Anonymize in logs if needed
user_agent: "Mozilla/5.0..."
request_id: "req_abc123"
# Business context
business:
user_id: "usr_456"
tenant_id: "tenant_789"
order_id: "ord_012"
action: "checkout"
amount_cents: 4999
currency: "USD"
# Error context
error:
type: "PaymentDeclinedError"
message: "Card declined: insufficient funds"
code: "CARD_DECLINED"
stack: "..." # Only in non-production or DEBUG level
retry_count: 2
retryable: true
Is the process about to crash?
→ FATAL (exit after logging)
Did an operation fail that needs human attention?
→ ERROR (page someone or create ticket)
Did something unexpected happen but we recovered?
→ WARN (review in daily triage)
Is this a normal business event worth recording?
→ INFO (audit trail, business metrics)
Is this useful for debugging but noisy in production?
→ DEBUG (off in prod, on in staging)
Is this only useful when stepping through code?
→ TRACE (never in production)
scrub_patterns:
# Always redact
- field_patterns: ["password", "secret", "token", "api_key", "authorization"]
action: replace_with_redacted
# Hash for correlation without exposure
- field_patterns: ["email", "phone", "ssn", "national_id"]
action: sha256_hash
# Mask partially
- field_patterns: ["credit_card", "card_number"]
action: mask_last_4 # "****-****-****-1234"
# IP anonymization
- field_patterns: ["client_ip", "ip_address"]
action: zero_last_octet # 203.0.113.0
Node.js (Pino):
import pino from 'pino';
import { AsyncLocalStorage } from 'node:async_hooks';
const als = new AsyncLocalStorage<Record<string, string>>();
const logger = pino({
level: process.env.LOG_LEVEL || 'info',
formatters: {
level: (label) => ({ level: label }),
},
mixin: () => als.getStore() ?? {},
redact: ['req.headers.authorization', '*.password', '*.token'],
timestamp: pino.stdTimeFunctions.isoTime,
});
// Middleware: inject context
app.use((req, res, next) => {
const ctx = {
trace_id: req.headers['x-trace-id'] || crypto.randomUUID(),
request_id: crypto.randomUUID(),
service: 'payment-api',
version: process.env.APP_VERSION,
};
als.run(ctx, () => next());
});
Python (structlog):
import structlog
structlog.configure(
processors=[
structlog.contextvars.merge_contextvars,
structlog.processors.add_log_level,
structlog.processors.TimeStamper(fmt="iso", utc=True),
structlog.processors.JSONRenderer(),
],
)
log = structlog.get_logger()
# Bind context per-request:
structlog.contextvars.bind_contextvars(trace_id=trace_id, user_id=user_id)
Go (zerolog):
log := zerolog.New(os.Stdout).With().
Timestamp().
Str("service", "payment-api").
Str("version", version).
Logger()
// Per-request:
reqLog := log.With().Str("trace_id", traceID).Logger()
| Volume | Solution | Retention | Cost | |--------|----------|-----------|------| | <10 GB/day | Loki + Grafana | 30 days hot, 90 days cold | Low | | 10-100 GB/day | Elasticsearch / OpenSearch | 14 days hot, 90 days S3 | Medium | | 100+ GB/day | ClickHouse or Datadog | 7 days hot, 30 days archive | High | | Budget-constrained | Loki + S3 backend | 90 days all cold | Very low |
| # | Anti-Pattern | Fix |
|---|-------------|-----|
| 1 | log.error(err) with no context | Always include: what operation, what input, what state |
| 2 | Logging request/response bodies | Log only in DEBUG; redact sensitive fields |
| 3 | String concatenation in log messages | Use structured fields: log.info("processed", { order_id, amount }) |
| 4 | Catch-and-log-and-rethrow | Log at the boundary where you handle it, not every layer |
| 5 | Different log formats per service | Standardize schema across all services |
| 6 | No log rotation / retention policy | Set max size + TTL; archive to cold storage |
| 7 | Logging inside hot paths | Aggregate: log summary every N items or every interval |
| 8 | Missing correlation IDs | Propagate trace_id from first entry point through all services |
| 9 | Boolean log levels (verbose: true) | Use standard levels with configurable minimum |
| 10 | Logging PII in plain text | Implement scrubbing at the logger level |
For every service endpoint, track:
| Metric | What | Prometheus Example |
|--------|------|--------------------|
| Rate | Requests per second | http_requests_total{method, path, status} |
| Errors | Failed requests per second | http_requests_total{status=~"5.."} / total |
| Duration | Latency distribution | http_request_duration_seconds{method, path} (histogram) |
For every resource (CPU, memory, disk, network):
| Metric | What | Example | |--------|------|---------| | Utilization | % resource busy | CPU usage 78% | | Saturation | Queue depth / backpressure | 12 requests queued | | Errors | Resource errors | 3 disk I/O errors |
| Signal | Meaning | Source | |--------|---------|--------| | Latency | Time to serve requests | RED Duration | | Traffic | Demand on the system | RED Rate | | Errors | Rate of failed requests | RED Errors | | Saturation | How "full" the service is | USE Saturation |
| Type | Use Case | Example | |------|----------|---------| | Counter | Things that only go up | Total requests, errors, bytes sent | | Gauge | Current value that goes up/down | Active connections, queue depth, temperature | | Histogram | Distribution of values | Request latency, response size | | Summary | Pre-calculated percentiles | Client-side latency (when you need exact percentiles) |
Rule: Use histograms over summaries in most cases — they're aggregatable across instances.
# Pattern: <namespace>_<subsystem>_<name>_<unit>
http_server_request_duration_seconds
http_server_requests_total
db_pool_connections_active
queue_messages_pending
cache_hit_ratio
# Rules:
# 1. Use snake_case
# 2. Include unit suffix (_seconds, _bytes, _total)
# 3. _total suffix for counters
# 4. Don't include label names in metric name
# 5. Use base units (seconds not milliseconds, bytes not kilobytes)
| Rule | Why | Example |
|------|-----|---------|
| Keep cardinality <100 per label | High cardinality kills performance | status="200" not status="200 OK" |
| No user IDs as labels | Unbounded cardinality | Use log correlation instead |
| No request paths with IDs | /api/users/123 creates millions of series | Normalize: /api/users/:id |
| Max 5-7 labels per metric | Each combo = a time series | {method, path, status, service} |
application_metrics:
# HTTP layer
- http_request_duration_seconds: histogram {method, path, status}
- http_request_size_bytes: histogram {method, path}
- http_response_size_bytes: histogram {method, path}
- http_requests_in_flight: gauge
# Business logic
- orders_processed_total: counter {status, payment_method}
- order_value_dollars: histogram {payment_method}
- user_signups_total: counter {source}
# Dependencies
- db_query_duration_seconds: histogram {query_type, table}
- db_connections_active: gauge {pool}
- db_connections_idle: gauge {pool}
- cache_requests_total: counter {result: hit|miss}
- external_api_duration_seconds: histogram {service, endpoint}
- external_api_errors_total: counter {service, error_type}
# Queue / async
- queue_messages_published_total: counter {queue}
- queue_messages_consumed_total: counter {queue, status}
- queue_processing_duration_seconds: histogram {queue}
- queue_depth: gauge {queue}
- queue_consumer_lag: gauge {queue, consumer_group}
infrastructure_metrics:
# Node exporter / cAdvisor provides these automatically
- cpu_usage_percent: gauge {instance}
- memory_usage_bytes: gauge {instance}
- disk_usage_bytes: gauge {instance, mount}
- disk_io_seconds: counter {instance, device}
- network_bytes: counter {instance, direction}
- container_cpu_usage: gauge {pod, container}
- container_memory_usage: gauge {pod, container}
| Component | Options | Recommendation | |-----------|---------|----------------| | Collection | Prometheus, OTEL Collector, Datadog Agent | Prometheus (free) or OTEL Collector (vendor-neutral) | | Storage | Prometheus, Thanos, Mimir, VictoriaMetrics | VictoriaMetrics (best cost/perf) or Mimir (Grafana ecosystem) | | Visualization | Grafana, Datadog, New Relic | Grafana (free, extensible) | | Alerting | Alertmanager, Grafana Alerting, PagerDuty | Alertmanager + PagerDuty routing |
Client Request
→ API Gateway (root span)
→ Auth Service (child span)
→ Order Service (child span)
→ Database Query (child span)
→ Payment Service (child span)
→ Stripe API (child span)
→ Notification Service (child span)
→ Email Provider (child span)
Auto-instrumentation (Node.js):
// tracing.ts — import BEFORE anything else
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4318/v1/traces',
}),
instrumentations: [getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-http': { ignoreIncomingPaths: ['/health', '/ready'] },
'@opentelemetry/instrumentation-express': { enabled: true },
})],
serviceName: process.env.OTEL_SERVICE_NAME || 'payment-api',
});
sdk.start();
Custom spans for business logic:
import { trace, SpanStatusCode } from '@opentelemetry/api';
const tracer = trace.getTracer('payment-service');
async function processPayment(order: Order) {
return tracer.startActiveSpan('process-payment', async (span) => {
span.setAttributes({
'order.id': order.id,
'order.amount_cents': order.amountCents,
'payment.method': order.paymentMethod,
});
try {
const result = await chargeCard(order);
span.setAttributes({ 'payment.status': result.status });
return result;
} catch (err) {
span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
span.recordException(err);
throw err;
} finally {
span.end();
}
});
}
| Strategy | When | Config |
|----------|------|--------|
| Always On | Dev/staging, low traffic (<100 rps) | ratio: 1.0 |
| Probabilistic | Moderate traffic (100-1000 rps) | ratio: 0.1 (10%) |
| Rate-limited | High traffic (>1000 rps) | max_traces_per_second: 100 |
| Tail-based | Want all errors + slow requests | Collector-side: keep if error OR duration > p99 |
| Parent-based | Respect upstream decisions | If parent sampled, child sampled |
Recommendation: Start with parent-based + probabilistic (10%). Add tail-based at the collector to capture all errors.
| Header | Standard | Format |
|--------|----------|--------|
| traceparent | W3C Trace Context | 00-{trace_id}-{span_id}-{flags} |
| tracestate | W3C Trace Context | Vendor-specific key-value pairs |
| b3 | Zipkin B3 | {trace_id}-{span_id}-{sampled} |
Rule: Use W3C Trace Context (traceparent) as primary. Support B3 for legacy Zipkin systems.
| Volume | Solution | Retention | |--------|----------|-----------| | <50 GB/day | Jaeger + Elasticsearch | 7 days | | 50-500 GB/day | Tempo + S3 | 14 days | | 500+ GB/day | Tempo + S3 with aggressive sampling | 7 days | | Budget-constrained | Jaeger + Badger (local disk) | 3 days |
| Service Type | Primary SLI | Secondary SLI | Measurement | |--------------|-------------|---------------|-------------| | API / Web | Availability + Latency | Error rate | Server-side + synthetic | | Data pipeline | Freshness + Correctness | Throughput | Pipeline timestamps + checksums | | Storage | Durability + Availability | Latency | Checksums + uptime monitoring | | Streaming | Throughput + Latency | Message loss rate | Consumer lag + e2e latency | | Batch jobs | Success rate + Freshness | Duration | Job scheduler metrics |
slo:
name: "Payment API Availability"
service: payment-api
owner: payments-team
sli:
type: availability
definition: "Proportion of non-5xx responses"
measurement: |
sum(rate(http_requests_total{service="payment-api",status!~"5.."}[5m]))
/
sum(rate(http_requests_total{service="payment-api"}[5m]))
target: 99.95% # 21.9 min downtime/month
window: rolling_30d
error_budget:
total_minutes: 21.9 # per 30 days
burn_rate_alerts:
- severity: critical
burn_rate: 14.4x # Budget consumed in 2 hours
short_window: 5m
long_window: 1h
- severity: warning
burn_rate: 6x # Budget consumed in 5 days
short_window: 30m
long_window: 6h
- severity: ticket
burn_rate: 1x # Budget consumed in 30 days
short_window: 6h
long_window: 3d
consequences:
budget_remaining_above_50pct: "Normal development velocity"
budget_remaining_20_to_50pct: "Prioritize reliability work"
budget_remaining_below_20pct: "Feature freeze; reliability only"
budget_exhausted: "All hands on reliability until budget recovers"
| Service Tier | Availability | p50 Latency | p99 Latency | Monthly Downtime | |--------------|-------------|-------------|-------------|------------------| | Tier 0 (payments, auth) | 99.99% | <100ms | <500ms | 4.3 min | | Tier 1 (core API) | 99.95% | <200ms | <1s | 21.9 min | | Tier 2 (non-critical) | 99.9% | <500ms | <2s | 43.8 min | | Tier 3 (internal tools) | 99.5% | <1s | <5s | 3.6 hours | | Batch / pipeline | 99% (success rate) | N/A | N/A | N/A |
# Weekly error budget review template
error_budget_review:
week: "2026-W08"
service: payment-api
slo_target: 99.95%
budget:
total_minutes_this_period: 21.9
consumed_minutes: 8.2
remaining_minutes: 13.7
remaining_percent: 62.6%
incidents_consuming_budget:
- date: "2026-02-18"
duration_minutes: 5.1
cause: "Database connection pool exhaustion"
preventable: true
action: "Increase pool size + add saturation alert"
- date: "2026-02-20"
duration_minutes: 3.1
cause: "Upstream payment provider timeout"
preventable: false
action: "Add circuit breaker with fallback"
velocity_decision: "Normal — 62.6% budget remaining"
reliability_work_this_week:
- "Add connection pool saturation alert"
- "Implement circuit breaker for payment provider"
| Severity | Response Time | Channel | Who | Example | |----------|--------------|---------|-----|---------| | P0 — Critical | <5 min | Page (PagerDuty/Opsgenie) | On-call engineer | Payment system down | | P1 — High | <30 min | Page during business hours, Slack 24/7 | On-call | Error rate >5% for 10 min | | P2 — Medium | <4 hours | Slack channel | Team | p99 latency degraded 2x | | P3 — Low | Next business day | Ticket auto-created | Team backlog | Disk usage >80% | | Info | N/A | Dashboard only | No one | Deploy completed |
| Anti-Pattern | Problem | Fix |
|-------------|---------|-----|
| Static CPU/memory thresholds | Noisy, not user-impacting | Use SLO-based burn rate alerts |
| Alert per instance | 50 instances = 50 alerts for same issue | Aggregate: alert on service-level error rate |
| No deduplication | Same alert fires 100 times | Group by service + alert name; set repeat interval |
| Missing runbook | Engineer gets paged, doesn't know what to do | Every alert links to a runbook |
| Threshold too sensitive | Fires on brief spikes | Use for: 5m to require sustained condition |
| Too many P0s | Alert fatigue → ignoring real incidents | Audit monthly; demote or remove noisy alerts |
groups:
- name: payment-api-slo
rules:
- alert: PaymentAPIHighErrorRate
expr: |
(
sum(rate(http_requests_total{service="payment-api",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="payment-api"}[5m]))
) > 0.01
for: 5m
labels:
severity: critical
service: payment-api
team: payments
annotations:
summary: "Payment API error rate {{ $value | humanizePercentage }} (>1%)"
description: "5xx error rate has exceeded 1% for 5 minutes"
runbook: "https://wiki.internal/runbooks/payment-api-errors"
dashboard: "https://grafana.internal/d/payment-api"
- alert: PaymentAPINoTraffic
expr: |
sum(rate(http_requests_total{service="payment-api"}[15m])) == 0
for: 5m
labels:
severity: critical
service: payment-api
annotations:
summary: "Payment API receiving zero traffic for 5 minutes"
runbook: "https://wiki.internal/runbooks/payment-api-no-traffic"
- alert: PaymentAPILatencyHigh
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{service="payment-api"}[5m])) by (le)
) > 2
for: 10m
labels:
severity: warning
annotations:
summary: "Payment API p99 latency {{ $value }}s (>2s for 10min)"
runbook: "https://wiki.internal/runbooks/payment-api-latency"
# Runbook: PaymentAPIHighErrorRate
## What This Alert Means
The payment API is returning >1% 5xx errors over a 5-minute window.
Users are likely failing to complete checkouts.
## Impact
- Users cannot process payments
- Revenue loss: ~$X per minute (based on average traffic)
- SLO: Payment API availability (target: 99.95%)
## Immediate Actions
1. Check the error dashboard: [link]
2. Check recent deploys: `kubectl rollout history deployment/payment-api`
3. Check upstream dependencies:
- Database: [dashboard link]
- Stripe API: [status page]
- Redis cache: [dashboard link]
4. Check application logs:
kubectl logs -l app=payment-api --since=10m | jq 'select(.level=="error")'
## Common Causes & Fixes
| Cause | Diagnosis | Fix |
|-------|-----------|-----|
| Bad deploy | Errors started at deploy time | `kubectl rollout undo deployment/payment-api` |
| DB connection exhaustion | `db_connections_active` at max | Restart pods (rolling) + increase pool size |
| Stripe outage | Stripe status page red | Enable fallback payment processor |
| Memory leak | Memory climbing, OOMKilled events | Rolling restart + investigate |
## Escalation
- If unresolved after 15 min: page payment team lead
- If revenue impact >$10K: page VP Engineering
- If Stripe outage: communicate to support team for customer messaging
## Resolution
- Confirm error rate <0.1% for 10 min
- Post in #incidents: root cause + duration + impact
- Schedule post-mortem if downtime >5 min
L1: Executive / Business Dashboard (non-technical stakeholders)
↓
L2: Service Overview Dashboard (on-call, quick triage)
↓
L3: Service Deep-Dive Dashboard (debugging specific service)
↓
L4: Infrastructure Dashboard (resource-level details)
panels:
- title: "Revenue per Minute"
type: stat
query: "sum(rate(orders_total{status='completed'}[5m])) * avg(order_value_dollars)"
- title: "Active Users (5min)"
type: stat
query: "count(count by (user_id) (http_requests_total{...}[5m]))"
- title: "Checkout Success Rate"
type: gauge
query: "sum(rate(checkout_total{status='success'}[1h])) / sum(rate(checkout_total[1h]))"
thresholds: [95, 98, 99.5]
- title: "Error Budget Remaining"
type: gauge
query: "1 - (error_budget_consumed / error_budget_total)"
Every service gets one of these with identical layout:
row_1_traffic:
- "Request Rate (rps)" — timeseries, by status code
- "Error Rate (%)" — timeseries, threshold line at SLO
- "Active Requests" — gauge
row_2_latency:
- "Latency Distribution" — heatmap
- "p50 / p95 / p99" — timeseries, threshold lines
- "Latency by Endpoint" — table, sorted by p99
row_3_dependencies:
- "Downstream Latency" — timeseries per dependency
- "Downstream Error Rate" — timeseries per dependency
- "Database Query Duration" — timeseries by query type
row_4_resources:
- "CPU Usage" — timeseries per pod
- "Memory Usage" — timeseries per pod
- "Pod Restarts" — stat
row_5_business:
- "Business Metric 1" — service-specific
- "Business Metric 2" — service-specific
| Severity | Criteria | Response | Communication | |----------|----------|----------|---------------| | SEV-1 | Service down, data loss risk, security breach | All hands, war room | Status page update every 15 min | | SEV-2 | Degraded service, SLO at risk, partial outage | On-call + backup | Status page update every 30 min | | SEV-3 | Minor degradation, workaround exists | On-call during hours | Internal Slack update | | SEV-4 | Cosmetic, low impact | Next sprint | None |
| Role | Responsibility | Who | |------|---------------|-----| | Incident Commander (IC) | Owns the incident. Coordinates. Makes decisions. | On-call lead | | Technical Lead | Diagnoses and fixes. Communicates technical status to IC. | Senior engineer | | Communications Lead | Updates status page, Slack, stakeholders. | Product/support | | Scribe | Documents timeline, actions, decisions in real-time. | Anyone available |
1. DETECT
- Alert fires → on-call paged
- Customer report → support escalates
- Internal discovery → engineer reports
2. TRIAGE (first 5 minutes)
- Confirm the issue is real (not false alert)
- Classify severity (SEV-1 through SEV-4)
- Open incident channel: #inc-YYYY-MM-DD-short-description
- Assign roles (IC, Tech Lead, Comms)
3. MITIGATE (next 5-30 minutes)
- Goal: STOP THE BLEEDING, not find root cause
- Options (try in order):
a. Rollback last deploy
b. Scale up / restart pods
c. Toggle feature flag off
d. Redirect traffic / enable fallback
e. Manual data fix
- Document every action with timestamp
4. STABILIZE
- Confirm mitigation is working (metrics back to normal)
- Monitor for 15-30 min for recurrence
- Update status page: "Monitoring fix"
5. RESOLVE
- Confirm all metrics healthy for 30+ min
- Update status page: "Resolved"
- Schedule post-mortem (within 48 hours for SEV-1/2)
- Send internal summary to stakeholders
📋 Incident: Payment API 5xx Errors
🔴 Severity: SEV-2
🕐 Started: 2026-02-22 14:23 UTC
👤 IC: @alice
🔧 Tech Lead: @bob
📢 Comms: @charlie
Status: MITIGATING
Impact: ~5% of checkout requests failing
Customer-facing: Yes
Timeline:
14:23 — Alert fired: PaymentAPIHighErrorRate
14:25 — IC assigned: @alice, confirmed real via dashboard
14:28 — Tech Lead: error logs show connection pool exhaustion post-deploy
14:31 — Rolled back deployment v2.3.1 → v2.3.0
14:35 — Error rate dropping, monitoring
14:50 — Error rate <0.1%, marking resolved
post_mortem:
title: "Payment API Connection Pool Exhaustion"
date: "2026-02-22"
severity: SEV-2
duration: 27 minutes (14:23 — 14:50 UTC)
authors: ["@alice", "@bob"]
reviewers: ["@engineering-leads"]
status: action_items_in_progress
summary: |
A deployment at 14:15 introduced a connection leak in the payment API.
Connection pool was exhausted by 14:23, causing 5xx errors for ~5% of
checkout requests. Rolled back at 14:31; recovered by 14:50.
impact:
user_impact: "~340 users saw checkout failures over 27 minutes"
revenue_impact: "$2,100 estimated (based on average order value × failed checkouts)"
slo_impact: "Consumed 5.1 min of 21.9 min monthly error budget (23%)"
data_impact: "No data loss. 12 orders failed; users could retry successfully."
timeline:
- time: "14:15"
event: "Deploy v2.3.1 rolled out (3/3 pods updated)"
- time: "14:23"
event: "PaymentAPIHighErrorRate alert fired"
- time: "14:25"
event: "IC assigned, confirmed via dashboard"
- time: "14:28"
event: "Root cause identified: new ORM query not releasing connections"
- time: "14:31"
event: "Rollback initiated: v2.3.1 → v2.3.0"
- time: "14:35"
event: "Error rate declining"
- time: "14:50"
event: "Resolved: error rate <0.1% sustained"
root_cause: |
The v2.3.1 deploy introduced a new database query in the order validation
path. The query used a raw connection instead of the pool's managed client,
so connections were acquired but never released. Under load, the pool
exhausted within 8 minutes.
contributing_factors:
- "No integration test for connection pool behavior under load"
- "Connection pool saturation metric existed but had no alert"
- "Code review didn't catch raw connection usage"
what_went_well:
- "Alert fired within 8 minutes of deploy"
- "IC assigned in 2 minutes"
- "Root cause identified in 3 minutes (clear in logs)"
- "Rollback executed cleanly"
what_went_wrong:
- "8-minute detection gap after deploy"
- "No canary deployment to catch before full rollout"
- "Connection pool saturation had no alert"
action_items:
- action: "Add connection pool saturation alert (>80% for 2 min)"
owner: "@bob"
priority: P1
due: "2026-02-25"
status: in_progress
ticket: "ENG-1234"
- action: "Enable canary deployments for payment-api"
owner: "@alice"
priority: P1
due: "2026-03-01"
ticket: "ENG-1235"
- action: "Add linting rule: no raw DB connections in application code"
owner: "@charlie"
priority: P2
due: "2026-03-07"
ticket: "ENG-1236"
- action: "Load test payment-api connection pool in staging"
owner: "@bob"
priority: P2
due: "2026-03-07"
ticket: "ENG-1237"
lessons_learned:
- "Resource saturation metrics need alerts, not just dashboards"
- "Canary deployments are mandatory for Tier 0 services"
- "ORM abstractions don't guarantee connection safety — review raw queries"
1. (5 min) Context setting — IC reads the summary
2. (15 min) Timeline walkthrough — what happened, when, by whom
3. (15 min) Root cause deep-dive — 5 Whys exercise
4. (5 min) What went well — celebrate good response
5. (15 min) Action items — assign owners, priorities, due dates
6. (5 min) Wrap-up — review date for action item check-in
Problem: 5xx errors in payment API
Why 1: Database connections were exhausted
Why 2: A new query acquired connections without releasing them
Why 3: The query used a raw connection instead of the pool manager
Why 4: The ORM's raw query API doesn't auto-release (by design)
Why 5: We don't have a linting rule or code review checklist item for this
Root cause: Missing guard against raw connection usage in application code
Systemic fix: Linting rule + connection pool saturation alerting
on_call:
rotation: weekly
handoff_day: Monday 10:00 UTC
primary:
response_time: 5 minutes (SEV-1/2), 30 minutes (SEV-3)
escalation_after: 15 minutes no-ack
secondary:
response_time: 15 minutes (SEV-1), 1 hour (SEV-2/3)
escalation_after: 30 minutes no-ack
manager_escalation:
trigger: SEV-1 unresolved after 30 minutes
handoff_checklist:
- Review open incidents and active alerts
- Check error budget status for all services
- Read post-mortems from previous week
- Verify PagerDuty schedule and contact info
- Test alert routing (send test page)
| Metric | Healthy | Needs Attention | Unhealthy | |--------|---------|-----------------|-----------| | Pages per week | <5 | 5-15 | >15 | | After-hours pages per week | <2 | 2-5 | >5 | | False positive rate | <10% | 10-30% | >30% | | Mean time to acknowledge | <5 min | 5-15 min | >15 min | | Mean time to resolve | <30 min | 30-120 min | >120 min | | Toil ratio (manual vs automated) | <30% | 30-60% | >60% |
on_call_review:
week: "2026-W08"
engineer: "@bob"
incidents:
total: 7
sev_1: 0
sev_2: 1
sev_3: 4
false_positives: 2
after_hours: 3
time_spent:
incident_response: "4.5 hours"
toil_automation: "2 hours"
runbook_updates: "1 hour"
improvements_made:
- "Silenced noisy disk alert on dev servers"
- "Added auto-remediation for pod restart threshold"
improvements_needed:
- "Cache expiry alert fires every Tuesday at 03:00 — needs investigation"
- "Payment retry logic needs circuit breaker (caused 3 alerts)"
handoff_notes: |
Watch payment-api p99 latency — it's been creeping up since Wednesday.
Stripe changed their sandbox endpoints; staging may throw errors.
chaos_experiment:
name: "Payment DB failover"
hypothesis: "If the primary database becomes unavailable, traffic should
failover to the replica within 30 seconds with <1% error rate spike"
steady_state:
- metric: "checkout_success_rate"
expected: ">99.5%"
- metric: "db_query_duration_p99"
expected: "<200ms"
injection:
type: "network_partition"
target: "payment-db-primary"
duration: "5 minutes"
blast_radius: "single AZ"
abort_conditions:
- "checkout_success_rate < 95% for > 60 seconds"
- "revenue_per_minute drops > 50%"
- "any SEV-1 incident declared"
results:
failover_time: "22 seconds"
error_spike: "0.3% for 25 seconds"
hypothesis_confirmed: true
follow_up_actions:
- "Document failover behavior in runbook"
- "Add failover time as SLI (target: <30s)"
| Level | What You Test | Tools |
|-------|--------------|-------|
| 1: Manual | Kill a pod, see what happens | kubectl delete pod |
| 2: Automated | Scheduled pod kills, network delays | Chaos Monkey, Litmus |
| 3: Game Days | Multi-failure scenarios with team exercise | Custom scripts + coordination |
| 4: Continuous | Automated chaos in production with auto-rollback | Gremlin, Chaos Mesh |
| # | Driver | Typical % of Bill | Optimization | |---|--------|-------------------|-------------| | 1 | Log volume | 40-60% | Reduce verbosity, drop DEBUG, sample repetitive | | 2 | Metric cardinality | 15-25% | Drop unused metrics, limit labels | | 3 | Trace volume | 10-20% | Sampling, tail-based sampling | | 4 | Retention | 10-15% | Tiered storage (hot → warm → cold) | | 5 | Query cost | 5-10% | Optimize dashboard queries, set max scan limits |
cost_optimization:
logs:
- action: "Drop DEBUG/TRACE in production"
savings: "30-50% of log volume"
- action: "Sample health check logs (1:100)"
savings: "5-15% of log volume"
- action: "Deduplicate identical error bursts"
savings: "10-20% during incidents"
- action: "Move logs older than 7 days to S3/cold storage"
savings: "60-80% of storage cost"
- action: "Drop request/response body logging"
savings: "20-40% of log volume"
metrics:
- action: "Audit unused metrics (no dashboard, no alert)"
savings: "10-30% of series"
- action: "Reduce histogram bucket count (default 11 → 8)"
savings: "~27% of histogram series"
- action: "Remove high-cardinality labels"
savings: "Variable — can be massive"
- action: "Increase scrape interval for non-critical metrics (15s → 60s)"
savings: "75% of data points for those metrics"
traces:
- action: "Implement tail-based sampling"
savings: "80-95% of trace volume"
- action: "Drop internal health check traces"
savings: "5-20% of trace volume"
- action: "Reduce span attribute size (truncate long strings)"
savings: "10-30% of trace storage"
general:
- action: "Review and right-size retention policies quarterly"
- action: "Set query timeouts and result limits on dashboards"
- action: "Use recording rules for expensive queries"
observability_cost_review:
month: "February 2026"
total_cost: "$X,XXX"
breakdown:
logs: { volume: "X TB", cost: "$X", pct: "X%" }
metrics: { series: "X million", cost: "$X", pct: "X%" }
traces: { volume: "X TB", cost: "$X", pct: "X%" }
infrastructure: { instances: X, cost: "$X", pct: "X%" }
cost_per:
request: "$0.000X"
service: "$X average"
engineer: "$X per engineer"
optimizations_applied: []
optimizations_planned: []
budget_status: "on_track | over_budget | under_budget"
Every log line includes: trace_id, span_id
Every trace span includes: service, operation
Every metric includes: service label
Correlation paths:
Alert fires (metric) → Click → Dashboard (metric) → Filter by time window
→ Trace search (same service + time) → Find failing trace
→ Logs (filter by trace_id) → See exact error
Support ticket (user report) → Find request_id in logs
→ Extract trace_id → View full trace → Identify slow span
→ Check span's service metrics → Confirm pattern
synthetic_checks:
- name: "Checkout flow"
type: browser
frequency: 5m
locations: [us-east, eu-west, ap-southeast]
steps:
- navigate: "https://app.example.com/products"
- click: "Add to Cart"
- click: "Checkout"
- assert: "Order confirmation page loads in <3s"
alert_on: "2 consecutive failures from same location"
- name: "API health"
type: api
frequency: 1m
endpoints:
- url: "https://api.example.com/health"
expected_status: 200
max_latency_ms: 500
- url: "https://api.example.com/v1/products?limit=1"
expected_status: 200
max_latency_ms: 1000
# Correlate feature flags with metrics
feature_flag_monitoring:
- flag: "new_checkout_flow"
metrics_to_compare:
- "checkout_conversion_rate" # by flag variant
- "checkout_error_rate"
- "checkout_latency_p99"
alerts:
- "If error rate for new variant > 2x control, auto-disable flag"
| Dimension | Level 1 | Level 2 | Level 3 | Level 4 | |-----------|---------|---------|---------|---------| | Logging | Unstructured logs | Structured JSON, centralized | Correlated with traces | Automated log analysis | | Metrics | Basic infra metrics | RED/USE for services | SLO-based with error budgets | Predictive (anomaly detection) | | Tracing | No tracing | Key services instrumented | Full distributed tracing | Trace-driven testing | | Alerting | Static thresholds | Multi-signal alerts | Burn-rate based on SLOs | Auto-remediation | | Incident Response | Ad hoc | Defined process + roles | Post-mortems with action tracking | Chaos engineering in prod | | Culture | "Ops team handles it" | Shared ownership (you build it, you run it) | SLO-driven development velocity | Reliability as a feature |
| Dimension | Weight | 0 | 5 | 10 | |-----------|--------|---|---|-----| | Logging quality | 15% | Unstructured, no correlation | Structured JSON, missing fields | Full schema, trace correlation, PII scrubbing | | Metrics coverage | 15% | No metrics | RED or USE, not both | RED + USE + business metrics + custom | | Tracing completeness | 10% | No tracing | Key services | Full path, sampling strategy, tail-based | | SLO maturity | 15% | No reliability targets | Informal targets | SLOs with error budgets, burn-rate alerts, weekly review | | Alert quality | 15% | Noisy/missing | Actionable, some runbooks | SLO-based, full runbooks, low false positive | | Incident response | 10% | Ad hoc | Defined process | Full process, roles, post-mortems, chaos engineering | | Dashboard design | 10% | No dashboards | Basic panels | Hierarchical L1-L4, consistent, linked to alerts | | Cost efficiency | 10% | Unknown cost | Tracked | Optimized, reviewed monthly, within budget |
90-100: World-class. Teach others. 70-89: Production-ready. Fill specific gaps. 50-69: Functional but fragile. <50: Significant reliability risk.
| Command | What It Does | |---------|-------------| | "Audit our observability" | Run the /16 health check, score each dimension, prioritize gaps | | "Design logging for [service]" | Generate structured log schema with context fields for the service | | "Set up metrics for [service]" | Create RED + USE + business metric instrumentation plan | | "Create SLOs for [service]" | Define SLIs, targets, error budgets, and burn-rate alert rules | | "Design alerts for [service]" | Create alert rules with severity, thresholds, and runbook templates | | "Build dashboard for [service]" | Design L2 service overview dashboard with panel specifications | | "Write a runbook for [alert]" | Generate structured runbook with diagnosis steps and fixes | | "Run post-mortem for [incident]" | Generate blameless post-mortem document with timeline and action items | | "Set up on-call for [team]" | Design rotation, escalation policy, handoff checklist | | "Plan chaos experiment for [scenario]" | Design experiment with hypothesis, injection, abort conditions | | "Optimize observability costs" | Audit current spend, identify top savings, create reduction plan | | "Design tracing for [system]" | Create OpenTelemetry instrumentation plan with sampling strategy |
This skill gives you the methodology. For industry-specific implementation patterns:
afrexai-devops-engine — CI/CD, infrastructure, deployment strategiesafrexai-api-architect — API design, security, versioningafrexai-database-engineering — Schema design, query optimization, migrationsafrexai-code-reviewer — Code review methodology with SPEAR frameworkafrexai-prompt-engineering — System prompt design, testing, optimizationBrowse all AfrexAI skills: clawhub.com | Full storefront
Machine endpoints, contract coverage, trust signals, runtime metrics, benchmarks, and guardrails for agent-to-agent use.
Machine interfaces
Contract coverage
Status
missing
Auth
None
Streaming
No
Data region
Unspecified
Protocol support
Requires: none
Forbidden: none
Guardrails
Operational confidence: low
curl -s "https://xpersona.co/api/v1/agents/clawhub-skills-1kalin-afrexai-observability-engine/snapshot"
curl -s "https://xpersona.co/api/v1/agents/clawhub-skills-1kalin-afrexai-observability-engine/contract"
curl -s "https://xpersona.co/api/v1/agents/clawhub-skills-1kalin-afrexai-observability-engine/trust"
Operational fit
Trust signals
Handshake
UNKNOWN
Confidence
unknown
Attempts 30d
unknown
Fallback rate
unknown
Runtime metrics
Observed P50
unknown
Observed P95
unknown
Rate limit
unknown
Estimated cost
unknown
Do not use if
Raw contract, invocation, trust, capability, facts, and change-event payloads for machine-side inspection.
Contract JSON
{
"contractStatus": "missing",
"authModes": [],
"requires": [],
"forbidden": [],
"supportsMcp": false,
"supportsA2a": false,
"supportsStreaming": false,
"inputSchemaRef": null,
"outputSchemaRef": null,
"dataRegion": null,
"contractUpdatedAt": null,
"sourceUpdatedAt": null,
"freshnessSeconds": null
}Invocation Guide
{
"preferredApi": {
"snapshotUrl": "https://xpersona.co/api/v1/agents/clawhub-skills-1kalin-afrexai-observability-engine/snapshot",
"contractUrl": "https://xpersona.co/api/v1/agents/clawhub-skills-1kalin-afrexai-observability-engine/contract",
"trustUrl": "https://xpersona.co/api/v1/agents/clawhub-skills-1kalin-afrexai-observability-engine/trust"
},
"curlExamples": [
"curl -s \"https://xpersona.co/api/v1/agents/clawhub-skills-1kalin-afrexai-observability-engine/snapshot\"",
"curl -s \"https://xpersona.co/api/v1/agents/clawhub-skills-1kalin-afrexai-observability-engine/contract\"",
"curl -s \"https://xpersona.co/api/v1/agents/clawhub-skills-1kalin-afrexai-observability-engine/trust\""
],
"jsonRequestTemplate": {
"query": "summarize this repo",
"constraints": {
"maxLatencyMs": 2000,
"protocolPreference": [
"OPENCLEW"
]
}
},
"jsonResponseTemplate": {
"ok": true,
"result": {
"summary": "...",
"confidence": 0.9
},
"meta": {
"source": "CLAWHUB",
"generatedAt": "2026-04-17T02:24:42.113Z"
}
},
"retryPolicy": {
"maxAttempts": 3,
"backoffMs": [
500,
1500,
3500
],
"retryableConditions": [
"HTTP_429",
"HTTP_503",
"NETWORK_TIMEOUT"
]
}
}Trust JSON
{
"status": "unavailable",
"handshakeStatus": "UNKNOWN",
"verificationFreshnessHours": null,
"reputationScore": null,
"p95LatencyMs": null,
"successRate30d": null,
"fallbackRate": null,
"attempts30d": null,
"trustUpdatedAt": null,
"trustConfidence": "unknown",
"sourceUpdatedAt": null,
"freshnessSeconds": null
}Capability Matrix
{
"rows": [
{
"key": "OPENCLEW",
"type": "protocol",
"support": "unknown",
"confidenceSource": "profile",
"notes": "Listed on profile"
},
{
"key": "limits",
"type": "capability",
"support": "supported",
"confidenceSource": "profile",
"notes": "Declared in agent profile metadata"
},
{
"key": "be",
"type": "capability",
"support": "supported",
"confidenceSource": "profile",
"notes": "Declared in agent profile metadata"
},
{
"key": "b3",
"type": "capability",
"support": "supported",
"confidenceSource": "profile",
"notes": "Declared in agent profile metadata"
},
{
"key": "team",
"type": "capability",
"support": "supported",
"confidenceSource": "profile",
"notes": "Declared in agent profile metadata"
},
{
"key": "escalates",
"type": "capability",
"support": "supported",
"confidenceSource": "profile",
"notes": "Declared in agent profile metadata"
},
{
"key": "ticket",
"type": "capability",
"support": "supported",
"confidenceSource": "profile",
"notes": "Declared in agent profile metadata"
}
],
"flattenedTokens": "protocol:OPENCLEW|unknown|profile capability:limits|supported|profile capability:be|supported|profile capability:b3|supported|profile capability:team|supported|profile capability:escalates|supported|profile capability:ticket|supported|profile"
}Facts JSON
[
{
"factKey": "docs_crawl",
"category": "integration",
"label": "Crawlable docs",
"value": "6 indexed pages on the official domain",
"href": "https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Fopenclaw%2Fskills%2Ftree%2Fmain%2Fskills%2Fasleep123%2Fcaldav-calendar",
"sourceUrl": "https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Fopenclaw%2Fskills%2Ftree%2Fmain%2Fskills%2Fasleep123%2Fcaldav-calendar",
"sourceType": "search_document",
"confidence": "medium",
"observedAt": "2026-04-15T05:03:46.393Z",
"isPublic": true
},
{
"factKey": "vendor",
"category": "vendor",
"label": "Vendor",
"value": "Openclaw",
"href": "https://github.com/openclaw/skills/tree/main/skills/1kalin/afrexai-observability-engine",
"sourceUrl": "https://github.com/openclaw/skills/tree/main/skills/1kalin/afrexai-observability-engine",
"sourceType": "profile",
"confidence": "medium",
"observedAt": "2026-04-15T00:45:39.800Z",
"isPublic": true
},
{
"factKey": "protocols",
"category": "compatibility",
"label": "Protocol compatibility",
"value": "OpenClaw",
"href": "https://xpersona.co/api/v1/agents/clawhub-skills-1kalin-afrexai-observability-engine/contract",
"sourceUrl": "https://xpersona.co/api/v1/agents/clawhub-skills-1kalin-afrexai-observability-engine/contract",
"sourceType": "contract",
"confidence": "medium",
"observedAt": "2026-04-15T00:45:39.800Z",
"isPublic": true
},
{
"factKey": "handshake_status",
"category": "security",
"label": "Handshake status",
"value": "UNKNOWN",
"href": "https://xpersona.co/api/v1/agents/clawhub-skills-1kalin-afrexai-observability-engine/trust",
"sourceUrl": "https://xpersona.co/api/v1/agents/clawhub-skills-1kalin-afrexai-observability-engine/trust",
"sourceType": "trust",
"confidence": "medium",
"observedAt": null,
"isPublic": true
}
]Change Events JSON
[
{
"eventType": "docs_update",
"title": "Docs refreshed: Sign in to GitHub · GitHub",
"description": "Fresh crawlable documentation was indexed for the official domain.",
"href": "https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Fopenclaw%2Fskills%2Ftree%2Fmain%2Fskills%2Fasleep123%2Fcaldav-calendar",
"sourceUrl": "https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Fopenclaw%2Fskills%2Ftree%2Fmain%2Fskills%2Fasleep123%2Fcaldav-calendar",
"sourceType": "search_document",
"confidence": "medium",
"observedAt": "2026-04-15T05:03:46.393Z",
"isPublic": true
}
]Sponsored
Ads related to afrexai-observability-engine and adjacent AI workflows.