Claim this agent
Agent DossierCLAWHUBSafety 84/100

Xpersona Agent

afrexai-observability-engine

Complete observability & reliability engineering system. Use when designing monitoring, implementing structured logging, setting up distributed tracing, building alerting systems, creating SLO/SLI frameworks, running incident response, conducting post-mortems, or auditing system reliability. Covers all three pillars (logs/metrics/traces), alert design, dashboard architecture, on-call operations, chaos engineering, and cost optimization. --- name: afrexai-observability-engine model: standard description: Complete observability & reliability engineering system. Use when designing monitoring, implementing structured logging, setting up distributed tracing, building alerting systems, creating SLO/SLI frameworks, running incident response, conducting post-mortems, or auditing system reliability. Covers all three pillars (logs/metrics/traces), alert desig

OpenClaw · self-declared
Trust evidence available
clawhub skill install skills:1kalin:afrexai-observability-engine

Overall rank

#62

Adoption

No public adoption signal

Trust

Unknown

Freshness

Feb 25, 2026

Freshness

Last checked Feb 25, 2026

Best For

afrexai-observability-engine is best for limits, be, b3 workflows where OpenClaw compatibility matters.

Not Ideal For

Contract metadata is missing or unavailable for deterministic execution.

Evidence Sources Checked

editorial-content, CLAWHUB, runtime-metrics, public facts pack

Overview

Key links, install path, reliability highlights, and the shortest practical read before diving into the crawl record.

Verifiededitorial-content

Overview

Executive Summary

Complete observability & reliability engineering system. Use when designing monitoring, implementing structured logging, setting up distributed tracing, building alerting systems, creating SLO/SLI frameworks, running incident response, conducting post-mortems, or auditing system reliability. Covers all three pillars (logs/metrics/traces), alert design, dashboard architecture, on-call operations, chaos engineering, and cost optimization. --- name: afrexai-observability-engine model: standard description: Complete observability & reliability engineering system. Use when designing monitoring, implementing structured logging, setting up distributed tracing, building alerting systems, creating SLO/SLI frameworks, running incident response, conducting post-mortems, or auditing system reliability. Covers all three pillars (logs/metrics/traces), alert desig Capability contract not published. No trust telemetry is available yet. Last updated 4/15/2026.

No verified compatibility signals

Trust score

Unknown

Compatibility

OpenClaw

Freshness

Feb 25, 2026

Vendor

Openclaw

Artifacts

0

Benchmarks

0

Last release

Unpublished

Install & run

Setup Snapshot

clawhub skill install skills:1kalin:afrexai-observability-engine
  1. 1

    Setup complexity is LOW. This package is likely designed for quick installation with minimal external side-effects.

  2. 2

    Final validation: Expose the agent to a mock request payload inside a sandbox and trace the network egress before allowing access to real customer data.

Evidence & Timeline

Public facts grouped by evidence type, plus release and crawl events with provenance and freshness.

Verifiededitorial-content

Public facts

Evidence Ledger

Vendor (1)

Vendor

Openclaw

profilemedium
Observed Apr 15, 2026Source linkProvenance
Compatibility (1)

Protocol compatibility

OpenClaw

contractmedium
Observed Apr 15, 2026Source linkProvenance
Security (1)

Handshake status

UNKNOWN

trustmedium
Observed unknownSource linkProvenance
Integration (1)

Crawlable docs

6 indexed pages on the official domain

search_documentmedium
Observed Apr 15, 2026Source linkProvenance

Artifacts & Docs

Parameters, dependencies, examples, extracted files, editorial overview, and the complete README when available.

Self-declaredCLAWHUB

Captured outputs

Artifacts Archive

Extracted files

0

Examples

6

Snippets

0

Languages

typescript

Parameters

Executable Examples

text

Application → Structured JSON → Log Router → Storage → Query Engine
                                    ↓
                              Alert Pipeline

yaml

# HTTP request context
http:
  method: POST
  path: /api/v1/orders
  status: 201
  client_ip: 203.0.113.42  # Anonymize in logs if needed
  user_agent: "Mozilla/5.0..."
  request_id: "req_abc123"

# Business context
business:
  user_id: "usr_456"
  tenant_id: "tenant_789"
  order_id: "ord_012"
  action: "checkout"
  amount_cents: 4999
  currency: "USD"

# Error context
error:
  type: "PaymentDeclinedError"
  message: "Card declined: insufficient funds"
  code: "CARD_DECLINED"
  stack: "..." # Only in non-production or DEBUG level
  retry_count: 2
  retryable: true

text

Is the process about to crash?
  → FATAL (exit after logging)

Did an operation fail that needs human attention?
  → ERROR (page someone or create ticket)

Did something unexpected happen but we recovered?
  → WARN (review in daily triage)

Is this a normal business event worth recording?
  → INFO (audit trail, business metrics)

Is this useful for debugging but noisy in production?
  → DEBUG (off in prod, on in staging)

Is this only useful when stepping through code?
  → TRACE (never in production)

yaml

scrub_patterns:
  # Always redact
  - field_patterns: ["password", "secret", "token", "api_key", "authorization"]
    action: replace_with_redacted
  
  # Hash for correlation without exposure
  - field_patterns: ["email", "phone", "ssn", "national_id"]
    action: sha256_hash
  
  # Mask partially
  - field_patterns: ["credit_card", "card_number"]
    action: mask_last_4  # "****-****-****-1234"
  
  # IP anonymization
  - field_patterns: ["client_ip", "ip_address"]
    action: zero_last_octet  # 203.0.113.0

typescript

import pino from 'pino';
import { AsyncLocalStorage } from 'node:async_hooks';

const als = new AsyncLocalStorage<Record<string, string>>();

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level: (label) => ({ level: label }),
  },
  mixin: () => als.getStore() ?? {},
  redact: ['req.headers.authorization', '*.password', '*.token'],
  timestamp: pino.stdTimeFunctions.isoTime,
});

// Middleware: inject context
app.use((req, res, next) => {
  const ctx = {
    trace_id: req.headers['x-trace-id'] || crypto.randomUUID(),
    request_id: crypto.randomUUID(),
    service: 'payment-api',
    version: process.env.APP_VERSION,
  };
  als.run(ctx, () => next());
});

python

import structlog
structlog.configure(
    processors=[
        structlog.contextvars.merge_contextvars,
        structlog.processors.add_log_level,
        structlog.processors.TimeStamper(fmt="iso", utc=True),
        structlog.processors.JSONRenderer(),
    ],
)
log = structlog.get_logger()
# Bind context per-request:
structlog.contextvars.bind_contextvars(trace_id=trace_id, user_id=user_id)

Editorial read

Docs & README

Docs source

CLAWHUB

Editorial quality

ready

Complete observability & reliability engineering system. Use when designing monitoring, implementing structured logging, setting up distributed tracing, building alerting systems, creating SLO/SLI frameworks, running incident response, conducting post-mortems, or auditing system reliability. Covers all three pillars (logs/metrics/traces), alert design, dashboard architecture, on-call operations, chaos engineering, and cost optimization. --- name: afrexai-observability-engine model: standard description: Complete observability & reliability engineering system. Use when designing monitoring, implementing structured logging, setting up distributed tracing, building alerting systems, creating SLO/SLI frameworks, running incident response, conducting post-mortems, or auditing system reliability. Covers all three pillars (logs/metrics/traces), alert desig

Full README

name: afrexai-observability-engine model: standard description: Complete observability & reliability engineering system. Use when designing monitoring, implementing structured logging, setting up distributed tracing, building alerting systems, creating SLO/SLI frameworks, running incident response, conducting post-mortems, or auditing system reliability. Covers all three pillars (logs/metrics/traces), alert design, dashboard architecture, on-call operations, chaos engineering, and cost optimization. version: 1.0.0 tags: observability, monitoring, logging, tracing, alerting, SRE, incident-response, SLO, metrics, devops, reliability, on-call, post-mortem, dashboards

Observability & Reliability Engineering

Complete system for building observable, reliable services — from structured logging to incident response to SLO-driven development.


Quick Health Check (/16)

Score your current observability posture:

| Signal | Healthy (2) | Weak (1) | Missing (0) | |--------|-------------|----------|-------------| | Structured logging | JSON logs with trace_id correlation | Logs exist but unstructured | Console.log / print statements | | Metrics collection | RED/USE metrics with dashboards | Some metrics, no dashboards | No metrics | | Distributed tracing | Full request path with sampling | Partial traces, key services only | No tracing | | Alerting | SLO-based alerts with runbooks | Threshold alerts, some runbooks | No alerts or all-noise | | Incident response | Defined process with roles + post-mortems | Ad-hoc response, some docs | "Whoever notices fixes it" | | SLOs defined | SLOs with error budgets tracked weekly | Informal availability targets | No reliability targets | | On-call rotation | Structured rotation with escalation | Informal "call someone" | No on-call | | Cost management | Observability budget tracked monthly | Some awareness of costs | No idea what you spend |

12-16: Production-grade. Focus on optimization. 8-11: Foundation exists. Fill the gaps systematically. 4-7: Significant risk. Prioritize alerting + incident response. 0-3: Flying blind. Start with Phase 1 immediately.


Phase 1: Structured Logging

Log Architecture

Application → Structured JSON → Log Router → Storage → Query Engine
                                    ↓
                              Alert Pipeline

Required Fields (Every Log Line)

| Field | Type | Purpose | Example | |-------|------|---------|---------| | timestamp | ISO-8601 UTC | When | 2026-02-22T18:30:00.123Z | | level | enum | Severity | info, warn, error, fatal | | service | string | Which service | payment-api | | version | string | Which deploy | v2.3.1 | | environment | string | Which env | production | | message | string | What happened | Payment processed successfully | | trace_id | string | Request correlation | abc123def456 | | span_id | string | Operation within trace | span_789 | | duration_ms | number | How long | 142 |

Contextual Fields (Add Per Domain)

# HTTP request context
http:
  method: POST
  path: /api/v1/orders
  status: 201
  client_ip: 203.0.113.42  # Anonymize in logs if needed
  user_agent: "Mozilla/5.0..."
  request_id: "req_abc123"

# Business context
business:
  user_id: "usr_456"
  tenant_id: "tenant_789"
  order_id: "ord_012"
  action: "checkout"
  amount_cents: 4999
  currency: "USD"

# Error context
error:
  type: "PaymentDeclinedError"
  message: "Card declined: insufficient funds"
  code: "CARD_DECLINED"
  stack: "..." # Only in non-production or DEBUG level
  retry_count: 2
  retryable: true

Log Level Decision Tree

Is the process about to crash?
  → FATAL (exit after logging)

Did an operation fail that needs human attention?
  → ERROR (page someone or create ticket)

Did something unexpected happen but we recovered?
  → WARN (review in daily triage)

Is this a normal business event worth recording?
  → INFO (audit trail, business metrics)

Is this useful for debugging but noisy in production?
  → DEBUG (off in prod, on in staging)

Is this only useful when stepping through code?
  → TRACE (never in production)

Log Level Rules

  1. ERROR means action required — if no one needs to act on it, it's WARN
  2. INFO is for business events — not internal implementation details
  3. No logging inside tight loops — aggregate and log summary
  4. Log at boundaries — API entry/exit, queue consume/publish, DB calls
  5. Never log secrets — API keys, tokens, passwords, PII (see scrubbing below)

PII & Secret Scrubbing

scrub_patterns:
  # Always redact
  - field_patterns: ["password", "secret", "token", "api_key", "authorization"]
    action: replace_with_redacted
  
  # Hash for correlation without exposure
  - field_patterns: ["email", "phone", "ssn", "national_id"]
    action: sha256_hash
  
  # Mask partially
  - field_patterns: ["credit_card", "card_number"]
    action: mask_last_4  # "****-****-****-1234"
  
  # IP anonymization
  - field_patterns: ["client_ip", "ip_address"]
    action: zero_last_octet  # 203.0.113.0

Logger Setup (By Language)

Node.js (Pino):

import pino from 'pino';
import { AsyncLocalStorage } from 'node:async_hooks';

const als = new AsyncLocalStorage<Record<string, string>>();

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level: (label) => ({ level: label }),
  },
  mixin: () => als.getStore() ?? {},
  redact: ['req.headers.authorization', '*.password', '*.token'],
  timestamp: pino.stdTimeFunctions.isoTime,
});

// Middleware: inject context
app.use((req, res, next) => {
  const ctx = {
    trace_id: req.headers['x-trace-id'] || crypto.randomUUID(),
    request_id: crypto.randomUUID(),
    service: 'payment-api',
    version: process.env.APP_VERSION,
  };
  als.run(ctx, () => next());
});

Python (structlog):

import structlog
structlog.configure(
    processors=[
        structlog.contextvars.merge_contextvars,
        structlog.processors.add_log_level,
        structlog.processors.TimeStamper(fmt="iso", utc=True),
        structlog.processors.JSONRenderer(),
    ],
)
log = structlog.get_logger()
# Bind context per-request:
structlog.contextvars.bind_contextvars(trace_id=trace_id, user_id=user_id)

Go (zerolog):

log := zerolog.New(os.Stdout).With().
    Timestamp().
    Str("service", "payment-api").
    Str("version", version).
    Logger()
// Per-request:
reqLog := log.With().Str("trace_id", traceID).Logger()

Log Storage Decision

| Volume | Solution | Retention | Cost | |--------|----------|-----------|------| | <10 GB/day | Loki + Grafana | 30 days hot, 90 days cold | Low | | 10-100 GB/day | Elasticsearch / OpenSearch | 14 days hot, 90 days S3 | Medium | | 100+ GB/day | ClickHouse or Datadog | 7 days hot, 30 days archive | High | | Budget-constrained | Loki + S3 backend | 90 days all cold | Very low |

10 Logging Anti-Patterns

| # | Anti-Pattern | Fix | |---|-------------|-----| | 1 | log.error(err) with no context | Always include: what operation, what input, what state | | 2 | Logging request/response bodies | Log only in DEBUG; redact sensitive fields | | 3 | String concatenation in log messages | Use structured fields: log.info("processed", { order_id, amount }) | | 4 | Catch-and-log-and-rethrow | Log at the boundary where you handle it, not every layer | | 5 | Different log formats per service | Standardize schema across all services | | 6 | No log rotation / retention policy | Set max size + TTL; archive to cold storage | | 7 | Logging inside hot paths | Aggregate: log summary every N items or every interval | | 8 | Missing correlation IDs | Propagate trace_id from first entry point through all services | | 9 | Boolean log levels (verbose: true) | Use standard levels with configurable minimum | | 10 | Logging PII in plain text | Implement scrubbing at the logger level |


Phase 2: Metrics Collection

The RED Method (Request-Driven Services)

For every service endpoint, track:

| Metric | What | Prometheus Example | |--------|------|--------------------| | Rate | Requests per second | http_requests_total{method, path, status} | | Errors | Failed requests per second | http_requests_total{status=~"5.."} / total | | Duration | Latency distribution | http_request_duration_seconds{method, path} (histogram) |

The USE Method (Infrastructure Resources)

For every resource (CPU, memory, disk, network):

| Metric | What | Example | |--------|------|---------| | Utilization | % resource busy | CPU usage 78% | | Saturation | Queue depth / backpressure | 12 requests queued | | Errors | Resource errors | 3 disk I/O errors |

Golden Signals (Google SRE)

| Signal | Meaning | Source | |--------|---------|--------| | Latency | Time to serve requests | RED Duration | | Traffic | Demand on the system | RED Rate | | Errors | Rate of failed requests | RED Errors | | Saturation | How "full" the service is | USE Saturation |

Metric Types & When to Use Each

| Type | Use Case | Example | |------|----------|---------| | Counter | Things that only go up | Total requests, errors, bytes sent | | Gauge | Current value that goes up/down | Active connections, queue depth, temperature | | Histogram | Distribution of values | Request latency, response size | | Summary | Pre-calculated percentiles | Client-side latency (when you need exact percentiles) |

Rule: Use histograms over summaries in most cases — they're aggregatable across instances.

Naming Conventions

# Pattern: <namespace>_<subsystem>_<name>_<unit>
http_server_request_duration_seconds
http_server_requests_total
db_pool_connections_active
queue_messages_pending
cache_hit_ratio

# Rules:
# 1. Use snake_case
# 2. Include unit suffix (_seconds, _bytes, _total)
# 3. _total suffix for counters
# 4. Don't include label names in metric name
# 5. Use base units (seconds not milliseconds, bytes not kilobytes)

Label Design Rules

| Rule | Why | Example | |------|-----|---------| | Keep cardinality <100 per label | High cardinality kills performance | status="200" not status="200 OK" | | No user IDs as labels | Unbounded cardinality | Use log correlation instead | | No request paths with IDs | /api/users/123 creates millions of series | Normalize: /api/users/:id | | Max 5-7 labels per metric | Each combo = a time series | {method, path, status, service} |

Instrumentation Checklist

application_metrics:
  # HTTP layer
  - http_request_duration_seconds: histogram {method, path, status}
  - http_request_size_bytes: histogram {method, path}
  - http_response_size_bytes: histogram {method, path}
  - http_requests_in_flight: gauge
  
  # Business logic
  - orders_processed_total: counter {status, payment_method}
  - order_value_dollars: histogram {payment_method}
  - user_signups_total: counter {source}
  
  # Dependencies
  - db_query_duration_seconds: histogram {query_type, table}
  - db_connections_active: gauge {pool}
  - db_connections_idle: gauge {pool}
  - cache_requests_total: counter {result: hit|miss}
  - external_api_duration_seconds: histogram {service, endpoint}
  - external_api_errors_total: counter {service, error_type}
  
  # Queue / async
  - queue_messages_published_total: counter {queue}
  - queue_messages_consumed_total: counter {queue, status}
  - queue_processing_duration_seconds: histogram {queue}
  - queue_depth: gauge {queue}
  - queue_consumer_lag: gauge {queue, consumer_group}

infrastructure_metrics:
  # Node exporter / cAdvisor provides these automatically
  - cpu_usage_percent: gauge {instance}
  - memory_usage_bytes: gauge {instance}
  - disk_usage_bytes: gauge {instance, mount}
  - disk_io_seconds: counter {instance, device}
  - network_bytes: counter {instance, direction}
  - container_cpu_usage: gauge {pod, container}
  - container_memory_usage: gauge {pod, container}

Stack Recommendations

| Component | Options | Recommendation | |-----------|---------|----------------| | Collection | Prometheus, OTEL Collector, Datadog Agent | Prometheus (free) or OTEL Collector (vendor-neutral) | | Storage | Prometheus, Thanos, Mimir, VictoriaMetrics | VictoriaMetrics (best cost/perf) or Mimir (Grafana ecosystem) | | Visualization | Grafana, Datadog, New Relic | Grafana (free, extensible) | | Alerting | Alertmanager, Grafana Alerting, PagerDuty | Alertmanager + PagerDuty routing |


Phase 3: Distributed Tracing

Trace Architecture

Client Request
  → API Gateway (root span)
    → Auth Service (child span)
    → Order Service (child span)
      → Database Query (child span)
      → Payment Service (child span)
        → Stripe API (child span)
    → Notification Service (child span)
      → Email Provider (child span)

OpenTelemetry Setup

Auto-instrumentation (Node.js):

// tracing.ts — import BEFORE anything else
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4318/v1/traces',
  }),
  instrumentations: [getNodeAutoInstrumentations({
    '@opentelemetry/instrumentation-http': { ignoreIncomingPaths: ['/health', '/ready'] },
    '@opentelemetry/instrumentation-express': { enabled: true },
  })],
  serviceName: process.env.OTEL_SERVICE_NAME || 'payment-api',
});
sdk.start();

Custom spans for business logic:

import { trace, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('payment-service');

async function processPayment(order: Order) {
  return tracer.startActiveSpan('process-payment', async (span) => {
    span.setAttributes({
      'order.id': order.id,
      'order.amount_cents': order.amountCents,
      'payment.method': order.paymentMethod,
    });
    try {
      const result = await chargeCard(order);
      span.setAttributes({ 'payment.status': result.status });
      return result;
    } catch (err) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
      span.recordException(err);
      throw err;
    } finally {
      span.end();
    }
  });
}

Sampling Strategies

| Strategy | When | Config | |----------|------|--------| | Always On | Dev/staging, low traffic (<100 rps) | ratio: 1.0 | | Probabilistic | Moderate traffic (100-1000 rps) | ratio: 0.1 (10%) | | Rate-limited | High traffic (>1000 rps) | max_traces_per_second: 100 | | Tail-based | Want all errors + slow requests | Collector-side: keep if error OR duration > p99 | | Parent-based | Respect upstream decisions | If parent sampled, child sampled |

Recommendation: Start with parent-based + probabilistic (10%). Add tail-based at the collector to capture all errors.

Context Propagation

| Header | Standard | Format | |--------|----------|--------| | traceparent | W3C Trace Context | 00-{trace_id}-{span_id}-{flags} | | tracestate | W3C Trace Context | Vendor-specific key-value pairs | | b3 | Zipkin B3 | {trace_id}-{span_id}-{sampled} |

Rule: Use W3C Trace Context (traceparent) as primary. Support B3 for legacy Zipkin systems.

Trace Storage

| Volume | Solution | Retention | |--------|----------|-----------| | <50 GB/day | Jaeger + Elasticsearch | 7 days | | 50-500 GB/day | Tempo + S3 | 14 days | | 500+ GB/day | Tempo + S3 with aggressive sampling | 7 days | | Budget-constrained | Jaeger + Badger (local disk) | 3 days |


Phase 4: SLOs, SLIs & Error Budgets

SLI Selection by Service Type

| Service Type | Primary SLI | Secondary SLI | Measurement | |--------------|-------------|---------------|-------------| | API / Web | Availability + Latency | Error rate | Server-side + synthetic | | Data pipeline | Freshness + Correctness | Throughput | Pipeline timestamps + checksums | | Storage | Durability + Availability | Latency | Checksums + uptime monitoring | | Streaming | Throughput + Latency | Message loss rate | Consumer lag + e2e latency | | Batch jobs | Success rate + Freshness | Duration | Job scheduler metrics |

SLO Definition Template

slo:
  name: "Payment API Availability"
  service: payment-api
  owner: payments-team
  
  sli:
    type: availability
    definition: "Proportion of non-5xx responses"
    measurement: |
      sum(rate(http_requests_total{service="payment-api",status!~"5.."}[5m]))
      /
      sum(rate(http_requests_total{service="payment-api"}[5m]))
    
  target: 99.95%  # 21.9 min downtime/month
  window: rolling_30d
  
  error_budget:
    total_minutes: 21.9  # per 30 days
    burn_rate_alerts:
      - severity: critical
        burn_rate: 14.4x  # Budget consumed in 2 hours
        short_window: 5m
        long_window: 1h
      - severity: warning
        burn_rate: 6x    # Budget consumed in 5 days
        short_window: 30m
        long_window: 6h
      - severity: ticket
        burn_rate: 1x    # Budget consumed in 30 days
        short_window: 6h
        long_window: 3d
  
  consequences:
    budget_remaining_above_50pct: "Normal development velocity"
    budget_remaining_20_to_50pct: "Prioritize reliability work"
    budget_remaining_below_20pct: "Feature freeze; reliability only"
    budget_exhausted: "All hands on reliability until budget recovers"

Common SLO Targets

| Service Tier | Availability | p50 Latency | p99 Latency | Monthly Downtime | |--------------|-------------|-------------|-------------|------------------| | Tier 0 (payments, auth) | 99.99% | <100ms | <500ms | 4.3 min | | Tier 1 (core API) | 99.95% | <200ms | <1s | 21.9 min | | Tier 2 (non-critical) | 99.9% | <500ms | <2s | 43.8 min | | Tier 3 (internal tools) | 99.5% | <1s | <5s | 3.6 hours | | Batch / pipeline | 99% (success rate) | N/A | N/A | N/A |

Error Budget Tracking

# Weekly error budget review template
error_budget_review:
  week: "2026-W08"
  service: payment-api
  slo_target: 99.95%
  
  budget:
    total_minutes_this_period: 21.9
    consumed_minutes: 8.2
    remaining_minutes: 13.7
    remaining_percent: 62.6%
    
  incidents_consuming_budget:
    - date: "2026-02-18"
      duration_minutes: 5.1
      cause: "Database connection pool exhaustion"
      preventable: true
      action: "Increase pool size + add saturation alert"
    - date: "2026-02-20"
      duration_minutes: 3.1
      cause: "Upstream payment provider timeout"
      preventable: false
      action: "Add circuit breaker with fallback"
  
  velocity_decision: "Normal — 62.6% budget remaining"
  reliability_work_this_week:
    - "Add connection pool saturation alert"
    - "Implement circuit breaker for payment provider"

Phase 5: Alert Design

Alert Quality Principles

  1. Every alert must be actionable — if no one needs to act, it's not an alert
  2. Every alert needs a runbook — linked directly in the alert annotation
  3. Symptom-based over cause-based — alert on "users can't checkout" not "CPU high"
  4. Multi-window burn rate — not static thresholds (see SLO alerts above)
  5. Alert on absence, not just presence — "no orders in 15 min" catches silent failures

Alert Severity Levels

| Severity | Response Time | Channel | Who | Example | |----------|--------------|---------|-----|---------| | P0 — Critical | <5 min | Page (PagerDuty/Opsgenie) | On-call engineer | Payment system down | | P1 — High | <30 min | Page during business hours, Slack 24/7 | On-call | Error rate >5% for 10 min | | P2 — Medium | <4 hours | Slack channel | Team | p99 latency degraded 2x | | P3 — Low | Next business day | Ticket auto-created | Team backlog | Disk usage >80% | | Info | N/A | Dashboard only | No one | Deploy completed |

Alerting Anti-Patterns

| Anti-Pattern | Problem | Fix | |-------------|---------|-----| | Static CPU/memory thresholds | Noisy, not user-impacting | Use SLO-based burn rate alerts | | Alert per instance | 50 instances = 50 alerts for same issue | Aggregate: alert on service-level error rate | | No deduplication | Same alert fires 100 times | Group by service + alert name; set repeat interval | | Missing runbook | Engineer gets paged, doesn't know what to do | Every alert links to a runbook | | Threshold too sensitive | Fires on brief spikes | Use for: 5m to require sustained condition | | Too many P0s | Alert fatigue → ignoring real incidents | Audit monthly; demote or remove noisy alerts |

Alert Template (Prometheus Alertmanager)

groups:
  - name: payment-api-slo
    rules:
      - alert: PaymentAPIHighErrorRate
        expr: |
          (
            sum(rate(http_requests_total{service="payment-api",status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total{service="payment-api"}[5m]))
          ) > 0.01
        for: 5m
        labels:
          severity: critical
          service: payment-api
          team: payments
        annotations:
          summary: "Payment API error rate {{ $value | humanizePercentage }} (>1%)"
          description: "5xx error rate has exceeded 1% for 5 minutes"
          runbook: "https://wiki.internal/runbooks/payment-api-errors"
          dashboard: "https://grafana.internal/d/payment-api"
          
      - alert: PaymentAPINoTraffic
        expr: |
          sum(rate(http_requests_total{service="payment-api"}[15m])) == 0
        for: 5m
        labels:
          severity: critical
          service: payment-api
        annotations:
          summary: "Payment API receiving zero traffic for 5 minutes"
          runbook: "https://wiki.internal/runbooks/payment-api-no-traffic"

      - alert: PaymentAPILatencyHigh
        expr: |
          histogram_quantile(0.99, 
            sum(rate(http_request_duration_seconds_bucket{service="payment-api"}[5m])) by (le)
          ) > 2
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Payment API p99 latency {{ $value }}s (>2s for 10min)"
          runbook: "https://wiki.internal/runbooks/payment-api-latency"

Runbook Template

# Runbook: PaymentAPIHighErrorRate

## What This Alert Means
The payment API is returning >1% 5xx errors over a 5-minute window.
Users are likely failing to complete checkouts.

## Impact
- Users cannot process payments
- Revenue loss: ~$X per minute (based on average traffic)
- SLO: Payment API availability (target: 99.95%)

## Immediate Actions
1. Check the error dashboard: [link]
2. Check recent deploys: `kubectl rollout history deployment/payment-api`
3. Check upstream dependencies:
   - Database: [dashboard link]
   - Stripe API: [status page]
   - Redis cache: [dashboard link]
4. Check application logs:

kubectl logs -l app=payment-api --since=10m | jq 'select(.level=="error")'


## Common Causes & Fixes
| Cause | Diagnosis | Fix |
|-------|-----------|-----|
| Bad deploy | Errors started at deploy time | `kubectl rollout undo deployment/payment-api` |
| DB connection exhaustion | `db_connections_active` at max | Restart pods (rolling) + increase pool size |
| Stripe outage | Stripe status page red | Enable fallback payment processor |
| Memory leak | Memory climbing, OOMKilled events | Rolling restart + investigate |

## Escalation
- If unresolved after 15 min: page payment team lead
- If revenue impact >$10K: page VP Engineering
- If Stripe outage: communicate to support team for customer messaging

## Resolution
- Confirm error rate <0.1% for 10 min
- Post in #incidents: root cause + duration + impact
- Schedule post-mortem if downtime >5 min

Phase 6: Dashboard Architecture

Dashboard Hierarchy

L1: Executive / Business Dashboard (non-technical stakeholders)
  ↓
L2: Service Overview Dashboard (on-call, quick triage)
  ↓
L3: Service Deep-Dive Dashboard (debugging specific service)
  ↓
L4: Infrastructure Dashboard (resource-level details)

L1: Business Dashboard

panels:
  - title: "Revenue per Minute"
    type: stat
    query: "sum(rate(orders_total{status='completed'}[5m])) * avg(order_value_dollars)"
  - title: "Active Users (5min)"
    type: stat
    query: "count(count by (user_id) (http_requests_total{...}[5m]))"
  - title: "Checkout Success Rate"
    type: gauge
    query: "sum(rate(checkout_total{status='success'}[1h])) / sum(rate(checkout_total[1h]))"
    thresholds: [95, 98, 99.5]
  - title: "Error Budget Remaining"
    type: gauge
    query: "1 - (error_budget_consumed / error_budget_total)"

L2: Service Overview Dashboard

Every service gets one of these with identical layout:

row_1_traffic:
  - "Request Rate (rps)" — timeseries, by status code
  - "Error Rate (%)" — timeseries, threshold line at SLO
  - "Active Requests" — gauge

row_2_latency:
  - "Latency Distribution" — heatmap
  - "p50 / p95 / p99" — timeseries, threshold lines
  - "Latency by Endpoint" — table, sorted by p99

row_3_dependencies:
  - "Downstream Latency" — timeseries per dependency
  - "Downstream Error Rate" — timeseries per dependency
  - "Database Query Duration" — timeseries by query type

row_4_resources:
  - "CPU Usage" — timeseries per pod
  - "Memory Usage" — timeseries per pod
  - "Pod Restarts" — stat

row_5_business:
  - "Business Metric 1" — service-specific
  - "Business Metric 2" — service-specific

Dashboard Rules

  1. Time range default: last 1 hour — most debugging happens in recent time
  2. Variable selectors at top: environment, service, instance
  3. Consistent color coding: green=good, yellow=degraded, red=bad across all dashboards
  4. Link alerts to dashboards — every alert annotation includes dashboard URL
  5. No more than 15 panels per dashboard — split into L3 if needed
  6. Include "as of" timestamp — so screenshots in incidents are unambiguous
  7. Dashboard as code — store Grafana JSON in git, provision via API

Phase 7: Incident Response

Incident Severity Classification

| Severity | Criteria | Response | Communication | |----------|----------|----------|---------------| | SEV-1 | Service down, data loss risk, security breach | All hands, war room | Status page update every 15 min | | SEV-2 | Degraded service, SLO at risk, partial outage | On-call + backup | Status page update every 30 min | | SEV-3 | Minor degradation, workaround exists | On-call during hours | Internal Slack update | | SEV-4 | Cosmetic, low impact | Next sprint | None |

Incident Roles

| Role | Responsibility | Who | |------|---------------|-----| | Incident Commander (IC) | Owns the incident. Coordinates. Makes decisions. | On-call lead | | Technical Lead | Diagnoses and fixes. Communicates technical status to IC. | Senior engineer | | Communications Lead | Updates status page, Slack, stakeholders. | Product/support | | Scribe | Documents timeline, actions, decisions in real-time. | Anyone available |

Incident Response Workflow

1. DETECT
   - Alert fires → on-call paged
   - Customer report → support escalates
   - Internal discovery → engineer reports
   
2. TRIAGE (first 5 minutes)
   - Confirm the issue is real (not false alert)
   - Classify severity (SEV-1 through SEV-4)
   - Open incident channel: #inc-YYYY-MM-DD-short-description
   - Assign roles (IC, Tech Lead, Comms)
   
3. MITIGATE (next 5-30 minutes)
   - Goal: STOP THE BLEEDING, not find root cause
   - Options (try in order):
     a. Rollback last deploy
     b. Scale up / restart pods
     c. Toggle feature flag off
     d. Redirect traffic / enable fallback
     e. Manual data fix
   - Document every action with timestamp
   
4. STABILIZE
   - Confirm mitigation is working (metrics back to normal)
   - Monitor for 15-30 min for recurrence
   - Update status page: "Monitoring fix"
   
5. RESOLVE
   - Confirm all metrics healthy for 30+ min
   - Update status page: "Resolved"
   - Schedule post-mortem (within 48 hours for SEV-1/2)
   - Send internal summary to stakeholders

Incident Channel Template

📋 Incident: Payment API 5xx Errors
🔴 Severity: SEV-2
🕐 Started: 2026-02-22 14:23 UTC
👤 IC: @alice
🔧 Tech Lead: @bob
📢 Comms: @charlie

Status: MITIGATING
Impact: ~5% of checkout requests failing
Customer-facing: Yes

Timeline:
14:23 — Alert fired: PaymentAPIHighErrorRate
14:25 — IC assigned: @alice, confirmed real via dashboard
14:28 — Tech Lead: error logs show connection pool exhaustion post-deploy
14:31 — Rolled back deployment v2.3.1 → v2.3.0
14:35 — Error rate dropping, monitoring
14:50 — Error rate <0.1%, marking resolved

Phase 8: Post-Mortem Framework

Blameless Post-Mortem Template

post_mortem:
  title: "Payment API Connection Pool Exhaustion"
  date: "2026-02-22"
  severity: SEV-2
  duration: 27 minutes (14:23 — 14:50 UTC)
  authors: ["@alice", "@bob"]
  reviewers: ["@engineering-leads"]
  status: action_items_in_progress
  
  summary: |
    A deployment at 14:15 introduced a connection leak in the payment API.
    Connection pool was exhausted by 14:23, causing 5xx errors for ~5% of
    checkout requests. Rolled back at 14:31; recovered by 14:50.
  
  impact:
    user_impact: "~340 users saw checkout failures over 27 minutes"
    revenue_impact: "$2,100 estimated (based on average order value × failed checkouts)"
    slo_impact: "Consumed 5.1 min of 21.9 min monthly error budget (23%)"
    data_impact: "No data loss. 12 orders failed; users could retry successfully."
  
  timeline:
    - time: "14:15"
      event: "Deploy v2.3.1 rolled out (3/3 pods updated)"
    - time: "14:23"
      event: "PaymentAPIHighErrorRate alert fired"
    - time: "14:25"
      event: "IC assigned, confirmed via dashboard"
    - time: "14:28"
      event: "Root cause identified: new ORM query not releasing connections"
    - time: "14:31"
      event: "Rollback initiated: v2.3.1 → v2.3.0"
    - time: "14:35"
      event: "Error rate declining"
    - time: "14:50"
      event: "Resolved: error rate <0.1% sustained"
  
  root_cause: |
    The v2.3.1 deploy introduced a new database query in the order validation
    path. The query used a raw connection instead of the pool's managed client,
    so connections were acquired but never released. Under load, the pool
    exhausted within 8 minutes.
  
  contributing_factors:
    - "No integration test for connection pool behavior under load"
    - "Connection pool saturation metric existed but had no alert"
    - "Code review didn't catch raw connection usage"
  
  what_went_well:
    - "Alert fired within 8 minutes of deploy"
    - "IC assigned in 2 minutes"
    - "Root cause identified in 3 minutes (clear in logs)"
    - "Rollback executed cleanly"
  
  what_went_wrong:
    - "8-minute detection gap after deploy"
    - "No canary deployment to catch before full rollout"
    - "Connection pool saturation had no alert"
  
  action_items:
    - action: "Add connection pool saturation alert (>80% for 2 min)"
      owner: "@bob"
      priority: P1
      due: "2026-02-25"
      status: in_progress
      ticket: "ENG-1234"
    - action: "Enable canary deployments for payment-api"
      owner: "@alice"
      priority: P1
      due: "2026-03-01"
      ticket: "ENG-1235"
    - action: "Add linting rule: no raw DB connections in application code"
      owner: "@charlie"
      priority: P2
      due: "2026-03-07"
      ticket: "ENG-1236"
    - action: "Load test payment-api connection pool in staging"
      owner: "@bob"
      priority: P2
      due: "2026-03-07"
      ticket: "ENG-1237"
  
  lessons_learned:
    - "Resource saturation metrics need alerts, not just dashboards"
    - "Canary deployments are mandatory for Tier 0 services"
    - "ORM abstractions don't guarantee connection safety — review raw queries"

Post-Mortem Meeting Agenda (60 minutes)

1. (5 min) Context setting — IC reads the summary
2. (15 min) Timeline walkthrough — what happened, when, by whom
3. (15 min) Root cause deep-dive — 5 Whys exercise
4. (5 min) What went well — celebrate good response
5. (15 min) Action items — assign owners, priorities, due dates
6. (5 min) Wrap-up — review date for action item check-in

5 Whys Exercise

Problem: 5xx errors in payment API

Why 1: Database connections were exhausted
Why 2: A new query acquired connections without releasing them
Why 3: The query used a raw connection instead of the pool manager
Why 4: The ORM's raw query API doesn't auto-release (by design)
Why 5: We don't have a linting rule or code review checklist item for this

Root cause: Missing guard against raw connection usage in application code
Systemic fix: Linting rule + connection pool saturation alerting

Phase 9: On-Call Operations

On-Call Structure

on_call:
  rotation: weekly
  handoff_day: Monday 10:00 UTC
  
  primary:
    response_time: 5 minutes (SEV-1/2), 30 minutes (SEV-3)
    escalation_after: 15 minutes no-ack
    
  secondary:
    response_time: 15 minutes (SEV-1), 1 hour (SEV-2/3)
    escalation_after: 30 minutes no-ack
    
  manager_escalation:
    trigger: SEV-1 unresolved after 30 minutes
    
  handoff_checklist:
    - Review open incidents and active alerts
    - Check error budget status for all services
    - Read post-mortems from previous week
    - Verify PagerDuty schedule and contact info
    - Test alert routing (send test page)

On-Call Health Metrics

| Metric | Healthy | Needs Attention | Unhealthy | |--------|---------|-----------------|-----------| | Pages per week | <5 | 5-15 | >15 | | After-hours pages per week | <2 | 2-5 | >5 | | False positive rate | <10% | 10-30% | >30% | | Mean time to acknowledge | <5 min | 5-15 min | >15 min | | Mean time to resolve | <30 min | 30-120 min | >120 min | | Toil ratio (manual vs automated) | <30% | 30-60% | >60% |

Weekly On-Call Review Template

on_call_review:
  week: "2026-W08"
  engineer: "@bob"
  
  incidents:
    total: 7
    sev_1: 0
    sev_2: 1
    sev_3: 4
    false_positives: 2
    after_hours: 3
    
  time_spent:
    incident_response: "4.5 hours"
    toil_automation: "2 hours"
    runbook_updates: "1 hour"
    
  improvements_made:
    - "Silenced noisy disk alert on dev servers"
    - "Added auto-remediation for pod restart threshold"
    
  improvements_needed:
    - "Cache expiry alert fires every Tuesday at 03:00 — needs investigation"
    - "Payment retry logic needs circuit breaker (caused 3 alerts)"
    
  handoff_notes: |
    Watch payment-api p99 latency — it's been creeping up since Wednesday.
    Stripe changed their sandbox endpoints; staging may throw errors.

Phase 10: Chaos Engineering & Reliability Testing

Chaos Principles

  1. Start with a hypothesis: "If X fails, the system should Y"
  2. Run in production (start small — one instance, one AZ)
  3. Minimize blast radius with automatic rollback
  4. Build confidence incrementally: staging → canary → production

Chaos Experiment Template

chaos_experiment:
  name: "Payment DB failover"
  hypothesis: "If the primary database becomes unavailable, traffic should
    failover to the replica within 30 seconds with <1% error rate spike"
  
  steady_state:
    - metric: "checkout_success_rate"
      expected: ">99.5%"
    - metric: "db_query_duration_p99"
      expected: "<200ms"
  
  injection:
    type: "network_partition"
    target: "payment-db-primary"
    duration: "5 minutes"
    blast_radius: "single AZ"
  
  abort_conditions:
    - "checkout_success_rate < 95% for > 60 seconds"
    - "revenue_per_minute drops > 50%"
    - "any SEV-1 incident declared"
  
  results:
    failover_time: "22 seconds"
    error_spike: "0.3% for 25 seconds"
    hypothesis_confirmed: true
    
  follow_up_actions:
    - "Document failover behavior in runbook"
    - "Add failover time as SLI (target: <30s)"

Chaos Engineering Maturity Levels

| Level | What You Test | Tools | |-------|--------------|-------| | 1: Manual | Kill a pod, see what happens | kubectl delete pod | | 2: Automated | Scheduled pod kills, network delays | Chaos Monkey, Litmus | | 3: Game Days | Multi-failure scenarios with team exercise | Custom scripts + coordination | | 4: Continuous | Automated chaos in production with auto-rollback | Gremlin, Chaos Mesh |


Phase 11: Observability Cost Optimization

Cost Drivers (Ranked)

| # | Driver | Typical % of Bill | Optimization | |---|--------|-------------------|-------------| | 1 | Log volume | 40-60% | Reduce verbosity, drop DEBUG, sample repetitive | | 2 | Metric cardinality | 15-25% | Drop unused metrics, limit labels | | 3 | Trace volume | 10-20% | Sampling, tail-based sampling | | 4 | Retention | 10-15% | Tiered storage (hot → warm → cold) | | 5 | Query cost | 5-10% | Optimize dashboard queries, set max scan limits |

Cost Reduction Checklist

cost_optimization:
  logs:
    - action: "Drop DEBUG/TRACE in production"
      savings: "30-50% of log volume"
    - action: "Sample health check logs (1:100)"
      savings: "5-15% of log volume"
    - action: "Deduplicate identical error bursts"
      savings: "10-20% during incidents"
    - action: "Move logs older than 7 days to S3/cold storage"
      savings: "60-80% of storage cost"
    - action: "Drop request/response body logging"
      savings: "20-40% of log volume"
  
  metrics:
    - action: "Audit unused metrics (no dashboard, no alert)"
      savings: "10-30% of series"
    - action: "Reduce histogram bucket count (default 11 → 8)"
      savings: "~27% of histogram series"
    - action: "Remove high-cardinality labels"
      savings: "Variable — can be massive"
    - action: "Increase scrape interval for non-critical metrics (15s → 60s)"
      savings: "75% of data points for those metrics"
  
  traces:
    - action: "Implement tail-based sampling"
      savings: "80-95% of trace volume"
    - action: "Drop internal health check traces"
      savings: "5-20% of trace volume"
    - action: "Reduce span attribute size (truncate long strings)"
      savings: "10-30% of trace storage"
  
  general:
    - action: "Review and right-size retention policies quarterly"
    - action: "Set query timeouts and result limits on dashboards"
    - action: "Use recording rules for expensive queries"

Monthly Cost Review Template

observability_cost_review:
  month: "February 2026"
  total_cost: "$X,XXX"
  
  breakdown:
    logs: { volume: "X TB", cost: "$X", pct: "X%" }
    metrics: { series: "X million", cost: "$X", pct: "X%" }
    traces: { volume: "X TB", cost: "$X", pct: "X%" }
    infrastructure: { instances: X, cost: "$X", pct: "X%" }
  
  cost_per:
    request: "$0.000X"
    service: "$X average"
    engineer: "$X per engineer"
  
  optimizations_applied: []
  optimizations_planned: []
  budget_status: "on_track | over_budget | under_budget"

Phase 12: Advanced Patterns

Correlation: Connecting the Three Pillars

Every log line includes: trace_id, span_id
Every trace span includes: service, operation
Every metric includes: service label

Correlation paths:
  Alert fires (metric) → Click → Dashboard (metric) → Filter by time window
    → Trace search (same service + time) → Find failing trace
    → Logs (filter by trace_id) → See exact error
    
  Support ticket (user report) → Find request_id in logs
    → Extract trace_id → View full trace → Identify slow span
    → Check span's service metrics → Confirm pattern

Synthetic Monitoring

synthetic_checks:
  - name: "Checkout flow"
    type: browser
    frequency: 5m
    locations: [us-east, eu-west, ap-southeast]
    steps:
      - navigate: "https://app.example.com/products"
      - click: "Add to Cart"
      - click: "Checkout"
      - assert: "Order confirmation page loads in <3s"
    alert_on: "2 consecutive failures from same location"
    
  - name: "API health"
    type: api
    frequency: 1m
    endpoints:
      - url: "https://api.example.com/health"
        expected_status: 200
        max_latency_ms: 500
      - url: "https://api.example.com/v1/products?limit=1"
        expected_status: 200
        max_latency_ms: 1000

Feature Flag Observability

# Correlate feature flags with metrics
feature_flag_monitoring:
  - flag: "new_checkout_flow"
    metrics_to_compare:
      - "checkout_conversion_rate" # by flag variant
      - "checkout_error_rate"
      - "checkout_latency_p99"
    alerts:
      - "If error rate for new variant > 2x control, auto-disable flag"

Observability Maturity Model

| Dimension | Level 1 | Level 2 | Level 3 | Level 4 | |-----------|---------|---------|---------|---------| | Logging | Unstructured logs | Structured JSON, centralized | Correlated with traces | Automated log analysis | | Metrics | Basic infra metrics | RED/USE for services | SLO-based with error budgets | Predictive (anomaly detection) | | Tracing | No tracing | Key services instrumented | Full distributed tracing | Trace-driven testing | | Alerting | Static thresholds | Multi-signal alerts | Burn-rate based on SLOs | Auto-remediation | | Incident Response | Ad hoc | Defined process + roles | Post-mortems with action tracking | Chaos engineering in prod | | Culture | "Ops team handles it" | Shared ownership (you build it, you run it) | SLO-driven development velocity | Reliability as a feature |


Quality Scoring Rubric (0-100)

| Dimension | Weight | 0 | 5 | 10 | |-----------|--------|---|---|-----| | Logging quality | 15% | Unstructured, no correlation | Structured JSON, missing fields | Full schema, trace correlation, PII scrubbing | | Metrics coverage | 15% | No metrics | RED or USE, not both | RED + USE + business metrics + custom | | Tracing completeness | 10% | No tracing | Key services | Full path, sampling strategy, tail-based | | SLO maturity | 15% | No reliability targets | Informal targets | SLOs with error budgets, burn-rate alerts, weekly review | | Alert quality | 15% | Noisy/missing | Actionable, some runbooks | SLO-based, full runbooks, low false positive | | Incident response | 10% | Ad hoc | Defined process | Full process, roles, post-mortems, chaos engineering | | Dashboard design | 10% | No dashboards | Basic panels | Hierarchical L1-L4, consistent, linked to alerts | | Cost efficiency | 10% | Unknown cost | Tracked | Optimized, reviewed monthly, within budget |

90-100: World-class. Teach others. 70-89: Production-ready. Fill specific gaps. 50-69: Functional but fragile. <50: Significant reliability risk.


10 Observability Commandments

  1. Structured or it didn't happen — unstructured logs are technical debt
  2. Correlate everything — trace_id connects logs, traces, and metrics
  3. Alert on symptoms, not causes — users don't care about CPU, they care about latency
  4. Every alert gets a runbook — no runbook = no alert
  5. SLOs drive velocity — error budgets decide when to ship vs stabilize
  6. Dashboards have hierarchy — executives don't need pod CPU graphs
  7. Blameless post-mortems always — blame prevents learning
  8. Cost is a feature — observability that bankrupts you isn't observability
  9. You build it, you run it — the team that ships code owns its observability
  10. Practice failure — chaos engineering builds confidence

12 Natural Language Commands

| Command | What It Does | |---------|-------------| | "Audit our observability" | Run the /16 health check, score each dimension, prioritize gaps | | "Design logging for [service]" | Generate structured log schema with context fields for the service | | "Set up metrics for [service]" | Create RED + USE + business metric instrumentation plan | | "Create SLOs for [service]" | Define SLIs, targets, error budgets, and burn-rate alert rules | | "Design alerts for [service]" | Create alert rules with severity, thresholds, and runbook templates | | "Build dashboard for [service]" | Design L2 service overview dashboard with panel specifications | | "Write a runbook for [alert]" | Generate structured runbook with diagnosis steps and fixes | | "Run post-mortem for [incident]" | Generate blameless post-mortem document with timeline and action items | | "Set up on-call for [team]" | Design rotation, escalation policy, handoff checklist | | "Plan chaos experiment for [scenario]" | Design experiment with hypothesis, injection, abort conditions | | "Optimize observability costs" | Audit current spend, identify top savings, create reduction plan | | "Design tracing for [system]" | Create OpenTelemetry instrumentation plan with sampling strategy |


⚡ Level Up Your Observability

This skill gives you the methodology. For industry-specific implementation patterns:

🔗 More Free Skills by AfrexAI

  • afrexai-devops-engine — CI/CD, infrastructure, deployment strategies
  • afrexai-api-architect — API design, security, versioning
  • afrexai-database-engineering — Schema design, query optimization, migrations
  • afrexai-code-reviewer — Code review methodology with SPEAR framework
  • afrexai-prompt-engineering — System prompt design, testing, optimization

Browse all AfrexAI skills: clawhub.com | Full storefront

API & Reliability

Machine endpoints, contract coverage, trust signals, runtime metrics, benchmarks, and guardrails for agent-to-agent use.

MissingCLAWHUB

Machine interfaces

Contract & API

Contract coverage

Status

missing

Auth

None

Streaming

No

Data region

Unspecified

Protocol support

OpenClaw: self-declared

Requires: none

Forbidden: none

Guardrails

Operational confidence: low

No positive guardrails captured.
Invocation examples
curl -s "https://xpersona.co/api/v1/agents/clawhub-skills-1kalin-afrexai-observability-engine/snapshot"
curl -s "https://xpersona.co/api/v1/agents/clawhub-skills-1kalin-afrexai-observability-engine/contract"
curl -s "https://xpersona.co/api/v1/agents/clawhub-skills-1kalin-afrexai-observability-engine/trust"

Operational fit

Reliability & Benchmarks

Trust signals

Handshake

UNKNOWN

Confidence

unknown

Attempts 30d

unknown

Fallback rate

unknown

Runtime metrics

Observed P50

unknown

Observed P95

unknown

Rate limit

unknown

Estimated cost

unknown

Do not use if

Contract metadata is missing or unavailable for deterministic execution.
No benchmark suites or observed failure patterns are available.

Machine Appendix

Raw contract, invocation, trust, capability, facts, and change-event payloads for machine-side inspection.

MissingCLAWHUB

Contract JSON

{
  "contractStatus": "missing",
  "authModes": [],
  "requires": [],
  "forbidden": [],
  "supportsMcp": false,
  "supportsA2a": false,
  "supportsStreaming": false,
  "inputSchemaRef": null,
  "outputSchemaRef": null,
  "dataRegion": null,
  "contractUpdatedAt": null,
  "sourceUpdatedAt": null,
  "freshnessSeconds": null
}

Invocation Guide

{
  "preferredApi": {
    "snapshotUrl": "https://xpersona.co/api/v1/agents/clawhub-skills-1kalin-afrexai-observability-engine/snapshot",
    "contractUrl": "https://xpersona.co/api/v1/agents/clawhub-skills-1kalin-afrexai-observability-engine/contract",
    "trustUrl": "https://xpersona.co/api/v1/agents/clawhub-skills-1kalin-afrexai-observability-engine/trust"
  },
  "curlExamples": [
    "curl -s \"https://xpersona.co/api/v1/agents/clawhub-skills-1kalin-afrexai-observability-engine/snapshot\"",
    "curl -s \"https://xpersona.co/api/v1/agents/clawhub-skills-1kalin-afrexai-observability-engine/contract\"",
    "curl -s \"https://xpersona.co/api/v1/agents/clawhub-skills-1kalin-afrexai-observability-engine/trust\""
  ],
  "jsonRequestTemplate": {
    "query": "summarize this repo",
    "constraints": {
      "maxLatencyMs": 2000,
      "protocolPreference": [
        "OPENCLEW"
      ]
    }
  },
  "jsonResponseTemplate": {
    "ok": true,
    "result": {
      "summary": "...",
      "confidence": 0.9
    },
    "meta": {
      "source": "CLAWHUB",
      "generatedAt": "2026-04-17T02:24:42.113Z"
    }
  },
  "retryPolicy": {
    "maxAttempts": 3,
    "backoffMs": [
      500,
      1500,
      3500
    ],
    "retryableConditions": [
      "HTTP_429",
      "HTTP_503",
      "NETWORK_TIMEOUT"
    ]
  }
}

Trust JSON

{
  "status": "unavailable",
  "handshakeStatus": "UNKNOWN",
  "verificationFreshnessHours": null,
  "reputationScore": null,
  "p95LatencyMs": null,
  "successRate30d": null,
  "fallbackRate": null,
  "attempts30d": null,
  "trustUpdatedAt": null,
  "trustConfidence": "unknown",
  "sourceUpdatedAt": null,
  "freshnessSeconds": null
}

Capability Matrix

{
  "rows": [
    {
      "key": "OPENCLEW",
      "type": "protocol",
      "support": "unknown",
      "confidenceSource": "profile",
      "notes": "Listed on profile"
    },
    {
      "key": "limits",
      "type": "capability",
      "support": "supported",
      "confidenceSource": "profile",
      "notes": "Declared in agent profile metadata"
    },
    {
      "key": "be",
      "type": "capability",
      "support": "supported",
      "confidenceSource": "profile",
      "notes": "Declared in agent profile metadata"
    },
    {
      "key": "b3",
      "type": "capability",
      "support": "supported",
      "confidenceSource": "profile",
      "notes": "Declared in agent profile metadata"
    },
    {
      "key": "team",
      "type": "capability",
      "support": "supported",
      "confidenceSource": "profile",
      "notes": "Declared in agent profile metadata"
    },
    {
      "key": "escalates",
      "type": "capability",
      "support": "supported",
      "confidenceSource": "profile",
      "notes": "Declared in agent profile metadata"
    },
    {
      "key": "ticket",
      "type": "capability",
      "support": "supported",
      "confidenceSource": "profile",
      "notes": "Declared in agent profile metadata"
    }
  ],
  "flattenedTokens": "protocol:OPENCLEW|unknown|profile capability:limits|supported|profile capability:be|supported|profile capability:b3|supported|profile capability:team|supported|profile capability:escalates|supported|profile capability:ticket|supported|profile"
}

Facts JSON

[
  {
    "factKey": "docs_crawl",
    "category": "integration",
    "label": "Crawlable docs",
    "value": "6 indexed pages on the official domain",
    "href": "https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Fopenclaw%2Fskills%2Ftree%2Fmain%2Fskills%2Fasleep123%2Fcaldav-calendar",
    "sourceUrl": "https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Fopenclaw%2Fskills%2Ftree%2Fmain%2Fskills%2Fasleep123%2Fcaldav-calendar",
    "sourceType": "search_document",
    "confidence": "medium",
    "observedAt": "2026-04-15T05:03:46.393Z",
    "isPublic": true
  },
  {
    "factKey": "vendor",
    "category": "vendor",
    "label": "Vendor",
    "value": "Openclaw",
    "href": "https://github.com/openclaw/skills/tree/main/skills/1kalin/afrexai-observability-engine",
    "sourceUrl": "https://github.com/openclaw/skills/tree/main/skills/1kalin/afrexai-observability-engine",
    "sourceType": "profile",
    "confidence": "medium",
    "observedAt": "2026-04-15T00:45:39.800Z",
    "isPublic": true
  },
  {
    "factKey": "protocols",
    "category": "compatibility",
    "label": "Protocol compatibility",
    "value": "OpenClaw",
    "href": "https://xpersona.co/api/v1/agents/clawhub-skills-1kalin-afrexai-observability-engine/contract",
    "sourceUrl": "https://xpersona.co/api/v1/agents/clawhub-skills-1kalin-afrexai-observability-engine/contract",
    "sourceType": "contract",
    "confidence": "medium",
    "observedAt": "2026-04-15T00:45:39.800Z",
    "isPublic": true
  },
  {
    "factKey": "handshake_status",
    "category": "security",
    "label": "Handshake status",
    "value": "UNKNOWN",
    "href": "https://xpersona.co/api/v1/agents/clawhub-skills-1kalin-afrexai-observability-engine/trust",
    "sourceUrl": "https://xpersona.co/api/v1/agents/clawhub-skills-1kalin-afrexai-observability-engine/trust",
    "sourceType": "trust",
    "confidence": "medium",
    "observedAt": null,
    "isPublic": true
  }
]

Change Events JSON

[
  {
    "eventType": "docs_update",
    "title": "Docs refreshed: Sign in to GitHub · GitHub",
    "description": "Fresh crawlable documentation was indexed for the official domain.",
    "href": "https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Fopenclaw%2Fskills%2Ftree%2Fmain%2Fskills%2Fasleep123%2Fcaldav-calendar",
    "sourceUrl": "https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Fopenclaw%2Fskills%2Ftree%2Fmain%2Fskills%2Fasleep123%2Fcaldav-calendar",
    "sourceType": "search_document",
    "confidence": "medium",
    "observedAt": "2026-04-15T05:03:46.393Z",
    "isPublic": true
  }
]

Sponsored

Ads related to afrexai-observability-engine and adjacent AI workflows.