Crawler Summary

agent-bench answer-first brief

Industrial-grade benchmarking engine for AI agents. Define test scenarios in YAML, run high-performance parallel evaluations, and generate premium glassmorphism reports. Supports LangGraph, CrewAI, AutoGen, and custom agent stacks. <div align="center"> <img src="https://raw.githubusercontent.com/lucide-icons/lucide/main/icons/layers.svg" width="80" height="80" /> <h1>agentbench</h1> <p><strong>Industrial-Grade Pytest for AI Agents.</strong></p> <div> <a href="https://github.com/Ismail-2001/agent-bench/actions"> <img src="https://img.shields.io/badge/CI-Passing-success?style=for-the-badge&logo=github-actions&logoColor=white" alt="CI Status" /> < Capability contract not published. No trust telemetry is available yet. 2 GitHub stars reported by the source. Last updated 4/15/2026.

Freshness

Last checked 4/15/2026

Best For

agent-bench is best for crewai, multi-agent workflows where OpenClaw compatibility matters.

Not Ideal For

Contract metadata is missing or unavailable for deterministic execution.

Evidence Sources Checked

editorial-content, GITHUB REPOS, runtime-metrics, public facts pack

Claim this agent
Agent DossierGITHUB REPOSSafety: 66/100

agent-bench

Industrial-grade benchmarking engine for AI agents. Define test scenarios in YAML, run high-performance parallel evaluations, and generate premium glassmorphism reports. Supports LangGraph, CrewAI, AutoGen, and custom agent stacks. <div align="center"> <img src="https://raw.githubusercontent.com/lucide-icons/lucide/main/icons/layers.svg" width="80" height="80" /> <h1>agentbench</h1> <p><strong>Industrial-Grade Pytest for AI Agents.</strong></p> <div> <a href="https://github.com/Ismail-2001/agent-bench/actions"> <img src="https://img.shields.io/badge/CI-Passing-success?style=for-the-badge&logo=github-actions&logoColor=white" alt="CI Status" /> <

OpenClawself-declared

Public facts

5

Change events

1

Artifacts

0

Freshness

Apr 15, 2026

Verifiededitorial-contentNo verified compatibility signals2 GitHub stars

Capability contract not published. No trust telemetry is available yet. 2 GitHub stars reported by the source. Last updated 4/15/2026.

2 GitHub starsTrust evidence available

Trust score

Unknown

Compatibility

OpenClaw

Freshness

Apr 15, 2026

Vendor

Ismail 2001

Artifacts

0

Benchmarks

0

Last release

Unpublished

Executive Summary

Key links, install path, and a quick operational read before the deeper crawl record.

Verifiededitorial-content

Summary

Capability contract not published. No trust telemetry is available yet. 2 GitHub stars reported by the source. Last updated 4/15/2026.

Setup snapshot

  1. 1

    Setup complexity is LOW. This package is likely designed for quick installation with minimal external side-effects.

  2. 2

    Final validation: Expose the agent to a mock request payload inside a sandbox and trace the network egress before allowing access to real customer data.

Evidence Ledger

Everything public we have scraped or crawled about this agent, grouped by evidence type with provenance.

Verifiededitorial-content
Vendor (1)

Vendor

Ismail 2001

profilemedium
Observed Apr 15, 2026Source linkProvenance
Compatibility (1)

Protocol compatibility

OpenClaw

contractmedium
Observed Apr 15, 2026Source linkProvenance
Adoption (1)

Adoption signal

2 GitHub stars

profilemedium
Observed Apr 15, 2026Source linkProvenance
Security (1)

Handshake status

UNKNOWN

trustmedium
Observed unknownSource linkProvenance
Integration (1)

Crawlable docs

6 indexed pages on the official domain

search_documentmedium
Observed Apr 15, 2026Source linkProvenance

Release & Crawl Timeline

Merged public release, docs, artifact, benchmark, pricing, and trust refresh events.

Self-declaredagent-index

Artifacts Archive

Extracted files, examples, snippets, parameters, dependencies, permissions, and artifact metadata.

Self-declaredGITHUB REPOS

Extracted files

0

Examples

4

Snippets

0

Languages

python

Executable Examples

bash

pip install agentbench

yaml

name: "basic-research"
tasks:
  - id: "compare-frameworks"
    input: "Compare LangGraph and CrewAI for production systems in 2026."
    criteria:
      - type: contains_all
        values: ["LangGraph", "CrewAI"]
      - type: min_length
        value: 200
      - type: llm_judge
        prompt: "Does this provide a technical comparison? Score 0-10."
        threshold: 7
    limits:
      max_tokens: 50000
      max_latency_seconds: 60

bash

agentbench run --scenario scenarios/research.yaml --agent my_module:MyAgentAdapter --format html

mermaid

graph TD
    A[Scenario Loader] --> B[Parallel Runner]
    B --> C[Agent Adapter]
    C --> D[LangGraph / CrewAI / AutoGen]
    B --> E[Evaluation Engine]
    E --> F[Deterministic Evaluators]
    E --> G[LLM-Judge / Semantic Check]
    B --> H[Reporters]
    H --> I[Rich CLI Table]
    H --> J[Glassmorphism HTML]
    H --> K[JSON Metadata]

Docs & README

Full documentation captured from public sources, including the complete README when available.

Self-declaredGITHUB REPOS

Docs source

GITHUB REPOS

Editorial quality

ready

Industrial-grade benchmarking engine for AI agents. Define test scenarios in YAML, run high-performance parallel evaluations, and generate premium glassmorphism reports. Supports LangGraph, CrewAI, AutoGen, and custom agent stacks. <div align="center"> <img src="https://raw.githubusercontent.com/lucide-icons/lucide/main/icons/layers.svg" width="80" height="80" /> <h1>agentbench</h1> <p><strong>Industrial-Grade Pytest for AI Agents.</strong></p> <div> <a href="https://github.com/Ismail-2001/agent-bench/actions"> <img src="https://img.shields.io/badge/CI-Passing-success?style=for-the-badge&logo=github-actions&logoColor=white" alt="CI Status" /> <

Full README
<div align="center"> <img src="https://raw.githubusercontent.com/lucide-icons/lucide/main/icons/layers.svg" width="80" height="80" /> <h1>agentbench</h1> <p><strong>Industrial-Grade Pytest for AI Agents.</strong></p> <div> <a href="https://github.com/Ismail-2001/agent-bench/actions"> <img src="https://img.shields.io/badge/CI-Passing-success?style=for-the-badge&logo=github-actions&logoColor=white" alt="CI Status" /> </a> <a href="https://pypi.org/project/agentbench/"> <img src="https://img.shields.io/badge/pypi-v0.1.1-blue?style=for-the-badge&logo=pypi&logoColor=white" alt="PyPI Version" /> </a> <a href="LICENSE"> <img src="https://img.shields.io/badge/License-MIT-yellow.svg?style=for-the-badge" alt="License" /> </a> </div> <p>Define test scenarios in YAML. Benchmark any agent — LangGraph, CrewAI, AutoGen, or custom. Get premium reports with pass/fail, tokens, latency, cost, and failure analysis.</p> <h4> <a href="#-quick-start">Quick Start</a> <span> · </span> <a href="#-features">Features</a> <span> · </span> <a href="#-architecture">Architecture</a> <span> · </span> <a href="#-reports">Reports</a> <span> · </span> <a href="#-contributing">Contributing</a> </h4> </div>

🏗️ The Enterprise Challenge

In 2026, 52% of organizations still don't run automated evaluations on their multi-step agent workflows. Existing tools are either ecosystem-locked (LangSmith) or too academic (THUDM/AgentBench).

agentbench fills the gap: a free, open-source CLI engine that brings deterministic and LLM-based testing to the modern agent stack. Think of it as pytest meets k6 for autonomous AI.


⚡ Quick Start

1. Install via uv or pip

pip install agentbench

2. Define a Scenario (research.yaml)

name: "basic-research"
tasks:
  - id: "compare-frameworks"
    input: "Compare LangGraph and CrewAI for production systems in 2026."
    criteria:
      - type: contains_all
        values: ["LangGraph", "CrewAI"]
      - type: min_length
        value: 200
      - type: llm_judge
        prompt: "Does this provide a technical comparison? Score 0-10."
        threshold: 7
    limits:
      max_tokens: 50000
      max_latency_seconds: 60

3. Run with Your Agent

agentbench run --scenario scenarios/research.yaml --agent my_module:MyAgentAdapter --format html

🎨 Professional Visualization

Our reporter generates a premium, glassmorphism-styled HTML dashboard for every run.

  • Dynamic Charts: Visualize pass/fail trends and latency spikes.
  • Deep Observability: Click into any task to see raw inputs, outputs, and failing criteria.
  • Cost Metrics: Real-time token counting and cost estimation.

[!NOTE] View a live example of the report aesthetics in the documentation.


🧩 Architecture

graph TD
    A[Scenario Loader] --> B[Parallel Runner]
    B --> C[Agent Adapter]
    C --> D[LangGraph / CrewAI / AutoGen]
    B --> E[Evaluation Engine]
    E --> F[Deterministic Evaluators]
    E --> G[LLM-Judge / Semantic Check]
    B --> H[Reporters]
    H --> I[Rich CLI Table]
    H --> J[Glassmorphism HTML]
    H --> K[JSON Metadata]

🚀 Key Features (FAANG Grade)

  • ⚡ Parallel Task Execution: Benchmark large scenarios 10x faster with managed asyncio concurrency.
  • 🛡️ Built-in Scenario Packs: Standardized benchmarks for tool-use, research, and error-recovery.
  • 👁️ Structured Observability: High-fidelity logging with structlog for easy ingestion into Datadog/Splunk.
  • 🔌 Framework Agnostic: A simple AgentAdapter interface allows you to test any agent in seconds.
  • 🐳 DevOps Ready: Includes an optimized Dockerfile (using uv) and a comprehensive Makefile.

📊 Core Metrics Measured

| Metric | Accuracy | How It's Measured | |--------|----------|-------------------| | Pass/Fail | 100% | All criteria must satisfy (deterministic + LLM) | | Tokens | 100% | Precise counting via tiktoken | | Latency | High | Monotonic wall-clock time from call to return | | Cost | Est. | Calculated from token count × model rates | | Consistency | High | Pass rate across multiple runs (optional) |


🤝 Contributing

We welcome contributions from the community! Please read our CONTRIBUTING.md to get started.

High-impact areas:

  • New evaluators: (e.g., Trajectory efficiency, Tool-calling accuracy)
  • Framework adapters: (Pre-built adapters for popular SDKs)
  • Reporters: (Markdown, PDF, or Grafana dashboards)

📜 License

MIT — Test everything. Trust nothing.


<div align="center"> <sub>Built with ❤️ by <strong>Ismail Sajid</strong> (Re-architected for FAANG by Antigravity AI)</sub> </div>

Contract & API

Machine endpoints, protocol fit, contract coverage, invocation examples, and guardrails for agent-to-agent use.

MissingGITHUB REPOS

Contract coverage

Status

missing

Auth

None

Streaming

No

Data region

Unspecified

Protocol support

OpenClaw: self-declared

Requires: none

Forbidden: none

Guardrails

Operational confidence: low

No positive guardrails captured.
Invocation examples
curl -s "https://xpersona.co/api/v1/agents/crewai-ismail-2001-agent-bench/snapshot"
curl -s "https://xpersona.co/api/v1/agents/crewai-ismail-2001-agent-bench/contract"
curl -s "https://xpersona.co/api/v1/agents/crewai-ismail-2001-agent-bench/trust"

Reliability & Benchmarks

Trust and runtime signals, benchmark suites, failure patterns, and practical risk constraints.

Missingruntime-metrics

Trust signals

Handshake

UNKNOWN

Confidence

unknown

Attempts 30d

unknown

Fallback rate

unknown

Runtime metrics

Observed P50

unknown

Observed P95

unknown

Rate limit

unknown

Estimated cost

unknown

Do not use if

Contract metadata is missing or unavailable for deterministic execution.
No benchmark suites or observed failure patterns are available.

Media & Demo

Every public screenshot, visual asset, demo link, and owner-provided destination tied to this agent.

Missingno-media
No screenshots, media assets, or demo links are available.

Related Agents

Neighboring agents from the same protocol and source ecosystem for comparison and shortlist building.

Self-declaredprotocol-neighbors
GITHUB_REPOSactivepieces

Rank

70

AI Agents & MCPs & AI Workflow Automation • (~400 MCP servers for AI agents) • AI Automation / AI Agent with MCPs • AI Workflows & AI Agents • MCPs for AI Agents

Traction

No public download signal

Freshness

Updated 2d ago

OPENCLAW
GITHUB_REPOScherry-studio

Rank

70

AI productivity studio with smart chat, autonomous agents, and 300+ assistants. Unified access to frontier LLMs

Traction

No public download signal

Freshness

Updated 6d ago

MCPOPENCLAW
GITHUB_REPOSAionUi

Rank

70

Free, local, open-source 24/7 Cowork app and OpenClaw for Gemini CLI, Claude Code, Codex, OpenCode, Qwen Code, Goose CLI, Auggie, and more | 🌟 Star if you like it!

Traction

No public download signal

Freshness

Updated 6d ago

MCPOPENCLAW
GITHUB_REPOSCopilotKit

Rank

70

The Frontend for Agents & Generative UI. React + Angular

Traction

No public download signal

Freshness

Updated 23d ago

OPENCLAW
Machine Appendix

Contract JSON

{
  "contractStatus": "missing",
  "authModes": [],
  "requires": [],
  "forbidden": [],
  "supportsMcp": false,
  "supportsA2a": false,
  "supportsStreaming": false,
  "inputSchemaRef": null,
  "outputSchemaRef": null,
  "dataRegion": null,
  "contractUpdatedAt": null,
  "sourceUpdatedAt": null,
  "freshnessSeconds": null
}

Invocation Guide

{
  "preferredApi": {
    "snapshotUrl": "https://xpersona.co/api/v1/agents/crewai-ismail-2001-agent-bench/snapshot",
    "contractUrl": "https://xpersona.co/api/v1/agents/crewai-ismail-2001-agent-bench/contract",
    "trustUrl": "https://xpersona.co/api/v1/agents/crewai-ismail-2001-agent-bench/trust"
  },
  "curlExamples": [
    "curl -s \"https://xpersona.co/api/v1/agents/crewai-ismail-2001-agent-bench/snapshot\"",
    "curl -s \"https://xpersona.co/api/v1/agents/crewai-ismail-2001-agent-bench/contract\"",
    "curl -s \"https://xpersona.co/api/v1/agents/crewai-ismail-2001-agent-bench/trust\""
  ],
  "jsonRequestTemplate": {
    "query": "summarize this repo",
    "constraints": {
      "maxLatencyMs": 2000,
      "protocolPreference": [
        "OPENCLEW"
      ]
    }
  },
  "jsonResponseTemplate": {
    "ok": true,
    "result": {
      "summary": "...",
      "confidence": 0.9
    },
    "meta": {
      "source": "GITHUB_REPOS",
      "generatedAt": "2026-04-17T03:31:26.313Z"
    }
  },
  "retryPolicy": {
    "maxAttempts": 3,
    "backoffMs": [
      500,
      1500,
      3500
    ],
    "retryableConditions": [
      "HTTP_429",
      "HTTP_503",
      "NETWORK_TIMEOUT"
    ]
  }
}

Trust JSON

{
  "status": "unavailable",
  "handshakeStatus": "UNKNOWN",
  "verificationFreshnessHours": null,
  "reputationScore": null,
  "p95LatencyMs": null,
  "successRate30d": null,
  "fallbackRate": null,
  "attempts30d": null,
  "trustUpdatedAt": null,
  "trustConfidence": "unknown",
  "sourceUpdatedAt": null,
  "freshnessSeconds": null
}

Capability Matrix

{
  "rows": [
    {
      "key": "OPENCLEW",
      "type": "protocol",
      "support": "unknown",
      "confidenceSource": "profile",
      "notes": "Listed on profile"
    },
    {
      "key": "crewai",
      "type": "capability",
      "support": "supported",
      "confidenceSource": "profile",
      "notes": "Declared in agent profile metadata"
    },
    {
      "key": "multi-agent",
      "type": "capability",
      "support": "supported",
      "confidenceSource": "profile",
      "notes": "Declared in agent profile metadata"
    }
  ],
  "flattenedTokens": "protocol:OPENCLEW|unknown|profile capability:crewai|supported|profile capability:multi-agent|supported|profile"
}

Facts JSON

[
  {
    "factKey": "vendor",
    "category": "vendor",
    "label": "Vendor",
    "value": "Ismail 2001",
    "href": "https://github.com/Ismail-2001/agent-bench",
    "sourceUrl": "https://github.com/Ismail-2001/agent-bench",
    "sourceType": "profile",
    "confidence": "medium",
    "observedAt": "2026-04-15T06:04:19.599Z",
    "isPublic": true
  },
  {
    "factKey": "protocols",
    "category": "compatibility",
    "label": "Protocol compatibility",
    "value": "OpenClaw",
    "href": "https://xpersona.co/api/v1/agents/crewai-ismail-2001-agent-bench/contract",
    "sourceUrl": "https://xpersona.co/api/v1/agents/crewai-ismail-2001-agent-bench/contract",
    "sourceType": "contract",
    "confidence": "medium",
    "observedAt": "2026-04-15T06:04:19.599Z",
    "isPublic": true
  },
  {
    "factKey": "traction",
    "category": "adoption",
    "label": "Adoption signal",
    "value": "2 GitHub stars",
    "href": "https://github.com/Ismail-2001/agent-bench",
    "sourceUrl": "https://github.com/Ismail-2001/agent-bench",
    "sourceType": "profile",
    "confidence": "medium",
    "observedAt": "2026-04-15T06:04:19.599Z",
    "isPublic": true
  },
  {
    "factKey": "docs_crawl",
    "category": "integration",
    "label": "Crawlable docs",
    "value": "6 indexed pages on the official domain",
    "href": "https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Fopenclaw%2Fskills%2Ftree%2Fmain%2Fskills%2Fasleep123%2Fcaldav-calendar",
    "sourceUrl": "https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Fopenclaw%2Fskills%2Ftree%2Fmain%2Fskills%2Fasleep123%2Fcaldav-calendar",
    "sourceType": "search_document",
    "confidence": "medium",
    "observedAt": "2026-04-15T05:03:46.393Z",
    "isPublic": true
  },
  {
    "factKey": "handshake_status",
    "category": "security",
    "label": "Handshake status",
    "value": "UNKNOWN",
    "href": "https://xpersona.co/api/v1/agents/crewai-ismail-2001-agent-bench/trust",
    "sourceUrl": "https://xpersona.co/api/v1/agents/crewai-ismail-2001-agent-bench/trust",
    "sourceType": "trust",
    "confidence": "medium",
    "observedAt": null,
    "isPublic": true
  }
]

Change Events JSON

[
  {
    "eventType": "docs_update",
    "title": "Docs refreshed: Sign in to GitHub · GitHub",
    "description": "Fresh crawlable documentation was indexed for the official domain.",
    "href": "https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Fopenclaw%2Fskills%2Ftree%2Fmain%2Fskills%2Fasleep123%2Fcaldav-calendar",
    "sourceUrl": "https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Fopenclaw%2Fskills%2Ftree%2Fmain%2Fskills%2Fasleep123%2Fcaldav-calendar",
    "sourceType": "search_document",
    "confidence": "medium",
    "observedAt": "2026-04-15T05:03:46.393Z",
    "isPublic": true
  }
]

Sponsored

Ads related to agent-bench and adjacent AI workflows.