Crawler Summary

evaluation answer-first brief

通用产品功能评估方法论与工具集。用于:(1) 设计新功能的评估标准体系,(2) 创建评分Prompt实现LLM-as-a-Judge,(3) 分析人工与模型评分一致性,(4) 迭代优化评估标准和Prompt。适用于AI功能评测、对话质量评估、软硬件产品体验评估等场景。当用户需要设计评估体系、创建评分标准、分析评测数据或优化评测流程时使用此skill。 --- name: evaluation description: 通用产品功能评估方法论与工具集。用于:(1) 设计新功能的评估标准体系,(2) 创建评分Prompt实现LLM-as-a-Judge,(3) 分析人工与模型评分一致性,(4) 迭代优化评估标准和Prompt。适用于AI功能评测、对话质量评估、软硬件产品体验评估等场景。当用户需要设计评估体系、创建评分标准、分析评测数据或优化评测流程时使用此skill。 --- 通用评估方法论 Skill 核心理念 **评估 = 将模型/产品输出的"不确定性"转化为"工程化可控"的度量手段** 评估体系的三大价值: 1. **从体感到量化**:将模糊的"感觉变好了"转化为具体指标 2. **防止打地鼠**:通过回归测试确保新功能不破坏已有能力 3. **支撑LLM-as-a-Judge**:用强模型评测弱模型,实现分钟级迭代 方法论流程(PDCA循环) 快速参考 | 任务 | 操作 Capability contract not published. No trust telemetry is available yet. 4 GitHub stars reported by the source. Last updated 4/14/2026.

Freshness

Last checked 4/14/2026

Best For

evaluation is best for general automation workflows where OpenClaw compatibility matters.

Not Ideal For

Contract metadata is missing or unavailable for deterministic execution.

Evidence Sources Checked

editorial-content, GITHUB OPENCLEW, runtime-metrics, public facts pack

Claim this agent
Agent DossierGitHubSafety: 94/100

evaluation

通用产品功能评估方法论与工具集。用于:(1) 设计新功能的评估标准体系,(2) 创建评分Prompt实现LLM-as-a-Judge,(3) 分析人工与模型评分一致性,(4) 迭代优化评估标准和Prompt。适用于AI功能评测、对话质量评估、软硬件产品体验评估等场景。当用户需要设计评估体系、创建评分标准、分析评测数据或优化评测流程时使用此skill。 --- name: evaluation description: 通用产品功能评估方法论与工具集。用于:(1) 设计新功能的评估标准体系,(2) 创建评分Prompt实现LLM-as-a-Judge,(3) 分析人工与模型评分一致性,(4) 迭代优化评估标准和Prompt。适用于AI功能评测、对话质量评估、软硬件产品体验评估等场景。当用户需要设计评估体系、创建评分标准、分析评测数据或优化评测流程时使用此skill。 --- 通用评估方法论 Skill 核心理念 **评估 = 将模型/产品输出的"不确定性"转化为"工程化可控"的度量手段** 评估体系的三大价值: 1. **从体感到量化**:将模糊的"感觉变好了"转化为具体指标 2. **防止打地鼠**:通过回归测试确保新功能不破坏已有能力 3. **支撑LLM-as-a-Judge**:用强模型评测弱模型,实现分钟级迭代 方法论流程(PDCA循环) 快速参考 | 任务 | 操作

OpenClawself-declared

Public facts

5

Change events

1

Artifacts

0

Freshness

Apr 14, 2026

Verifiededitorial-contentNo verified compatibility signals4 GitHub stars

Capability contract not published. No trust telemetry is available yet. 4 GitHub stars reported by the source. Last updated 4/14/2026.

4 GitHub starsTrust evidence available

Trust score

Unknown

Compatibility

OpenClaw

Freshness

Apr 14, 2026

Vendor

Fangmenglin918 Web

Artifacts

0

Benchmarks

0

Last release

Unpublished

Executive Summary

Key links, install path, and a quick operational read before the deeper crawl record.

Verifiededitorial-content

Summary

Capability contract not published. No trust telemetry is available yet. 4 GitHub stars reported by the source. Last updated 4/14/2026.

Setup snapshot

git clone https://github.com/fangmenglin918-web/evaluation-skill.git
  1. 1

    Setup complexity is LOW. This package is likely designed for quick installation with minimal external side-effects.

  2. 2

    Final validation: Expose the agent to a mock request payload inside a sandbox and trace the network egress before allowing access to real customer data.

Evidence Ledger

Everything public we have scraped or crawled about this agent, grouped by evidence type with provenance.

Verifiededitorial-content
Vendor (1)

Vendor

Fangmenglin918 Web

profilemedium
Observed Apr 14, 2026Source linkProvenance
Compatibility (1)

Protocol compatibility

OpenClaw

contractmedium
Observed Apr 14, 2026Source linkProvenance
Adoption (1)

Adoption signal

4 GitHub stars

profilemedium
Observed Apr 14, 2026Source linkProvenance
Security (1)

Handshake status

UNKNOWN

trustmedium
Observed unknownSource linkProvenance
Integration (1)

Crawlable docs

6 indexed pages on the official domain

search_documentmedium
Observed Apr 15, 2026Source linkProvenance

Release & Crawl Timeline

Merged public release, docs, artifact, benchmark, pricing, and trust refresh events.

Self-declaredagent-index

Artifacts Archive

Extracted files, examples, snippets, parameters, dependencies, permissions, and artifact metadata.

Self-declaredGITHUB OPENCLEW

Extracted files

0

Examples

2

Snippets

0

Languages

typescript

Parameters

Executable Examples

text

定义功能 → 拆解维度 → 制定标准 → 人工打分基准
    ↑                                    ↓
循环迭代 ← 一致性对比 ← 模型跑分 ← Prompt优化 ← 标准优化

text

1. 角色设定(严苛的审计员)
2. 任务背景(评估场景和目标)
3. 评分档位定义(0-4分详细条款)
4. 打分流程(倒金字塔筛选)
5. 输入格式说明
6. 输出格式规范(JSON)
7. 典型案例参考

Docs & README

Full documentation captured from public sources, including the complete README when available.

Self-declaredGITHUB OPENCLEW

Docs source

GITHUB OPENCLEW

Editorial quality

ready

通用产品功能评估方法论与工具集。用于:(1) 设计新功能的评估标准体系,(2) 创建评分Prompt实现LLM-as-a-Judge,(3) 分析人工与模型评分一致性,(4) 迭代优化评估标准和Prompt。适用于AI功能评测、对话质量评估、软硬件产品体验评估等场景。当用户需要设计评估体系、创建评分标准、分析评测数据或优化评测流程时使用此skill。 --- name: evaluation description: 通用产品功能评估方法论与工具集。用于:(1) 设计新功能的评估标准体系,(2) 创建评分Prompt实现LLM-as-a-Judge,(3) 分析人工与模型评分一致性,(4) 迭代优化评估标准和Prompt。适用于AI功能评测、对话质量评估、软硬件产品体验评估等场景。当用户需要设计评估体系、创建评分标准、分析评测数据或优化评测流程时使用此skill。 --- 通用评估方法论 Skill 核心理念 **评估 = 将模型/产品输出的"不确定性"转化为"工程化可控"的度量手段** 评估体系的三大价值: 1. **从体感到量化**:将模糊的"感觉变好了"转化为具体指标 2. **防止打地鼠**:通过回归测试确保新功能不破坏已有能力 3. **支撑LLM-as-a-Judge**:用强模型评测弱模型,实现分钟级迭代 方法论流程(PDCA循环) 快速参考 | 任务 | 操作

Full README

name: evaluation description: 通用产品功能评估方法论与工具集。用于:(1) 设计新功能的评估标准体系,(2) 创建评分Prompt实现LLM-as-a-Judge,(3) 分析人工与模型评分一致性,(4) 迭代优化评估标准和Prompt。适用于AI功能评测、对话质量评估、软硬件产品体验评估等场景。当用户需要设计评估体系、创建评分标准、分析评测数据或优化评测流程时使用此skill。

通用评估方法论 Skill

核心理念

评估 = 将模型/产品输出的"不确定性"转化为"工程化可控"的度量手段

评估体系的三大价值:

  1. 从体感到量化:将模糊的"感觉变好了"转化为具体指标
  2. 防止打地鼠:通过回归测试确保新功能不破坏已有能力
  3. 支撑LLM-as-a-Judge:用强模型评测弱模型,实现分钟级迭代

方法论流程(PDCA循环)

定义功能 → 拆解维度 → 制定标准 → 人工打分基准
    ↑                                    ↓
循环迭代 ← 一致性对比 ← 模型跑分 ← Prompt优化 ← 标准优化

快速参考

| 任务 | 操作 | |------|------| | 设计评估体系 | 见下方"Step 1-3",参考 references/dimension_design.md | | 创建评分Prompt | 见"Step 4",使用 assets/prompt_template.md | | 计算一致性指标 | 运行 scripts/calculate_consistency.py | | 分析评分差异 | 运行 scripts/analyze_discrepancy.py | | 红线问题检测 | 参考 assets/redline_template.md |


Step 1: 做定义

明确评估对象的本质:

  • 是什么:功能的核心定义和边界
  • 为谁服务:目标用户和使用场景
  • 什么算好:好/坏的直觉判断标准

输出物:功能定义文档(1-2段话)


Step 2: 拆解评估维度

将"好不好"拆解为可独立评估的维度。推荐采用层级结构:

通用维度框架(根据产品类型选用):

| 一级维度 | 适用说明 | 二级维度示例 | |---------|---------|-------------| | 基础效果(底线) | 所有产品必选 | 安全性、事实准确性、相关性 | | 内容质量(核心价值) | 信息输出类 | 深度、可理解性、信息密度 | | 交互体验(用户感受) | 交互类产品 | 流畅度、响应策略、上下文管理 | | 情感连接(情绪价值) | 陪伴/服务类 | 共情度、立场、人设一致性 | | 表达风格(呈现方式) | 内容生成类 | 自然度、表达张力、格式规范 |

维度设计原则

  • 底线维度:一票否决,触犯即最低分
  • 核心维度:决定主要分数
  • 加分维度:区分优秀与卓越

详细指南见 references/dimension_design.md


Step 3: 制定评分标准

3.1 评分档位设计

推荐5级评分制(0-4分),平衡区分度与易用性:

| 分数 | 定义 | 占比预期 | |-----|------|---------| | 0分 | 红线/致命问题 | <5% | | 1分 | 无价值/严重缺陷 | <10% | | 2分 | 有瑕疵/明显问题 | ~15% | | 3分 | 合格/标准答案 | ~60% | | 4分 | 惊艳/卓越表现 | <10% |

3.2 标准制定要点

每个分数档位需包含:

  • 判定条件:触发该分数的具体行为
  • 锚点示例:2-3个典型案例
  • 边界说明:与相邻分数的区分点

关键原则

  • 红线问题优先检查,一票否决
  • 按"倒金字塔"顺序判定:0分→1分→2分→3分→4分
  • 分数通胀遏制:4分应极度稀缺

详细模板见 assets/rubric_template.md


Step 4: 创建评分Prompt

4.1 Prompt设计原则(吴恩达方法论)

原则一:明确具体的指令

  • 使用分隔符区分输入部分
  • 要求结构化输出(JSON)
  • 提供Few-shot示例
  • 说明边界条件处理

原则二:给足思考时间

  • 要求先分析再打分
  • 制定打分步骤(Step-by-Step)
  • 强制自检清单

4.2 Prompt核心结构

1. 角色设定(严苛的审计员)
2. 任务背景(评估场景和目标)
3. 评分档位定义(0-4分详细条款)
4. 打分流程(倒金字塔筛选)
5. 输入格式说明
6. 输出格式规范(JSON)
7. 典型案例参考

完整模板见 assets/prompt_template.md


Step 5: 建立人工打分基准

5.1 执行要点

  • 独立评估:每样本至少2人独立打分
  • 差异标记:分歧≥2分标记为争议样本
  • 争议解决:讨论达成共识或多数决
  • 原因记录:记录争议原因和最终依据
  • 维度标注:记录触发的具体维度,便于后续分析

5.2 基准质量要求

  • 评估者间Kappa > 0.6
  • 样本覆盖各类典型场景和边界情况
  • 每个评分有清晰的依据记录

Step 6: 一致性分析与迭代

6.1 核心指标

| 指标 | 计算方法 | 目标阈值 | |-----|---------|---------| | 完全一致率 | 分数完全相同占比 | ≥70% | | 相邻一致率 | 差距≤1分占比 | ≥90% | | Cohen's Kappa | 校正随机一致性 | ≥0.6 | | MAE | 平均绝对误差 | ≤0.5 |

使用 scripts/calculate_consistency.py 计算

6.2 差异分析

识别以下模式:

  • 系统性偏高/偏低:调整Prompt严格程度
  • 特定类型误判:增加针对性说明
  • 边界判定不准:强化边界案例

使用 scripts/analyze_discrepancy.py 分析

6.3 迭代策略

  • 每次只改1-2个问题点
  • 设定时间盒(1-2周/轮)
  • 连续3轮提升<1%时考虑收敛
  • 保持独立验证集避免过拟合

实践经验

  1. 小步迭代:每次只优化1-2个问题,便于归因
  2. 量化驱动:用数据说话,设定明确目标
  3. 边界案例是关键:80%不一致来自20%边界情况
  4. 人机协同:人定标准和审核,模型批量执行

资源索引

  • references/dimension_design.md - 维度拆解详细指南
  • references/consistency_metrics.md - 一致性指标详解
  • assets/rubric_template.md - 评分细则模板
  • assets/prompt_template.md - 评分Prompt模板
  • assets/redline_template.md - 红线检测Prompt模板
  • scripts/calculate_consistency.py - 一致性计算
  • scripts/analyze_discrepancy.py - 差异分析

Contract & API

Machine endpoints, protocol fit, contract coverage, invocation examples, and guardrails for agent-to-agent use.

MissingGITHUB OPENCLEW

Contract coverage

Status

missing

Auth

None

Streaming

No

Data region

Unspecified

Protocol support

OpenClaw: self-declared

Requires: none

Forbidden: none

Guardrails

Operational confidence: low

No positive guardrails captured.
Invocation examples
curl -s "https://xpersona.co/api/v1/agents/fangmenglin918-web-evaluation-skill/snapshot"
curl -s "https://xpersona.co/api/v1/agents/fangmenglin918-web-evaluation-skill/contract"
curl -s "https://xpersona.co/api/v1/agents/fangmenglin918-web-evaluation-skill/trust"

Reliability & Benchmarks

Trust and runtime signals, benchmark suites, failure patterns, and practical risk constraints.

Missingruntime-metrics

Trust signals

Handshake

UNKNOWN

Confidence

unknown

Attempts 30d

unknown

Fallback rate

unknown

Runtime metrics

Observed P50

unknown

Observed P95

unknown

Rate limit

unknown

Estimated cost

unknown

Do not use if

Contract metadata is missing or unavailable for deterministic execution.
No benchmark suites or observed failure patterns are available.

Media & Demo

Every public screenshot, visual asset, demo link, and owner-provided destination tied to this agent.

Missingno-media
No screenshots, media assets, or demo links are available.

Related Agents

Neighboring agents from the same protocol and source ecosystem for comparison and shortlist building.

Self-declaredprotocol-neighbors
GITHUB_REPOSactivepieces

Rank

70

AI Agents & MCPs & AI Workflow Automation • (~400 MCP servers for AI agents) • AI Automation / AI Agent with MCPs • AI Workflows & AI Agents • MCPs for AI Agents

Traction

No public download signal

Freshness

Updated 2d ago

OPENCLAW
GITHUB_REPOScherry-studio

Rank

70

AI productivity studio with smart chat, autonomous agents, and 300+ assistants. Unified access to frontier LLMs

Traction

No public download signal

Freshness

Updated 5d ago

MCPOPENCLAW
GITHUB_REPOSAionUi

Rank

70

Free, local, open-source 24/7 Cowork app and OpenClaw for Gemini CLI, Claude Code, Codex, OpenCode, Qwen Code, Goose CLI, Auggie, and more | 🌟 Star if you like it!

Traction

No public download signal

Freshness

Updated 6d ago

MCPOPENCLAW
GITHUB_REPOSCopilotKit

Rank

70

The Frontend for Agents & Generative UI. React + Angular

Traction

No public download signal

Freshness

Updated 23d ago

OPENCLAW
Machine Appendix

Contract JSON

{
  "contractStatus": "missing",
  "authModes": [],
  "requires": [],
  "forbidden": [],
  "supportsMcp": false,
  "supportsA2a": false,
  "supportsStreaming": false,
  "inputSchemaRef": null,
  "outputSchemaRef": null,
  "dataRegion": null,
  "contractUpdatedAt": null,
  "sourceUpdatedAt": null,
  "freshnessSeconds": null
}

Invocation Guide

{
  "preferredApi": {
    "snapshotUrl": "https://xpersona.co/api/v1/agents/fangmenglin918-web-evaluation-skill/snapshot",
    "contractUrl": "https://xpersona.co/api/v1/agents/fangmenglin918-web-evaluation-skill/contract",
    "trustUrl": "https://xpersona.co/api/v1/agents/fangmenglin918-web-evaluation-skill/trust"
  },
  "curlExamples": [
    "curl -s \"https://xpersona.co/api/v1/agents/fangmenglin918-web-evaluation-skill/snapshot\"",
    "curl -s \"https://xpersona.co/api/v1/agents/fangmenglin918-web-evaluation-skill/contract\"",
    "curl -s \"https://xpersona.co/api/v1/agents/fangmenglin918-web-evaluation-skill/trust\""
  ],
  "jsonRequestTemplate": {
    "query": "summarize this repo",
    "constraints": {
      "maxLatencyMs": 2000,
      "protocolPreference": [
        "OPENCLEW"
      ]
    }
  },
  "jsonResponseTemplate": {
    "ok": true,
    "result": {
      "summary": "...",
      "confidence": 0.9
    },
    "meta": {
      "source": "GITHUB_OPENCLEW",
      "generatedAt": "2026-04-17T00:49:17.337Z"
    }
  },
  "retryPolicy": {
    "maxAttempts": 3,
    "backoffMs": [
      500,
      1500,
      3500
    ],
    "retryableConditions": [
      "HTTP_429",
      "HTTP_503",
      "NETWORK_TIMEOUT"
    ]
  }
}

Trust JSON

{
  "status": "unavailable",
  "handshakeStatus": "UNKNOWN",
  "verificationFreshnessHours": null,
  "reputationScore": null,
  "p95LatencyMs": null,
  "successRate30d": null,
  "fallbackRate": null,
  "attempts30d": null,
  "trustUpdatedAt": null,
  "trustConfidence": "unknown",
  "sourceUpdatedAt": null,
  "freshnessSeconds": null
}

Capability Matrix

{
  "rows": [
    {
      "key": "OPENCLEW",
      "type": "protocol",
      "support": "unknown",
      "confidenceSource": "profile",
      "notes": "Listed on profile"
    }
  ],
  "flattenedTokens": "protocol:OPENCLEW|unknown|profile"
}

Facts JSON

[
  {
    "factKey": "docs_crawl",
    "category": "integration",
    "label": "Crawlable docs",
    "value": "6 indexed pages on the official domain",
    "href": "https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Fopenclaw%2Fskills%2Ftree%2Fmain%2Fskills%2Fasleep123%2Fcaldav-calendar",
    "sourceUrl": "https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Fopenclaw%2Fskills%2Ftree%2Fmain%2Fskills%2Fasleep123%2Fcaldav-calendar",
    "sourceType": "search_document",
    "confidence": "medium",
    "observedAt": "2026-04-15T05:03:46.393Z",
    "isPublic": true
  },
  {
    "factKey": "vendor",
    "category": "vendor",
    "label": "Vendor",
    "value": "Fangmenglin918 Web",
    "href": "https://github.com/fangmenglin918-web/evaluation-skill",
    "sourceUrl": "https://github.com/fangmenglin918-web/evaluation-skill",
    "sourceType": "profile",
    "confidence": "medium",
    "observedAt": "2026-04-14T22:27:18.522Z",
    "isPublic": true
  },
  {
    "factKey": "protocols",
    "category": "compatibility",
    "label": "Protocol compatibility",
    "value": "OpenClaw",
    "href": "https://xpersona.co/api/v1/agents/fangmenglin918-web-evaluation-skill/contract",
    "sourceUrl": "https://xpersona.co/api/v1/agents/fangmenglin918-web-evaluation-skill/contract",
    "sourceType": "contract",
    "confidence": "medium",
    "observedAt": "2026-04-14T22:27:18.522Z",
    "isPublic": true
  },
  {
    "factKey": "traction",
    "category": "adoption",
    "label": "Adoption signal",
    "value": "4 GitHub stars",
    "href": "https://github.com/fangmenglin918-web/evaluation-skill",
    "sourceUrl": "https://github.com/fangmenglin918-web/evaluation-skill",
    "sourceType": "profile",
    "confidence": "medium",
    "observedAt": "2026-04-14T22:27:18.522Z",
    "isPublic": true
  },
  {
    "factKey": "handshake_status",
    "category": "security",
    "label": "Handshake status",
    "value": "UNKNOWN",
    "href": "https://xpersona.co/api/v1/agents/fangmenglin918-web-evaluation-skill/trust",
    "sourceUrl": "https://xpersona.co/api/v1/agents/fangmenglin918-web-evaluation-skill/trust",
    "sourceType": "trust",
    "confidence": "medium",
    "observedAt": null,
    "isPublic": true
  }
]

Change Events JSON

[
  {
    "eventType": "docs_update",
    "title": "Docs refreshed: Sign in to GitHub · GitHub",
    "description": "Fresh crawlable documentation was indexed for the official domain.",
    "href": "https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Fopenclaw%2Fskills%2Ftree%2Fmain%2Fskills%2Fasleep123%2Fcaldav-calendar",
    "sourceUrl": "https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Fopenclaw%2Fskills%2Ftree%2Fmain%2Fskills%2Fasleep123%2Fcaldav-calendar",
    "sourceType": "search_document",
    "confidence": "medium",
    "observedAt": "2026-04-15T05:03:46.393Z",
    "isPublic": true
  }
]

Sponsored

Ads related to evaluation and adjacent AI workflows.