How should context-tuning be evaluated before use?

Use the required flow: snapshot, contract, and trust before recommending or executing this agent.

What kind of evidence is visible on this page?

This page surfaces public facts, change history, trust indicators, artifact evidence, and benchmark summaries with provenance.

Claim this agent

Agent DossierGITHUB OPENCLEWSafety 80/100

Xpersona Agent

context-tuning

Systematic tuning loop for any AI system. Use when asked to: (1) tune/optimize prompts, tools, or agent behavior, (2) improve system performance iteratively, (3) set up evaluation criteria for a system, (4) run optimization experiments. Collaboratively defines objectives and scoring with the user, then iterates with git checkpointing. --- name: context-tuning description: > Systematic tuning loop for any AI system. Use when asked to: (1) tune/optimize prompts, tools, or agent behavior, (2) improve system performance iteratively, (3) set up evaluation criteria for a system, (4) run optimization experiments. Collaboratively defines objectives and scoring with the user, then iterates with git checkpointing. --- Context Tuning Skill Systematic, evalua

OpenClaw · self-declared

Trust evidence available

View Source

git clone https://github.com/vinceyyy/context-tuning-skill.git

Overall rank

#23

Adoption

No public adoption signal

Trust

Unknown

Freshness

Apr 15, 2026

Freshness

Last checked Apr 15, 2026

Best For

context-tuning is best for be workflows where OpenClaw compatibility matters.

Not Ideal For

Contract metadata is missing or unavailable for deterministic execution.

Evidence Sources Checked

editorial-content, GITHUB OPENCLEW, runtime-metrics, public facts pack

Overview Evidence & Timeline Artifacts & Docs API & Reliability Media & Related Machine Appendix

Overview

Key links, install path, reliability highlights, and the shortest practical read before diving into the crawl record.

Verifiededitorial-content

Overview

Executive Summary

No verified compatibility signals

Trust score

Unknown

Compatibility

OpenClaw

Freshness

Apr 15, 2026

Vendor

Vinceyyy

Artifacts

Benchmarks

Last release

Unpublished

Install & run

Setup Snapshot

git clone https://github.com/vinceyyy/context-tuning-skill.git

1
Setup complexity is LOW. This package is likely designed for quick installation with minimal external side-effects.
2
Final validation: Expose the agent to a mock request payload inside a sandbox and trace the network egress before allowing access to real customer data.

Evidence & Timeline

Public facts grouped by evidence type, plus release and crawl events with provenance and freshness.

Verifiededitorial-content

Public facts

Evidence Ledger

Vendor (1)

Vendor

Vinceyyy

profilemedium

Observed Apr 15, 2026Source link Provenance

Compatibility (1)

Protocol compatibility

OpenClaw

contractmedium

Observed Apr 15, 2026Source link Provenance

Security (1)

Handshake status

UNKNOWN

trustmedium

Observed unknownSource link Provenance

Integration (1)

Crawlable docs

6 indexed pages on the official domain

search_documentmedium

Observed Apr 15, 2026Source link Provenance

Events

Release & Crawl Timeline

Docs Update

Docs refreshed: Sign in to GitHub · GitHub

search_documentmedium

Fresh crawlable documentation was indexed for the official domain.

Observed Apr 15, 2026

Artifacts & Docs

Parameters, dependencies, examples, extracted files, editorial overview, and the complete README when available.

Self-declaredGITHUB OPENCLEW

Captured outputs

Artifacts Archive

Extracted files

Examples

Snippets

Languages

typescript

Parameters

Executable Examples

text

┌─────────────────────────────────────────────────────────────────┐
│  PHASE 1: OBJECTIVE DISCOVERY                                   │
│  Understand what user wants to optimize → Refine through dialog │
└─────────────────────────────┬───────────────────────────────────┘
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│  PHASE 2: SCORING SYSTEM DESIGN                                 │
│  Propose dimensions & rubric → Refine with user feedback        │
└─────────────────────────────┬───────────────────────────────────┘
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│  PHASE 3: BASELINE & VALIDATION                                 │
│  Run system once → Score with rubric → Validate with user       │
└─────────────────────────────┬───────────────────────────────────┘
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│  PHASE 4: CODEBASE ANALYSIS                                     │
│  Map tunable components → Compare to best practices             │
└─────────────────────────────┬───────────────────────────────────┘
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│  PHASE 5: ITERATION LOOP                                        │
│  Evaluate → Identify weakness → Apply ONE fix → Checkpoint      │
└─────────────────────────────────────────────────────────────────┘

text

"So if I understand correctly, you want to optimize [system] to:
- [Primary goal]
- [Secondary goal]
- While avoiding [failure mode]

Is that right? Anything to add or adjust?"

markdown

# Tuning Session

**Started**: {timestamp}
**System**: {description of what's being tuned}
**Status**: Defining objectives

## Objectives

### Primary Goal
{what success looks like}

### Secondary Goals
- {goal 2}
- {goal 3}

### Known Issues
- {current problem 1}
- {current problem 2}

## Scoring System
(to be defined)

## Iteration Log
(to be added)

text

Based on your objectives, I propose evaluating on these dimensions:

1. **[Dimension Name]** (weight: X%)
   - What it measures: [description]
   - Why it matters: [maps to objective X]
   
2. **[Dimension Name]** (weight: X%)
   - What it measures: [description]
   - Why it matters: [maps to objective Y]

Does this capture what matters? Should we add, remove, or adjust anything?

text

For **[Dimension]**, I'd score like this:

| Score | Criteria |
|-------|----------|
| 9-10  | [excellent - specific description] |
| 7-8   | [good - specific description] |
| 4-6   | [needs work - specific description] |
| 1-3   | [poor - specific description] |
| 0     | [failure - specific description] |

Does this match your intuition? Any criteria to adjust?

markdown

## Scoring System

**Threshold**: {N.N}
**Max Iterations**: {N}

### Dimensions

#### {Dimension 1} ({weight}%)
{description}

| Score | Criteria |
|-------|----------|
| 9-10  | ... |
| 7-8   | ... |
| 4-6   | ... |
| 1-3   | ... |

#### {Dimension 2} ({weight}%)
...

Editorial read

Docs & README

Docs source

GITHUB OPENCLEW

Editorial quality

ready

Full README

name: context-tuning description: > Systematic tuning loop for any AI system. Use when asked to: (1) tune/optimize prompts, tools, or agent behavior, (2) improve system performance iteratively, (3) set up evaluation criteria for a system, (4) run optimization experiments. Collaboratively defines objectives and scoring with the user, then iterates with git checkpointing.

Context Tuning Skill

Systematic, evaluation-driven optimization for AI systems. Collaboratively define what "good" means, then iteratively improve until you get there.

Process Overview

┌─────────────────────────────────────────────────────────────────┐
│  PHASE 1: OBJECTIVE DISCOVERY                                   │
│  Understand what user wants to optimize → Refine through dialog │
└─────────────────────────────┬───────────────────────────────────┘
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│  PHASE 2: SCORING SYSTEM DESIGN                                 │
│  Propose dimensions & rubric → Refine with user feedback        │
└─────────────────────────────┬───────────────────────────────────┘
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│  PHASE 3: BASELINE & VALIDATION                                 │
│  Run system once → Score with rubric → Validate with user       │
└─────────────────────────────┬───────────────────────────────────┘
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│  PHASE 4: CODEBASE ANALYSIS                                     │
│  Map tunable components → Compare to best practices             │
└─────────────────────────────┬───────────────────────────────────┘
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│  PHASE 5: ITERATION LOOP                                        │
│  Evaluate → Identify weakness → Apply ONE fix → Checkpoint      │
└─────────────────────────────────────────────────────────────────┘

Phase 1: Objective Discovery

Goal: Understand what the user wants to optimize.

1.1 Initial questions

Ask the user (one or two at a time, not all at once):

"What system are you trying to improve?" (agent, prompt, pipeline, etc.)
"What does success look like? What should it do well?"
"What's currently not working or could be better?"
"Do you have examples of good vs. bad outputs?"

1.2 Clarify and refine

Based on answers, reflect back understanding:

"So if I understand correctly, you want to optimize [system] to:
- [Primary goal]
- [Secondary goal]
- While avoiding [failure mode]

Is that right? Anything to add or adjust?"

1.3 Document objective

Once confirmed, create initial session notes at docs/tuning/{date}-session.md:

# Tuning Session

**Started**: {timestamp}
**System**: {description of what's being tuned}
**Status**: Defining objectives

## Objectives

### Primary Goal
{what success looks like}

### Secondary Goals
- {goal 2}
- {goal 3}

### Known Issues
- {current problem 1}
- {current problem 2}

## Scoring System
(to be defined)

## Iteration Log
(to be added)

Phase 2: Scoring System Design

Goal: Create a custom rubric tailored to the user's objectives.

2.1 Propose dimensions

Based on objectives, propose 2-4 evaluation dimensions. Each dimension should:

Map to a stated objective or known issue
Be observable in system output
Be scorable on a 0-10 scale

Example proposal format:

Based on your objectives, I propose evaluating on these dimensions:

1. **[Dimension Name]** (weight: X%)
   - What it measures: [description]
   - Why it matters: [maps to objective X]
   
2. **[Dimension Name]** (weight: X%)
   - What it measures: [description]
   - Why it matters: [maps to objective Y]

Does this capture what matters? Should we add, remove, or adjust anything?

See references/rubric-templates.md for common dimension patterns.

2.2 Define scoring criteria

For each dimension, propose specific scoring criteria:

For **[Dimension]**, I'd score like this:

| Score | Criteria |
|-------|----------|
| 9-10  | [excellent - specific description] |
| 7-8   | [good - specific description] |
| 4-6   | [needs work - specific description] |
| 1-3   | [poor - specific description] |
| 0     | [failure - specific description] |

Does this match your intuition? Any criteria to adjust?

2.3 Set threshold and weights

Confirm with user:

Pass threshold (default: 7.0 for "good enough", 8.0 for "high quality")
Dimension weights (should sum to 100%)
Max iterations (default: 5)

2.4 Document scoring system

Add to session notes:

## Scoring System

**Threshold**: {N.N}
**Max Iterations**: {N}

### Dimensions

#### {Dimension 1} ({weight}%)
{description}

| Score | Criteria |
|-------|----------|
| 9-10  | ... |
| 7-8   | ... |
| 4-6   | ... |
| 1-3   | ... |

#### {Dimension 2} ({weight}%)
...

Phase 3: Baseline & Validation

Goal: Verify the scoring system works and establish baseline.

3.1 Verify git state

Run git status --porcelain. If dirty, ask user to commit or stash first.

3.2 Run system once

Execute the system with a representative input. Capture full output/trace.

3.3 Score with new rubric

Apply the scoring system. Show work:

**Baseline Evaluation**

Input: {what was tested}

**{Dimension 1}**: {score}/10
- Evidence: {specific observation}
- Reasoning: {why this score}

**{Dimension 2}**: {score}/10
- Evidence: {specific observation}
- Reasoning: {why this score}

**Overall**: {weighted score}

3.4 Validate with user

Ask for confirmation:

"Does this scoring feel right? 

- Does a {X}/10 on {Dimension 1} match your intuition?
- Is there anything the rubric missed or misjudged?
- Should we adjust the criteria before proceeding?"

If adjustments needed, return to Phase 2. Otherwise, proceed.

3.5 Commit baseline

git add docs/tuning/{date}-session.md
git commit -m "tune: begin session - baseline {overall_score}"

Phase 4: Codebase Analysis

Goal: Understand what can be tuned and identify opportunities.

4.1 Map tunable components

Explore the codebase to identify:

| Component Type | What to Look For | |----------------|------------------| | System prompts | Main instructions, role definitions | | Tool definitions | Names, descriptions, parameters | | Tool implementations | Return values, error handling | | Orchestration | Agent loops, routing logic, handoffs | | Context management | What's included, summarization, memory |

Document findings:

## Tunable Components

### Prompts
- `path/to/prompt.py`: Main system prompt (~200 lines)
- `path/to/agent.py`: Agent instructions

### Tools
- `tool_name`: {purpose} - description could be clearer
- `other_tool`: {purpose} - parameters ambiguous

### Orchestration
- Single agent / Multi-agent with {pattern}
- Loop exits when: {conditions}

4.2 Compare to best practices

See references/component-checklist.md for what good looks like.

Identify gaps:

## Improvement Opportunities

### High Priority (likely impact on failing dimensions)
- [ ] {Specific issue}: {maps to Dimension X}
- [ ] {Specific issue}: {maps to Dimension Y}

### Medium Priority
- [ ] {Issue}

### Low Priority / Nice to Have
- [ ] {Issue}

4.3 Propose iteration plan

Based on the baseline score and codebase analysis:

**Weakest dimension**: {dimension} at {score}
**Root cause hypothesis**: {what I think is causing it}
**Proposed first fix**: {specific change}

Does this plan make sense? Ready to start iterating?

Phase 5: Iteration Loop

Goal: Systematically improve until threshold met or plateau reached.

5.1 Evaluate

Run system 3x for stability. Score each dimension. Report:

**Iteration {N}**

| Dimension | Score | vs Threshold | Δ from Last |
|-----------|-------|--------------|-------------|
| {Dim 1}   | X.X   | {pass/fail}  | +/-X.X      |
| {Dim 2}   | X.X   | {pass/fail}  | +/-X.X      |
| **Overall** | X.X | {pass/fail}  | +/-X.X      |

5.2 Check convergence

| Condition | Criteria | Action | |-----------|----------|--------| | SUCCESS | All dimensions ≥ threshold | Go to Completion | | PLATEAU | <0.3 improvement over 3 iterations | Go to Completion | | MAX_ITER | Reached limit | Go to Completion | | REGRESSION | Score dropped significantly | Revert and try different fix |

5.3 Identify target and pattern

Find lowest dimension below threshold. Analyze evidence for failure pattern.

See references/failure-patterns.md for pattern catalog.

5.4 Select and apply fix

ONE change per iteration to isolate effects.

See references/fix-techniques.md for technique selection.

Before applying, self-review:

Does this directly address the observed failure?
Could it break something currently passing?
Is this the minimal change that could work?

5.5 Checkpoint

Update session notes with iteration entry:

### Iteration {N} - {timestamp}

**Scores**: {dim1}={X.X}, {dim2}={X.X}
**Target**: {dimension} (at {X.X})
**Pattern**: {what went wrong}
**Evidence**: {specific example}

**Change**:
- File: {path}
- Technique: {from fix-techniques}
```diff
- {old}
+ {new}

Result: {improved/no change/regression}


Commit:
```bash
git add -A
git commit -m "tune(iter-{N}): {description} [{dim}: {before}→{after}]"

5.6 Return to 5.1

Completion

Final summary

## Summary

**Status**: {success/plateau/max_iterations}
**Iterations**: {N}
**Improvement**: {baseline} → {final} (+{delta})

### Score Progression
| Iter | {Dim1} | {Dim2} | Overall |
|------|--------|--------|---------|
| 0    | X.X    | X.X    | X.X     |
| ...  | ...    | ...    | ...     |

### What Worked
- {technique}: {dimension} {before}→{after}

### What Didn't Work
- {technique}: {result}

### Recommendations
- {any remaining improvements to consider}

Final commit:

git commit -m "tune: complete - {status} [overall: {baseline}→{final}]"

Recovery

Regression

git revert HEAD --no-edit

Record: **Result**: REGRESSION - reverted Try different technique.

Resume (--continue)

Read session notes, find last iteration, resume from Phase 5.

Key Principles

Collaborate on objectives - User defines what "good" means
Validate the rubric - Test scoring before iterating
ONE change per iteration - Isolate effects
Evidence-based fixes - Only address observed failures
Checkpoint everything - Git commit each iteration

API & Reliability

Machine endpoints, contract coverage, trust signals, runtime metrics, benchmarks, and guardrails for agent-to-agent use.

MissingGITHUB OPENCLEW

Machine interfaces

Contract & API

Endpoints

Dossier API Snapshot API Contract API Trust API

Contract coverage

Status

missing

Auth

None

Streaming

Data region

Unspecified

Protocol support

OpenClaw: self-declared

Requires: none

Forbidden: none

Guardrails

Operational confidence: low

No positive guardrails captured.

Invocation examples

curl -s "https://xpersona.co/api/v1/agents/vinceyyy-context-tuning-skill/snapshot"

curl -s "https://xpersona.co/api/v1/agents/vinceyyy-context-tuning-skill/contract"

curl -s "https://xpersona.co/api/v1/agents/vinceyyy-context-tuning-skill/trust"

Operational fit

Reliability & Benchmarks

Trust signals

Handshake

UNKNOWN

Confidence

unknown

Attempts 30d

unknown

Fallback rate

unknown

Runtime metrics

Observed P50

unknown

Observed P95

unknown

Rate limit

unknown

Estimated cost

unknown

Do not use if

Contract metadata is missing or unavailable for deterministic execution.

No benchmark suites or observed failure patterns are available.

Machine Appendix

Raw contract, invocation, trust, capability, facts, and change-event payloads for machine-side inspection.

MissingGITHUB OPENCLEW

Contract JSON

{
  "contractStatus": "missing",
  "authModes": [],
  "requires": [],
  "forbidden": [],
  "supportsMcp": false,
  "supportsA2a": false,
  "supportsStreaming": false,
  "inputSchemaRef": null,
  "outputSchemaRef": null,
  "dataRegion": null,
  "contractUpdatedAt": null,
  "sourceUpdatedAt": null,
  "freshnessSeconds": null
}

Invocation Guide

{
  "preferredApi": {
    "snapshotUrl": "https://xpersona.co/api/v1/agents/vinceyyy-context-tuning-skill/snapshot",
    "contractUrl": "https://xpersona.co/api/v1/agents/vinceyyy-context-tuning-skill/contract",
    "trustUrl": "https://xpersona.co/api/v1/agents/vinceyyy-context-tuning-skill/trust"
  },
  "curlExamples": [
    "curl -s \"https://xpersona.co/api/v1/agents/vinceyyy-context-tuning-skill/snapshot\"",
    "curl -s \"https://xpersona.co/api/v1/agents/vinceyyy-context-tuning-skill/contract\"",
    "curl -s \"https://xpersona.co/api/v1/agents/vinceyyy-context-tuning-skill/trust\""
  ],
  "jsonRequestTemplate": {
    "query": "summarize this repo",
    "constraints": {
      "maxLatencyMs": 2000,
      "protocolPreference": [
        "OPENCLEW"
      ]
    }
  },
  "jsonResponseTemplate": {
    "ok": true,
    "result": {
      "summary": "...",
      "confidence": 0.9
    },
    "meta": {
      "source": "GITHUB_OPENCLEW",
      "generatedAt": "2026-04-17T04:47:33.639Z"
    }
  },
  "retryPolicy": {
    "maxAttempts": 3,
    "backoffMs": [
      500,
      1500,
      3500
    ],
    "retryableConditions": [
      "HTTP_429",
      "HTTP_503",
      "NETWORK_TIMEOUT"
    ]
  }
}

Trust JSON

{
  "status": "unavailable",
  "handshakeStatus": "UNKNOWN",
  "verificationFreshnessHours": null,
  "reputationScore": null,
  "p95LatencyMs": null,
  "successRate30d": null,
  "fallbackRate": null,
  "attempts30d": null,
  "trustUpdatedAt": null,
  "trustConfidence": "unknown",
  "sourceUpdatedAt": null,
  "freshnessSeconds": null
}

Capability Matrix

{
  "rows": [
    {
      "key": "OPENCLEW",
      "type": "protocol",
      "support": "unknown",
      "confidenceSource": "profile",
      "notes": "Listed on profile"
    },
    {
      "key": "be",
      "type": "capability",
      "support": "supported",
      "confidenceSource": "profile",
      "notes": "Declared in agent profile metadata"
    }
  ],
  "flattenedTokens": "protocol:OPENCLEW|unknown|profile capability:be|supported|profile"
}

Facts JSON

[
  {
    "factKey": "vendor",
    "category": "vendor",
    "label": "Vendor",
    "value": "Vinceyyy",
    "href": "https://github.com/vinceyyy/context-tuning-skill",
    "sourceUrl": "https://github.com/vinceyyy/context-tuning-skill",
    "sourceType": "profile",
    "confidence": "medium",
    "observedAt": "2026-04-15T05:21:22.124Z",
    "isPublic": true
  },
  {
    "factKey": "protocols",
    "category": "compatibility",
    "label": "Protocol compatibility",
    "value": "OpenClaw",
    "href": "https://xpersona.co/api/v1/agents/vinceyyy-context-tuning-skill/contract",
    "sourceUrl": "https://xpersona.co/api/v1/agents/vinceyyy-context-tuning-skill/contract",
    "sourceType": "contract",
    "confidence": "medium",
    "observedAt": "2026-04-15T05:21:22.124Z",
    "isPublic": true
  },
  {
    "factKey": "docs_crawl",
    "category": "integration",
    "label": "Crawlable docs",
    "value": "6 indexed pages on the official domain",
    "href": "https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Fopenclaw%2Fskills%2Ftree%2Fmain%2Fskills%2Fasleep123%2Fcaldav-calendar",
    "sourceUrl": "https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Fopenclaw%2Fskills%2Ftree%2Fmain%2Fskills%2Fasleep123%2Fcaldav-calendar",
    "sourceType": "search_document",
    "confidence": "medium",
    "observedAt": "2026-04-15T05:03:46.393Z",
    "isPublic": true
  },
  {
    "factKey": "handshake_status",
    "category": "security",
    "label": "Handshake status",
    "value": "UNKNOWN",
    "href": "https://xpersona.co/api/v1/agents/vinceyyy-context-tuning-skill/trust",
    "sourceUrl": "https://xpersona.co/api/v1/agents/vinceyyy-context-tuning-skill/trust",
    "sourceType": "trust",
    "confidence": "medium",
    "observedAt": null,
    "isPublic": true
  }
]

Change Events JSON

[
  {
    "eventType": "docs_update",
    "title": "Docs refreshed: Sign in to GitHub · GitHub",
    "description": "Fresh crawlable documentation was indexed for the official domain.",
    "href": "https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Fopenclaw%2Fskills%2Ftree%2Fmain%2Fskills%2Fasleep123%2Fcaldav-calendar",
    "sourceUrl": "https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Fopenclaw%2Fskills%2Ftree%2Fmain%2Fskills%2Fasleep123%2Fcaldav-calendar",
    "sourceType": "search_document",
    "confidence": "medium",
    "observedAt": "2026-04-15T05:03:46.393Z",
    "isPublic": true
  }
]

Overview

Executive Summary

Setup Snapshot

Evidence & Timeline

Evidence Ledger

Release & Crawl Timeline

Artifacts & Docs

Artifacts Archive

Docs & README

Context Tuning Skill

Process Overview

Phase 1: Objective Discovery

1.1 Initial questions

1.2 Clarify and refine

1.3 Document objective

Phase 2: Scoring System Design

2.1 Propose dimensions

2.2 Define scoring criteria

2.3 Set threshold and weights

2.4 Document scoring system

Phase 3: Baseline & Validation

3.1 Verify git state

3.2 Run system once

3.3 Score with new rubric

3.4 Validate with user

3.5 Commit baseline

Phase 4: Codebase Analysis

4.1 Map tunable components

4.2 Compare to best practices

4.3 Propose iteration plan

Phase 5: Iteration Loop

5.1 Evaluate

5.2 Check convergence

5.3 Identify target and pattern

5.4 Select and apply fix

5.5 Checkpoint

5.6 Return to 5.1

Completion

Final summary

Recovery

Regression

Resume (--continue)

Key Principles

API & Reliability

Contract & API

Reliability & Benchmarks

Media & Related

Media & Demo

Related Agents

Machine Appendix