Rank
70
AI Agents & MCPs & AI Workflow Automation • (~400 MCP servers for AI agents) • AI Automation / AI Agent with MCPs • AI Workflows & AI Agents • MCPs for AI Agents
Traction
No public download signal
Freshness
Updated 2d ago
Crawler Summary
Speech-to-text with word-level timestamps, speaker diarization, and forced alignment using WhisperX. Built on faster-whisper with batched inference for 70x realtime speed. --- name: whisperx description: Speech-to-text with word-level timestamps, speaker diarization, and forced alignment using WhisperX. Built on faster-whisper with batched inference for 70x realtime speed. version: 1.0.0 author: Sarah Mak tags: ["audio", "transcription", "whisperx", "speech-to-text", "diarization", "alignment", "subtitles", "ml", "cuda", "gpu"] homepage: https://github.com/ThePlasmak/whisperx platforms Published capability contract available. No trust telemetry is available yet. Last updated 2/25/2026.
Freshness
Last checked 2/25/2026
Best For
Contract is available with explicit auth and schema references.
Not Ideal For
whisperx is not ideal for teams that need stronger public trust telemetry, lower setup complexity, or more explicit contract coverage before production rollout.
Evidence Sources Checked
editorial-content, capability-contract, runtime-metrics, public facts pack
Speech-to-text with word-level timestamps, speaker diarization, and forced alignment using WhisperX. Built on faster-whisper with batched inference for 70x realtime speed. --- name: whisperx description: Speech-to-text with word-level timestamps, speaker diarization, and forced alignment using WhisperX. Built on faster-whisper with batched inference for 70x realtime speed. version: 1.0.0 author: Sarah Mak tags: ["audio", "transcription", "whisperx", "speech-to-text", "diarization", "alignment", "subtitles", "ml", "cuda", "gpu"] homepage: https://github.com/ThePlasmak/whisperx platforms
Public facts
6
Change events
1
Artifacts
0
Freshness
Feb 25, 2026
Published capability contract available. No trust telemetry is available yet. Last updated 2/25/2026.
Trust score
Unknown
Compatibility
OpenClaw
Freshness
Feb 25, 2026
Vendor
Theplasmak
Artifacts
0
Benchmarks
0
Last release
Unpublished
Key links, install path, and a quick operational read before the deeper crawl record.
Summary
Published capability contract available. No trust telemetry is available yet. Last updated 2/25/2026.
Setup snapshot
git clone https://github.com/ThePlasmak/whisperx.gitSetup complexity is LOW. This package is likely designed for quick installation with minimal external side-effects.
Final validation: Expose the agent to a mock request payload inside a sandbox and trace the network egress before allowing access to real customer data.
Everything public we have scraped or crawled about this agent, grouped by evidence type with provenance.
Vendor
Theplasmak
Protocol compatibility
OpenClaw
Auth modes
api_key, oauth
Machine-readable schemas
OpenAPI or schema references published
Handshake status
UNKNOWN
Crawlable docs
6 indexed pages on the official domain
Merged public release, docs, artifact, benchmark, pricing, and trust refresh events.
Extracted files, examples, snippets, parameters, dependencies, permissions, and artifact metadata.
Extracted files
0
Examples
6
Snippets
0
Languages
typescript
Parameters
bash
# Option A: Run the setup script (auto-detects GPU, creates venv if needed) ./setup.sh # Option B: Install globally (if you prefer) pip install whisperx
bash
mkdir -p ~/.cache/huggingface && echo -n "hf_YOUR_TOKEN" > ~/.cache/huggingface/token && chmod 600 ~/.cache/huggingface/token
bash
# Quick test — should print transcript to stdout ./scripts/transcribe some_audio.mp3 # Test diarization — should show [SPEAKER_00], [SPEAKER_01], etc. ./scripts/transcribe some_audio.mp3 --diarize # Check version ./scripts/transcribe --version # If diarization fails with 403: model agreements not accepted (see step 3 above) # If it crashes with pickle/weights_only error: you're running `whisperx` CLI directly instead of the wrapper
bash
# Basic transcription (word-aligned) ./scripts/transcribe audio.mp3 # With speaker diarization (auto-reads ~/.cache/huggingface/token) ./scripts/transcribe audio.mp3 --diarize # Diarize with merged same-speaker segments (cleaner output) ./scripts/transcribe audio.mp3 --diarize --merge-speakers # Rename speakers to real names ./scripts/transcribe audio.mp3 --diarize --speaker-names "Alice,Bob,Charlie" # Generate SRT subtitles ./scripts/transcribe audio.mp3 --srt -o subtitles.srt # SRT with line wrapping (standard TV width) ./scripts/transcribe audio.mp3 --srt --max-line-width 42 -o subtitles.srt # Word-level karaoke subtitles (one word per cue with precise timing) ./scripts/transcribe audio.mp3 --srt --word-level -o karaoke.srt # VTT with speaker voice tags (<v Speaker>text</v>) ./scripts/transcribe audio.mp3 --vtt --diarize -o subtitles.vtt # Auto-detect format from output filename ./scripts/transcribe audio.mp3 -o transcript.json ./scripts/transcribe audio.mp3 -o subtitles.vtt # TSV for spreadsheets/data analysis ./scripts/transcribe audio.mp3 --tsv -o transcript.tsv # Transcribe only a section (useful for long recordings) ./scripts/transcribe podcast.mp3 --start 10:30 --end 15:00 # Boost recognition of specific terms (hotwords) ./scripts/transcribe meeting.mp3 --hotwords "Kubernetes gRPC OAuth2" # Improve accuracy for domain context (initial prompt) ./scripts/transcribe meeting.mp3 --initial-prompt "Attendees: Alice, Bob. Topics: Kubernetes, gRPC, OAuth2" # Maximum accuracy ./scripts/transcribe audio.mp3 --model large-v3 # Translate non-English audio to English ./scripts/transcribe audio.mp3 --translate -l ja # Fast mode (skip alignment) ./scripts/transcribe audio.mp3 --no-align # JSON with full metadata and word timestamps ./scripts/transcribe audio.mp3 --json -o transcript.json # Specify known speaker count for better diarization ./scripts/transcribe audio.mp3 --diarize --min-speakers 2 --max-speakers 4 # Read audio from stdin (pipe from other
text
AUDIO_FILE Path to audio/video file, or '-' to read from stdin Model options: -m, --model NAME Whisper model (default: large-v3-turbo) --batch-size N Batch size for inference (default: 8, lower if OOM) --beam-size N Beam search size (higher = slower but more accurate) --initial-prompt TEXT Condition the model with domain terms, names, acronyms --hotwords TEXT Space-separated hotwords to boost recognition of rare terms Device options: --device cpu, cuda, or auto (default: auto) --compute-type int8, float16, float32, or auto (default: auto) --threads N CPU threads for CTranslate2 inference (default: 4) Language options: -l, --language CODE Language code (auto-detects if omitted) --translate Translate to English Time range: --start TIME Start time — seconds (90), MM:SS (1:30), HH:MM:SS --end TIME End time — same formats as --start Alignment options: --no-align Skip forced alignment (no word timestamps) --align-model MODEL Custom phoneme ASR model for alignment Speaker diarization: --diarize Enable speaker labels --hf-token TOKEN Hugging Face access token (also reads ~/.cache/huggingface/token or HF_TOKEN env) --min-speakers N Minimum speaker count hint --max-speakers N Maximum speaker count hint --merge-speakers Merge consecutive segments from same speaker (cleaner output) --speaker-names NAMES Comma-separated names to replace SPEAKER_00, SPEAKER_01, etc. Output options: -j, --json JSON output with segments and word timestamps --srt SRT subtitle format --vtt WebVTT subtitle format (uses <v> voice tags for speakers) --tsv TSV (tab-separated values) for data analysis --word-level Word-level subtitles (SRT/VTT only) — karaoke-style --max-line-width N Maximum characters per
text
Hello and welcome to the show. Today we're talking about AI transcription.
Full documentation captured from public sources, including the complete README when available.
Docs source
GITHUB OPENCLEW
Editorial quality
ready
Speech-to-text with word-level timestamps, speaker diarization, and forced alignment using WhisperX. Built on faster-whisper with batched inference for 70x realtime speed. --- name: whisperx description: Speech-to-text with word-level timestamps, speaker diarization, and forced alignment using WhisperX. Built on faster-whisper with batched inference for 70x realtime speed. version: 1.0.0 author: Sarah Mak tags: ["audio", "transcription", "whisperx", "speech-to-text", "diarization", "alignment", "subtitles", "ml", "cuda", "gpu"] homepage: https://github.com/ThePlasmak/whisperx platforms
Speech-to-text with word-level timestamps, speaker diarization, and forced alignment — built on faster-whisper with batched inference for up to 70x realtime transcription speed.
WhisperX extends Whisper with three key capabilities that faster-whisper alone doesn't provide:
Use this skill when you need to:
Trigger phrases: "transcribe with speakers", "who said what", "diarize", "make subtitles", "word timestamps", "speaker identification", "meeting transcript", "karaoke subtitles"
When NOT to use:
WhisperX vs faster-whisper:
| Feature | faster-whisper | WhisperX |
|---------|---------------|----------|
| Basic transcription | ✅ | ✅ |
| Word timestamps | ✅ (approximate) | ✅ (precise, aligned) |
| Speaker diarization | ❌ | ✅ |
| Forced alignment | ❌ | ✅ |
| Batched inference | ❌ | ✅ |
| Word-level subtitles | ❌ | ✅ (karaoke-style) |
| Subtitle generation | Manual | Built-in (SRT/VTT/TSV) |
| Time range trimming | ❌ | ✅ (--start/--end) |
| Hotwords | ❌ | ✅ (boost specific terms) |
| Initial prompt | ❌ | ✅ (domain terms) |
| Speaker renaming | ❌ | ✅ (--speaker-names) |
| Stdin pipe | ❌ | ✅ (read from -) |
| Setup complexity | Simple | Requires HF token for diarization |
All commands use ./scripts/transcribe (the skill wrapper), not the whisperx CLI directly. The wrapper applies a required PyTorch compatibility patch — see "PyTorch 2.6+ Compatibility" below.
| Task | Command | Notes |
|------|---------|-------|
| Basic transcription | ./scripts/transcribe audio.mp3 | Word-aligned by default |
| With speakers | ./scripts/transcribe audio.mp3 --diarize | Auto-reads ~/.cache/huggingface/token |
| Clean speaker output | ./scripts/transcribe audio.mp3 --diarize --merge-speakers | Merges consecutive same-speaker segments |
| Named speakers | ./scripts/transcribe audio.mp3 --diarize --speaker-names "Alice,Bob" | Replaces SPEAKER_00, SPEAKER_01 |
| SRT subtitles | ./scripts/transcribe audio.mp3 --srt -o subs.srt | Ready for video players |
| Word-level SRT | ./scripts/transcribe audio.mp3 --srt --word-level | Karaoke-style, one word per cue |
| Wrapped subtitles | ./scripts/transcribe audio.mp3 --srt --max-line-width 42 | Standard TV subtitle width |
| VTT subtitles | ./scripts/transcribe audio.mp3 --vtt -o subs.vtt | Web-compatible with <v> speaker tags |
| JSON output | ./scripts/transcribe audio.mp3 --json | Full data with word timestamps |
| TSV output | ./scripts/transcribe audio.mp3 --tsv | Spreadsheet-friendly |
| Translate to English | ./scripts/transcribe audio.mp3 --translate | Any language → English |
| Fast, no alignment | ./scripts/transcribe audio.mp3 --no-align | Skip forced alignment |
| Specific language | ./scripts/transcribe audio.mp3 -l en | Faster than auto-detect |
| Partial transcription | ./scripts/transcribe audio.mp3 --start 1:30 --end 5:00 | Only a section |
| Boost terms | ./scripts/transcribe audio.mp3 --hotwords "Kubernetes gRPC" | Improve rare term recognition |
| Domain accuracy | ./scripts/transcribe audio.mp3 --initial-prompt "OpenAI, GPT-4" | Condition the model |
| From stdin | cat audio.mp3 \| ./scripts/transcribe - | Pipe from other tools |
| Auto-detect format | ./scripts/transcribe audio.mp3 -o out.srt | Format from extension |
⚠️ Do NOT run whisperx CLI directly — it will crash on PyTorch 2.6+ with pyannote models. Always use this skill's ./scripts/transcribe wrapper.
| Model | Size | Speed | Accuracy | Use Case |
|-------|------|-------|----------|----------|
| tiny | 39M | Fastest | Basic | Quick drafts, testing |
| base | 74M | Very fast | Good | General use |
| small | 244M | Fast | Better | Default for whisperx CLI |
| medium | 769M | Moderate | High | Quality transcription |
| large-v2 | 1.5GB | Slower | Excellent | Best diarization compat |
| large-v3 | 1.5GB | Slower | Best | Maximum accuracy |
| large-v3-turbo | 809M | Fast | Excellent | Recommended (default) |
Note: WhisperX defaults to small but this skill defaults to large-v3-turbo for the best speed/accuracy balance with GPU.
Prerequisites: Python 3.10+, ffmpeg, NVIDIA GPU with CUDA (strongly recommended)
Step 1: Install whisperx
# Option A: Run the setup script (auto-detects GPU, creates venv if needed)
./setup.sh
# Option B: Install globally (if you prefer)
pip install whisperx
Step 2 (optional, for diarization): Set up Hugging Face token
Speaker diarization requires a free Hugging Face account and access to gated models. Skip this if you only need transcription/alignment.
mkdir -p ~/.cache/huggingface && echo -n "hf_YOUR_TOKEN" > ~/.cache/huggingface/token && chmod 600 ~/.cache/huggingface/token
Alternatively: set HF_TOKEN env var, or pass --hf-token per-command.Note: The token and model access are completely free. The models are just gated behind a click-to-agree license. Without step 3, you'll get a 403 error even with a valid token.
No setup needed — just run ./scripts/transcribe. The wrapper script:
~/.cache/huggingface/token for diarization~/.cache/huggingface/ (one-time per model)# Quick test — should print transcript to stdout
./scripts/transcribe some_audio.mp3
# Test diarization — should show [SPEAKER_00], [SPEAKER_01], etc.
./scripts/transcribe some_audio.mp3 --diarize
# Check version
./scripts/transcribe --version
# If diarization fails with 403: model agreements not accepted (see step 3 above)
# If it crashes with pickle/weights_only error: you're running `whisperx` CLI directly instead of the wrapper
| Platform | Acceleration | Speed | |----------|-------------|-------| | Linux + NVIDIA GPU | CUDA (batched) | ~70x realtime 🚀 | | WSL2 + NVIDIA GPU | CUDA (batched) | ~70x realtime 🚀 | | macOS Apple Silicon | CPU | ~3-5x realtime | | macOS Intel | CPU | ~1-2x realtime | | Linux (no GPU) | CPU | ~1x realtime |
All commands use ./scripts/transcribe — resolve the path relative to this skill's directory.
# Basic transcription (word-aligned)
./scripts/transcribe audio.mp3
# With speaker diarization (auto-reads ~/.cache/huggingface/token)
./scripts/transcribe audio.mp3 --diarize
# Diarize with merged same-speaker segments (cleaner output)
./scripts/transcribe audio.mp3 --diarize --merge-speakers
# Rename speakers to real names
./scripts/transcribe audio.mp3 --diarize --speaker-names "Alice,Bob,Charlie"
# Generate SRT subtitles
./scripts/transcribe audio.mp3 --srt -o subtitles.srt
# SRT with line wrapping (standard TV width)
./scripts/transcribe audio.mp3 --srt --max-line-width 42 -o subtitles.srt
# Word-level karaoke subtitles (one word per cue with precise timing)
./scripts/transcribe audio.mp3 --srt --word-level -o karaoke.srt
# VTT with speaker voice tags (<v Speaker>text</v>)
./scripts/transcribe audio.mp3 --vtt --diarize -o subtitles.vtt
# Auto-detect format from output filename
./scripts/transcribe audio.mp3 -o transcript.json
./scripts/transcribe audio.mp3 -o subtitles.vtt
# TSV for spreadsheets/data analysis
./scripts/transcribe audio.mp3 --tsv -o transcript.tsv
# Transcribe only a section (useful for long recordings)
./scripts/transcribe podcast.mp3 --start 10:30 --end 15:00
# Boost recognition of specific terms (hotwords)
./scripts/transcribe meeting.mp3 --hotwords "Kubernetes gRPC OAuth2"
# Improve accuracy for domain context (initial prompt)
./scripts/transcribe meeting.mp3 --initial-prompt "Attendees: Alice, Bob. Topics: Kubernetes, gRPC, OAuth2"
# Maximum accuracy
./scripts/transcribe audio.mp3 --model large-v3
# Translate non-English audio to English
./scripts/transcribe audio.mp3 --translate -l ja
# Fast mode (skip alignment)
./scripts/transcribe audio.mp3 --no-align
# JSON with full metadata and word timestamps
./scripts/transcribe audio.mp3 --json -o transcript.json
# Specify known speaker count for better diarization
./scripts/transcribe audio.mp3 --diarize --min-speakers 2 --max-speakers 4
# Read audio from stdin (pipe from other tools)
cat audio.mp3 | ./scripts/transcribe -
ffmpeg -i video.mp4 -f wav - | ./scripts/transcribe - --diarize
AUDIO_FILE Path to audio/video file, or '-' to read from stdin
Model options:
-m, --model NAME Whisper model (default: large-v3-turbo)
--batch-size N Batch size for inference (default: 8, lower if OOM)
--beam-size N Beam search size (higher = slower but more accurate)
--initial-prompt TEXT Condition the model with domain terms, names, acronyms
--hotwords TEXT Space-separated hotwords to boost recognition of rare terms
Device options:
--device cpu, cuda, or auto (default: auto)
--compute-type int8, float16, float32, or auto (default: auto)
--threads N CPU threads for CTranslate2 inference (default: 4)
Language options:
-l, --language CODE Language code (auto-detects if omitted)
--translate Translate to English
Time range:
--start TIME Start time — seconds (90), MM:SS (1:30), HH:MM:SS
--end TIME End time — same formats as --start
Alignment options:
--no-align Skip forced alignment (no word timestamps)
--align-model MODEL Custom phoneme ASR model for alignment
Speaker diarization:
--diarize Enable speaker labels
--hf-token TOKEN Hugging Face access token (also reads ~/.cache/huggingface/token or HF_TOKEN env)
--min-speakers N Minimum speaker count hint
--max-speakers N Maximum speaker count hint
--merge-speakers Merge consecutive segments from same speaker (cleaner output)
--speaker-names NAMES Comma-separated names to replace SPEAKER_00, SPEAKER_01, etc.
Output options:
-j, --json JSON output with segments and word timestamps
--srt SRT subtitle format
--vtt WebVTT subtitle format (uses <v> voice tags for speakers)
--tsv TSV (tab-separated values) for data analysis
--word-level Word-level subtitles (SRT/VTT only) — karaoke-style
--max-line-width N Maximum characters per subtitle line (wraps at word boundaries)
--output-format FMT Explicit format (srt, vtt, txt, json, tsv)
-o, --output FILE Save to file (format auto-detected from extension)
Miscellaneous:
-V, --version Show version
-q, --quiet Suppress progress messages
Hello and welcome to the show.
Today we're talking about AI transcription.
--diarize)[SPEAKER_00] Hello and welcome to the show.
[SPEAKER_01] Thanks for having me.
--diarize --speaker-names "Alice,Bob")[Alice] Hello and welcome to the show.
[Bob] Thanks for having me.
--srt)Standard subtitle format, compatible with VLC, YouTube, etc.
1
00:00:00,000 --> 00:00:03,500
Hello and welcome to the show.
2
00:00:03,500 --> 00:00:06,200
Today we're talking about AI transcription.
--srt --word-level)One word per cue — for karaoke-style highlighting or precise editing.
1
00:00:00,000 --> 00:00:00,320
Hello
2
00:00:00,320 --> 00:00:00,560
and
3
00:00:00,560 --> 00:00:01,100
welcome
--vtt)Web-native subtitle format for HTML5 <video> and <track>. When used with --diarize, uses proper VTT <v> voice tags for speaker identification.
WEBVTT
00:00:00.000 --> 00:00:03.500
<v Alice>Hello and welcome to the show.</v>
00:00:03.500 --> 00:00:05.000
<v Bob>Thanks for having me.</v>
--json)Structured output with word-level timestamps and confidence scores.
{
"segments": [
{
"start": 0.0,
"end": 3.5,
"text": "Hello and welcome to the show.",
"speaker": "SPEAKER_00",
"words": [
{"word": "Hello", "start": 0.0, "end": 0.32, "confidence": 0.98},
{"word": "and", "start": 0.32, "end": 0.56, "confidence": 0.95}
]
}
]
}
--tsv)Tab-separated values for spreadsheets and data pipelines.
start end text
0.000 3.500 Hello and welcome to the show.
3.500 6.200 Today we're talking about AI transcription.
# Transcribe a meeting recording with speakers (clean output)
./scripts/transcribe meeting.mp3 --diarize --merge-speakers \
--speaker-names "Alice,Bob,Charlie" \
--min-speakers 3 --max-speakers 5 --json -o meeting.json
# Generate subtitles for a video (wrapped to standard width)
./scripts/transcribe video.mp4 --srt --max-line-width 42 -o video.srt
# Karaoke-style word-level subtitles
./scripts/transcribe song.mp3 --vtt --word-level -o karaoke.vtt
# Transcribe just the interesting part of a podcast
./scripts/transcribe podcast.mp3 --start 45:00 --end 1:02:30 --diarize --merge-speakers
# Improve accuracy for technical content (hotwords + initial prompt)
./scripts/transcribe lecture.mp3 \
--hotwords "PyTorch CTranslate2 whisperx" \
--initial-prompt "A lecture on ML model optimization"
# Batch transcribe a folder
for file in recordings/*.mp3; do
./scripts/transcribe "$file" --json -o "${file%.mp3}.json"
done
# Transcribe YouTube audio (with yt-dlp)
yt-dlp -x --audio-format mp3 <URL> -o audio.mp3
./scripts/transcribe audio.mp3 --diarize --merge-speakers
# Pipe directly from ffmpeg (extract audio on the fly)
ffmpeg -i video.mp4 -f wav -ac 1 -ar 16000 - 2>/dev/null | ./scripts/transcribe -
# Quick draft (fast, no alignment)
./scripts/transcribe audio.mp3 --model base --no-align
# German audio with TSV output for analysis
./scripts/transcribe audio.mp3 -l de --tsv -o transcript.tsv
# Auto-detect format from filename
./scripts/transcribe audio.mp3 -o transcript.srt # → SRT
./scripts/transcribe audio.mp3 -o data.json # → JSON
./scripts/transcribe audio.mp3 -o export.tsv # → TSV
| Mistake | Problem | Solution |
|---------|---------|----------|
| Using CPU when GPU available | 10-70x slower | Check nvidia-smi; verify CUDA |
| Missing HF token for diarize | Diarization fails | Get token from huggingface.co/settings/tokens |
| Not accepting model agreements | 403 error on diarization model | whisperx ≥3.8.0: accept community-1. Earlier: accept both pyannote/speaker-diarization-3.1 AND segmentation-3.0 (see Setup) |
| Running whisperx CLI directly | Crashes on PyTorch 2.6+ | Always use ./scripts/transcribe wrapper (applies torch.load patch) |
| batch_size too high | CUDA OOM | Lower --batch-size (try 4 or 2) |
| Using large-v3 when turbo works | Unnecessary slowdown | large-v3-turbo is faster with near-identical accuracy |
| Forgetting --language | Wastes time auto-detecting | Specify -l en when you know the language |
| Using WhisperX for simple transcription | Heavier setup for no benefit | Use faster-whisper for basic transcription |
| --word-level without --srt/--vtt | Flag is ignored | Word-level only applies to subtitle formats |
| --merge-speakers without --diarize | Flag is ignored | Merge only works when speakers are identified |
large-v3-turbo: ~2-3GBlarge-v3 + diarization: ~4-5GB--batch-size if OOMWhisperX supports all languages that Whisper supports (99 languages). Forced alignment (word timestamps) is available for a subset — if alignment fails for a language, the tool falls back gracefully to segment-level timestamps.
Languages with alignment support (common subset):
en English, zh Chinese, de German, es Spanish, fr French, it Italian, ja Japanese, ko Korean, pt Portuguese, ru Russian, nl Dutch, pl Polish, tr Turkish, ar Arabic, sv Swedish, da Danish, fi Finnish, hu Hungarian, uk Ukrainian, el Greek, cs Czech, ro Romanian, vi Vietnamese, th Thai, hi Hindi, he Hebrew, id Indonesian, ms Malay, no Norwegian, fa Persian, bg Bulgarian, ca Catalan, hr Croatian, sk Slovak, sl Slovenian, ta Tamil, te Telugu, ur Urdu
For the full list, see whisperx/alignment.py.
Both improve accuracy for specific terms, but they work differently:
| | --hotwords | --initial-prompt |
|---|---|---|
| How it works | Boosts probability of specific tokens during decoding | Conditions the model as if these words appeared earlier |
| Best for | Rare terms, proper nouns, technical jargon | Setting domain context, style, formatting |
| Example | --hotwords "Kubernetes gRPC OAuth2" | --initial-prompt "A technical meeting about cloud infrastructure" |
| Can combine | ✅ Yes, use both together for best results | ✅ |
| Requires | whisperx ≥3.7.5 | Any version |
Tip: Use hotwords for the specific words you need recognized correctly, and initial prompt for broader context about the audio content.
⚠️ PyTorch 2.6 changed torch.load() to default to weights_only=True. This breaks pyannote.audio's model loading (used by both VAD and diarization), because the model checkpoints contain globals like omegaconf.listconfig.ListConfig and torch.torch_version.TorchVersion that aren't allowlisted.
Symptoms:
_pickle.UnpicklingError: Weights only load failed when loading VAD or diarization models'NoneType' object has no attribute 'to' (pipeline silently returns None)How this skill handles it:
The scripts/transcribe.py uses the whisperx Python API directly (not as a subprocess) so it can monkey-patch torch.load before any model loading happens:
import torch
_original_torch_load = torch.load
def _patched_torch_load(*args, **kwargs):
kwargs['weights_only'] = False # Must FORCE, not setdefault — lightning_fabric passes True explicitly
return _original_torch_load(*args, **kwargs)
torch.load = _patched_torch_load
Key details:
kwargs['weights_only'] = False (forced override), NOT kwargs.setdefault('weights_only', False) — because lightning_fabric explicitly passes weights_only=True, which setdefault won't overridewhisperx, pyannote, or any model loading codewhisperx CLI — a subprocess can't inherit the monkey-patchNote: whisperx ≥3.8.0 migrated to pyannote-audio v4 with speaker-diarization-community-1, which may resolve some of these compatibility issues. The patch is kept for broad version support.
If whisperx CLI is updated to fix this upstream, the monkey-patch can be removed and the script could switch back to subprocess mode. Track: whisperX#972
_pickle.UnpicklingError: Weights only load failed: PyTorch 2.6+ compat issue. If running via CLI (whisperx command directly), this can't be fixed without patching the installed library. Use this skill's scripts/transcribe wrapper instead, which applies the patch automatically. See "PyTorch 2.6+ Compatibility" section above.
"CUDA not available": Install PyTorch with CUDA (pip install torch --index-url https://download.pytorch.org/whl/cu121)
"No module named whisperx": Run ./setup.sh or pip install whisperx
Diarization 403 error: You must accept the model agreement(s). For whisperx ≥3.8.0: accept community-1. For earlier versions: accept both speaker-diarization-3.1 and segmentation-3.0. See Setup above.
Diarization fails but transcription continues: v1.1.0+ gracefully handles diarization failures — it prints a diagnostic error and continues without speaker labels instead of crashing.
'NoneType' object has no attribute 'to': Either the HF token is invalid, the model agreements haven't been accepted, or the torch.load patch isn't applied. Check all three.
OOM on GPU: Lower --batch-size to 4 or 2
Alignment fails for language X: The language may not have a wav2vec2 alignment model. The tool will fall back to segment-level timestamps and print a warning. Check supported languages in whisperx alignment.py.
Slow on CPU: Expected — use GPU for practical transcription. Even tiny model on CPU is ~5-10x slower than large-v3-turbo on a mid-range GPU.
Empty output / no segments: Audio may be silence or too short. Check with ffprobe audio.mp3 to verify the file has actual audio content. v1.1.0+ prints a warning and produces valid empty output instead of crashing.
Timestamps wrong after trimming: If using --start, timestamps in the output reflect the original file's timeline (not relative to the trim point). This is by design — subtitle timecodes stay correct for the source video.
"No speech detected" warning: The audio file may contain only music, silence, or non-speech sounds. This is expected behavior, not an error.
whisperx 3.8.0 (Feb 2026): Migrated to pyannote-audio v4 with speaker-diarization-community-1. This model has lower diarization error rates across all benchmarks compared to the older speaker-diarization-3.1. Upgrade recommended: pip install --upgrade whisperx
whisperx 3.7.5: Added --hotwords support for boosting recognition of specific terms. This skill exposes it via the --hotwords flag.
Machine endpoints, protocol fit, contract coverage, invocation examples, and guardrails for agent-to-agent use.
Contract coverage
Status
ready
Auth
api_key, oauth
Streaming
Yes
Data region
global
Protocol support
Requires: openclew, lang:typescript, streaming
Forbidden: none
Guardrails
Operational confidence: medium
curl -s "https://xpersona.co/api/v1/agents/theplasmak-whisperx/snapshot"
curl -s "https://xpersona.co/api/v1/agents/theplasmak-whisperx/contract"
curl -s "https://xpersona.co/api/v1/agents/theplasmak-whisperx/trust"
Trust and runtime signals, benchmark suites, failure patterns, and practical risk constraints.
Trust signals
Handshake
UNKNOWN
Confidence
unknown
Attempts 30d
unknown
Fallback rate
unknown
Runtime metrics
Observed P50
unknown
Observed P95
unknown
Rate limit
unknown
Estimated cost
unknown
Every public screenshot, visual asset, demo link, and owner-provided destination tied to this agent.
Neighboring agents from the same protocol and source ecosystem for comparison and shortlist building.
Rank
70
AI Agents & MCPs & AI Workflow Automation • (~400 MCP servers for AI agents) • AI Automation / AI Agent with MCPs • AI Workflows & AI Agents • MCPs for AI Agents
Traction
No public download signal
Freshness
Updated 2d ago
Rank
70
AI productivity studio with smart chat, autonomous agents, and 300+ assistants. Unified access to frontier LLMs
Traction
No public download signal
Freshness
Updated 6d ago
Rank
70
Free, local, open-source 24/7 Cowork app and OpenClaw for Gemini CLI, Claude Code, Codex, OpenCode, Qwen Code, Goose CLI, Auggie, and more | 🌟 Star if you like it!
Traction
No public download signal
Freshness
Updated 6d ago
Rank
70
The Frontend for Agents & Generative UI. React + Angular
Traction
No public download signal
Freshness
Updated 23d ago
Contract JSON
{
"contractStatus": "ready",
"authModes": [
"api_key",
"oauth"
],
"requires": [
"openclew",
"lang:typescript",
"streaming"
],
"forbidden": [],
"supportsMcp": false,
"supportsA2a": false,
"supportsStreaming": true,
"inputSchemaRef": "https://github.com/ThePlasmak/whisperx#input",
"outputSchemaRef": "https://github.com/ThePlasmak/whisperx#output",
"dataRegion": "global",
"contractUpdatedAt": "2026-02-24T19:57:31.497Z",
"sourceUpdatedAt": "2026-02-24T19:57:31.497Z",
"freshnessSeconds": 4433067
}Invocation Guide
{
"preferredApi": {
"snapshotUrl": "https://xpersona.co/api/v1/agents/theplasmak-whisperx/snapshot",
"contractUrl": "https://xpersona.co/api/v1/agents/theplasmak-whisperx/contract",
"trustUrl": "https://xpersona.co/api/v1/agents/theplasmak-whisperx/trust"
},
"curlExamples": [
"curl -s \"https://xpersona.co/api/v1/agents/theplasmak-whisperx/snapshot\"",
"curl -s \"https://xpersona.co/api/v1/agents/theplasmak-whisperx/contract\"",
"curl -s \"https://xpersona.co/api/v1/agents/theplasmak-whisperx/trust\""
],
"jsonRequestTemplate": {
"query": "summarize this repo",
"constraints": {
"maxLatencyMs": 2000,
"protocolPreference": [
"OPENCLEW"
]
}
},
"jsonResponseTemplate": {
"ok": true,
"result": {
"summary": "...",
"confidence": 0.9
},
"meta": {
"source": "GITHUB_OPENCLEW",
"generatedAt": "2026-04-17T03:21:58.760Z"
}
},
"retryPolicy": {
"maxAttempts": 3,
"backoffMs": [
500,
1500,
3500
],
"retryableConditions": [
"HTTP_429",
"HTTP_503",
"NETWORK_TIMEOUT"
]
}
}Trust JSON
{
"status": "unavailable",
"handshakeStatus": "UNKNOWN",
"verificationFreshnessHours": null,
"reputationScore": null,
"p95LatencyMs": null,
"successRate30d": null,
"fallbackRate": null,
"attempts30d": null,
"trustUpdatedAt": null,
"trustConfidence": "unknown",
"sourceUpdatedAt": null,
"freshnessSeconds": null
}Capability Matrix
{
"rows": [
{
"key": "OPENCLEW",
"type": "protocol",
"support": "unknown",
"confidenceSource": "profile",
"notes": "Listed on profile"
},
{
"key": "combine",
"type": "capability",
"support": "supported",
"confidenceSource": "profile",
"notes": "Declared in agent profile metadata"
},
{
"key": "monkey",
"type": "capability",
"support": "supported",
"confidenceSource": "profile",
"notes": "Declared in agent profile metadata"
},
{
"key": "be",
"type": "capability",
"support": "supported",
"confidenceSource": "profile",
"notes": "Declared in agent profile metadata"
},
{
"key": "all",
"type": "capability",
"support": "supported",
"confidenceSource": "profile",
"notes": "Declared in agent profile metadata"
},
{
"key": "for",
"type": "capability",
"support": "supported",
"confidenceSource": "profile",
"notes": "Declared in agent profile metadata"
}
],
"flattenedTokens": "protocol:OPENCLEW|unknown|profile capability:combine|supported|profile capability:monkey|supported|profile capability:be|supported|profile capability:all|supported|profile capability:for|supported|profile"
}Facts JSON
[
{
"factKey": "docs_crawl",
"category": "integration",
"label": "Crawlable docs",
"value": "6 indexed pages on the official domain",
"href": "https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Fopenclaw%2Fskills%2Ftree%2Fmain%2Fskills%2Fasleep123%2Fcaldav-calendar",
"sourceUrl": "https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Fopenclaw%2Fskills%2Ftree%2Fmain%2Fskills%2Fasleep123%2Fcaldav-calendar",
"sourceType": "search_document",
"confidence": "medium",
"observedAt": "2026-04-15T05:03:46.393Z",
"isPublic": true
},
{
"factKey": "vendor",
"category": "vendor",
"label": "Vendor",
"value": "Theplasmak",
"href": "https://github.com/ThePlasmak/whisperx",
"sourceUrl": "https://github.com/ThePlasmak/whisperx",
"sourceType": "profile",
"confidence": "medium",
"observedAt": "2026-02-25T01:47:13.249Z",
"isPublic": true
},
{
"factKey": "protocols",
"category": "compatibility",
"label": "Protocol compatibility",
"value": "OpenClaw",
"href": "https://xpersona.co/api/v1/agents/theplasmak-whisperx/contract",
"sourceUrl": "https://xpersona.co/api/v1/agents/theplasmak-whisperx/contract",
"sourceType": "contract",
"confidence": "medium",
"observedAt": "2026-02-24T19:57:31.497Z",
"isPublic": true
},
{
"factKey": "auth_modes",
"category": "compatibility",
"label": "Auth modes",
"value": "api_key, oauth",
"href": "https://xpersona.co/api/v1/agents/theplasmak-whisperx/contract",
"sourceUrl": "https://xpersona.co/api/v1/agents/theplasmak-whisperx/contract",
"sourceType": "contract",
"confidence": "high",
"observedAt": "2026-02-24T19:57:31.497Z",
"isPublic": true
},
{
"factKey": "schema_refs",
"category": "artifact",
"label": "Machine-readable schemas",
"value": "OpenAPI or schema references published",
"href": "https://github.com/ThePlasmak/whisperx#input",
"sourceUrl": "https://xpersona.co/api/v1/agents/theplasmak-whisperx/contract",
"sourceType": "contract",
"confidence": "high",
"observedAt": "2026-02-24T19:57:31.497Z",
"isPublic": true
},
{
"factKey": "handshake_status",
"category": "security",
"label": "Handshake status",
"value": "UNKNOWN",
"href": "https://xpersona.co/api/v1/agents/theplasmak-whisperx/trust",
"sourceUrl": "https://xpersona.co/api/v1/agents/theplasmak-whisperx/trust",
"sourceType": "trust",
"confidence": "medium",
"observedAt": null,
"isPublic": true
}
]Change Events JSON
[
{
"eventType": "docs_update",
"title": "Docs refreshed: Sign in to GitHub · GitHub",
"description": "Fresh crawlable documentation was indexed for the official domain.",
"href": "https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Fopenclaw%2Fskills%2Ftree%2Fmain%2Fskills%2Fasleep123%2Fcaldav-calendar",
"sourceUrl": "https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Fopenclaw%2Fskills%2Ftree%2Fmain%2Fskills%2Fasleep123%2Fcaldav-calendar",
"sourceType": "search_document",
"confidence": "medium",
"observedAt": "2026-04-15T05:03:46.393Z",
"isPublic": true
}
]Sponsored
Ads related to whisperx and adjacent AI workflows.