Loading...
How VoiceLayer — Voice I/O for AI Coding Agents works
Two tools, auto-mode detection. voice_speak handles TTS in three modes: announce, brief, consult. voice_ask does full bidirectional Q&A with session booking. The system picks the interaction pattern from context: fire-and-forget for short updates, slower pacing for explanations, full conversation for Q&A.
voice_speak
TTS, auto rate
voice_ask
Q&A + session lock
Auto mode
Context-aware selection
voice_speak
TTS, auto rate
voice_ask
Q&A + session lock
Auto mode
Context-aware selection
Recording uses sox at 16kHz mono PCM, processed in 1-second chunks with RMS energy detection for silence. Transcription runs through whisper.cpp locally (~200-400ms on Apple Silicon) with automatic model discovery from ~/.cache/whisper/. Cloud fallback via Wispr Flow WebSocket handles cases where local STT isn't available. Stop recording by touching a signal file. Simple Unix.
Record
sox 16kHz mono
WAV Buffer
1s chunks + RMS
whisper.cpp
Local ~300ms
Transcription
Agent response
Record
sox 16kHz mono
WAV Buffer
1s chunks + RMS
whisper.cpp
Local ~300ms
Transcription
Agent response
whisper.cpp on Apple Silicon with ggml-large-v3-turbo achieves near-instant transcription. No cloud roundtrip, no API keys, no data leaving your machine.
Edge-TTS provides neural-quality speech synthesis for free. Speech rate auto-adjusts by content length: shorter messages play faster (+10%), longer explanations slow down (-15% for 1000+ chars). Each voice mode has its own rate default. Announce is snappy, brief is deliberate. Users interrupt playback by touching a stop signal file, monitored by a 300ms polling loop.
Text
Agent response
Edge-TTS
Neural synthesis
Rate Adjust
Mode + length aware
Play
afplay / mpv
Text
Agent response
Edge-TTS
Neural synthesis
Rate Adjust
Mode + length aware
Play
afplay / mpv
Only one session can use the microphone at a time. VoiceLayer handles this with a lockfile mutex: a JSON file at /tmp/voicelayer-session.lock with the owning PID, session ID, and start timestamp. Lock creation uses atomic wx write flags to prevent TOCTOU races. Dead process detection uses the signal-zero trick: process.kill(pid, 0) throws ESRCH for dead processes, so stale locks get cleaned up automatically.
// Atomic lock creation (TOCTOU-safe)
writeFileSync(lockPath, JSON.stringify({
pid: process.pid,
sessionId,
startedAt: new Date().toISOString()
}), { flag: 'wx' });
// Dead process detection
try {
process.kill(lockPid, 0); // alive
} catch {
unlinkSync(lockPath); // stale → cleanup
}Lockfile mutex with dead process cleanup
Lockfiles work across process boundaries without kernel objects or special privileges. Dead process detection via signal-zero is a Unix classic. Simple and reliable.