Architecture

How VoiceLayer — Voice I/O for AI Coding Agents works

Voice Tools Pipeline

Two tools, auto-mode detection. voice_speak handles TTS in three modes: announce, brief, consult. voice_ask does full bidirectional Q&A with session booking. The system picks the interaction pattern from context: fire-and-forget for short updates, slower pacing for explanations, full conversation for Q&A.

voice_speak

TTS, auto rate

voice_ask

Q&A + session lock

Auto mode

Context-aware selection

voice_speak

TTS, auto rate

voice_ask

Q&A + session lock

Auto mode

Context-aware selection

Speech-to-Text Flow

Recording uses sox at 16kHz mono PCM, processed in 1-second chunks with RMS energy detection for silence. Transcription runs through whisper.cpp locally (~200-400ms on Apple Silicon) with automatic model discovery from ~/.cache/whisper/. Cloud fallback via Wispr Flow WebSocket handles cases where local STT isn't available. Stop recording by touching a signal file. Simple Unix.

Record

sox 16kHz mono

WAV Buffer

1s chunks + RMS

whisper.cpp

Local ~300ms

Transcription

Agent response

Record

sox 16kHz mono

WAV Buffer

1s chunks + RMS

whisper.cpp

Local ~300ms

Transcription

Agent response

~300ms latency

whisper.cpp on Apple Silicon with ggml-large-v3-turbo achieves near-instant transcription. No cloud roundtrip, no API keys, no data leaving your machine.

Text-to-Speech Flow

Edge-TTS provides neural-quality speech synthesis for free. Speech rate auto-adjusts by content length: shorter messages play faster (+10%), longer explanations slow down (-15% for 1000+ chars). Each voice mode has its own rate default. Announce is snappy, brief is deliberate. Users interrupt playback by touching a stop signal file, monitored by a 300ms polling loop.

Text

Agent response

Edge-TTS

Neural synthesis

Rate Adjust

Mode + length aware

Play

afplay / mpv

Text

Agent response

Edge-TTS

Neural synthesis

Rate Adjust

Mode + length aware

Play

afplay / mpv

Session Booking

Only one session can use the microphone at a time. VoiceLayer handles this with a lockfile mutex: a JSON file at /tmp/voicelayer-session.lock with the owning PID, session ID, and start timestamp. Lock creation uses atomic wx write flags to prevent TOCTOU races. Dead process detection uses the signal-zero trick: process.kill(pid, 0) throws ESRCH for dead processes, so stale locks get cleaned up automatically.

typescript

// Atomic lock creation (TOCTOU-safe)
writeFileSync(lockPath, JSON.stringify({
  pid: process.pid,
  sessionId,
  startedAt: new Date().toISOString()
}), { flag: 'wx' });

// Dead process detection
try {
  process.kill(lockPid, 0); // alive
} catch {
  unlinkSync(lockPath);     // stale → cleanup
}

Lockfile mutex with dead process cleanup