Case study

Voice & Voice-to-Text

Turning voice into a reliable, first-class writing surface on iOS

Focus Reliability architecture, UX states, user trust

Work type Direction-setting, cross-functional alignment, implementation support

Output Voice session model, state machine, integration patterns, rollout readiness

Context

Voice and Voice-to-Text (VTT) on iOS was a high-impact surface for mobile writing. For users, it was a faster way to draft messages and edits. For the product, it sat directly on top of the keyboard, where trust and reliability matter more than anywhere else.

Behind the scenes, voice spanned multiple complex domains at once: iOS audio capture and interruptions, backend transcription latency, real-time UI states, and integration into editing and AI workflows. When any part failed, the experience felt broken, and users lost confidence quickly.

The core problem

Voice reliability was inconsistent across devices and hard to debug. Ownership between iOS, backend transcription, and voice infrastructure was unclear. UX states for recording, partial transcription, and failure were under-defined. Iteration speed was slow because UI, audio capture, and transcription were tightly coupled and fragile.

This created two risks at the same time: a user trust risk and an execution risk. Shipping more features would not fix the underlying system.

The key decision

Treat voice as a core writing modality, not a dictation feature. Reliability and composability had to come first, so voice could integrate cleanly with editing and AI without becoming a permanent source of regressions.

This meant prioritizing foundational work before expanding capability, even when feature delivery pressure was high.

My role

I stepped into this space as an end-to-end owner for voice on iOS, providing direction-setting leadership for how voice should behave in the keyboard, technical leadership for reliability and architecture, and cross-functional alignment across iOS, backend, voice infrastructure, product, and design.

The leverage move

The highest leverage was shifting from bug-fixing to system design. I reframed voice from a set of symptoms to a lifecycle problem: sessions, states, transitions, and failure recovery. That reframing created a shared north star across disciplines and changed what we prioritized.

What I did

Established ownership and boundaries — clarified responsibilities across iOS session lifecycle, backend transcription, and voice infrastructure, reducing coordination overhead and decision ambiguity.
Decoupled the architecture — introduced clean abstractions separating audio capture, session state, and transcription delivery, making failures easier to reason about and reducing regression risk.
Defined deterministic UX states — partnered with Product and Design to specify recording, partial, final, error, and retry states, then translated them into predictable iOS behavior engineers could implement consistently.
Raised the reliability bar — prioritized interruption handling (calls, Siri, route changes), recovery paths, guardrails, and observability so issues became diagnosable instead of mysterious.
Enabled others — created reusable patterns and integration APIs, reviewed voice-related PRs for resilience, and helped other engineers extend voice safely without reintroducing instability.

Trade-offs and constraints

We chose reliability over new capability in the short term. We resisted adding edge-case features until the session model could handle interruptions and retries deterministically. We also optimized for composability with the keyboard and editing workflows, which meant being strict about boundaries and responsibilities rather than letting layers leak into each other.

What changed

Voice became stable enough for broader rollout. The most common failure modes became rarer and easier to diagnose. The experience felt faster and more predictable because UI states matched system reality. Most importantly, voice shifted from a fragile, high-risk feature to a reliable writing surface the product organization could build on.

Validation approach

We validated the system through reliability signals and shared understanding: could engineers explain the voice session lifecycle, could every UI state map to a clear system state, could failures recover without manual intervention, and could logs make intermittent bugs diagnosable. As those answers converged, confidence increased and rollout became safer.

Reflection

This work reinforced that leadership often looks like creating clarity others can execute on. By defining a session model, state machine, and integration patterns, we unlocked safer iteration and protected user trust. Voice stopped being a constant source of uncertainty and became a platform the product could extend.

In one sentence

I led the direction and execution of Voice and Voice-to-Text on iOS, transforming a fragile dictation feature into a reliable, scalable writing surface through clear ownership, architectural decoupling, cross-functional alignment, and a user trust-first reliability bar.

← Previous Case Study