SayKit: a framework for building voice driven applicatons

The current state of development tools

Over a year ago we set out to build an entirely voice based shopping app to help the blind live more independently. It turned out to be a lot more difficult than we expected. While there are over 300 classes and protocols in iOS to help developers create dynamic, graphical apps, there are fewer than 10 to help manage audio based applications.

There are over 300 classes and protocols in iOS to help developers create dynamic, graphical apps, there are fewer than 10 to help manage audio based applications.


For Graphical applications there are abstractions to recognize gestures, to layout and compose views, to perform animations, to navigate application hierarchies, and to prompt users for immediate responses. There are sophisticated WYSIWYG tools. There are system standard widgets, fonts, image filters, even physics simulation routines! There are hundreds of pages of documentation to tell you how to use and integrate these tools. And if you can't find an answer there, there are a million Stack Overflow answers you can sift through.

But what about audio interfaces? In iOS, there's a microphone and a speech synthesizer. It's like asking someone to build a modern complex visual application with nothing but a function to draw shapes and another to do hit testing.

 

An example: Yes/No prompt

Say you want to ask the user a Yes or No question and allow them to respond verbally. You also want to have a visual element to give the user feedback about the status of the prompt, as well as allow them to respond directly with a tap on a Yes or No button.


Creating the prompt without SayKit

  1. Create a view to display on screen while the prompt is active. Add a label for the question text, a Yes button, a No button, and a Microphone button.
  2. Display the view.
  3. Trigger the speech synthesizer to ask the question and wait until it completes. Wire up the microphone button to allow the user to skip to the next step immediately.
  4. Start speech recognition service via a call to an external service's API.
  5. Rewire the microphone button so that it stops the recognizer immediately (useful if the user is in a loud area and the microphone is having trouble recognizing the end of speech).
  6. If the user taps the Yes or No buttons, deactivate the microphone and go to step 10.
  7. If the API service successfully returns text, see if the string contains "yes" or "no". If so, go to step 10. Otherwise, go to step 9.
  8. If the API service fails for some reason, go to step 9.
  9. We didn't get a valid response, so we need to decide if the user should have to answer again. If so, go back to step 3. Otherwise, we dismiss our prompter view and tell our caller we didn't get a valid response.
  10. We got a valid response! We dismiss our prompter and tell our caller the response.

 

Creating the prompt with SayKit (in Swift)

VoiceRequestCoordinator.initiate(ConfirmationRequest("Are you sure?")) { result in
   // act on result
}


That’s it.

 

What we’ve made

When we set out to create voice-based apps, we found ourselves in pretty uncharted territory. So we built tools, a lot of tools. Sure, our tools overcome various engineering challenges, making common tasks much easier, but more crucially they encourage design patterns that promote writing maintainable code.


The basics

In iOS, there are no public APIs that support speech recognition. To compensate, we've added support for:

  • Speech recognition
  • Natural Language Understanding
  • Ensuring app speech output doesn't conflict with microphone input
  • Handling ambiguous user commands, asking for clarifications

Speech output is supported by a system-wide speech synthesizer with a single serial queue. We've built extensions to overcome limitations with this model, including:

  • Integrating sounds into a serial speech queue (listening tones, error tones, etc).
  • Extending the built-in speech synthesizer to support prioritized speech (e.g. an ongoing utterance must be paused temporarily to allow a notification to be spoken)
  • Guarding against VoiceOver accessibility features conflicting with app speech output



Moving beyond Question and Answer based interactions

Currently many voice-based applications are limited to a single voice action or a voice action with a clarification. For example:

  • User: What’s the weather?
  • User: What’s the monthly payment on a $300,000 mortgage?
    • App: What is the interest rate and repayment period?

These flat models cause a lot of friction when a developer attempts to use them within a larger, hierarchical application (i.e. selecting a date as part of a flight search app like Hipmunk). To reduce this friction, we've built systems for:

  • Composing and layering audio output in a hierarchical manner
  • Limiting commands to specific context and managing multiple contexts.
  • Supporting context-specific orientation commands (e.g. "Help", "What can I say") in a natural, declarative manner



Visual-audio interface coordination

Emphasizing an audio interface should not necessitate de-emphasizing a visual one. However, making the two paradigms work together is far from trivial. We've built tools to facilitate coordinating the two, including:

  • Standard visual interfaces to allow touch-interaction with conversational prompts (e.g. "Are you sure?", "Which color would you like?")
  • Standard "playback" control buttons to work with speech output
  • To allow for simple audio "mirroring" of visual data, we've generalized existing view-based delegation protocols in iOS to support audio (e.g. UITableViewDataSource)
  • Support for handling commands outside the visual application context (e.g. "search" from a screen with no search bar)

This coordination will remain important as long as smartphones continue to be primary way that people interact with applications. As we move towards wearables and other voice enabled devices, this coordination will become less and less necessary.

 

The future

SayKit is just the first step towards creating a framework for conversational applications as robust as those we’ve come to expect for graphical apps. There are many more common interactions that need to be supported not to mention new platforms. We’ll be adding more and more support as we develop these modules.

Stay Tuned!