Hey Alexa, what’s next? Breaking through voice technology’s ceiling

THE RECENT ANNOUNCEMENT from Amazon that they would be reducing staff and
budget for the Alexa department has deemed the voice assistant as “a colossal failure.”
In its wake, there has been discussion that voice as an industry is stagnating (or even
worse, on the decline).
I have to say, I disagree.

While it is true that that voice has hit its use-case ceiling, that doesn’t equal stagnation.
It simply means that the current state of the technology has a few limitations that are
important to understand if we want it to evolve.

Simply put, today’s technologies do not perform in a way that meets the human
standard. To do so requires three capabilities:
1. Superior natural language understanding (NLU): There are lots of good
companies out there that have conquered this aspect. The technology
capabilities are such that they can pick up on what you’re saying and know
the usual ways people might mention what they want. For example, if you say,
“I’d like a hamburger with onions,” it knows that you want the onions on the
hamburger, not in a separate bag.

2. Voice metadata extraction: Voice technology needs to be able to pick up
whether a speaker is happy or frustrated, how far they are from the mic and
their identities and accounts. It needs to recognize voice enough so that it
knows when you or somebody else is talking.

3. Overcome crosstalk and untethered noise: The ability to understand in the
presence of cross-talk even when other people are talking and when there are
noises (traffic, music, babble) not independently accessible to noise cancellation
algorithms.

There are companies that achieve the first two. These solutions are typically built to
work in sound environments that assume there is a single speaker with background
noise mostly canceled. However, in a typical public setting with multiple sources of noise,
that is a questionable assumption.