Avoid Being Fooled by Parlor Tricks: The Necessity of Real-World Environment Testing for ASR

S. Hamid Nawab — Originally published by Speech Technology Magazine

Voice assistants developed by Amazon, Google, Xiaomi, Alibaba, and others are poised to take over the world. A report by Juniper Research estimates that 70 million U.S. households will have at least one voice assistant-enabled speaker by 2022. The same report says that the majority of voice-assisted activities will occur on smartphones, with voice assistants installed on over 5 billion smartphones worldwide by 2022.

High Stakes for Voice Assistants

While consumers used their voice assistants an average of once a month in 2017, surveys indicate consumers might be harder to please when doing things such as voice shopping. The Information estimates that only 2% of Amazon Echo owners have ever tried to purchase anything through a voice assistant, and of those who tried, 90% never tried again. This suggests that the user experience has to work really, really well for consumers to keep using voice assistants for more than searches and dictation.

That’s why the voice assistant’s ability to perform in varied, and often difficult, sound environments will be a key pillar for the sector’s success. The sheer scale of distribution for voice assistants means they are going to be used in many different situations and environments, many of which require them to adapt to the variability of the scenarios, which is a huge risk for this emerging market.

When Sound Environment Models are Not Enough

When companies are developing their voice assistants, they create synthetic environments that mimic what the product may have to actually face in the real world. The mimicking of real-world situations is necessitated by the need to have testing control over quantifiable environmental factors, and generally dependent on the device matching an environmental sound profile to the scene when activated. The device then uses that sound profile to direct signal processing and noise cancellation activities to produce a clean signal for the automatic speech recognition (ASR) software to convert into commands and actions.

In real-world situations, the device, the target speaker, and multiple sources of background noise, as well as other voices, will be present and often moving relative to one another. A selected sound profile that was effective at the beginning of an interaction may be inadequate a moment later as the scene shifts, again and again. In the current generation of devices, the user is expected to control this environment for the voice assistant. Given that billions of users will be operating voice assistants with no training, voice assistants will likely deliver sub-optimal results and greatly hinder the widespread adoption and use of voice interfaces.