Delivering more natural, personalized, and secure voice control for today’s connected world

The industry has moved from punch card to keyboard and from mouse to touchscreen, all in pursuit of more direct system manipulation to optimize user experience (UX) in today’s increasingly mobile and interconnected world. These are all abstractions of physical devices, though, and voice control has been heralded as the next step toward a more natural UX. Unfortunately, today’s solutions can’t deliver what machines need to understand – people – resulting in poor performance and no convenient way to control a new generation of voice-only products and services.

One of the biggest impediments to satisfactory voice control performance has been ambient noise, including nearby conversations, outdoor sounds, and reverberation when speaking in certain indoor environments. The use of multiple acoustic microphones and microphone arrays to improve directional acquisition has proven expensive and incapable of adequately isolating the speaker for reliable voice control. Now, a new approach is available that leverages lasers and interferometry techniques to gather additional critical information exclusively about the user communicating with a device. Combining this optical information with the output from an acoustic microphone gives automatic speech recognition (ASR) engines something they have never had before – a near-perfect reference audio signal directly from the speaker’s facial vibrations, regardless of noise levels.

Understanding the special challenges of human-to-machine communications

Human-to-machine communications (HMC) technology enables humans to interact with and control a variety of networked devices as quickly and efficiently as possible. While voice is an excellent interface, the problem with using today’s ASR engines for HMC applications is that they have generally been designed for human listeners and typically only perform well if words are spoken clearly and there is no background noise. This, of course, is not the case in a real-world, noisy environment. Machines are incapable of inferring meaning as humans do if background noise periodically drowns out the speaker, and while voice-recognition software can be trained to understand accents and other speech patterns, they cannot be trained to ignore background noise. Solutions must be able to isolate the speaker’s voice from others in the background, as well as from other types of ambient noise.

In tests of voice-recognition solutions in a moving vehicle with windows fully open and with speakers in the background, the word/command recognition rate typically drops to 0 percent (Figure 1). The industry has pursued a number of approaches to solving this problem over the past 20 years but, in general, these efforts have delivered only single-digit percentage improvements in word recognition performance.

[Figure 1 |  Today’s voice control solutions cannot deliver what machines need to understand humans.]

Reducing or eliminating background noise to isolate the speaker’s voice is critical to improving the accuracy of automatic speech recognition engines in HMC applications. Acoustic microphone technology alone does not provide enough directional acquisition capability to achieve this level of speaker isolation, even with multiple microphones and microphone arrays. However, if the output from an acoustic microphone can be paired with additional outputs associated exclusively with the speaker, there is an opportunity to reduce word error rates by at least 60 percent.

Applying optical laser technology and interferometry techniques

The key to improving voice recognition using optical laser technology is the ability to measure the distance and velocity of facial vibrations during speech. This approach takes advantage of the fact that even in an environment full of acoustic vibrations, a person’s facial skin only vibrates during speech.

The optical sensor and acoustic microphone operate alongside each other. The acoustic microphone extracts signals from the air across the full 4-6 KHz range of normal speech, albeit with a high level of non-speaker-related ambient noise. Meanwhile, an eye-safe optical sensor is pointed at a fixed location on the user’s face such as the mouth, lip, cheek, throat, or behind the ear, and picks up only the signals from the facial skin that are transmitted during speech at lower, 1-2 KHz frequencies (Figure 2). It is impervious to noise in this range. Nanometer-resolution interferometry techniques are then used to measure differences in the distance traveled by light as it reflects from these areas. The data is converted into intensity variations, and algorithms filter out any vibrations not associated with the user’s speech. The intensity variations are then converted to signals, which are converted back to sound.

[Figure 2 | Facial skin vibrates at the same frequency as a person’s voice. Using a multi-sensor approach, a noisy audio signal can sampled by an acoustic mic while an HMC sensor measures facial skin vibrations created by the speaker at nanometer resolution.]

In essence, the optical HMC sensor creates a virtual “cube” around the speaker. Because vibrations are associated only with the user’s speech, there is an extremely high level of directional pickup and, in turn, near-perfect isolation from extraneous noise and other background voices. No other sounds are detected or sent to the speech recognition engine.

Implementation options

The first implementation option is to connect an HMC optical sensor to the section of a voice control solution. This improves noise reduction performance with an associated improvement in speech-recognition performance, creating a platform for significantly improving current products without requiring changes to existing speech recognition architectures.

Alternatively, HMC optical can be connected directly to a speech recognition engine, eliminating the need for noise reduction modification. The speech recognition engine simultaneously processes the acoustic and optical signals and performs all necessary noise compensation using both sets of input.

Using the latter approach, a speech recognition engine leverages the best characteristics of the acoustic microphone and optical sensor. This change to the speech recognition model has interesting implications for not only improving voice control performance but also exploring new use cases in environments that were previously considered prohibitively noisy. Today’s sensor technology is small enough (sub-3 mm form factor) with sufficient power efficiency for use in very small devices, for both head-mounted (virtual and augmented reality glasses, headsets, helmets) and remote (voice-controlled infotainment and access control) applications.

The use case is particularly compelling. Speech recognition has been nearly impossible with windows rolled down and background passenger conversations. With optical HMC sensor technology installed in the infotainment center or rear-view mirror and pointed at the driver, however, all commands are clean and isolated from background noise. The sensor in this application operates at ranges up to 1 m across a field of view that enables typical driver movements.

Speech recognition in head-mounted devices has also been difficult, especially in noisy environments. Adding an optical HMC sensor to the headset isolates the speaker from ambient noise and removes the requirement for acoustic mics to be positioned close to the user’s mouth. Designers can “remove the boom” and create new, more convenient designs and a better user experience in applications including emergency response communications solutions, motorcycle helmets, aviation headphones, and gaming and virtual/augmented reality gear. Optical HMC sensors used in these applications support an up-to-50 mm range when pointed at a fixed location on a user’s face.

New ways to measure performance

In addition to changing how ASR engines operate and creating new voice-control use cases, optical HMC laser technology is also poised to change how the speech-recognition industry measures performance. In the past, performance was typically calculated using a Mean Opinion Score (MOS) that measures intelligibility and whether the experience is a good or bad one from a human user’s perspective. The MOS has been used for decades in the telephony industry to measure quality based on a user’s assessment.

In the HMC world, however, it may be more important to know how many times a command must be given before execution. Early developers of HMC solutions are now looking at such metrics as how much time it takes for a single task to be performed – i.e., using speech recognition to identify the barcode on a box on the factory floor so that an automated transport system can move it from one location to another.

Future developments

As HMC optical sensors move to the higher end of their available frequency range it will be possible to achieve unlimited vocabulary speech recognition, independent of the acoustic microphone.

Another opportunity is to point two or more optical sensors at different locations on the speaker’s face, such as behind the ear and on the jaw in a head-mounted application. Add this to acoustic microphones with noise-cancellation and beam-forming capabilities, and speech recognition engines can benefit from an unprecedented level of speaker isolation for HMC applications plus ultra-high-quality audio for human-to-human communication.

It will also be possible to use a single optical HMC sensor for multiple high-value functions. For instance, optical sensors can perform proximity sensing, touch sensing (which would eliminate the need for buttons on wearable devices), and always-on voice-trigger functions. They also can be used to turn voice into another authentication factor for an expanding range of personalized online and mobile financial, healthcare, smart , and other secure -based services. Implementing a sensor in this way would enable system developers to replace from $10 to $20 in sensors with a single solution that leverages sensor-based interferometry for numerous applications.

The industry is moving into a new generation of capabilities with the and an increasingly connected world. Voice is the optimal UX, but isn’t feasible without dramatic improvements. HMC optical sensor technology provides an important new solution while also creating opportunities for many new voice control applications moving forward.

Rammy Bahalul, Vice President, Sales and Business Development, VocalZoom.

VocalZoom

vocalzoom.com

LinkedIn: www.linkedin.com/company/vocalzoom