Ford Motors has explained below how Voice Recognition technology works on a phone, tablet or car infotainment system. Apple’s Siri, Amazon’s Alexa, Microsoft’s Cortana and Ford’s SYNC infotainment system work on voice recognition technology.
- It’s not really about the sound — it’s actually about the sound wave that comes out when we say something
A sound is created through tiny changes in air pressure, and it enters our ears as one continuous sound wave. But computers aren’t like people, so they need a way to “hear” the words that are said and turn them into text. So when sounds enter the devices we use, its computer measures that sound wave at one point in time, stores it and measures it again, and does this again and again with each sound. The result: the sound you made is now digitalized for the computer to understand. As you can imagine, this is a very precise process and our smart devices can sometimes mistake what we say. If the computer detects a gap in the wave, what gets measured may not be correct.
- The sound of a word vs. the sound of something else
Once a sound is recorded digitally, the computer has to figure out what sounds it has to pay attention to, using algorithms. To determine if chunks of digitized sound are actually words, rather than sounds from a car engine or a radio, the computer applies a bunch of mathematical operations to separate what is speech and what isn’t.
- Same word, different accents
Voice recognition works by breaking up the speech into small segments called phonemes. In English alone, there are about 40 different phonemes. The computer is trained to recognize what each speech segment looks like digitally, but they’re not always the same. For instance, sounds vary with different accents, placement in a word and even spellings (i.e. “to” vs. “two” vs. “too”). Based on a dictionary word list and contextual relationships, the computer in your gadgets can make an assumption of what you’re saying. So, if your friend Mary is in your contact list, the command “call Mary” is linked to “Mary” and not “merry”.
4. Predicting what the next word in a sentence might be
There can be many different word combinations in a single speech stream simply because there are lots of phonemes that sound similar to one another when said quickly. Sometimes the result can be a wacky sequence of words that don’t really make sense. To avoid this, the computer system applies models based on how people actually talk to figure out how likely one word is to follow another.
5. Presenting the best result as quickly as possible
Once all the calculations are done and the guesses are made, the computer can finally present its best result, whether it’s on a screen, from a pre-set menu or coming up with a vocal response.
With more real-time and accurate technology now available, voice activated commands are making our lives better in a myriad of different ways. Although at times it may seem like your device is just out to annoy you with its bizarre answers, consider all the tedious calculations and complex transformations it has to do behind-the-scenes to recognize a single word, let alone an entire sentence.