Speech recognition, sometimes referred to as "automatic speech recognition (ASR)" or "voice recognition," isn't a new technology — researchers at Bell Labs pursued ASR with funding from the defense department as far back as 1936. But because human language is so complex, decades passed before computer systems sophisticated enough to decode speech with a high degree of accuracy became a practical reality.
In recent years, however, speech recognition software has become increasingly popular as companies like Google and Apple have made the technology a marquee feature of modern smartphones — the most prevalent personal computers ever. In just the past few weeks, with the introduction of Siri, a voice-controlled artificial intelligence system on the iPhone 4S, and Android 4.0, which promises real-time voice-transcription anywhere you can type, speech recognition has exploded into the mainstream like never before.
It seems obvious now that the future is one where you will be able to talk to your devices and have them understand you and respond accordingly. But how is this even possible? Below, a brief, comprehensible primer on how computers learn the language.
Before a computer can figure out what you're saying, it has to receive the most basic sounds you're making in a way it can understand. Speech recognition begins with a simple microphone and a computer's analog-to-digital converter, which receives audio input and converts it to a string of ones and zeros. The computer analyzes the incoming sound wave at frequent intervals and ignores extraneous data like volume or background noise.
From there, the speech recognition software breaks up the audio signal into very small segments of just a few hundredths or even thousandths of a second. In these small bits of sound, the computer looks for phonemes — the most basic elements of a language. A phoneme is one of a fixed number of sounds used to pronounce all the words in a given language. There are between 40 and 50 phonemes used in English, and all of these will have been installed into the memory of the speech recognition program.
Once phonemes have been identified, the software compares these against a vast database of tens of thousands of words either pre-installed in its memory or stored on the cloud. Using this database, it begins to put together what words are contained in the original audio signal.
In order for a computer to comprehend a wide variety of people using different accents and speaking patterns with any degree of accuracy, it has to rely on context and probability. In other words, the computer has to be able to discern between what it thinks it heard and what was most likely said.
To do this, modern speech recognition programs use a number of complex algorithms and statistical models developed by language scientists. The most popular model is called the Hidden Markov Model, and strings together phonemes into words and phrases based on statistical likelihood. In the Markov Model, each phoneme present is compared to the one that came before it. The most likely chain of phonemes is used to determine the word, and the most likely chain of words is used to determine the phrase. Because there are literally trillions of possible word combinations in a language, doing this detective work (and fast) typically requires quite a bit of processing power.
The better a computer knows its audience, the better it will be at understanding them. To increase accuracy, speech recognition systems created for specific markets, like hospitals or customer support centers, are programmed to be particularly sensitive to certain kinds of terms or phrases that are common to the industry.
Advanced consumer-grade speech recognition software, like that used by Siri in the new iPhone, becomes increasingly familiar with its individual user over time. So if you have a habit of speaking in slang terms or running your words together when you talk, the program can retain this information and become better at deciphering your future questions, comments and ramblings.
The recent breakthroughs in speech recognition technology are helping computers to figure out not only what we're saying, but what we really mean. And as the number of people using the technology increases, the technology itself will improve — accomadating more and more of the nuances in our ever-evolving language.
The history of interfacing with computers is one defined by intermediaries: the keyboard, the mouse, the stylus, the touchscreen. Modern speech recognition promises a future where we'll be able to perform complex computing tasks merely by wishing them outloud.