Speech Recognition
From Isopedia
Speech Recognition, also known as automatic speech recognition, computer speech recognition, or erroneously as voice recognition, is the process of converting a speech signal to a sequence of words, by means of an algorithm implemented as a computer program, or more simply, as the ability of machines to respond to spoken commands.
Contents
|
History of Speech Recognition
Early Efforts - Mechanical Synthesis
The earliest efforts to produce synthetic speech began over two hundred years ago. In St. Petersburg in 1779, Professor Christian Kratzenstein explained the physiological differences between five long vowels and made an apparatus to produce them artificially. He constructed acoustic resonators similar to the human vocal tract and activated the resonators with vibrating reeds like in music instruments.
In 1791, in Vienna, Wolfgang von Kempelen introduced an "Acoustic-Mechanical Speech Machine" which could produce single sounds and some combinations of sounds. His machine had a pressure chamber for the lungs, a vibrating reed which acted as vocal cords, and a leather tube for vocal tract action. By manipulating the shape of the tube, he could produce different vowel sounds. Kempelen received negative publicity and was not taken seriously due to other inventions of his that proved to be fraudulent. However, his machine did lead to new theories regarding human vocals.
In the mid 1800's, Charles Wheatstone constructed a version of von Kempelen's speaking machine which could produce vowels and most consonant sounds. He could even produce some full words. Alexander Graham Bell, inspired by Wheatstone's machine, also constructed a similar machine of his own.
Research and experiments with mechanical vocal systems continued until the 1960's, but with little success. It was not until the beginning of electrical synthesizers that voice recognition began its true beginning.
1930's-'40s - Homer Dudley and Vocoder
The first experiments in voice encoding were conducted in 1928 by engineer Homer Dudley at AT&T's Bell Labs. At that time, the labs produced the first electronic speech synthesizer called a machine called Voder or Vocoder, derived from voice encoder. Dudley patented his invention in 1935. He, along with fellow engineers Riesz and Watkins produced the first electronic speech synthesizer in 1936. It was demonstrated in the 1939 Worlds Fairs by experts who used a keyboard and foot pedals to play the machine and emit speech. The Vocoder was originally developed as a speech coder for telecommunications applications in the 1930s, with the idea being to code speech for transmission. In this fashion, it was used for secure radio communication, where voice was digitized, encrypted, and then transmitted on a narrow, voice-bandwidth channel.
Most early research in voice encoding and speech recognition was funded and performed by Universities and the U.S. Government, primarily by the military and the Defense Advanced Research Project Agency. Dudley's Vocoder was used in the SIGSALY system, built by Bell Labs engineers in 1943. This system was used for encrypting high-level communications for Allies during World War II. After the '30s, and early '40s, there was little improvement on Dudley's Vocoder and speech recognition.
1950's - ‘60s - Synthesizers
In 1951, Franklin Cooper developed a Pattern Playback synthesizer at Haskins Laboratories. It reconverted recorded spectrogram patterns into sounds, either in original or modified forms. They were recorded optically on a transparent belt. Spectrograms calculate the frequency spectrum of a compound signal. It is a three-dimensional plot of the energy of the frequency content of a signal as it changes over time.
In 1953, Walter Lawrence introduced the first formant synthesizer, PAT (Parametric Artificial Talker). It consisted of three electronic formant resonators connected in parallel. A buzz or noise was inputted and a moving glass slide converted painted patterns into six time functions to control the three formant frequencies: voicing amplitude, fundamental frequency, and noise amplitude. It was the first successful synthesizer to describe the reconstruction process in terms of vocal tract resonances.
At the same time, Gunner Fant introduced the first cascade formant synthesizer called OVE I (Orator Verbis Electris) which consisted of resonators connected in cascade. In 1962, Fant and his colleague Martony introduced the OVE II synthesizer which had separate parts to model the transfer function of the vocal tract for vowels, nasals, and consonants. These synthesizers lead to the OVE III and GLOVE projects.
1970's to Today - The HMM Model and the Commercial Market
In the early 1970's, Lenny Baum of Princeton University invented the Hidden Markvo Modeling approach to speech recognition. This is a statistical model which outputs a sequence of symbols or quantities and matches patterns. This approach became the basis for modern speech recognition and was adopted by all leading speech recognition companies. Baum shared his invention with several Advanced Research Projects Agency contractors including IBM.
In 1971, DARPA (Defense Advanced Research Projects Agency) established the Speech Understanding Research program to develop a computer system that could understand continuous speech. Lawrence Roberts spent $3 million per year for five years of government funds on the program. This led to the establishment of many Speech Understanding Research groups and was the largest speech recognition project ever.
In 1978, Texas Instruments introduced a popular toy called "Speak and Spell". It used a speech chip which led to huge strides in development of more human like digital synthesis sound.
In 1982, Dragon Systems was founded by doctors Jim and Janet Baker. It has a long history of speech and language technology innovations and patents. In 1984, SpeechWorks, which is the leading provider of over-the-phone automated speech recognition, was founded.
In 1995, Dragon Systems released word dictation-level speech recognition software, which was the first time dictation speech recognition technology was available to consumers. IBM and Kurzweil soon followed the trend.
Charles Schwab became the first company to devote resources towards developing the program Voice Broker in 1996. The program allows for up to 360 simultaneous customers to call in and get quotes on stocks and options and handled 50,000 requests daily. It was 95% accurate and set the stage for many other companies to follow in their footsteps. BellSouth launched the first voice portal called Val, which is a type of web portal that can be accessed be people entirely by voice, used by both consumers and businesses
In 1997, Dragon introduced "Naturally Speaking" which was the first "continuous speech" dictation software available, meaning that you no longer needed to pause between words for the computer to understand what was being said.
In 1998, Lernout and Hauspie bought Kurzweil. Microsoft invested $45 million in Lernout and Hauspie to form a partnership, eventually allowing Microsoft to use their speech recognition technology in their systems. In 1999, Microsoft acquired Entropic, giving them access to the "most accurate speech recognition system in the world".
In 2000, Lernout and Hauspie acquired Dragon Systems for $460 million. In the same year, TellMe introduced the first world-wide voice portal.
In 2001, ScanSoft acquired Lernout and Hauspie as well as their speech and language assets. They also acquired SpeechWorks in 2003 as well as closing a deal to distribute and support IBM desktop products that employ speech recognition.
Today, Microsoft and Alcatel-Lucent both hold patents in speech recognition, and the two are currently in dispute.
Technical Information About Speech Recognition
Early speech recognition tried to apply grammatical and syntactical rules to speech, and when the words would fit into a certain set of rules the software was then able to figure out what was trying to be said. But the human language has many variation that made this type of software not have a very high accuracy level. Today speech recongition systems use statistical modeling systems, which uses a mathmatical functions and probability.
Common Steps to Most Speech Recognition Softwares
1. Audio recoring or the recognition of voice
2. Pre-filtering, normalization, banding, ect.
3. Framing and windowing, chopping the data into smaller pieces
4. Filtering; each frame is filtered to find the right noices of voice
5. Comparison and matching, taking the sound bits and matching them together with their dictionary of words.
6. Action, which preforms the action needed by the recognition of word.
Different Types of Speech Recognition
Isolated Word
These systems usually have listening and not listening times where the users have to wait inbetween the udderance of each word for the system to process its request.
Connected Words
This system is alike to the isolated word system but it allows the user to not wait inbetween each udderance.
Continuous Speech
This allows the user to speak naturally, while the computer is able to figure out what the user is saying. These systems are very hard to make since the system must use special methods.
Spontaneous Speech
This systems is like the continuous speech recognition but it can tell the difference between the um.. hmmm... and ahh.. which allows it to handle a variety of speech features.
Voice Varification/Identification
This is when a system has the aviablity to tel the difference between its different users, this is not like a securty feature or a varification software.
Small Vocabulary/Many-User Software
These systems are usually used for telephone answering systems. The software does not need to be able to determine many words since it can only has to recognize a small vocabulary. With this there is teh advantage that the users voice can have different accents, pitchs, and patterns and the system will still be able to recognize the command. But since the vocabulary is small, this limits the commands that the program is able to complete with having the options of predetermined options or numbers.
Large Vocabulary/Limited-User Software
These systems are usually used in a business enviroment where the users are scarce. The more users of this systems the less accurate the software becomes. Usually this software's accuracy is around 85 percent, but is able to fall drastically with an addition of another user. These systems are able to have over a 10,000 word dictionary which usually has to be trained by the user to work properly.
If these systems were made over ten years ago the user had a choice whether they wanted a discrete system or a continous system, where the continous system you would be able to talk regularly, and the discrete system you would have to pause inbetween each word spoken. Since then almost every software has became a continous program since it was found that the users did not like to pause between every word.
Hidden Markvo Model
The Hidden Markvo Model is a system of processes that are linked together, to make a word chain, but the chain can branch off into different sounds where probability is then assigned to each phoneme, where is then uses its dictionary to find the correct word, using the proability.
Here is an example from Howstuffworks.com; where they have broken down two phrases that the software might find as being the same words.
r eh k ao g n ay z s p iy ch
recognize speech
r eh k ay n ay s b iy ch
wreck a nice beach
For this reason is why the user must program the software so that when the user is talking it is able to pick out the right phrase; this is usually based on word patterns and spacing in between words.
The Conversion of Speech to Data
The software program has to go through many steps in order to create text on the screen from the words that the user has spoken.
When a human speaks they create a viberation in the air, the analog-to-digital converter then translates this into data that the computer is able to understand. The computer is able to do this by digitalizing the sound waves into measurements at many intervals. The software is then able to remove the unwanted sounds and then even able to seperate it into different bands of frequency. The software is then able to make the speech that is just heard at one constant volume. Then the software is able to cut the signals into small peices even down to thousanths of a second to be able to find common sounds inbetween words like "s" or "p". This then matchs the sounds it has just recalled to the phonenes, which are the smallest elements of a language, and is then able to put together a meaningful expression.
Market Conditions
Although historically the market for speech recognition technology has been small, it has begun to boom in the last 2 years doubling to 1.7 billion in the United States alone. In the past there have been many problems associated with voice recognition technology such as the innability to recognize languages, accents, speed of speech and other variables necessary to recognize speech. However it is clear that as the technology improves that the market will in turn grow immensly with countless opportunities for new advancements and new products. Speech recognition developers include TellMe Inc, Nuance Convergys, Genesys, Aspect, IBM, Netsuite, Softel, Microsoft, etc.
Cell phones
Sales of voice recognition software paired with cell phones has been the best selling voice recognition software. With cell phones voice recognition is used to dictate text messages as well as navigate through ones speed dial. This technology has seen greatest sales in states where there are laws prohibiting the use of cell phones while driving. In a recent survey conducted by the Technology Manufacturing Corporation 75% of drivers said they would be interested in voice recognition technology in cell phones while driving.
Typing and the Computer
When voice recognition software was first concieved it was predicted that the keyboard would be a thing of the past by now. Unfortunately the technology has not progressed as fast as many hoped. The speech recognition keyboard suffers from errors in dictation as well as suffers from a limited vocabulary. In 2006 Microsoft was embarrassed as their latest voice recognition keyboard failed to work in a live demonstration. However as the technology advances there is an enormous opportunity for sales as the product will ultimately save the user time and avoid any typing related injuries such as carpel tunnel as well as allow disabled consumers who cannot type to use the software and ultimately make secretaries a thing of the past as well as keyboards.
Server Based Voice Recognition
Server based voice recognition is considered to be the most lucrative and useful aspect of voice recognition as the user is able to navigate through large databases (such as medical records) using speech recognition. In 2006 the market for based voice-recognition technology to power call centers and the like reached nearly $600 million and is expected to double by 2009 according to Opus research.
Other Uses for Speech Recognition
Other uses for speech recognition include GPS devices and car navigation systems, navigating MP3 players, video games, and various search engine based tools.
Key Markets
United States
The United States is currently the largest market for voice recognition software, however the market is considered to be reaching maturity in terms of the various products available to the public making growth difficult
United Kingdom
The United Kingdom is the second largest market totaling 49 million in the last year. With the highest number of cell phone call centers in Europe and strong media interest in voice solutions, the market is poised for continued strong growth.
Europe
In Europe the markets long term revenue is impeded by their individual small size, however the Nordic regions are considered to be the leader in developing new consumer voice services and technologies.
China
China: The market for voice technologies in China is currently minimal, however growth is inevitable. Although uptake of mobile and voice technologies in China is currently low, as the most populous nation in the world the current number of mobile phone users in China exceeds the population of most European countries.
Japan
Japan: In Japan the voice recognition market has numerous barriers to entry and is difficult to navigate. However trends show that Japanese businesses has a demand for the highest quality products and services.
Future of Speech Recognition
Immediate Future - Improving On Current Technology
As soon as within the next six to twelve months, speech recognition should reach a point where the computer is able to accurately transcribe close to 100% of what is said. This belief is backed by the principal of Moore's Law which anticipates a substantial increase in available memory, capacity, and computer power. The key is to teach the machines vocabulary phonetically so that it can understand what it is hearing and assign specific sounds to each combination of letters. Already, speech recognition has exceeded the productivity of manual labor in some areas. For example, when there is a large quantity of data, like there is in a directory, machines are already more capable of sifting through the great amount of data than people.
Further Down the Road - Human Computers
In the distant future, speech recognition may turn into speech understanding, meaning that the machines would not only recognize the words that are being said, but would also be able to understand what they mean. In fact, there are people who believe that eventually, the computers will have the capability of talking back, and even carrying on a conversation. However, even humans are imperfect, and the expectation cannot be that the machines will make no mistakes. A human scribe will sometimes mishear or mistype a sentence or two, and nothing more should be expected of these machines. They are merely assuming the responsibilities of a human assistant with an equal amount of productivity, in turn enabling human assistants to focus their time on something that machines have not been invented for yet.
The Future Overseas
The Defense Advanced Research Projects Agency is developing a program, Global Autonomous Language Exploitation, that will translate international news broadcasts and newspapers. The goal is to produce a product that can translate two languages with at least 90% accuracy. The ultimate goal, however, is a universal translator. This translator would be able to translate any language, but this product is still quite far from completion. The problem is that it is very difficult to combine speech recognition with automatic translation. Inconsistencies between languages such as slang, dialects, accents and background noises are difficult for a machine to recognize.
Troubleshooting - Roadblocks
Not surprisingly, with the tremendous exponential increase in technological advances each year, the availability of technology for speech recognition development should never pose any type of real problem. The problems that speech recognition will face in the future are more business problems than they are technological. While there is a reasonable return on investment for speech recognition applications, the solutions they provide are strictly vertical, meaning they are very specific to the product and they do not ensure any sense of a permanent solution. Once the speech recognition technology becomes more horizontal, that is more widely spread throughout the market, then the price will driven down and it will become a more attractive product to consumers.
Sources
http://www.tmaa.com/Meisel%20commentary%20on%20market%20segments.pdf
http://www.computerworld.com/blogs/node/193
http://www.dragon-medical-transcription.com/historyspeechrecognition.html
http://cslu.cse.ogi.edu/HLTsurvey/ch1node4.html
http://en.wikipedia.org/wiki/Speech_recognition
http://ccrma.stanford.edu/~jhw/bioauth/andre/VoiceRecognitionmktApr02.pdf
http://en.wikipedia.org/wiki/Voder
http://www.acoustics.hut.fi/publications/files/theses/lemmetty_mst/chap2.html
http://www.faqs.org/docs/Linux-HOWTO/Speech-Recognition-HOWTO.html#TYPES
http://news.com.com/The+future+of+talking+computers/2008-1011_3-5090381.html
Team Members
Michelle Liptak
John Polchowski
Douglas Friedrich
Daniel Toplitt
