Scientific computing > Applications of scientific computing > Speech recognition and synthetic speech
 
Tehdyt toimenpiteet

Speech recognition and synthetic speech

A computer does not understand a natural language. Sound and text are easy to convert to bits that the computer can read, but conveying the meaning of words and sentences is difficult. Language and speech technology develops methods and computer programs that improve computers’ capacity to make sense of a natural language and to process it in the same manner as the human brain does.

We do not know exactly how the human brain processes language. Problems in computer linguistics (speech recognition, processing and production) have been tackled by making use of computers’ greatest advantage, i.e., their ability to compute. Computer linguists do not try to imitate language processing in the human brain; instead, they concentrate on developing a computationally effective model of language.

When we speak, we don’t produce separate words, but a flow of words merged together. A speech recognition system needs a way to observe sound and the ability to recognize the words it contains. Usually speech recognition devices take the sound that they have observed and compare it to the speech samples saved in the speech database. Recognition is more likely if the samples compared are short. If they are whole sentences, recognition is almost impossible. Some speech recognition devices can distinguish certain command words among the speech stream, provided they have been programmed to recognize these words and carry out the commands.

Even though a speech recognition program recognizes words, interpretation of their meaning may be difficult. When interpreting words, people conceive them as components of sentence structures that are not always unambiguous. The meaning of the same words may also change depending on the stress, tone, context, or the speaker’s facial expression. “Rock” may refer to music or to stone material. An animated model of a talking head can be used to illustrate how a person interprets speech not only from the voice but also from the speaker’s facial expressions. People understand the meaning of what they hear because they can compare it to their existing knowledge of the world. A computer does not have this knowledge.

The specific features of languages also affect the requirements set for speech technology. In the Finnish language, the affixes added to the root of the word can change the word’s meaning; so can the word order of the sentence and the context. In addition, words and expressions are accompanied by associations and symbolic meanings. In other words, language is open to many interpretations.

A speech synthesizer does not understand the meanings of the language it produces; it just parrots the words mechanically. Most speech synthesizers have a database that contains all of the language’s sounds as they appear in various combinations. Language technology develops rules and techniques that the machines can use as the base when combining these sounds.

Speech recognition and synthesis are even more challenging because everything must occur quickly, keeping pace with natural speech.

The practical applications of speech recognition and synthetic speech include the user interfaces of mobile phones and computers, as well as various telephone services, because in these applications language technology is a way of making the interaction between man and machine more natural and easier.