Speech Processing

Speech processing, and speech recognition in particular, is one challenging problem where neural networks have a lot to offer. Neural networks have been widely used for solving many tasks and subtasks of speech processing.

Speech Synthesis: NETtalk

One of the first neural network applications of speech processing was NETtalk, developed by Sejnowskiand Rosenberg (1987). This is an MLP trained with a backpropagation algorithm to pronounce an English text. The input of the network consists of seven input modules, one for a letter, each consisting of 29 input nodes (26 for the letters in the alphabet plus three for punctuation), 80 hidden nodes, and 26 output nodes, encoding phonemes.

The input text to the network ''slides" from left to right as in a window. The desired output is the phoneme that corresponds to the letter at the center of the window. The outputs are connected to a speech generator for the recognized phonemes. The network is trained on 1024 words from English phoneme source. The accuracy after only 50 epochs was 95%. When tested, an accuracy of 78% was achieved.

Speech Recognition

Neural networks can be used for pattern matching, as well as for the language analysis phase. The signal processing may be performed in a standard way: a digitizing frequency of 22,050 Hz, a 256- point FFT, and 26 Mel-scale cestrum coefficients obtained for each segment of 11.6 ms of the speech signal. The time segments overlap on 50%.

Some of the most commonly used connectionist models for speech recognition are MLP, SOM, time-delay networks, and recurrent networks. Their use depends on the type of the recognition performed, for example, whole word recognition, or sub words recognition, for example, phoneme recognition.