Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
Speech synthesis and recognition
Holmes J., Holmes J., Holmes W., Taylor & Francis, Inc., Bristol, PA, 2002. 304 pp. Type: Book (9780748408573)
Date Reviewed: Jan 3 2003

Over the past 30 years, the field of speech processing has seen a consolidation of advanced techniques in signal processing and statistical modeling, which has resulted in widespread availability of speech synthesis and recognition products. This book represents a guide to this technology that is as coherent and comprehensive as is possible in a volume of this size.

The emphasis of the book is on the engineering, computing, and to a certain extent, the phonetics background needed to build speech synthesis and recognition engines, as opposed to aspects of incorporating speech functionality in the user interface. The book consists of three main parts: mechanisms and models of human speech communication, speech synthesis, and speech recognition. This core is complemented by chapters that contain a survey of current applications, a discussion of current technology, and a discussion of future directions. There is also a summary and a set of exercises at the end of each chapter, which I found very appropriate. The summary and exercises help the reader regain perspective before starting a new chapter; the exposition is quite dense at times, notably in the first four chapters, and the reader might otherwise lose sight of the global picture.

The book starts with an explanation of the complexity of speech communication, taking the reader from the basics of phonetics and phonology to acoustic and electronic models of speech production and perception via the physiology of the human production and auditory systems. In addition to clear explanations of the basic concepts needed for later, more technology-oriented chapters, chapters 1, 2, and 3 contain enough information to give the reader a good idea of the multidisciplinary nature of the problem. The text often links the discussion of current technology with its historical background. An interesting example of this style can be seen in the section on spectrograms, where the authors describe the operation of a spectrograph, a device used before cheap computing power made the computation of Fourier transforms the preferred method for producing spectrograms.

Before moving on to speech synthesis, the authors review the main methods of coding speech in digital form. This review covers simple waveform coders, vocoders, and intermediate systems, as well as the basics of speech coding evaluation. The presentation of speech synthesis proceeds in a bottom-up fashion: low-level speech production techniques are introduced first (chapters 5 and 6), and then the overall architecture of a typical text-to-speech system is discussed (chapter 7). Chapter 5 describes concatenative synthesis, or synthesis of messages by concatenation of stored human speech. The discussion ranges from early systems, based on concatenation of word-size units, to the currently dominant implementation paradigm: concatenation of short waveform segments by means of pitch-synchronous overlap-add (PSOLA) techniques. Chapter 6 discusses an alternative to concatenative synthesis that consists of using acoustic-phonetic rules to d rive a formant synthesizer. Although synthesis by rule is rarely used in practical text-to-speech systems these days, chapter 6 is still relevant, since synthesis by rule has a more flexible and conceptually transparent architecture than its concatenative rival. Chapter 7 concludes speech synthesis by discussing synthesis from textual, and to a lesser extent, conceptual input. This chapter introduces an architecture for text-to-speech systems, and briefly describes the natural language processing modules that it encompasses, including dictionary lookup, and morphological and syntactic analysis.

As with synthesis, the introduction to automatic speech recognition (ASR) is situated in a historical context. After a brief mention of early (unsuccessful) rule-based attempts, chapter 8 describes speech recognition by template matching, with emphasis on explaining the general principles that also underlie the more powerful methods used today. The chapter also covers the basics of signal processing, and introduces distance metrics, as well as describing an instance of the dynamic programming techniques used in modern ASR, known as dynamic time warping, which is essentially what makes efficient single-word matching possible. This discussion sets the stage for the prevailing ASR paradigm in the past 20 years: probabilistic pattern matching by hidden Markov models (HMMs). Chapter 9 introduces HMMs, along with the dynamic programming techniques widely used in HMM recognition and parameter estimation, such as the Viterbi algorithm and Baum-Welch re-estimation. Chapter 10 extends chapter 8’s description of signal processing techniques used in the first stages of ASR, known as front-end analysis. The part on ASR concludes with chapters on large-vocabulary recognition and advanced performance improvement techniques, and some discussion of the use of neural networks in ASR.

The final chapters provide an overview of applications of speech technology, and the authors’ appraisal of future research directions. The closing chapter provides a useful guide for further reading.

I believe a chapter on evaluation should have been included in the text; it would be a natural complement to the speech synthesis and recognition sections, and tie in closely with the initial chapters. The book is nevertheless a good first text for advanced students with a serious interest in speech technology.

Reviewer:  Saturnino Luz Review #: CR126813 (0303-0234)
Bookmark and Share
  Editor Recommended
Featured Reviewer
 
 
Speech Recognition And Synthesis (I.2.7 ... )
 
Would you recommend this review?
yes
no
Other reviews under "Speech Recognition And Synthesis": Date
On-line recognition of spoken words from a large vocabulary
Kohonen T. (ed), Riittinen H., Reuhkala E., Haltsonen S. Information Sciences 33(1-2): 3-30, 1984. Type: Article
Oct 1 1985
Connected spoken word recognition algorithms by constant time delay DP, O (n) DP and augmented continuous DP matching
Nakagawa S. Information Sciences 33(1-2): 63-85, 1984. Type: Article
Jun 1 1985
The phonetic basis for computer speech processing
Ladefoged P., Prentice Hall International (UK) Ltd., Hertfordshire, UK, 1985. Type: Book (9789780131638419)
Dec 1 1987
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy