Computing Reviews, the leading online review service for computing literature.

Search

Speech synthesis and recognition
Holmes J., Holmes J., Holmes W., Taylor & Francis, Inc., Bristol, PA, 2002. 304 pp. Type: Book (9780748408573)

Date Reviewed: Jan 3 2003

Over the past 30 years, the field of speech processing has seen a consolidation of advanced techniques in signal processing and statistical modeling, which has resulted in widespread availability of speech synthesis and recognition products. This book represents a guide to this technology that is as coherent and comprehensive as is possible in a volume of this size. The emphasis of the book is on the engineering, computing, and to a certain extent, the phonetics background needed to build speech synthesis and recognition engines, as opposed to aspects of incorporating speech functionality in the user interface. The book consists of three main parts: mechanisms and models of human speech communication, speech synthesis, and speech recognition. This core is complemented by chapters that contain a survey of current applications, a discussion of current technology, and a discussion of future directions. There is also a summary and a set of exercises at the end of each chapter, which I found very appropriate. The summary and exercises help the reader regain perspective before starting a new chapter; the exposition is quite dense at times, notably in the first four chapters, and the reader might otherwise lose sight of the global picture. The book starts with an explanation of the complexity of speech communication, taking the reader from the basics of phonetics and phonology to acoustic and electronic models of speech production and perception via the physiology of the human production and auditory systems. In addition to clear explanations of the basic concepts needed for later, more technology-oriented chapters, chapters 1, 2, and 3 contain enough information to give the reader a good idea of the multidisciplinary nature of the problem. The text often links the discussion of current technology with its historical background. An interesting example of this style can be seen in the section on spectrograms, where the authors describe the operation of a spectrograph, a device used before cheap computing power made the computation of Fourier transforms the preferred method for producing spectrograms. Before moving on to speech synthesis, the authors review the main methods of coding speech in digital form. This review covers simple waveform coders, vocoders, and intermediate systems, as well as the basics of speech coding evaluation. The presentation of speech synthesis proceeds in a bottom-up fashion: low-level speech production techniques are introduced first (chapters 5 and 6), and then the overall architecture of a typical text-to-speech system is discussed (chapter 7). Chapter 5 describes concatenative synthesis, or synthesis of messages by concatenation of stored human speech. The discussion ranges from early systems, based on concatenation of word-size units, to the currently dominant implementation paradigm: concatenation of short waveform segments by means of pitch-synchronous overlap-add (PSOLA) techniques. Chapter 6 discusses an alternative to concatenative synthesis that consists of using acoustic-phonetic rules to d rive a formant synthesizer. Although synthesis by rule is rarely used in practical text-to-speech systems these days, chapter 6 is still relevant, since synthesis by rule has a more flexible and conceptually transparent architecture than its concatenative rival. Chapter 7 concludes speech synthesis by discussing synthesis from textual, and to a lesser extent, conceptual input. This chapter introduces an architecture for text-to-speech systems, and briefly describes the natural language processing modules that it encompasses, including dictionary lookup, and morphological and syntactic analysis. As with synthesis, the introduction to automatic speech recognition (ASR) is situated in a historical context. After a brief mention of early (unsuccessful) rule-based attempts, chapter 8 describes speech recognition by template matching, with emphasis on explaining the general principles that also underlie the more powerful methods used today. The chapter also covers the basics of signal processing, and introduces distance metrics, as well as describing an instance of the dynamic programming techniques used in modern ASR, known as dynamic time warping, which is essentially what makes efficient single-word matching possible. This discussion sets the stage for the prevailing ASR paradigm in the past 20 years: probabilistic pattern matching by hidden Markov models (HMMs). Chapter 9 introduces HMMs, along with the dynamic programming techniques widely used in HMM recognition and parameter estimation, such as the Viterbi algorithm and Baum-Welch re-estimation. Chapter 10 extends chapter 8’s description of signal processing techniques used in the first stages of ASR, known as front-end analysis. The part on ASR concludes with chapters on large-vocabulary recognition and advanced performance improvement techniques, and some discussion of the use of neural networks in ASR. The final chapters provide an overview of applications of speech technology, and the authors’ appraisal of future research directions. The closing chapter provides a useful guide for further reading. I believe a chapter on evaluation should have been included in the text; it would be a natural complement to the speech synthesis and recognition sections, and tie in closely with the initial chapters. The book is nevertheless a good first text for advanced students with a serious interest in speech technology.

Reviewer: Saturnino Luz	Review #: CR126813 (0303-0234)

Speech Recognition And Synthesis (I.2.7 ... )

Would you recommend this review?

yes

Other reviews under "Speech Recognition And Synthesis":	Date

On-line recognition of spoken words from a large vocabulary Kohonen T. (ed), Riittinen H., Reuhkala E., Haltsonen S. Information Sciences 33(1-2): 3-30, 1984. Type: Article	Oct 1 1985

Connected spoken word recognition algorithms by constant time delay DP, O (n) DP and augmented continuous DP matching Nakagawa S. Information Sciences 33(1-2): 63-85, 1984. Type: Article	Jun 1 1985

The phonetic basis for computer speech processing Ladefoged P., Prentice Hall International (UK) Ltd., Hertfordshire, UK, 1985. Type: Book (9789780131638419)	Dec 1 1987

more...

Reproduction in whole or in part without permission is prohibited. Copyright 1999-2024 ThinkLoud^®
Terms of Use | Privacy Policy