Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
Analysis and modeling of F0 contours for Cantonese text-to-speech
Li Y., Lee T., Qian Y. ACM Transactions on Asian Language Information Processing3 (3):169-180,2004.Type:Article
Date Reviewed: Feb 24 2005

The fundamental frequency (F0) of human speech is the critical factor in creating synthetic speech with natural prosody, the temporal and rhythmic properties of human utterance that make speech sound natural rather than robotic. Mechanical techniques do a fairly good job of synthesizing intelligible speech, by imitating local intonation or tone contours, but don’t sound truly natural because they don’t handle phrase and sentence fundamental tone contours. Telephone robots have become much more sophisticated, but have not become more natural sounding.

Li, Lee, and Qian take on an interesting challenge, by developing a text-to-speech system for Cantonese, a Chinese dialect with many tones. Intonation in Indo-European languages is employed to convey emotion; however, in monosyllabic agglutinative languages like the Chinese dialects, it conveys lexical, and, to some extent, syntactic information. It seems that the authors’ technique will work for any language if it works for Cantonese, which has nine tones in all, of which three are entering tones, and the other six occur throughout an utterance. Fukienese, with its 13 tones, might be an interesting stress test as well.

The authors capture the change in F0 over an utterance as a phrase curve, and local (syllabic) intonation is detected by any break that exceeds a given length. While the beginning and end frequencies of each of the six tones may vary over the phrase, their ratio or interval will be constant, and the phrase curve will be determined by linear regression over the converted tone heights. The authors analyzed 1,200 utterances, having 4,937 intonation phrases, to develop a Cantonese text-to-speech system, called CUTalk, consisting of three modules: text analysis, acoustic synthesis, and prosody generation.

Subjective tests of the system were made with sentences taken from local newspapers, and naturalness was rated, on a scale from one to five, by native speakers. The results showed a marked improvement in the generation of natural spoken Chinese by a computer, but also revealed some opportunities for additional improvement. This always seems to be the result of trying to analyze or synthesize human language mechanically; we discover that language is even more complex than we thought, and learn as much about the natural language as we learn about computation for linguistic applications.

Reviewer:  P. C. Patton Review #: CR130856 (0509-1051)
Bookmark and Share
  Reviewer Selected
Featured Reviewer
 
 
Speech Recognition And Synthesis (I.2.7 ... )
 
 
Pattern Analysis (I.5.2 ... )
 
Would you recommend this review?
yes
no
Other reviews under "Speech Recognition And Synthesis": Date
On-line recognition of spoken words from a large vocabulary
Kohonen T. (ed), Riittinen H., Reuhkala E., Haltsonen S. Information Sciences 33(1-2): 3-30, 1984. Type: Article
Oct 1 1985
Connected spoken word recognition algorithms by constant time delay DP, O (n) DP and augmented continuous DP matching
Nakagawa S. Information Sciences 33(1-2): 63-85, 1984. Type: Article
Jun 1 1985
The phonetic basis for computer speech processing
Ladefoged P., Prentice Hall International (UK) Ltd., Hertfordshire, UK, 1985. Type: Book (9789780131638419)
Dec 1 1987
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy