Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
A multi-modal approach for determining speaker location and focus
Siracusa M., Morency L., Wilson K., Fisher J., Darrell T.  Multimodal interfaces (Proceedings of the 5th international conference, Vancouver, British Columbia, Canada, Nov 5-7, 2003)77-80.2003.Type:Proceedings
Date Reviewed: Mar 1 2004

Research done at MIT to develop a computer system that can locate a speaker in a scene, and determine to whom they are speaking, is described in this paper. No intended application is described, but we can guess there will be uses in security systems, automated TV studios, robotic systems, or clandestine spy activities.

The system includes dual stereo cameras, and microphones in a fixed position. To calibrate the system, a person speaks directly to the cameras and microphones. The function of the audio is to determine which person is speaking. The functions of the video are to identify the speaker, by identifying moving lips, and to determine to whom they are speaking.

The direction of speaking is derived by sampling the one- through six-kilohertz (KHZ) band of each microphone independently. The system then computes a time difference correlation to determine speaker direction. The video can be used to locate a face, to locate face movements derived from speech, and to determine a face’s orientation (speaking direction). The system can then correlate the speaking face and the sound. The computations are described using statistical formulas of the correlation of Gaussian density, and distribution functions from the audio and video inputs. It would have been helpful if the authors included a graphical representation of the distribution functions, to demonstrate the microphone and video inputs and the formula output.

Research results show that this level of system is satisfactory if the speaker stands out from any clutter, and is facing somewhat toward the camera. Future research will seek to enhance the system using a speech recognizer and human facial model.

Reviewer:  Neil Karl Review #: CR129163 (0408-0970)
Bookmark and Share
 
General (I.0 )
 
Would you recommend this review?
yes
no
Other reviews under "General": Date
Nanotechnology: science and computation (Natural Computing Series)
Chen J., Jonoska N., Rozenberg G., Springer-Verlag New York, Inc., Secaucus, NJ, 2006.  393, Type: Book (9783540302957)
Aug 2 2007
High performance computing for big data: methodologies and applications
Wang C., CRC Press, Inc., Boca Raton, FL, 2018.  286, Type: Book (978-1-498783-99-6), Reviews: (1 of 2)
Apr 4 2019
High performance computing for big data: methodologies and applications
Wang C., CRC Press, Inc., Boca Raton, FL, 2018.  286, Type: Book (978-1-498783-99-6), Reviews: (2 of 2)
Nov 14 2019
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2023 ThinkLoud®
Terms of Use
| Privacy Policy