Skip to main content Accessibility help
×
Hostname: page-component-586b7cd67f-tf8b9 Total loading time: 0 Render date: 2024-11-27T15:38:49.076Z Has data issue: false hasContentIssue false

19 - Automatic Speech Recognition by Machines

from Section IV - Audition and Perception

Published online by Cambridge University Press:  11 November 2021

Rachael-Anne Knight
Affiliation:
City, University of London
Jane Setter
Affiliation:
University of Reading
Get access

Summary

Building machines to converse with human beings through automatic speech recognition (ASR) and understanding (ASU) has long been a topic of great interest for scientists and engineers, and we have recently witnessed rapid technological advances in this area. Here, we first cast the ASR problem as a pattern-matching and channel-decoding paradigm. We then follow this with a discussion of the Hidden Markov Model (HMM), which is the most successful technique for modelling fundamental speech units, such as phones and words, in order to solve ASR as a search through a top-down decoding network. Recent advances using deep neural networks as parts of an ASR system are also highlighted. We then compare the conventional top-down decoding approach with the recently proposed automatic speech attribute transcription (ASAT) paradigm, which can better leverage knowledge sources in speech production, auditory perception and language theory through bottom-up integration. Finally we discuss how the processing-based speech engineering and knowledge-based speech science communities can work collaboratively to improve our understanding of speech and enhance ASR capabilities.

Type
Chapter
Information
Publisher: Cambridge University Press
Print publication year: 2021

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

19.7 References

Allen, J. (1994). How do humans process and recognize speech. IEEE Transactions on Speech and Audio Processing, 2(4) 567–77.CrossRefGoogle Scholar
Baker, J. K. (1975). The DRAGON System: An overview. IEEE Transactions on Acoustics, Speech and Signal Processing, 23(1), 24–9.Google Scholar
Bourlard, H. A. & Morgan, N. (1994). Connectionist Speech Recognition: A Hybrid Approach. Berlin: Springer-Verlag.Google Scholar
Chan, W., Jaitly, N., Le, Q. & Vinyals, O. (2016). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Proceedings of International Conference on Acoustics, Speech, and Signal Processing, Shanghai, pp. 4960–4.Google Scholar
Cherry, C. (1968). On Human Communications. Cambridge, MA: MIT Press.Google Scholar
Cohen, M. H., Giangola, J. P. & Balogh, J. (2004). Voice User Interface Design. Hoboken, NJ: Anderson-Wiley.Google Scholar
Davis, K. H., Biddulph, R. & Balashek, S. (1952). Automatic recognition of spoken digits. Journal of the Acoustical Society of America, 24(6), 637–42.Google Scholar
Denes, P. E. & Pinson, E. N. (1993). The Speech Chain: The Physics and Biology of Spoken Languages, 2nd ed. Oxford: W. H. Freeman and Company.Google Scholar
Fant, G. (1960). Acoustic Theory of Speech Production. The Hague: Mouton.Google Scholar
Fant, G. (1973). Speech Sounds and Features. Cambridge, MA: MIT Press.Google Scholar
Flanagan, J. L. (1965). Speech Analysis, Synthesis and Perception. Berlin: Springer-Verlag.Google Scholar
Forgie, J. W. & Forgie, C. D. (1959). Results obtained from a vowel recognition computer program. Journal of the Acoustical Society of America, 31(11), 1480–89.Google Scholar
Gold, B. & Morgan, N. (1999). Speech and Audio Signal Processing. New York: Wiley.Google Scholar
Graves, A., Fernández, S., Gomez, F. & Schmidhuber, J. (2006). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning, pp. 369–76.Google Scholar
Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E. et al. (2014). Deep speech: Scaling up end-to-end speech recognition. In arXiv preprint arXiv:1412.5567.Google Scholar
Hinton, G. E. & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–7.Google Scholar
Hinton, G. E., Deng, L., Yu, D., Dahl, G., Mohamed, A. R., Jaitly, N. et al. (2012). Deep neural networks for acoustic modelling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 8297.Google Scholar
Huang, X., Acero, A. & Hong, H.-W. (2001). Spoken Language Processing: A Guide to Theory, Algorithm and System Development. Upper Saddle River, NJ: Prentice Hall.Google Scholar
Jelinek, F. (1997). Statistical Methods for Speech Recognition. Cambridge, MA: MIT Press.Google Scholar
Juang, B. H. & Furui, S. (2000). Automatic speech recognition and understanding: A first step toward natural human–machine communication. Proceedings of the IEEE, 88(8), 1142–65.Google Scholar
Juneja, A., Deshmukh, O. & Espy-Wilson, C. (2002). An event-based acoustic-phonetic approach to speech segmentation and E-set recognition. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 4: IV/4164.Google Scholar
Jurafsky, D. & Martin, J. H. (2000). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Upper Saddle River, NJ: Prentice Hall.Google Scholar
Klatt, D. (1977). Review of the ARPA Speech Understanding Project. Journal of the Acoustical Society of America, 62(6), 1324–66.Google Scholar
Lee, C. H. & Rabiner, L. R. (1989). A frame-synchronous network search algorithm for connected word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(11), 1649–58.Google Scholar
Lee, C. H., Soong, F. K. & Paliwal, K. K. (1996). Automatic Speech and Speaker Recognition: Advanced Topics. Dordrecht: Kluwer Academic.Google Scholar
Lee, C.-H. & Huo, Q. (2000). On adaptive decision rules and decision parameter adaptation for automatic speech recognition. Proceedings of the IEEE, 88(8), 1241–69.Google Scholar
Lee, C.-H. & Siniscalchi, S. M. (2013). An information-extraction approach to speech processing: Analysis, detection, verification and recognition. Proceedings of the IEEE, 101(5), 1089–115.Google Scholar
Liu, S. A. (1996). Landmark detection for distinctive feature-based speech recognition. Journal of the Acoustical Society of America, 100(5), 3417–30.Google Scholar
Lippmann, R. P. (1997). Speech recognition by machines and humans. Speech Communication, 22(1), 115.CrossRefGoogle Scholar
Lowerre, B. (1990). The HARPY speech understanding system. In Lea, W., ed., Trends in Speech Recognition. Upper Saddle River, NJ: Prentice Hall, pp. 576–86.Google Scholar
Manning, C. & Schutze, H. (1999). Foundations of Statistical Natural Language Processing. Cambridge, MA: MIT Press.Google Scholar
Martin, T. B., Nelson, A. L. & Zadell, H. J. (1964). Speech Recognition by Feature-Abstraction Techniques. Tech Report AL-TDR-64–176, Air Force Avionics Lab.Google Scholar
Mohri, M., Pereira, F. C. N. & Riley, M. (2002). Weighted finite-state transducers in speech recognition. Computer Speech & Language, 16, 6988.Google Scholar
Nagata, K., Kato, Y. & Chiba, S. (1963). Spoken Digit Recognizer for Japanese Language. NEC Research and Development Laboratories.Google Scholar
Ney, H. & Ortmanns, S. (2000). Progress in dynamic programming search for LVCSR. Proceedings of the IEEE, 88(8), 1224–40.Google Scholar
Olive, J. P., Greenwood, A. & Coleman, J. (1993). Acoustics of American English Speech: A Dynamic Approach. Berlin: Springer-Verlag.Google Scholar
Olson, H. F. & Belar, H. (1956). Phonetic typewriter. Journal of the Acoustical Society of America, 28(6), 1072–81.CrossRefGoogle Scholar
O’Shaughnessy, D. (2000). Speech Communications: Human and Machine. Reading, MA: Addison-Wesley.Google Scholar
Ostendorf, M. (1999). Moving beyond the beads-on-a-string model of speech. In Proceedings of. IEEE ASRU Automatic Speech Recognition and Understanding, Singapore, pp. 7984.Google Scholar
Ostendorf, M., Digalakis, V. V. & Kimball, O. A. (1996). From HMM’s to segment models: A unified view of stochastic modeling for speech recognition. IEEE Transactions on Speech and Audio Processing, 4(5), 360–78.Google Scholar
Paul, D. B. & Baker, J. M. (1992). The design for the Wall Street Journal-based CSR Corpus. In Proceedings of the Workshop on Speech and Natural Language, pp. 899902.Google Scholar
Rabiner, L. R. (1989). A tutorial on Hidden Markov Models and selected applications in speech recognition. Proceedings of the. IEEE, 77(2), 257–86.Google Scholar
Rabiner, L. R. & Juang, B.-H. (1993). Fundamentals of Speech Recognition. Upper Saddle River, NJ: Prentice Hall.Google Scholar
Rabiner, L. R. & Schafer, R. W. (2010). Theory and Applications of Digital Speech Processing. Upper Saddle River, NJ: Prentice Hall.Google Scholar
Ramabhadran, B., Chen, N. F., Harper, M. P., Kingsbury, B. & Knill, K. (2017). Introduction to the special issue on end-to-end speech and language processing. IEEE Journal of Selected Topics in Signal Processing, 11(8), 1237–9.Google Scholar
Sainath, T. N., Weiss, R. J., Wilson, K. W., Li, B., Narayanan, A., Variani, E. et al. (2017). Multichannel signal processing with deep neural networks for automatic speech recognition. IEEE /ACM Transactions on Audio, Speech, and Language Processing, 25, 965–79.Google Scholar
Sakoe, H. (1979). Two-level DP matching: A dynamic programming-based pattern matching algorithm for connected word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 27, 588–95.Google Scholar
Shannon, C. E. (1948). A mathematical theory of communication. The Bell System Technical Journal, 27, 379423 & 623–56.Google Scholar
Siniscalchi, S. M. & Lee, C.-H. (2009). A study on integrating acoustic-phonetic information into lattice rescoring for automatic speech recognition. Speech Communication, 51, 1139–53.Google Scholar
Sproat, R. (1998). Multilingual Text-to-Speech Synthesis: The Bell Labs Approach, Dordrecht: Kluwer Academic.Google Scholar
Stevens, K. (2000). Acoustic Phonetics. Cambridge, MA: MIT Press.Google Scholar
Stork, D. G. (1997). HAL’s Legacy: 2001’s Computer as Dream and Reality. Cambridge, MA: MIT Press.Google Scholar
Sundermeyer, M., Schlüter, R. & Ney, H. (2012). LSTM neural networks for language modelling. In Proceedings of INTERSPEECH, Portland, OR, 194–6.Google Scholar
Taylor, P. (2009). Text-to-Speech Synthesis. Cambridge: Cambridge University Press.CrossRefGoogle Scholar
Thomáš, M. (2012). Statistical Language Models Based on Neural Networks. PhD thesis, Brno University of Technology.Google Scholar
Vintsyuk, T. K. (1968). Speech discrimination by dynamic programming. Kibernetika, 4(2), 81–8.Google Scholar
Viterbi, A. J. (1967). Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 13(2), 260–9.CrossRefGoogle Scholar
Yu, D. & Deng, L. (2014). Automatic Speech Recognition: A Deep Learning Approach. Berlin: Springer-Verlag.Google Scholar

Save book to Kindle

To save this book to your Kindle, first ensure [email protected] is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

Available formats
×