Hostname: page-component-586b7cd67f-tf8b9 Total loading time: 0 Render date: 2024-12-03T19:27:18.198Z Has data issue: false hasContentIssue false

Phonetisaurus: Exploring grapheme-to-phoneme conversion with joint n-gram models in the WFST framework

Published online by Cambridge University Press:  07 September 2015

JOSEF ROBERT NOVAK
Affiliation:
The University of Tokyo, Graduate School of Information Science and Technology, Tokyo, Japan e-mails: [email protected], [email protected], [email protected]
NOBUAKI MINEMATSU
Affiliation:
The University of Tokyo, Graduate School of Information Science and Technology, Tokyo, Japan e-mails: [email protected], [email protected], [email protected]
KEIKICHI HIROSE
Affiliation:
The University of Tokyo, Graduate School of Information Science and Technology, Tokyo, Japan e-mails: [email protected], [email protected], [email protected]

Abstract

This paper provides an analysis of several practical issues related to the theory and implementation of Grapheme-to-Phoneme (G2P) conversion systems utilizing the Weighted Finite-State Transducer paradigm. The paper addresses issues related to system accuracy, training time and practical implementation. The focus is on joint n-gram models which have proven to provide an excellent trade-off between system accuracy and training complexity. The paper argues in favor of simple, productive approaches to G2P, which favor a balance between training time, accuracy and model complexity. The paper also introduces the first instance of using joint sequence RnnLMs directly for G2P conversion, and achieves new state-of-the-art performance via ensemble methods combining RnnLMs and n-gram based models. In addition to detailed descriptions of the approach, minor yet novel implementation solutions, and experimental results, the paper introduces Phonetisaurus, a fully-functional, flexible, open-source, BSD-licensed G2P conversion toolkit, which leverages the OpenFst library. The work is intended to be accessible to a broad range of readers.

Type
Articles
Copyright
Copyright © Cambridge University Press 2015 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Allauzen, C., Mohri, M., and Roark, B., 2003. Generalized algorithms for constructing statistical language models. In Proceedings of the 41st Annual Meeting of the Assocication for Computational Linguistics, Stroudsburg, PA, USA, pp. 40–7.Google Scholar
Allauzen, C., Riley, M., Schalkwyk, J., Wojciech, S., and Mohri, M. 2007. OpenFst: a general and efficient weighted finite-state transducer library. In Proceedings of CIAA 2007, pp. 1123, Lecture Notes in Computer Science, vol. 4783. Berlin Heidelberg: Springer.Google Scholar
Alumäe, T., and Kurimo, M. 2010. Efficient estimation of maximum entropy language models with N-gram features: an SRILM extension. In Proceedings of Interspeech 2010, Chiba, Japan.Google Scholar
Auli, M., Galley, M., Quirk, C., and Zweig, G., 2013. Joint language and translation modeling with recurrent neural networks. In Proceedings of EMNLP 2013, Melbourne, Australia, pp. 1044–54.Google Scholar
Bell, T., Cleary, J., and Witten, I., 1990. Text Compression. Upper Saddle River, NJ, USA: Prentice Hall.Google Scholar
Bisani, M., and Ney, H. 2008. Joint-sequence models for grapheme-to-phoneme conversion. In Speech Communication, pp. 434–51. Amsterdam: Elsevier Science Publishers B. V. Google Scholar
Caseiro, D., and Trancoso, I. 2002. Grapheme-to-Phoneme using finite state transducers. In Proceedings of the 2002 IEEE Workshop on Speech Synthesis, Piscataway NJ, USA.Google Scholar
Chen, S. 2003. Conditional and joint models for grapheme-to-phoneme conversion. In Proceedings of EUROSPEECH.Google Scholar
Chen, S., and Goodman, J. 1998. An empirical study of smoothing techniques for language modeling. Technical Report, Computer Science Group, Harvard Univerisity.Google Scholar
Cortes, C., Kuznetsov, V., and Mohri, M., 2014. Ensemble methods for structured prediction. In Proceedings of ICML 2014, Bonn, Germany, pp. 896903.Google Scholar
Damper, R., Marchand, Y., Adsett, C., Soonklang, T., and Marsters, S., 2005. Multilingual data-driven pronunciation. In Proceedings of the 10th International Conference on Speech and Computer (SPECOM 2005), Patras, Greece, pp. 167–70.Google Scholar
Deligne, S., Yvon, F., and Bimbot, F., 1995. Variable-length sequence matching for phonetic transcription using joint multigrams. In Proceedings of EUROSPEECH 1995, Madrid, Spain, pp. 2243–46.Google Scholar
Galescu, L., and Allen, J. F. 2001. Bi-directional conversion between graphemes and phonemes using a joint n-gram model. In Proceedings of the 4th ISCA Tutorial and Research Workshop on Speech Synthesis, Perthshire, Scotland.Google Scholar
Hahn, S., Vozila, P., and Bisani, M. 2012. Comparison of grapheme-to-phoneme methods on large pronunciation dictionaries and LVCSR tasks. In Proceedings of INTERSPEECH 2012, Portland, Oregon.Google Scholar
Hixon, B., Schneider, E., and Epstein, S., 2011. Phonemic similarity metrics to compare pronunciation methods. In Proceedings of INTERSPEECH 2011, Florence, Italy, pp. 825–8.Google Scholar
Jiampojamarn, S., and Kondrak, G., 2010. Letter-to-phoneme alignment: an exploration. In Proceedings of the ACL 2010, Uppsala, Sweden, pp. 780–8.Google Scholar
Jiampojamarn, S., Kondrak, G., and Sherif, T., 2007. Applying many-to-many alignments and hidden Markov models to letter-to-phoneme conversion. In Proceedings of NAACL HLT 2007, Rochester, New York, pp. 372–9.Google Scholar
Kempe, A., 2001. Factorization of ambiguous finite-state transducers. In Proceedings of CIAA 2001, Pretoria, South Africa, pp. 170–81.Google Scholar
Kneser, R., and Ney, H. 1995. Improved backing-off for m-gram language modeling. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 1995, Detroit, Michigan, pp. 1:181–4.Google Scholar
Mikolov, T. 2012. Statistical Language Models Based on Neural Networks. PhD Thesis, Brno University of Technology, Czech republic.Google Scholar
Mikolov, T., Karafiat, M., Burget, L., Černocký, J., and Khundanpur, S., 2010. Recurrent neural network based language model. In Proceedings of INTERSPEECH 2010, Chiba, Japan, pp. 1045–8.Google Scholar
Mikolov, T., Kombrink, S., Anoop, D., Burget, L., and Černocký, J. 2011. RNNLM - recurrent neural network language modeling toolkit. In ASRU 2011, demo session, Waikoloa, Hawaii.Google Scholar
Mohri, M. 2002. Semiring frameworks and algorithms for shortest-distance problems. Journal of Automata, Languages and Combinatorics 7 (3): 321–50. Magdeburg: Otto-von-Guericke-Universitat.Google Scholar
Mohri, M., Pereira, F., and Riley, M. 2002. Weighted finite-state transducers in speech recognition. ComputerSpeech and Language 16 (1): 6988. Elsevier.Google Scholar
Novak, J., Dixon, P., Minematsu, M., Hirose, K., Horie, C., and Kashioka, H., 2012. Improving WFST-based G2P Conversion with alignment constraints and RNNLM N-best rescoring. In Proceedings of INTERSPEECH 2012, Portland, Oregon, pp. 2526–9.Google Scholar
Novak, J., Minematsu, M., and Hirose, K., 2012. WFST-based grapheme-to-phoneme conversion: open source tools for alignment, model-building and decoding. In Proceedings of FSMNLP 2012, San Sebastian, Spain, pp. 45–9.Google Scholar
Novak, J., Minematsu, M., and Hirose, K., 2013. Failure transitions for Joint n-gram models and G2P conversion. In Proceedings of INTERSPEECH 2013, Lyon, France, pp. 1821–5.Google Scholar
Roark, B., Sproat, R., Allauzen, C., Riley, M., Sorensen, J., and Tai, T., 2012. The OpenGrm open-source finite-state grammar software libraries. In Proceedings of the ACL 2012 - System Demonstrations, Jeju, South Korea, pp. 61–6.Google Scholar
Ristad, E., and Yianilos, P., 1998. Learning string edit distance. IEEE Transactions PRMI 20 (5): 522–32.Google Scholar
Schlippe, T., Quaschningk, W., and Schultz, T., 2014. Combining grapheme-to-phoneme converter outputs for enhanced pronunciation generation in low-resource scenarios. In Proceedings of the 4th Workshop on Spoken Language Technologies for Under-resourced Languages, St. Petersburg, Russia, pp. 1416.Google Scholar
Schuster, M., and Paliwal, K., 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45 (11): 2673–81.Google Scholar
Sejnowski, T. J., and Rosenberg, C. R. 1993. NETtalk corpus. Available at: ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/dictionar-ies/beep.tar.gz.Google Scholar
Shu, H., and Hetherington, I., 2002. EM training of finite-state transducers and its application to pronunciation modeling. In Proceedings of ICSLP 2002, Denver, Colorado, pp. 1293–6.Google Scholar
Stolcke, A., 2002. SRILM - an extensible language modeling toolkit. In Proceedings of ICSLP 2002, Denver, Colorado, pp. 901–4.Google Scholar
Stüker, S., and Schultz, T., 2004. A grapheme based speech recognition system for Russian. In Proceedings of SPECOM, St. Petersburg, Russia, pp. 297303.Google Scholar
Tam, Y. 2009. Rapid Unsupervised Topic Adaptation - A Latent Semantic Approach. PhD thesis, Carnegie Mellon University, Pittsburgh, PA, USA.Google Scholar
Weide, R. L. 1998. The Carnegie Mellon pronouncing dictionary. Available at: http://www.speech.cs.cmu.edu/cgi-bin/cmudict.Google Scholar
Witten, I., and Bell, T. 1991. The zero-frequency problem: estimating the probabilities of novel events in adaptive text compression. IEEE Transactions on Information Theory 37 (4): 1085–94. IEEE Transactions on Information Theory.Google Scholar
Wu, J. 2002. Maximum Entropy Language Modeling with Non-Local Dependencies. PhD thesis, Baltimore, Maryland, USA.Google Scholar
Wu, K., Allauzen, C., Hall, K., Riley, M., and Roark, B. 2014. Encoding linear models as weighted finite-state transducers. In Proceedings of INTERSPEECH 2014, Singapore, pp. 1258–62.Google Scholar