Phonetisaurus: Exploring grapheme-to-phoneme conversion with joint n-gram models in the WFST framework

JOSEF ROBERT NOVAK; NOBUAKI MINEMATSU; KEIKICHI HIROSE

doi:10.1017/S1351324915000315

Phonetisaurus: Exploring grapheme-to-phoneme conversion with joint n-gram models in the WFST framework

Published online by Cambridge University Press: 07 September 2015

JOSEF ROBERT NOVAK ,

NOBUAKI MINEMATSU and

KEIKICHI HIROSE

Show author details

JOSEF ROBERT NOVAK: Affiliation:
The University of Tokyo, Graduate School of Information Science and Technology, Tokyo, Japan e-mails: novakj@gavo.t.u-tokyo.ac.jp, mine@gavo.t.u-tokyo.ac.jp, hirose@gavo.t.u-tokyo.ac.jp
NOBUAKI MINEMATSU: Affiliation:
The University of Tokyo, Graduate School of Information Science and Technology, Tokyo, Japan e-mails: novakj@gavo.t.u-tokyo.ac.jp, mine@gavo.t.u-tokyo.ac.jp, hirose@gavo.t.u-tokyo.ac.jp
KEIKICHI HIROSE: Affiliation:
The University of Tokyo, Graduate School of Information Science and Technology, Tokyo, Japan e-mails: novakj@gavo.t.u-tokyo.ac.jp, mine@gavo.t.u-tokyo.ac.jp, hirose@gavo.t.u-tokyo.ac.jp

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

This paper provides an analysis of several practical issues related to the theory and implementation of Grapheme-to-Phoneme (G2P) conversion systems utilizing the Weighted Finite-State Transducer paradigm. The paper addresses issues related to system accuracy, training time and practical implementation. The focus is on joint n-gram models which have proven to provide an excellent trade-off between system accuracy and training complexity. The paper argues in favor of simple, productive approaches to G2P, which favor a balance between training time, accuracy and model complexity. The paper also introduces the first instance of using joint sequence RnnLMs directly for G2P conversion, and achieves new state-of-the-art performance via ensemble methods combining RnnLMs and n-gram based models. In addition to detailed descriptions of the approach, minor yet novel implementation solutions, and experimental results, the paper introduces Phonetisaurus, a fully-functional, flexible, open-source, BSD-licensed G2P conversion toolkit, which leverages the OpenFst library. The work is intended to be accessible to a broad range of readers.

Type: Articles
Information: Natural Language Engineering , Volume 22 , Issue 6 , November 2016 , pp. 907 - 938

DOI: https://doi.org/10.1017/S1351324915000315 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2015

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Allauzen, C., Mohri, M., and Roark, B., 2003. Generalized algorithms for constructing statistical language models. In Proceedings of the 41st Annual Meeting of the Assocication for Computational Linguistics, Stroudsburg, PA, USA, pp. 40–7.Google Scholar

Allauzen, C., Riley, M., Schalkwyk, J., Wojciech, S., and Mohri, M. 2007. OpenFst: a general and efficient weighted finite-state transducer library. In Proceedings of CIAA 2007, pp. 11–23, Lecture Notes in Computer Science, vol. 4783. Berlin Heidelberg: Springer.Google Scholar

Alumäe, T., and Kurimo, M. 2010. Efficient estimation of maximum entropy language models with N-gram features: an SRILM extension. In Proceedings of Interspeech 2010, Chiba, Japan.Google Scholar

Auli, M., Galley, M., Quirk, C., and Zweig, G., 2013. Joint language and translation modeling with recurrent neural networks. In Proceedings of EMNLP 2013, Melbourne, Australia, pp. 1044–54.Google Scholar

Bell, T., Cleary, J., and Witten, I., 1990. Text Compression. Upper Saddle River, NJ, USA: Prentice Hall.Google Scholar

Bisani, M., and Ney, H. 2008. Joint-sequence models for grapheme-to-phoneme conversion. In Speech Communication, pp. 434–51. Amsterdam: Elsevier Science Publishers B. V. Google Scholar

Caseiro, D., and Trancoso, I. 2002. Grapheme-to-Phoneme using finite state transducers. In Proceedings of the 2002 IEEE Workshop on Speech Synthesis, Piscataway NJ, USA.Google Scholar

Chen, S. 2003. Conditional and joint models for grapheme-to-phoneme conversion. In Proceedings of EUROSPEECH.Google Scholar

Chen, S., and Goodman, J. 1998. An empirical study of smoothing techniques for language modeling. Technical Report, Computer Science Group, Harvard Univerisity.Google Scholar

Cortes, C., Kuznetsov, V., and Mohri, M., 2014. Ensemble methods for structured prediction. In Proceedings of ICML 2014, Bonn, Germany, pp. 896–903.Google Scholar

Damper, R., Marchand, Y., Adsett, C., Soonklang, T., and Marsters, S., 2005. Multilingual data-driven pronunciation. In Proceedings of the 10th International Conference on Speech and Computer (SPECOM 2005), Patras, Greece, pp. 167–70.Google Scholar

Deligne, S., Yvon, F., and Bimbot, F., 1995. Variable-length sequence matching for phonetic transcription using joint multigrams. In Proceedings of EUROSPEECH 1995, Madrid, Spain, pp. 2243–46.Google Scholar

Galescu, L., and Allen, J. F. 2001. Bi-directional conversion between graphemes and phonemes using a joint n-gram model. In Proceedings of the 4th ISCA Tutorial and Research Workshop on Speech Synthesis, Perthshire, Scotland.Google Scholar

Hahn, S., Vozila, P., and Bisani, M. 2012. Comparison of grapheme-to-phoneme methods on large pronunciation dictionaries and LVCSR tasks. In Proceedings of INTERSPEECH 2012, Portland, Oregon.Google Scholar

Hixon, B., Schneider, E., and Epstein, S., 2011. Phonemic similarity metrics to compare pronunciation methods. In Proceedings of INTERSPEECH 2011, Florence, Italy, pp. 825–8.Google Scholar

Jiampojamarn, S., and Kondrak, G., 2010. Letter-to-phoneme alignment: an exploration. In Proceedings of the ACL 2010, Uppsala, Sweden, pp. 780–8.Google Scholar

Jiampojamarn, S., Kondrak, G., and Sherif, T., 2007. Applying many-to-many alignments and hidden Markov models to letter-to-phoneme conversion. In Proceedings of NAACL HLT 2007, Rochester, New York, pp. 372–9.Google Scholar

Kempe, A., 2001. Factorization of ambiguous finite-state transducers. In Proceedings of CIAA 2001, Pretoria, South Africa, pp. 170–81.Google Scholar

Kneser, R., and Ney, H. 1995. Improved backing-off for m-gram language modeling. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 1995, Detroit, Michigan, pp. 1:181–4.Google Scholar

Mikolov, T. 2012. Statistical Language Models Based on Neural Networks. PhD Thesis, Brno University of Technology, Czech republic.Google Scholar

Mikolov, T., Karafiat, M., Burget, L., Černocký, J., and Khundanpur, S., 2010. Recurrent neural network based language model. In Proceedings of INTERSPEECH 2010, Chiba, Japan, pp. 1045–8.Google Scholar

Mikolov, T., Kombrink, S., Anoop, D., Burget, L., and Černocký, J. 2011. RNNLM - recurrent neural network language modeling toolkit. In ASRU 2011, demo session, Waikoloa, Hawaii.Google Scholar

Mohri, M. 2002. Semiring frameworks and algorithms for shortest-distance problems. Journal of Automata, Languages and Combinatorics 7 (3): 321–50. Magdeburg: Otto-von-Guericke-Universitat.Google Scholar

Mohri, M., Pereira, F., and Riley, M. 2002. Weighted finite-state transducers in speech recognition. ComputerSpeech and Language 16 (1): 69–88. Elsevier.Google Scholar

Novak, J. (2011) Available at: http://code.google.com/p/phonetisaurus.Google Scholar

Novak, J., Dixon, P., Minematsu, M., Hirose, K., Horie, C., and Kashioka, H., 2012. Improving WFST-based G2P Conversion with alignment constraints and RNNLM N-best rescoring. In Proceedings of INTERSPEECH 2012, Portland, Oregon, pp. 2526–9.Google Scholar

Novak, J., Minematsu, M., and Hirose, K., 2012. WFST-based grapheme-to-phoneme conversion: open source tools for alignment, model-building and decoding. In Proceedings of FSMNLP 2012, San Sebastian, Spain, pp. 45–9.Google Scholar

Novak, J., Minematsu, M., and Hirose, K., 2013. Failure transitions for Joint n-gram models and G2P conversion. In Proceedings of INTERSPEECH 2013, Lyon, France, pp. 1821–5.Google Scholar

Roark, B., Sproat, R., Allauzen, C., Riley, M., Sorensen, J., and Tai, T., 2012. The OpenGrm open-source finite-state grammar software libraries. In Proceedings of the ACL 2012 - System Demonstrations, Jeju, South Korea, pp. 61–6.Google Scholar

Ristad, E., and Yianilos, P., 1998. Learning string edit distance. IEEE Transactions PRMI 20 (5): 522–32.Google Scholar

Schlippe, T., Quaschningk, W., and Schultz, T., 2014. Combining grapheme-to-phoneme converter outputs for enhanced pronunciation generation in low-resource scenarios. In Proceedings of the 4th Workshop on Spoken Language Technologies for Under-resourced Languages, St. Petersburg, Russia, pp. 14–16.Google Scholar

Schuster, M., and Paliwal, K., 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45 (11): 2673–81.Google Scholar

Sejnowski, T. J., and Rosenberg, C. R. 1993. NETtalk corpus. Available at: ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/dictionar-ies/beep.tar.gz.Google Scholar

Shu, H., and Hetherington, I., 2002. EM training of finite-state transducers and its application to pronunciation modeling. In Proceedings of ICSLP 2002, Denver, Colorado, pp. 1293–6.Google Scholar

Stolcke, A., 2002. SRILM - an extensible language modeling toolkit. In Proceedings of ICSLP 2002, Denver, Colorado, pp. 901–4.Google Scholar

Stüker, S., and Schultz, T., 2004. A grapheme based speech recognition system for Russian. In Proceedings of SPECOM, St. Petersburg, Russia, pp. 297–303.Google Scholar

Tam, Y. 2009. Rapid Unsupervised Topic Adaptation - A Latent Semantic Approach. PhD thesis, Carnegie Mellon University, Pittsburgh, PA, USA.Google Scholar

Weide, R. L. 1998. The Carnegie Mellon pronouncing dictionary. Available at: http://www.speech.cs.cmu.edu/cgi-bin/cmudict.Google Scholar

Witten, I., and Bell, T. 1991. The zero-frequency problem: estimating the probabilities of novel events in adaptive text compression. IEEE Transactions on Information Theory 37 (4): 1085–94. IEEE Transactions on Information Theory.Google Scholar

Wu, J. 2002. Maximum Entropy Language Modeling with Non-Local Dependencies. PhD thesis, Baltimore, Maryland, USA.Google Scholar

Wu, K., Allauzen, C., Hall, K., Riley, M., and Roark, B. 2014. Encoding linear models as weighted finite-state transducers. In Proceedings of INTERSPEECH 2014, Singapore, pp. 1258–62.Google Scholar

Article contents

Phonetisaurus: Exploring grapheme-to-phoneme conversion with joint n-gram models in the WFST framework

Abstract

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests