Language identification in the spoken term detection system for the kazakh language in a multilinge environment

Authors

  • Zh. Kozhirbayev L.N.Gumilyov Eurasian National University, Astana city
  • Zh. Yessenbayev National Laboratory Astana, Astana city
  • А. Sharipbay L.N.Gumilyov Eurasian National University, Astana city

Keywords:

Language identification, Long Short-Term Memory Recurrent Neural Networks, Automatic Speech Recognition, Spoken Term Detection

Abstract

The processing of Big data is currently one of the most important tasks of the IT industry, and
audiomaterials are considered as one of the main sources of this data. Consequently, along with
the increase in the volume of audio information, it is necessary to create effective information
retrieval systems from audio materials (STD). Since audio data can be in different languages, it
is essential to recognize the language in the audio. Automatic language identification (LID) is
considered as a task which automatically distinguishes of the language spoken in a speech sample.
The modern progress in signal processing such as pattern recognition, machine learning and neural
networks increases the performance of LID. In this work we applied state-of-the-art technology
Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) to the raw audio features
in order to identify the audio samples in the Kazakh language. LSTM networks are considered
as a type of RNN which utilizes special units along with ones. Moreover, LSTM units consist of
«memory cell» which can keep information in memory for long periods of time. STD system can
select audio materials in Kazakh with LID and thus do not spend computing resources on audio
data in other languages. In this work we show results for conducted automatic speech recognition,
spoken term detection and language identification experiments with LSTM RNN for 1s, 2s and 3s
segments of audio samples in the Kazakh language.

References

[1] Baldwin, Timothy, and Marco Lui. "Language identification: The long and the short of the matter." In Human Language
Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational
Linguistics, Los Angeles, CA, pp. 229-237. Association for Computational Linguistics, 2010.
[2] Brown, Martin G., Jonathan Trumbull Foote, Gareth JF Jones, K. Sparck Jones, and Steve J. Young. "Open-vocabulary
speech indexing for voice and video mail retrieval." In Proceedings of the fourth ACM international conference on
Multimedia, Boston, MA, USA, pp. 307-316. ACM, 1997.
[3] Brummer, Niko, Sandro Cumani, Ondrej Glembek, Martin Karafiat, Pavel Matejka, Jan Pesan, Oldrich Plchot, Mehdi
Soufifar, Edward de Villiers, and Jan Honza Cernocky. "Description and analysis of the Brno276 system for LRE2011."
In Odyssey 2012-The Speaker and Language Recognition Workshop, Singapore, 2012.
[4] Can, Dogan, and Murat Saraclar, Edward de Villiers, and Jan Honza Cernocky. "Lattice indexing for spoken term
detection." In IEEE Transactions on Audio, Speech, and Language Processing 19, no. 8 (2011): 2338-2347.
[5] Cimarusti, Deidre, and R. Ives. "Development of an automatic identification system of spoken languages: Phase I." In
Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP’82, Paris, France, vol. 7, pp.
1661-1663. IEEE, 1982.
[6] Chelba, Ciprian, and Alex Acero. "Indexing uncertainty for spoken document search." In Ninth European Conference on
Speech Communication and Technology, Lisboa, Portugal, pp. 61-64. 2005.
[7] Chelba, Ciprian, and Alex Acero. "Position specific posterior lattices for indexing speech." In Proceedings of the 43rd
Annual Meeting on Association for Computational Linguistics, Ann Arbor, Michigan, USA, pp. 443-450. Association for
Computational Linguistics, 2005.
[8] Dehak, Najim, Patrick J. Kenny, Reda Dehak, Pierre Dumouchel, and Pierre Ouellet. "Front-end factor analysis for
speaker verification." In IEEE Transactions on Audio, Speech, and Language Processing 19, no. 4 (2011): 788-798.
[9] Dunning, Ted. "Statistical identification of language." In Computing Research Laboratory, New Mexico State University,
pp. 94-273, 1994.
[10] Garofolo, John S., Cedric GP Auzanne, and Ellen M. Voorhees. "The TREC spoken document retrieval track: A success
story." In Content-Based Multimedia Information Access-Volume 1, pp. 1-20. LE CENTRE DE HAUTES ETUDES
INTERNATIONALES D’INFORMATIQUE DOCUMENTAIRE, 2000.
[11] Gold, E. Mark. "Language identification in the limit." In Information and control, 10, no. 5 (1967): 447-474.
[12] Gonzalez-Dominguez, Javier, Ignacio Lopez-Moreno, Javier Franco-Pedroso, Daniel Ramos, Doroteo Torre Toledano, and
Joaquin Gonzalez-Rodriguez. "Multilevel and session variability compensated language recognition: Atvs-uam systems at
nist lre 2009." In IEEE Journal of Selected Topics in Signal Processing , 4, no. 6 (2010): 1084-1093.
[13] Graves, Alex. In Supervised sequence labelling with recurrent neural networks , Vol. 385. Heidelberg: Springer, 2012.
[14] James, David Anthony. "The application of classical information retrieval techniques to spoken documents." In University
of Cambridge Press , 1995.
[15] Li, K., and T. Edwards. "Statistical models for automatic language identification." In Acoustics, Speech, and Signal
Processing, IEEE International Conference on ICASSP’80, Denver, Colorado, USA , vol. 5, pp. 884-887. IEEE, 1980.
[16] Lopez-Moreno, Ignacio, Javier Gonzalez-Dominguez, Oldrich Plchot, David Martinez, Joaquin Gonzalez-Rodriguez, and
Pedro Moreno. "Automatic language identification using deep neural networks." In IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy , pp. 5337-5341. IEEE, 2014.
[17] Lozano-Diez, Alicia, Javier Gonzalez-Dominguez, Ruben Zazo, Daniel Ramos, and Joaquin Gonzalez-Rodriguez. "On the
use of convolutional neural networks in pairwise language recognition." In Advances in Speech and Language Technologies
for Iberian Languages, Madrid, Spain , pp. 79-88. Springer, Cham, 2014.
[18] Mamou, Jonathan, David Carmel, and Ron Hoory. "Spoken document retrieval from call-center conversations." In
Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information
retrieval, Seattle, WA, USA , pp. 51-58. ACM, 2006.
[19] Martinez, David, Oldrich Plchot, Lukas Burget, Ondrej Glembek, and Pavel Matejka. "Language recognition in ivectors
space." In Proceedings of Interspeech, Firenze, Italy (2011): 861-864.
[20] Nakagawa, Seiichi, Yoshio Ueda, and Takashi Seino. "Speaker-independent, text-independent language identification by
HMM." In ICSLP, Alberta, Canada, vol. 92, pp. 1011-1014. 1992.
[21] Povey, Daniel, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann et
al. "The Kaldi speech recognition toolkit." In IEEE 2011 workshop on automatic speech recognition and understanding,
no. EPFL-CONF-192584. IEEE Signal Processing Society, 2011.
[22] Singer, Elliot, Pedro Torres-Carrasquillo, Douglas A. Reynolds, Alan McCree, Fred Richardson, Najim Dehak, and
Doug Sturim. "The MITLL NIST LRE 2011 language recognition system." In Odyssey 2012-The Speaker and Language
Recognition Workshop, Singapore , 2012.
[23] Singhal, Amit, John Choi, Donald Hindle, David D. Lewis, and Fernando Pereira. "At&t at trec-7." NIST SPECIAL
PUBLICATION SP (1999): 239-252.
[24] Singhal, Amit, and Fernando Pereira. "Document expansion for speech retrieval." In Proceedings of the 22nd annual
international ACM SIGIR conference on Research and development in information retrieval, Berkeley, CA, USA , pp.
34-41. ACM, 1999.
[25] Torres-Carrasquillo, Pedro A., Elliot Singer, Mary A. Kohler, Richard J. Greene, Douglas A. Reynolds, and John R.
Deller Jr. "Approaches to language identification using Gaussian mixture models and shifted delta cepstral features." In
Interspeech, Denver, Colorado, USA . 2002.
[26] Van Segbroeck, Maarten, Ruchir Travadi, and Shrikanth S. Narayanan. "Rapid language identification." In IEEE
Transactions on Audio, Speech, and Language Processing 23, no. 7 (2015): 1118-1129.
[27] Zazo, Ruben, Alicia Lozano-Diez, Javier Gonzalez-Dominguez, Doroteo T. Toledano, and Joaquin Gonzalez-Rodriguez.
"Language identification in short utterances using long short-term memory (LSTM) recurrent neural networks." PloS
one. 11, no. 1 (2016): e0146917.
[28] 1. Zissman, Marc A. "Comparison of four approaches to automatic language identification of telephone speech." IEEE
Transactions on speech and audio processing 4, no. 1 (1996): 31.

Downloads

Published

2019-01-14