Метод разработки моделей распознавания речи для использования в информационных системах энергетики

В.А. Нечаев; С.В. Косяков

Development of automatic speech recognition model for energy facilities

V.D. Nechaev, S.V. Kosyakov

Vestnik IGEU, 2023 issue 4, pp. 94—100

Download PDF

Abstract in English:

Background. Currently, when developing automatic speech recognition models for specialized subject areas, in particular for energy facilities, deep neural network architectures are used, which require a large amount of training data. At the same time, models often turn out to be poorly suitable for use in specific information systems due to poor-quality recognition of highly specialized subject vocabulary. Additional training of models to improve their quality in a specific context of recognition encounters the difficulty to obtain a sufficient amount of data and the laboriousness of their markup. Thus, an urgent task is to create methods that allow reducing the complexity of developing applied speech recognition models and improving their quality when used in subject areas, in particular, in the field of energy.

Materials and methods. Methods of thematic text modeling based on language models for adapting open data are applied. A deep neural network is used as a pretrained speech recognition model. For training, open-source datasets are used.

Results. A method to develop automatic speech recognition models for specialized subject areas has been developed. It includes the stage of intermediate learning of subject area vocabulary based on open-source data selected using thematic sampling. Based on the method, the authors have developed and studied a model of automatic speech recognition for energy facilities. It has showed higher recognition results than models obtained by traditional methods.

Conclusions. Approbation of the proposed method has confirmed its effectiveness. The applied neural network model developed on the method has demonstrated the possibility to work in the information systems of energy facilities in Russian and English without additional training on proprietary data.

References in English:

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I. Attention is all you need. Advances in neural information processing systems, 2017, vol. 30.
Zhang, Q., Lu, H., Sak, H., Tripathi, A., McDermott, E., Koo, S., Kumar, Sh. Transformer transducer: A streamable speech recognition model with transformer encoders and RNN-T loss. ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7829–7833.
Li, J. Recent advances in end-to-end automatic speech recognition. APSIPA Transactions on Signal and Information Processing, 2022, vol. 11, no. 1.
Nevlyudov, I.S., Cymbal, A.M., Milyutina, S.S. Ispol'zovanie iskusstvennoj nejronnoj seti v podsisteme vvoda golosovoj informacii SAPR TP robotizirovannogo proizvodstva [Application of artificial neural network in the subsystem of entering voice information of CAD TP of robotic production]. Radioelektronika i informatika, 2007, no. 1, pp. 56–61.
Saon, G., Chien, J.T. Large-vocabulary continuous speech recognition systems: A look at some recent advances. IEEE signal processing magazine, 2012, vol. 29, no. 6, pp. 18–33.
Ling, S., Liu, Y., Salazar, J., Kirchhoff, K. Deep contextualized acoustic representations for semi-supervised speech recognition. ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6429–6433.
Borsos, Z., Marinier, R., Vincent, D., Kharitonov, E., Pietquin, O., Sharifi, M., Teboul, O., Grangier, D., Tagliasacchi, M., Zeghidour, M. Audiolm: a language modeling approach to audio generation. arXiv preprint arXiv:2209.03143. 2022.
Slizhikova, A., Veysov, A., Nurtdinova, D., Voronin, D., Baburov, Y. Russian open speech to text (stt/asr) dataset (2022). URL: https://github.com/snakers4/open_stt/
Chen, G., Chai, S., Wang, G., Du, J., Zhang, W., Weng, C., Su, D., Povey, D., Trma, J., Zhang, J., Jin, M., Khudanpur, S., Watanabe, S., Zhao, S., Zou, W., Li, X., Yao, X., Wang, Y., Wang, Y., You, Z., Yan, Z. Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio. arXiv preprint arXiv:2106.06909. 2021.
Panayotov, V., Chen, G., Povey, D., Khudanpur, S. Librispeech: an asr corpus based on public domain audio books. 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2015, pp. 5206–5210.
Ardila, R., Branson, M., Davis, K., Henretty, M., Kohler, M., Meyer, J., Morais, R., Saunders, L., Tyers, F., Weber, G. Common voice: A massively-multilingual speech corpus. arXiv preprint arXiv:1912.06670. 2019.
Pratap, V., Xu, Q., Sriram, A., Synnaeve, G., Collobert, R. Mls: A large-scale multilingual dataset for speech research. arXiv preprint arXiv:2012.03411. 2020.
Munich Artificial Intelligence Laboratories GmbH. The m-ailabs speech dataset. https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/, 2017.
Salesky, E., Wiesner, M., Bremerman, J., Cattoni, R., Negri, M., Turchi, M., Oard, D., Post, M. The multilingual tedx corpus for speech recognition and translation. arXiv preprint arXiv:2102.01757. 2021.
Wang, C., Rivière, M., Lee, A., Wu, A., Talnikar, C., Haziza, D., Williamson, M., Pino, J., Dupoux, E. Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. arXiv preprint arXiv:2101.00390. 2021.
VoxForge, Free Speech Recognition. www.voxforge.org, 2022.
Song, K., Tan, X., Qin, T., Lu, J., Liu, T. Mpnet: Masked and permuted pre-training for language understanding. Advances in Neural Information Processing Systems, 2020, vol. 33, pp. 16857–16867.
McInnes, L., Healy, J., Melville, J. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426. 2018.
McInnes, L., Healy, J. Accelerated hierarchical density based clustering. 2017 IEEE International Conference on Data Mining Workshops (ICDMW). IEEE, 2017, pp. 33–42.
Grootendorst, M. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv preprint arXiv:2203.05794. 2022.
Kuchaiev, O., Li, J., Nguyen, H., Hrinchuk, O., Leary, R., Ginsburg, B., Kriman, S., Beliaev, S., Lavrukhin, V., Cook, J., Castonguay, P., Popova, M., Huang, J., Cohen, J. Nemo: a toolkit for building ai applications using neural modules. arXiv preprint arXiv:1909.09577. 2019.
Majumdar, S., Balam, J., Hrinchuk, O., Lavrukhin, V., Noroozi, V., Ginsburg, B. Citrinet: Closing the gap between non-autoregressive and autoregressive end-to-end models for automatic speech recognition. arXiv preprint arXiv:2104.01721. 2021.
Gulati, A., Qin, J., Chiu, C., Parmar, N., Zhang, Y., Yu, J., Han, W., Wang, S., Zhang, Z., Wu, Y., Pang, R. Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100. 2020.
Student's t-test. Dependent t-test for paired samples. URL: https://en.wikipedia.org/wiki/T-test#Dependent_t-test_for_paired_samples (access date: 01.02.2023).

Key words in Russian:

модели распознавания речи, машинное обучение, методы тематического моделирования текста, нейронная сеть, языковая модель

Key words in English:

automatic speech recognition models, machine learning, thematic modelling methods, neural network, language model

The DOI index:

10.17588/2072-2672.2023.4.094-100

Downloads count: