Praktični vidiki uporabe podbesednih enot v strojnem prevajanju slovenščina-angleščina

Avtorji

  • Gregor Donaj Univerza v Mariboru, Fakulteta za elektrotehniko, računalništvo in informatiko https://orcid.org/0000-0002-0297-2714
  • Mirjam Sepesy Maučec Univerza v Mariboru, Fakulteta za elektrotehniko, računalništvo in informatiko

DOI:

https://doi.org/10.4312/slo2.0.2023.1.275-301

Ključne besede:

strojno prevajanje, velikost slovarja, podbesedne enote, grafične procesne enote

Povzetek

Večina sodobnih sistemov za strojno prevajanje temelji na arhitekturi nevronskih mrež. To velja za spletne ponudnike strojnega prevajanja, za raziskovalne sisteme in za orodja, ki so lahko v pomoč poklicnim prevajalcem v njihovi praksi. Čeprav lahko sisteme nevronskih mrež uporabljamo na običajnih centralnih procesnih enotah osebnih računalnikov in strežnikov, je za delovanje s smiselno hitrostjo potrebna uporaba grafičnih procesnih enot. Pri tem smo omejeni z velikostjo slovarja, kar zmanjšuje kakovost prevodov. Velikost slovarja besednih enot je še posebej pereč problem visoko pregibnih jezikov. Rešujemo ga z uporabo podbesednih enot, s katerimi dosežemo večjo pokritost jezika. V članku predstavljamo različne metode razcepljanja besed na podbesedne enote z različno velikimi slovarji in primerjamo njihovo uporabo v strojnem prevajalniku za jezikovni par slovenščina-angleščina. V primerjavo vključujemo še prevajalnik brez razcepljanja besed. Predstavljamo rezultate uspešnosti prevajanja z metriko BLEU, hitrosti učenja modelov in hitrosti prevajanja ter velikosti modelov. Dodajamo pregled praktičnih vidikov uporabe podbesednih enot v strojnem prevajalniku, ki ga uporabljamo skupaj z orodji za računalniško podprto prevajanje.

Prenosi

Podatki o prenosih še niso na voljo.

Literatura

Bahdanau D., Cho K., & Bengio Y. (2014). Neural Machine Translation by Jointly Learning to Align and Translate. In 3rd International Conference on Learning Representations.

Banerjee, T., & Bhattacharyya, P. (2018). Meaningless yet meaningful: Morphology grounded subword-level nmt. In Proceedings of the second workshop on subword/character level models (pp. 55–60). Retrieved from https://aclanthology.org/W18-1207.pdf DOI: https://doi.org/10.18653/v1/W18-1207

Bañón, M., Chen, P., Haddow, B., Heafield, K., Hoang, H., Esplà-Gomis, M., Forcada, M. L., …, & Zaragoza, J. (2020). ParaCrawl: Web-scale acquisition of parallel corpora. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 4555–4567). doi: 10.18653/v1/2020.acl-main.417 DOI: https://doi.org/10.18653/v1/2020.acl-main.417

Brown, P. F., Della Pietra, S. A., Della Pietra, V. J., & Mercer, R. L. (1993). The mathematics of statistical machine translation: Parameter estimation. Computational linguistics, 19(2), 263–311.

Creutz, M., & Lagus, K. (2002). Unsupervised discovery of morphemes. In Proceedings of the workshop on morphological and phonological learning of ACL-02 (pp. 21–30). doi: 10.3115/1118647.1118650 DOI: https://doi.org/10.3115/1118647.1118650

Etchegoyhen, T., Bywood, L., Fishel, M., Georgakopoulou, P., Jiang, J., Loenhout, G. V., Pozo, A. D., …, & Volk, M. (2014). Machine translation for subtitling: A large-scale evaluation. In N. C. C. Chair et al. (Eds.), Proceedings of the ninth international conference on language resources and evaluation (LREC’14). Retrieved from http://www.lrec-conf.org/proceedings/lrec2014/pdf/463_Paper.pdf

Gupta, R., Besacier, L., Dymetman, M., & Gallé, M. (2019). Character-based NMT with transformer. arXiv preprint arXiv:1911.04997.

Heigold, G., Varanasi, S., Neumann, G., & van Genabith, J. (2018). How robust are character-based word embeddings in tagging and MT against wrod scramlbing or randdm nouse? In Proceedings of the 13th conference of the association for machine translation in the Americas (Vol 1, pp. 68–80). Retrieved from https://aclanthology.org/W18-1807.pdf

Junczys-Dowmunt, M., Grundkiewicz, R., Dwojak, T., Hoang, H., Heafield, K., Neckermann, T., Seide, F., …, & Birch, A. (2018). Marian: Fast neural machine translation in C++. In Proceedings of ACL2018, system demonstrations (pp. 116–121). DOI: https://doi.org/10.18653/v1/P18-4020

Koehn, P., Hoang, H.T., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., …, & Herbst, E. (2007). Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th annual meeting of the association for computational linguistics companion volume proceedings of the demo and poster sessions (pp. 177–180). DOI: https://doi.org/10.3115/1557769.1557821

Kudo, T. (2018). Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: Long papers) (pp. 66–75). doi: 10.18653/v1/P18-1007 DOI: https://doi.org/10.18653/v1/P18-1007

Kudo, T., & Richardson, J. (2018). SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 conference on empirical methods in natural language processing: System demonstrations (pp. 66–71). doi: 10.18653/v1/D18-2012 DOI: https://doi.org/10.18653/v1/D18-2012

Marco, M. W. D., Huck, M., & Fraser, A. (2022). Modeling Target-Side Morphology in Neural Machine Translation: A Comparison of Strategies. In Proceedings of the Conference on Machine Translation (WMT) (Vol 1, pp. 56–67).

Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics (pp. 311–318). Retrieved from https://aclanthology.org/P02-1040.pdf DOI: https://doi.org/10.3115/1073083.1073135

Popović, M. (2015). chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the tenth workshop on statistical machine translation (pp. 392–395). doi: 10.18653/v1/ W15-3049 DOI: https://doi.org/10.18653/v1/W15-3049

Post, M. (2018). A call for clarity in reporting BLEU scores. In Proceedings of the third conference on machine translation: Research papers (pp. 186–191). Retrieved from https://aclanthology.org/W18-6319.pdf DOI: https://doi.org/10.18653/v1/W18-6319

Sennrich, R., Haddow, B., & Birch, A. (2016). Neural machine translation of rare words with subword units. In Proceedings of the 54th annual meeting of the association for computational linguistics (Vol. 1, pp. 1715–1725). doi: 10.18653/v1/P16-1162 DOI: https://doi.org/10.18653/v1/P16-1162

Sepesy Maučec, M., & Donaj, G. (2019). Machine Translation and the Evaluation of Its Quality. In A. Sadollah & T. S. Sinha (Eds.), Recent Trends in Computational Intelligence. IntechOpen. DOI: https://doi.org/10.5772/intechopen.89063

Stahlberg, F. (2020). Neural machine translation: A review. Journal of Artificial Intelligence Research, 69, 343–418. DOI: https://doi.org/10.1613/jair.1.12007

Tamchyna, A., Marco, M. W. D., & Fraser, A. (2017). Modeling target-side inflection in neural machine translation. In Proceedings of the Conference on Machine Translation (WMT) (Vol. 1, pp. 32–42). DOI: https://doi.org/10.18653/v1/W17-4704

Tukeyev, U., Karibayeva, A., & Zhumanov, Z. H. (2020). Morphological segmentation method for Turkic language neural machine translation. Cogent Engineering, 7(1), 1856500. doi: 10.1080/23311916.2020.1856500 DOI: https://doi.org/10.1080/23311916.2020.1856500

Vaswani A., Shazeer,N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems. (pp. 5998–6008).

Virpioja, S., Smit, P., Grönroos, S.-A., & Kurimo, M. (2013). Morfessor 2.0: Python implementation and extensions for morfessor baseline. Aalto University.

Prenosi

Objavljeno

12.09.2023

Kako citirati

Donaj, G., & Sepesy Maučec, M. (2023). Praktični vidiki uporabe podbesednih enot v strojnem prevajanju slovenščina-angleščina. Slovenščina 2.0: Empirične, Aplikativne in Interdisciplinarne Raziskave, 11(1), 275–301. https://doi.org/10.4312/slo2.0.2023.1.275-301

Številka

Rubrike

Članki – Sklop 2: Jezikovni viri in tehnologije