A Case Study Demonstrating an Approach to the Statistical Analysis of the Variation of Multiword Expressions in Slovene Corpora
DOI:
https://doi.org/10.4312/linguistica.65.1.45-61Keywords:
multiword expressions, multiword expression variants, statistical analysis, automatic extraction, corporaAbstract
In Slovene linguistics, much research in phraseology has either been theoretical in nature or focused more on compiling lexicographic resources for human users. While several machine-readable lexicographic resources containing multiword expressions (MWEs) have also been developed in recent years, Slovene phraseology and computational Slovene linguistics remain largely divided into separate tracks. We attempt to bridge the gap with a brief demonstration of the benefits that computational and statistical approaches based on machine-readable data can have for linguists and phraseologists. We briefly present the SUK Training Corpus of Slovene, the largest machine-readable dataset for Slovene that contains annotations of multiword expressions, as well as the Q-CAT Corpus Annotation Tool that was used to annotate it. We extract examples for two Slovene MWEs (priti na zeleno vejo and podirati se kot hišica iz kart) from the morphosyntactically annotated Gigafida 2.1 Corpus of Written Standard Slovene using a rule-based approach that leverages syntactic structures. We perform a statistical analysis to determine the degree of variation within the extracted examples. We aim to show that machine-readable data is intended not only for developers of NLP tools but can also help provide additional insight into the structure and variation for the linguistic description of MWEs.
Metrics
Downloads
References
ARHAR HOLDT, Špela/Jaka ČIBEJ/Kaja DOBROVOLJC/Tomaž ERJAVEC/Polona GANTAR/Simon KREK/Tina MUNDA/Nejc ROBIDA/Luka TERČON/Slavko ŽITNIK (2024a) “SUK 1.0: A New Training Corpus for Linguistic Annotation of Modern Standard Slovene.” In: N. Calzolari/M.Kan/V. Hoste/A. Lenci/S.Sakti/N. Xue (eds), Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). Turin, Italy: ELRA and ICCL. 15428–15435.
ARHAR HOLDT, Špela/Simon KREK/Kaja DOBROVOLJC/Tomaž ERJAVEC/Polona GANTAR/Jaka ČIBEJ/Eva PORI/Luka TERČON/Tina MUNDA/Slavko ŽITNIK/Nejc ROBIDA/Neli BLAGUS/Sara MOŽE/Nina LEDINEK/Nanika HOLZ/Katja ZUPAN/Taja KUZMAN/Teja KAVČIČ/Iza ŠKRJANEC/Dafne MARKO/Lucija JEZERŠEK/Anja ZAJC (2024b) Training corpus SUK 1.1, Slovene language resource repository CLARIN.SI, ISSN 2820-4042, http://hdl.handle.net/11356/1959
AVRAM, Andrei/Verginica BARBU MITITELU/Dumitru-Clementin CERCEL (2023) “Romanian Multiword Expression Detection Using Multilingual Adversarial Training and Lateral Inhibition.” In: A. Bahtia/K. Evang/M.Garcia/V.Giouli/L.Han/S. Taslimipoor (eds), Proceedings of the 19th Workshop on Multiword Expressions (MWE 2023). Dubrovnik, Croatia: Association for Computational Linguistics, 7–13.
BRANK, Janez (2023) Q-CAT Corpus Annotation Tool 1.5, Slovene language resource repository CLARIN.SI, ISSN 2820-4042, http://hdl.handle.net/11356/1844
BUI, Van-Tuan/Agata SAVARY (2024) “Cross-type French Multiword Expression Identification with Pre-trained Masked Language Models.” In: N. Calzolari/M. Kan/V. Hoste/A. Lenci/S. Sakti/N. Xue (eds), Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). Turin, Italy: ELRA and ICCL, 4198–4204.
ČIBEJ, Jaka/Polona GANTAR/Mija BON (2024) “Annotation of Multiword Expressions in the SUK 1.0 Training Corpus of Slovene: Lessons Learned and Future Steps.” In: A. Bhatia/G. Bouma/A.S. Doğruöz/K. Evang/M. Garcia/V. Giouli/L. Han/J. Nivre/A.Redemaker (eds), Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024. Turin, Italy: ELRA and ICCL, 154–162.
GANTAR, Polona (2021) “Zapis kanonične oblike frazeoloških enot v Leksikonu večbesednih enot za slovenščino.” In: Špela ARHAR HOLDT (ed.), Nova slovnica sodobne standardne slovenščine: viri in metode. Ljubljana: Založba Univerze v Ljubljani.
GANTAR, Polona/Lut COLMAN/Carla PARRA ESCARTÍN/Héctor MARTÍNEZ ALONSO (2019) “Multiword Expressions: Between Lexicography and NLP”. International Journal of Lexicography 32/2, 138–162.
GANTAR, Polona/Simon KREK (2022) “Creating the lexicon of multi-word expressions for Slovene. Methodology and structure.” In: A. Klosa-Kückelhaus/S.Engelberg/C.Möhrs/P. Storjohann (eds), Dictionaries and Society. Proceedings of the XX EURALEX International Congress, 549–562.
GANTAR, Polona/Simon KREK/Iztok KOSEM/Mojca ŠORLI/Polonca KOCJANČIČ/Katja GRABNAR/Olga YEROŠINA/Petra ZARANŠEK/Nina DRSTVENŠEK (2013) Slovene lexical database 1.0, Slovene language resource repository CLARIN.SI, http://hdl.handle.net/11356/1030
GANTAR, Polona/Simon KREK/Taja KUZMAN (2017) “Verbal multiword expressions in Slovene.” In: R. Mitkov (ed.), Computational and corpus-based phraseology: proceedings. 2nd International Conference, EUROPHRAS 2017, London, UK, November 13-14, 2017. Cham: Springer, 247–259.
GANTAR, Polona/Špela ARHAR HOLDT/Jaka ČIBEJ/Taja KUZMAN/Teja KAVČIČ (2018) “Glagolske večbesedne enote v učnem korpusu ssj500k 2.1.” In: D. Fišer/A. Pančur (eds), Zbornik konference Jezikovne tehnologije in digitalna humanistika, 20. september - 21. september 2018, Ljubljana, Slovenija Ljubljana: Znanstvena založba Filozofske fakultete, 85–92.
JAKOP, Nataša (2023) “Impact of social networks on the use of proverbs.” In: Proceedings of The 7th International Conference on Research in Humanities and Social Sciences: 19–21 May 2023, Milan, Italy. Vilnius: Diamond Scientific Publishing. https://www.dpublication.com/abstract-of-7th-icrhs/17-70720/
JAKOP, Nataša/Erika KRŽIŠNIK (2021) Dileme in merila ob kodifikaciji frazeologije. Jezik in slovstvo, 66/2-3, 93–116.
KOSEM, Iztok/Apolonija GANTAR/Simon KREK (2021Semantic data should no longer exist in isolation: the Digital Dictionary Database of Slovene. 9th EURALEX International Congress „Lexicography for Inclusion“, 81-83. https://elex.is/wp-content/uploads/2021/09/Semantic-Data-should-no-longer-exist-in-isolation-the-Digital-Dictionary-Database-of-Slovene_Kosem-Krek-Gantar_EURALEX2020.pdf
KREK, Simon/Špela ARHAR HOLDT/Tomaž ERJAVEC/Jaka ČIBEJ/Andraž REPAR/Polona GANTAR/Nikola LJUBEŠIĆ/Iztok KOSEM/Kaja DOBROVOLJC (2020) “Gigafida 2.0: the reference corpus of written standard Slovene.” In: N. Calzolari (ed.), LREC 2020: Twelfth International Conference on Language Resources and Evaluation: May 11-16, 2020, Marseille, France. Paris: ELRA - European Language Resources Association, 3340-3345. http://www.lrec-conf.org/proceedings/lrec2020/LREC-2020.pdf
KREK, Simon/Apolonija GANTAR/Cyprian LASKOWSKI/Luka KRSNIK/Iztok KOSEM/Janez BRANK/Kaja DOBROVOLJC/Špela ARHAR HOLDT/Jaka ČIBEJ/Marko ROBNIK-ŠIKONJA/Bojan KLEMENC/Vojko GORJANC (2021) Multiword Expressions lexicon extracted from the Gigafida 2.1 corpus, Slovene language resource repository CLARIN.SI, ISSN 2820-4042, http://hdl.handle.net/11356/1421
METERC, Matej (2023) “Raziskave nekonvencionalnih replik za prikaz v zbirki ter Slovarju pregovorov in sorodnih paremioloških izrazov.” Svetovi, 1/1, 69–81.
METERC, Matej/Rok MRVIČ, Rok (2024) “Cross-genre analysis of paremiological prolongations in Slovene according to Permjakov’s distinction between extension and addition. Proverbium.” 2024, 41/1, 22–49.
PERDIH, Andrej/Kozma AHAČIČ/Nataša JAKOP/Tina LEDINEK/Špela PETRIC (2024) “Semantic information on the Franček educational language portal for Slovene.” In: K. Štrkalj Despot/A. Ostroški Anić/I. Brač (eds), Lexicography and Semantics: proceedings of the XXI EURALEX International Congress, 8–12 October 2024, Cavtat, Croatia. Zagreb: Institut za hrvatski jezik, 144–157.
RAMISCH, Carlos/Agata SAVARY/Bruno GUILLAUME/Jakub WASZCZUK/Marie CANDITO/Ashwini VAIDYA/Verginica BARBU MITITELU/Archna BHATIA/Uxoa IÑURRIETA, Voula GIOULI/Tunga GÜNGÖR/Menghan JIANG/Timm LICHTE/Chaya LIEBESKIND/Johanna MONTI/Renata RAMISCH/Sara STYMNE/Abigail WALSH/Hongzhi XU (2020) “Edition 1.2 of the PARSEME Shared Task on Semi-supervised Identification of Verbal Multiword Expressions.” In: S.Markantonatou/J. McCrae/J. Mitrović/C. Tiberius/C. Ramisch/A. Vaidya/P. Osenova/A. Savary (eds), Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons. Association for Computational Linguistics, 107–118. https://aclanthology.org/2020.mwe-1.14/
SAVARY, Agata/Daniel ZEMAN/Verginica BARBU MITITELU/Anabela BARREIRO/Olesea CAFTANATOV/Marie-Catherine DE MARNEFFE/Kaja DOBROVOLJC/Gülşen ERYIĞIT/Voula GIOULI/Bruno GUILLAUME/Stella MARKANTONATOU/Nurit MELNIK/Joakim NIVRE/Atul Kr. OJHA/Carlos RAMISCH/Abigail WALSH/Beata WÓJTOWICZ/Alina WRÓBLEWSKA (2024) “UniDive: A COST Action on Universality, Diversity and Idiosyncrasy in Language Technology.” In: M. Melero/S. Sakti/C. Soria (eds), Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024. Turin, Italy: ELRA and ICCL, 372–382.
SHANNON, Claude E. (1948) A Mathematical Theory of Communication. Bell System Technical Journal, 27/3, 379–423.
ŠKVORC, Tadej/Polona GANTAR/Marko ROBNIK-ŠIKONJA (2020) Dataset of Slovene idiomatic expressions SloIE, Slovene language resource repository CLARIN.SI. http://hdl.handle.net/11356/1335
ŠKVORC, Tadej/Polona GANTAR/Marko ROBNIK-ŠIKONJA (2021) “Strojno prepoznavanje idiomov z globokimi nevronskimi mrežami.” In: Š. Arhar Holdt (ed.), Nova slovnica sodobne standardne slovenščine: viri in metode. Založba Univerze v Ljubljani, 231–258.
STRAKA, Milan (2018) “UDPipe 2.0 prototype at CoNLL 2018 UD shared task.” In: D. Zeman/J. Hajič (eds), Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. Brussels, Belgium: Association for Computational Linguistics, 197–207.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Jaka Čibej

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.