Defining collocation for Slovenian lexical resources
DOI:
https://doi.org/10.4312/slo2.0.2020.2.1-27Keywords:
collocation, multiword lexical unit, word combination, Slovene, lexicography, dictionary databaseAbstract
In this paper, we define the notion of collocation for the purpose of its use in machine-readable language resources, which will be used in the creation of electronic dictionaries and language applications for Slovene. Based on theoretical and lexicographically-driven studies we define collocation as a lexical phenomenon, defined by three key aspects: statistical, syntactic, and semantic. We take lexicographic relevance as a point of departure for defining collocations within the typology of word combinations, as well as for distinguishing them from free combinations. Free combinations are (frequent) syntactically valid word combinations without lexicographic value and consequently there is no need for the description of their meaning, or syntactic role. Next, we distinguish collocations from all multiword lexical units (compounds, phraseological units and lexico-grammatical units) using the lexicographic view that multiword lexical units, whose meaning is not a sum of its parts, require a description of their meaning whereas collocations do not. In the final part, we return to the three aspects of collocation and their role in automatic extraction of collocational information from corpora. Semantic criterion or dictionary relevance of extracted collocations has particularly exposed the problem of semantically broad collocates such as certain types of adverbs, adjectives and verbs, and word which feature in different syntactic roles (e.g. pronouns and adjuncts). We discuss a particular issue of collocations related to proper names and the decisions about their inclusion into the dictionary based on the evaluation of lexicographers.
Downloads
References
Altenberg, B. (1991). Amplifier Collocations in Spoken English. In S. Johansson & A. B. Stenström (Eds.), English Computer Corpora. Selected Papers and Research Guide (pp. 127–147). Berlin/New York: Mouton de Gruyter.
Arhar Holdt, Š. (in press). Razvrstitev kolokacij v slovarskem vmesniku: uporabniške prioritete. In Kolokacije kot temelj jezikovnega opisa: od statistike do semantike. Ljubljana: Ljubljana University Press, Faculty of Arts.
Atkins, B. T. S., & Rundell, M. (2008). The Oxford Guide to Practical Lexicography. New York: Oxford University Press.
Baldwin, T., & Kim, S. N. (2010). Multiword expressions. In Handbook of Natural Language Processing (2nd ed.). CRC Press, Taylor and Francis Group.
Benson, M., Benson, E., & Ilson, R. (1986). The BBI Dictionary of English Word Combinations. John Benjamins, Amsterdam. DOI: https://doi.org/10.1075/z.bbi1(1st)
Berry-Rogghe, G. L. (1973). The computation of collocations and their relevance in lexical studies. In The computer and literal studies (pp. 103–112). Edinburgh/New York: University Press.
Biber, D. (1993). Representativeness in Corpus Design. Literary and Linguistic Computing 8(4), 243–257. DOI: https://doi.org/10.1093/llc/8.4.243
Church, K., & Hanks, P. (1990). Word association norms, mutual information and lexicography. Computational Linguistics, 6(1), 22–29.
Church, K. W., Gale, W., Hanks, P., & Hindle, D. (1991). Using statistics in lexical analysis. In U. Zernik (Ed.), Lexical Acquisition: Exploiting On-line Resources to Build a Lexicon (pp. 116–164). Erlbaum, Hillsdale, NJ.
Cowie, A. P. (1981). The treatment of collocations and idioms in learners' dictionaries. In A. P. Cowie (Ed.), Lexicography and its Pedagogical Applications [Thematic issue]. Applied Linguistics 2(3), 223–235. DOI: https://doi.org/10.1093/applin/2.3.223
Evert, S. (2004). The statistics of word cooccurrences: Word pairs and collocations. PhD Thesis, University of Stuttgart.
Evert, S. (2009). Corpora and collocations. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics: An International Handbook: Vol. 2 (pp. 1212–1248). Berlin/New York: Mouton de Gruyter.
Fellbaum, C. (2015). Syntax and grammar of idioms and collocations In T. Kiss & A. Alexiadou (Eds.), Syntax: Theory and analysis: Vol. 2 (pp. 776–802). Berlin/New York: Mouton de Gruyter. DOI: https://doi.org/10.1515/9783110377408.777
Firth, J. R. (1957). Modes of Meaning. Papers in Linguistics 1934–51. London: Oxford University Press.
Gantar, P. (2015). Leksikografski opis slovenščine v digitalnem okolju. Ljubljana: Znanstvena založba Filozofske fakultete. Retrieved from http://www.ff.uni-lj.si/sites/default/files/Dokumenti/Knjige/e-books/leksikografski.pdf
Gantar, P., Colman, L., Parra Escartín, C., & Marínez Alonso, H. (2019). Multiword Expressions: Between Lexicography and NLP. International Journal of Lexicography, 32(2), 138–162. DOI: https://doi.org/10.1093/ijl/ecy012
Gantar, P., Kosem, I., & Krek, S. (2016). Discovering automated lexicography: the case of Slovene lexical database. International journal of lexicography, 29(2), 200–225. DOI: https://doi.org/10.1093/ijl/ecw014
Gorjanc, V., Gantar, P., Kosem, I., & Krek, S. (Eds.). (2017). Dictionary of Modern Slovene: Problems and Solutions. Ljubljana: Ljubljana University Press, Faculty of Arts.
Grčar, M., Krek, S., & Dobrovoljc, K. (2012). Obeliks: statistični oblikoskladenjski označevalnik in lematizator za slovenski jezik. In T. Erjavec & J. Žganec Gros (Eds.), Zbornik Osme konference Jezikovne tehnologije. Ljubljana: Institut Jožef Stefan.
Gries, S. (2013). 50-something years of work on collocations. International Journal of Corpus Linguistics, 18(1), 137–165. DOI: https://doi.org/10.1075/ijcl.18.1.09gri
Halliday, M. A. K. (1966). Lexis as a Linguistic Level. Journal of Linguistics, 2(1), 57–67. DOI: https://doi.org/10.1017/S0022226700001328
Hausmann, F. J. (1989). Le dictionnaire de collocations. In F. J. Hausmann et al. (Eds.), Wörterbücher: ein internationales Handbuch zur Lexikographie (pp. 1010–1019). Berlin/New York: De Gruyter.
Hudeček, L., & Mihaljević, M. (2020). Collocations in Croatian Web Dictionary – Mrežnik. Slovenščina 2.0: Empirical, Applied and Interdisciplinary Research, 8(1). DOI: https://doi.org/10.4312/slo2.0.2020.2.78-111
Kilgarriff, A., Rychly, P., Smrz, P., & Tugwell, D. (2004). The Sketch Engine. In G. Williams & S. Vessier (Eds.), Proceedings of the 11th EURALEX International Congress (pp. 105–116). Lorient: France.
Kilgarrif, A., Baisa, V., Rychlý, P., & Jakubíček, M. (2015). Longest–commonest Match. In I. Kosem, M. Jakubíček, J. Kallas & S. Krek (Eds.), Electronic Lexicography in the 21st Century: Linking Lexical Data in the Digital Age. Proceedings of the eLex 2015 Conference (pp. 397–404). Ljubljana/Brighton: Trojina, Institute for Applied Slovene Studies/Lexical Computing Ltd.
Klemenc, B., Robnik Šikonja, M., Fürst, L., Bohak, C., & Krek, S. (2017). Technological design of a state-of-the-art digital dictionary. In V. Gorjanc, P. Gantar, I. Kosem & S. Krek (Eds.), Dictionary of Modern Slovene: Problems and Solutions (pp. 10–22). Ljubljana: Ljubljana University Press, Faculty of Arts.
Kosem, I., Husák, M., & McCarthy, D. (2011). GDEX for Slovene. In I. Kosem & K. Kosem (Eds.), Electronic Lexicography in the 21st Century: New applications for new users. Proceedings of the eLex 2011 Conference, 10–12 November, 2011, Bled, Slovenia (pp. 151–159). Ljubljana: Trojina, Institute for Applied Slovene Studies.
Kosem, I., Krek, S., Gantar, P., Arhar Holdt, Š., Čibej, J., & Laskowski, C. (2018). Collocations Dictionary of Modern Slovene. In J. Čibej, V. Gorjanc, I. Kosem & S. Krek (Eds.), Proceedings of the 18th EURALEX International Congress: Lexicography in Global Contexts, 17–21 July, 2018, Ljubljana, Slovenia (pp. 989–997). Ljubljana: Ljubljana University Press, Faculty of Arts. Retrieved from https://e-knjige.ff.uni-lj.si/znanstvena-zalozba/catalog/view/118/211/3000-1
Krek, S. (2016). Leksikografska orodja za slovenščino: slovnica besednih skic. In V. Gorjanc, P. Gantar, I. Kosem & S. Krek (Eds.), Slovar sodobne slovenščine: problemi in rešitve (pp. 358–378). Ljubljana: Ljubljana University Press, Faculty of Arts.
Krek, S., Gantar, P., Kosem, I., Gorjanc, V., & Laskowski, C. (2016). Baza kolokacijskega slovarja slovenskega jezika. In T. Erjavec & D. Fišer (Eds.), Proceedings of the Conference on Language Technologies and Digital Humanities, September 29th–October 1st, 2016, Ljubljana, Slovenia (pp. 101–105). Ljubljana: Academic Publishing Division of the Faculty of Arts.
Logar, N., Grčar, M., Brakus, M., Erjavec, T., Arhar Holdt, Š., & Krek, S. (2012). Korpusi slovenskega jezika Gigafida, KRES, ccGigafida in ccKRES: gradnja, vsebina, uporaba. Ljubljana: Trojina, zavod za uporabno slovenistiko; Fakulteta za družbene vede.
Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. Cambridge, Massachusetts: The MIT Press, Chap. 5. Collocations.
Moon, R. (1998). Fixed Expressions and Idioms, a Corpus-Based Approach. Oxford: Oxford University Press.
Palmer, H. E. (1933). Second Interim Report on English Collocations, Submitted to the Tenth Annual Conference of English Teachers under the Auspices of the Institute for Research in English Teaching. Tokyo: Institute for Research in English Teaching.
Pecina, P. (2009). Lexical association measures and collocation extraction. Language Resources and Evaluation, 44(1–2), 137–158. DOI: https://doi.org/10.1007/s10579-009-9101-4
Pori, E., & Kosem, I. (2018). In the Search of Lexicographically Relevant Collocation: The Example of Grammatical Relations Containing Adverbs. Slovenščina 2.0: Empirical, Applied and Interdisciplinary Research, 6(2), 154–185. doi: 10.4312/slo2.0.2018.2.154-185 DOI: https://doi.org/10.4312/slo2.0.2018.2.154-185
Pori, E., Kosem, I., Čibej, J., & Arhar Holdt, Š. (2020). The attitude of dictionary users towards automatically extracted collocation data: a user study. Slovenščina 2.0: Empirical, Applied and Interdisciplinary Research, 8(2), 168–201. DOI: https://doi.org/10.4312/slo2.0.2020.2.168-201
Seretan, V. (2010). Syntax-Based Collocation Extraction (1st ed.). Berlin, Heidelberg: Springer-Verlag. DOI: https://doi.org/10.1007/978-94-007-0134-2_1
Sinclair, J. (1991). Corpus, Concordance, Collocation. Oxford: Oxford University Press.
Wiechmann, D. (2008). On the computation of collostruction strength. Corpus Linguistics and Linguistic Theory 42, 253–290.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2020 Iztok Kosem, Simon Krek, Polona Gantar

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All content of Slovenščina 2.0 is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0).
Slovenščina 2.0 applies the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license to all published material. Under this license, authors retain ownership of the copyright for their content, but allow anyone to download, reuse, reprint, modify, distribute, copy, remix, transform and/or build upon the content for any purpose, even commercial, as long as the original authors and source are cited. No permission is required from the authors or the publishers. Appropriate attribution can be provided by simply citing the original article. If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original. For any reuse or redistribution of a work, users must also make clear the license terms under which the work was published.
No separate publishing agreements are signed between the author and the publisher. Authors retain copyright and the publishing rights of their work without any restrictions.
Authors are permitted and encouraged to post the journal’s published version of the work online (e.g., in institutional repositories, on their own websites), with an acknowledgement of its initial publication in Slovenščina 2.0.