Collocation ranking: frequency vs semantics

Authors

  • Nikola Ljubešić Jožef Stefan Institute, Ljubljana, Slovenia; University of Ljubljana, Faculty of Computer and Information Science, Slovenia
  • Nataša Logar University of Ljubljana, Faculty of Social Sciences, Slovenia
  • Iztok Kosem University of Ljubljana, Faculty of Arts, Slovenia; Jožef Stefan Institute, Ljubljana, Slovenia

DOI:

https://doi.org/10.4312/slo2.0.2021.2.41-70

Keywords:

collocations, word embeddings, logDice, general language, academic language

Abstract

Collocations play a very important role in language description, especially in identifying meanings of words. Modern lexicography’s inevitable part of meaning deduction are lists of collocates ranked by some statistical measurement. In the paper, we present a comparison between two approaches to the ranking of collocates: (a) the logDice method, which is dominantly used and frequency-based, and (b) the fastText word embeddings method, which is new and semantic-based. The comparison was made on two Slovene datasets, one representing general language headwords and their collocates, and the other representing headwords and their collocates extracted from a language for special purposes corpus. In the experiment, two methods were used: for the quantitative part of the evaluation, we used supervised machine learning with the area-under-the-curve (AUC) ROC score and support-vector machines (SVMs) algorithm, and in the qualitative part the ranking results of the two methods were evaluated by lexicographers. The results were somewhat inconsistent; while the quantitative evaluation confirmed that the machine-learning-based approach produced better collocate ranking results than the frequency-based one, lexicographers in most cases considered the listings of collocates of both methods very similar.

Downloads

Download data is not yet available.

References

Berry-Rogghe, G. L. (1973). The Computation of Collocations and their Relevance in Lexical Studies. In A. J. Aitken, R. W. Bailey, and N. Hamilton-Smith (Eds.), The Computer and Literal Studies (pp. 103–112). Edinburgh, New York: University Press.

Biber, D. (1993). Representativeness in Corpus Design. Literary and Linguistic Computing, 8(4), 243–57.

Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2016). Enriching Word Vectors with Subword Information. In H. Schütze (Ed.), Transactions of the Association for Computational Linguistics 5 (pp. 135–146).

Camacho-Collados, J., & Pilehvar, M. T. (2018). From Word to Sense Embeddings: A Survey on Vector Representations of Meaning. Journal of Artificial Intelligence Research 63, 743–788.

Church, K. W., Gale, W., Hanks, P., & Hindle, D. (1991). Using Statistics in Lexical Analysis. In U. Zernik (Ed.), Lexical Acquisition: Exploiting On-line Resources to Build a Lexicon (pp. 116–164). Erlbaum, Hillsdale, NJ.

Church, K., & Hanks, P. (1990). Word Association Norms, Mutual Information and Lexicography. Computational Linguistics, 6(1), 22–29.

Cook, P., Lau, J. H., Rundell, M., McCarthy, D., & Baldwin, T. (2013). A Lexicographic Appraisal of an Automatic Approach for Detecting New Word Senses. In I. Kosem, J. Kallas, P. Gantar, S. Krek, M. Langemets & M. Tuulik (Eds.), Electronic Lexicography in the 21st Century: Thinking Outside the Paper, Proceedings of the eLex 2013 Conference, 17–19 October 2013, Tallinn, Estonia (pp. 49–65). Ljubljana/Tallinn: Trojina, Institute for Applied Slovene Studies/Eesti Keele Instituut.

Enikeeva, E. V., & Mitrofanova, O. A. (2017). Russian Collocation Extraction Based on Word Embeddings. In V. Selegey et al. (Eds.), Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference “Dialogue 2017” (pp. 52–64). Moscow: The Computational Linguistics and Intellectual Technologies.

Erjavec, T., Fišer, D., & Ljubešić, N. (2020). The KAS Corpus of Slovenian Academic Writing. Language Resources & Evaluation 55, 551–583.

Evert, S. (2004). The Statistics of Word Cooccurrences: Word Pairs and Collocations, PhD Thesis. University of Stuttgart.

Evert, S. (2009). Corpora and Collocations. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics: An International Handbook 2 (pp. 1212–1248). Berlin/New York: Mouton de Gruyter.

Evert, S., Uhrig, P., Bartsch, S., & Proisl, T. (2017). E-VIEW-alation – a Large-scale Evaluation Study of Association Measures for Collocation Identification. In I. Kosem, C. Tiberius, M. Jakubíček, J. Kallas, S. Krek & V. Baisa (Eds.), Electronic Lexicography in the 21st Century, Proceedings of eLex 2017 Conference (pp. 531–549). Leiden, Netherlands/Brno: Lexical Computing CZ s.r.o.

Firth, J. R. (1957). Modes of Meaning: Papers in Linguistics: 1934–1951. London: Oxford University Press.

Gantar, P., Kosem, I., & Krek, S. (2016). Discovering Automated Lexicography: the Case of the Slovene Lexical Database. International Journal of Lexicography, 29(2), 200–225.

Gantar, P., Krek, S., Kosem, I., & Gorjanc, V. (2015). Collocation Dictionary for Slovene: Challenge for Automatic Extraction of Data and Crowdsourcing. In G. Corpas Pastor, M. Buendía Castro & R. Guttiérrez Florido (Eds.), Computerised and Corpus-based Approaches to Phraseology: Monolingual and Multilingual Perspectives (Fraseologı´a computacional y basada en corpus: perspectivas monolingu¨es y multilingu¨es), Europhras, 2015 (pp. 84–86). Malaga: Lexytrad, Research Group in Lexicography and Translation.

Garcia, M., García-Salido, M., & Alonso-Ramos, M. (2017). Using Bilingual Word-embeddings for Multilingual Collocation Extraction. In S. Markantonatou, C. Ramisch, A. Savary & V. Vincze (Eds.), Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017) (pp. 21–30). Valencia: Association for Computational Linguistics.

Gorjanc, V., & Fišer, D. (2010). Korpusna analiza. Ljubljana: Znanstvena založba Filozofske fakultete Univerze v Ljubljani.

Gorjanc, V., & Vintar, Š. (2000). Iskanja po korpusu slovenskega jezika FIDA. In T. Erjavec & J. Gros (Eds.), Jezikovne tehnologije: Zbornik konference (pp. 20–27). Ljubljana: Institut Jožef Stefan.

Gries, S. (2013). 50-something Years of Work on Collocations. International Journal of Corpus Linguistics, 18(1), 137–165.

Kilgarriff, A., Rychlý, P., Smrz, P., & Tugwell, D. (2004). The Sketch Engine. In G. Williams & S. Vessier (Eds.), Proceedings of the Eleventh EURALEX International Congress, EURALEX 2004 Lorient, France July 6–10, 2004 (pp. 105–116). Lorient: Université de Bretagne – sud.

Kilgarriff, A., Husák, M., McAdam, K., Rundell, M., & Rychlý, P. (2008). GDEX: Automatically Finding Good Dictionary Examples in a Corpus. In E. Bernal & J. DeCesaris (Eds.), Proceedings of the Thirteenth EURALEX International Congress (pp. 425–432). Barcelona, Spain: Institut Universitari de Linguistica Aplicada, Universitat Pompeu Fabra.

Kilgarriff, A., & Rychlý, P. (2010). Semi-automatic Dictionary Drafting. In G.-M. de Schryver (Ed.), A Way with Words: A Festschrift for Patrick Hanks (pp. 299–312). Kampala: Menha Publishers.

Kilgarriff, A., & Kosem, I. (2012). Corpus Tools for Lexicographers. In S. Granger & M. Paquot (Eds.), Electronic Lexicography (pp. 31–56). Oxford: Oxford University Press.

Kosem, I., Gantar, P., & Krek, S. (2013). Automation of Lexicographic Work: an Opportunity for Both Lexicographers and Crowd-sourcing. In I. Kosem, J. Kallas, P. Gantar, S. Krek, M. Langemets & M. Tuulik (Eds.), Electronic Lexicography in the 21st century: Thinking Outside the Paper, Proceedings of the eLex 2013 Conference, 17–19 October 2013, Tallinn, Estonia (pp. 32–48). Ljubljana/Tallinn: Trojina, Institute for Applied Slovene Studies/Eesti Keele Instituut.

Kosem, I., Husak, M., & McCarthy, D. (2011). GDEX For Slovene. In I. Kosem & K. Kosem (Eds.), Electronic Lexicography in the 21st century: New Applications for New Users, Proceedings of eLex 2011, 10–12 November 2011, Bled, Slovenia (pp. 150–159). Ljubljana: Trojina, Institute for Applied Slovene Studies.

Kosem, I., Krek, S., Gantar, P., Arhar Holdt, Š., Čibej, J., & Laskowski, C. (2018). Collocations Dictionary of Modern Slovene. In J. Čibej, V. Gorjanc, I. Kosem & S. Krek (Eds.), Proceedings of the 18th EURALEX International Congress: Lexicography in Global Contexts, 17–21 July 2018, Ljubljana (pp. 989–997). Ljubljana: Ljubljana University Press, Faculty of Arts.

Kosem, I., Gantar, P., Krek, S., Arhar Holdt, Š., Čibej, J., Laskowski, C., Pori, E., Klemenc, B., Dobrovoljc, K., Gorjanc, V., & Ljubešić, N. (2019). Collocations Dictionary of Modern Slovene KSSS 1.0. Ljubljana: Slovenian Language Resource Repository CLARIN.SI. Retrieved from http://hdl.handle.net/11356/1250 (26. 8. 2021)

Krek, S. (2012). New Slovene Sketch Grammar for Automatic Extraction of Lexical Data: Presentation given at SKEW3, Brno, Czech Republic, 21–22 March 2012. Retrieved from https://trac.sketchengine.co.uk/attachment/wiki/SKEW-3/Program/Krek_SKEW-3.pdf?format=raw (26. 8. 2021)

Krek, S., Arhar Holdt, Š., Erjavec, T., Čibej, J., Repar, A., Gantar, P., Ljubešić, N., Kosem, I., & Dobrovoljc, K. (2020). Gigafida 2.0: the Reference Corpus of Written Standard Slovene. In N. Calzolari (Ed.), LREC 2020: Twelfth International Conference on Language Resources and Evaluation: May 11–16, 2020, Palais du Pharo, Marseille, France, Conference Proceedings (pp. 3340–3345). Paris: ELRA – European Language Resources Association.

Levy, O., & Goldberg, Y. (2014). Neural Word Embedding as Implicit Matrix Factorization. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence & K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 27 (NIPS 2014) (pp. 1–9).

Li, J., & Jurafsky, D. (2015). Do Multi-sense Embeddings Improve Natural Language Understanding?. In L. Màrquez, C. Callison-Burch & J. Su (Eds.), Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (pp. 1722–1732). Lisbon: Association for Computational Linguistics.

Liu, X., & Huang, D. (2017). Translation Oriented Sentence Level Collocation Identification and Extraction. In D. Wong & D. Xiong (Eds.), Machine Translation, CWMT 2017: Communications in Computer and Information Science 787 (pp. 78–89). Singapore: Springer.

Ljubešić, N., & Erjavec, T. (2018). Word Embeddings CLARIN.SI-embed.sl 1.0. Ljubljana: Slovenian Language Resource Repository CLARIN.SI. Retrieved from http://hdl.handle.net/11356/1204 (26. 8. 2021)

Logar, N., Grčar, M., Brakus, M., Erjavec, T., Arhar Holdt, Š., & Krek, S. (2012). Korpusi slovenskega jezika Gigafida, KRES, ccGigafida in ­ccKRES: gradnja, vsebina, uporaba. Ljubljana: Trojina, zavod za uporabno slovenistiko; Fakulteta za družbene vede.

Logar, N., Gantar, P., & Kosem, I. (2014). Collocations and Examples of Use: a Lexical-semantic Approach to Terminology. Slovenščina 2.0, 2(1), 41–61.

Logar, N., & Erjavec, T. (2019). Slovene Academic Writing: a Corpus Approach to Lexical Analysis. In I. Simonnæs (Ed.), New Challenges for Research on Language for Special Purposes: Selected Proceedings from the 21st LSP-Conference, 28–30 June 2017, Bergen, Norway (pp. 205–217). Berlin: Frank & Timme.

Logar, N., Kosem, I., & Erjavec, T. (2019). Collocation Lexicon of Slovene Academic Discourse Aleks. Ljubljana: Slovenian Language Resource Repository CLARIN.SI. Retrieved from http://hdl.handle.net/11356/1245 (26. 8. 2021)

Manning, C. D., & Schütze, H. (1999). Foundations of Statistical Natural Language Processing, Chap. 5: Collocations. Cambridge, Massachusetts: The MIT Press.

Mel’cuk, I. (1996). Lexical Functions: a Tool for the Description of Lexical Relations in a Lexicon. Lexical Functions in Lexicography and Natural Language Processing, 31, 37–102.

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. Retrieved from https://arxiv.org/abs/1301.3781 (26. 8. 2021)

Pecina, P. (2009). Lexical Association Measures and Collocation extraction. Language Resources and Evaluation, 44(1–2), 137–158.

Pedregosa, F., et al. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

Rayson, P., & Garside, R. (2000). Comparing Corpora using Frequency Profiling. In WCC’00, Proceedings of the Workshop on Comparing Corpora, 9, 1–6.

Rodríguez-Fernández, S., Carlini, R., Espinosa Anke, L., & Wanner, L. (2016a). Example-based Acquisition of Fine-grained Collocation Resources. In N. Calzolari et al. (Eds.), Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016) (pp. 2317–2322). Portorož: ELRA.

Rodríguez-Fernández, S., Carlini, R., Espinosa Anke, L., & Wanner, L. (2016b). Semantics-driven Recognition of Collocations Using Word Embeddings. In K. Erk & N. A. Smith (Eds.), Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 499–505). Berlin: Association for Computational Linguistics.

Rundell, M., & Kilgarriff, A. (2011). Automating the Creation of Dictionaries: Where Will It All End?. In F. Meunier, G. Gilquin & M. Paquot (Eds.), A Taste for Corpora: in Honour of Sylviane Granger (pp. 257–282). Amsterdam: John Benjamins.

Rychlý, P. (2008). A Lexicographer-Friendly Association Score. In P. Sojka & A. Horák (Eds.), Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2008 (pp. 6–9). Brno: Masaryk University.

Singleton, D. (2000). Language and the Lexicon: an Introduction. New York: Oxford University Press.

Wanner, L., Ferraro, G., & Moreno, P. (2017). Towards Distributional Semantics-Based Classification of Collocations for Collocation Dictionaries. International Journal of Lexicography, 30(2), 167–186.

Wiechmann, D. (2008). On the Computation of Collostruction Strength. Corpus Linguistics and Linguistic Theory, 42, 253–290.

Downloads

Published

29.12.2021

How to Cite

Ljubešić, N., Logar, N., & Kosem, I. (2021). Collocation ranking: frequency vs semantics. Slovenščina 2.0: Empirical, Applied and Interdisciplinary Research, 9(2), 41–70. https://doi.org/10.4312/slo2.0.2021.2.41-70

Issue

Section

Articles