Razvrščanje kolokatorjev v seznam: pogostost proti semantiki

Avtorji

  • Nikola Ljubešić Institut Jožef Stefan, Ljubljana; Univerza v Ljubljani, Fakulteta za računalništvo in informatiko
  • Nataša Logar Univerza v Ljubljani, Fakulteta za družbene vede
  • Iztok Kosem Univerza v Ljubljani, Filozofska fakulteta; Institut Jožef Stefan, Ljubljana

DOI:

https://doi.org/10.4312/slo2.0.2021.2.41-70

Ključne besede:

kolokacije, besedne vložitve, logDice, splošni jezik, strokovno-znanstveni jezik

Povzetek

Kolokacije imajo v opisu jezika zelo pomembno vlogo. Še zlasti to velja za prepoznavanje pomena besed. Zato so postali v moderni leksikografiji neobhoden del pomenske členitve prav seznami kolokatorjev, razvrščeni po eni od statističnih mer povezovalnosti. Prispevek prikazuje primerjavo med dvema pristopoma k razvrščanju kolokatorjev: (a) metodo logDice, ki je zelo uveljavljena in temelji na pogostosti, ter (b) metodo besednih vložitev, ki je nova in temelji na strojnem učenju ter besedni semantiki. Primerjavo med rezultati obeh pristopov smo naredili na dveh zbirkah podatkov za slovenščino, eno z iztočnicami in njihovimi kolokacijami iz splošnega jezika, drugo z iztočnicami in njihovimi kolokacijami iz strokovno-znanstvenega jezika. Pri ocenjevanju rezultatov smo uporabili dve metodi: v kvantitativnem delu preizkusa smo izvedli nadzorovano strojno učenje z AUC ROC evalvacijo algoritma podpornih vektorjev (SVM); v kvalitativnem delu pa so rezultate obeh pristopov k razvrščanju kolokatorjev ocenili še leksikografi. Ugotovitve niso enoznačne; medtem ko je kvantitativno ocenjevanje pokazalo, da je pristop s strojnim učenjem in semantično razpršenostjo dal boljše razvrstitve kolokatorjev kot pristop, ki izhaja iz pogostosti, pa so leksikografi večinoma ocenili, da so seznami kolokatorjev obeh pristopov med sabo zelo podobni.

Prenosi

Podatki o prenosih še niso na voljo.

Literatura

Berry-Rogghe, G. L. (1973). The Computation of Collocations and their Relevance in Lexical Studies. In A. J. Aitken, R. W. Bailey, and N. Hamilton-Smith (Eds.), The Computer and Literal Studies (pp. 103–112). Edinburgh, New York: University Press.

Biber, D. (1993). Representativeness in Corpus Design. Literary and Linguistic Computing, 8(4), 243–57.

Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2016). Enriching Word Vectors with Subword Information. In H. Schütze (Ed.), Transactions of the Association for Computational Linguistics 5 (pp. 135–146).

Camacho-Collados, J., & Pilehvar, M. T. (2018). From Word to Sense Embeddings: A Survey on Vector Representations of Meaning. Journal of Artificial Intelligence Research 63, 743–788.

Church, K. W., Gale, W., Hanks, P., & Hindle, D. (1991). Using Statistics in Lexical Analysis. In U. Zernik (Ed.), Lexical Acquisition: Exploiting On-line Resources to Build a Lexicon (pp. 116–164). Erlbaum, Hillsdale, NJ.

Church, K., & Hanks, P. (1990). Word Association Norms, Mutual Information and Lexicography. Computational Linguistics, 6(1), 22–29.

Cook, P., Lau, J. H., Rundell, M., McCarthy, D., & Baldwin, T. (2013). A Lexicographic Appraisal of an Automatic Approach for Detecting New Word Senses. In I. Kosem, J. Kallas, P. Gantar, S. Krek, M. Langemets & M. Tuulik (Eds.), Electronic Lexicography in the 21st Century: Thinking Outside the Paper, Proceedings of the eLex 2013 Conference, 17–19 October 2013, Tallinn, Estonia (pp. 49–65). Ljubljana/Tallinn: Trojina, Institute for Applied Slovene Studies/Eesti Keele Instituut.

Enikeeva, E. V., & Mitrofanova, O. A. (2017). Russian Collocation Extraction Based on Word Embeddings. In V. Selegey et al. (Eds.), Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference “Dialogue 2017” (pp. 52–64). Moscow: The Computational Linguistics and Intellectual Technologies.

Erjavec, T., Fišer, D., & Ljubešić, N. (2020). The KAS Corpus of Slovenian Academic Writing. Language Resources & Evaluation 55, 551–583.

Evert, S. (2004). The Statistics of Word Cooccurrences: Word Pairs and Collocations, PhD Thesis. University of Stuttgart.

Evert, S. (2009). Corpora and Collocations. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics: An International Handbook 2 (pp. 1212–1248). Berlin/New York: Mouton de Gruyter.

Evert, S., Uhrig, P., Bartsch, S., & Proisl, T. (2017). E-VIEW-alation – a Large-scale Evaluation Study of Association Measures for Collocation Identification. In I. Kosem, C. Tiberius, M. Jakubíček, J. Kallas, S. Krek & V. Baisa (Eds.), Electronic Lexicography in the 21st Century, Proceedings of eLex 2017 Conference (pp. 531–549). Leiden, Netherlands/Brno: Lexical Computing CZ s.r.o.

Firth, J. R. (1957). Modes of Meaning: Papers in Linguistics: 1934–1951. London: Oxford University Press.

Gantar, P., Kosem, I., & Krek, S. (2016). Discovering Automated Lexicography: the Case of the Slovene Lexical Database. International Journal of Lexicography, 29(2), 200–225.

Gantar, P., Krek, S., Kosem, I., & Gorjanc, V. (2015). Collocation Dictionary for Slovene: Challenge for Automatic Extraction of Data and Crowdsourcing. In G. Corpas Pastor, M. Buendía Castro & R. Guttiérrez Florido (Eds.), Computerised and Corpus-based Approaches to Phraseology: Monolingual and Multilingual Perspectives (Fraseologı´a computacional y basada en corpus: perspectivas monolingu¨es y multilingu¨es), Europhras, 2015 (pp. 84–86). Malaga: Lexytrad, Research Group in Lexicography and Translation.

Garcia, M., García-Salido, M., & Alonso-Ramos, M. (2017). Using Bilingual Word-embeddings for Multilingual Collocation Extraction. In S. Markantonatou, C. Ramisch, A. Savary & V. Vincze (Eds.), Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017) (pp. 21–30). Valencia: Association for Computational Linguistics.

Gorjanc, V., & Fišer, D. (2010). Korpusna analiza. Ljubljana: Znanstvena založba Filozofske fakultete Univerze v Ljubljani.

Gorjanc, V., & Vintar, Š. (2000). Iskanja po korpusu slovenskega jezika FIDA. In T. Erjavec & J. Gros (Eds.), Jezikovne tehnologije: Zbornik konference (pp. 20–27). Ljubljana: Institut Jožef Stefan.

Gries, S. (2013). 50-something Years of Work on Collocations. International Journal of Corpus Linguistics, 18(1), 137–165.

Kilgarriff, A., Rychlý, P., Smrz, P., & Tugwell, D. (2004). The Sketch Engine. In G. Williams & S. Vessier (Eds.), Proceedings of the Eleventh EURALEX International Congress, EURALEX 2004 Lorient, France July 6–10, 2004 (pp. 105–116). Lorient: Université de Bretagne – sud.

Kilgarriff, A., Husák, M., McAdam, K., Rundell, M., & Rychlý, P. (2008). GDEX: Automatically Finding Good Dictionary Examples in a Corpus. In E. Bernal & J. DeCesaris (Eds.), Proceedings of the Thirteenth EURALEX International Congress (pp. 425–432). Barcelona, Spain: Institut Universitari de Linguistica Aplicada, Universitat Pompeu Fabra.

Kilgarriff, A., & Rychlý, P. (2010). Semi-automatic Dictionary Drafting. In G.-M. de Schryver (Ed.), A Way with Words: A Festschrift for Patrick Hanks (pp. 299–312). Kampala: Menha Publishers.

Kilgarriff, A., & Kosem, I. (2012). Corpus Tools for Lexicographers. In S. Granger & M. Paquot (Eds.), Electronic Lexicography (pp. 31–56). Oxford: Oxford University Press.

Kosem, I., Gantar, P., & Krek, S. (2013). Automation of Lexicographic Work: an Opportunity for Both Lexicographers and Crowd-sourcing. In I. Kosem, J. Kallas, P. Gantar, S. Krek, M. Langemets & M. Tuulik (Eds.), Electronic Lexicography in the 21st century: Thinking Outside the Paper, Proceedings of the eLex 2013 Conference, 17–19 October 2013, Tallinn, Estonia (pp. 32–48). Ljubljana/Tallinn: Trojina, Institute for Applied Slovene Studies/Eesti Keele Instituut.

Kosem, I., Husak, M., & McCarthy, D. (2011). GDEX For Slovene. In I. Kosem & K. Kosem (Eds.), Electronic Lexicography in the 21st century: New Applications for New Users, Proceedings of eLex 2011, 10–12 November 2011, Bled, Slovenia (pp. 150–159). Ljubljana: Trojina, Institute for Applied Slovene Studies.

Kosem, I., Krek, S., Gantar, P., Arhar Holdt, Š., Čibej, J., & Laskowski, C. (2018). Collocations Dictionary of Modern Slovene. In J. Čibej, V. Gorjanc, I. Kosem & S. Krek (Eds.), Proceedings of the 18th EURALEX International Congress: Lexicography in Global Contexts, 17–21 July 2018, Ljubljana (pp. 989–997). Ljubljana: Ljubljana University Press, Faculty of Arts.

Kosem, I., Gantar, P., Krek, S., Arhar Holdt, Š., Čibej, J., Laskowski, C., Pori, E., Klemenc, B., Dobrovoljc, K., Gorjanc, V., & Ljubešić, N. (2019). Collocations Dictionary of Modern Slovene KSSS 1.0. Ljubljana: Slovenian Language Resource Repository CLARIN.SI. Retrieved from http://hdl.handle.net/11356/1250 (26. 8. 2021)

Krek, S. (2012). New Slovene Sketch Grammar for Automatic Extraction of Lexical Data: Presentation given at SKEW3, Brno, Czech Republic, 21–22 March 2012. Retrieved from https://trac.sketchengine.co.uk/attachment/wiki/SKEW-3/Program/Krek_SKEW-3.pdf?format=raw (26. 8. 2021)

Krek, S., Arhar Holdt, Š., Erjavec, T., Čibej, J., Repar, A., Gantar, P., Ljubešić, N., Kosem, I., & Dobrovoljc, K. (2020). Gigafida 2.0: the Reference Corpus of Written Standard Slovene. In N. Calzolari (Ed.), LREC 2020: Twelfth International Conference on Language Resources and Evaluation: May 11–16, 2020, Palais du Pharo, Marseille, France, Conference Proceedings (pp. 3340–3345). Paris: ELRA – European Language Resources Association.

Levy, O., & Goldberg, Y. (2014). Neural Word Embedding as Implicit Matrix Factorization. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence & K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 27 (NIPS 2014) (pp. 1–9).

Li, J., & Jurafsky, D. (2015). Do Multi-sense Embeddings Improve Natural Language Understanding?. In L. Màrquez, C. Callison-Burch & J. Su (Eds.), Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (pp. 1722–1732). Lisbon: Association for Computational Linguistics.

Liu, X., & Huang, D. (2017). Translation Oriented Sentence Level Collocation Identification and Extraction. In D. Wong & D. Xiong (Eds.), Machine Translation, CWMT 2017: Communications in Computer and Information Science 787 (pp. 78–89). Singapore: Springer.

Ljubešić, N., & Erjavec, T. (2018). Word Embeddings CLARIN.SI-embed.sl 1.0. Ljubljana: Slovenian Language Resource Repository CLARIN.SI. Retrieved from http://hdl.handle.net/11356/1204 (26. 8. 2021)

Logar, N., Grčar, M., Brakus, M., Erjavec, T., Arhar Holdt, Š., & Krek, S. (2012). Korpusi slovenskega jezika Gigafida, KRES, ccGigafida in ­ccKRES: gradnja, vsebina, uporaba. Ljubljana: Trojina, zavod za uporabno slovenistiko; Fakulteta za družbene vede.

Logar, N., Gantar, P., & Kosem, I. (2014). Collocations and Examples of Use: a Lexical-semantic Approach to Terminology. Slovenščina 2.0, 2(1), 41–61.

Logar, N., & Erjavec, T. (2019). Slovene Academic Writing: a Corpus Approach to Lexical Analysis. In I. Simonnæs (Ed.), New Challenges for Research on Language for Special Purposes: Selected Proceedings from the 21st LSP-Conference, 28–30 June 2017, Bergen, Norway (pp. 205–217). Berlin: Frank & Timme.

Logar, N., Kosem, I., & Erjavec, T. (2019). Collocation Lexicon of Slovene Academic Discourse Aleks. Ljubljana: Slovenian Language Resource Repository CLARIN.SI. Retrieved from http://hdl.handle.net/11356/1245 (26. 8. 2021)

Manning, C. D., & Schütze, H. (1999). Foundations of Statistical Natural Language Processing, Chap. 5: Collocations. Cambridge, Massachusetts: The MIT Press.

Mel’cuk, I. (1996). Lexical Functions: a Tool for the Description of Lexical Relations in a Lexicon. Lexical Functions in Lexicography and Natural Language Processing, 31, 37–102.

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. Retrieved from https://arxiv.org/abs/1301.3781 (26. 8. 2021)

Pecina, P. (2009). Lexical Association Measures and Collocation extraction. Language Resources and Evaluation, 44(1–2), 137–158.

Pedregosa, F., et al. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

Rayson, P., & Garside, R. (2000). Comparing Corpora using Frequency Profiling. In WCC’00, Proceedings of the Workshop on Comparing Corpora, 9, 1–6.

Rodríguez-Fernández, S., Carlini, R., Espinosa Anke, L., & Wanner, L. (2016a). Example-based Acquisition of Fine-grained Collocation Resources. In N. Calzolari et al. (Eds.), Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016) (pp. 2317–2322). Portorož: ELRA.

Rodríguez-Fernández, S., Carlini, R., Espinosa Anke, L., & Wanner, L. (2016b). Semantics-driven Recognition of Collocations Using Word Embeddings. In K. Erk & N. A. Smith (Eds.), Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 499–505). Berlin: Association for Computational Linguistics.

Rundell, M., & Kilgarriff, A. (2011). Automating the Creation of Dictionaries: Where Will It All End?. In F. Meunier, G. Gilquin & M. Paquot (Eds.), A Taste for Corpora: in Honour of Sylviane Granger (pp. 257–282). Amsterdam: John Benjamins.

Rychlý, P. (2008). A Lexicographer-Friendly Association Score. In P. Sojka & A. Horák (Eds.), Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2008 (pp. 6–9). Brno: Masaryk University.

Singleton, D. (2000). Language and the Lexicon: an Introduction. New York: Oxford University Press.

Wanner, L., Ferraro, G., & Moreno, P. (2017). Towards Distributional Semantics-Based Classification of Collocations for Collocation Dictionaries. International Journal of Lexicography, 30(2), 167–186.

Wiechmann, D. (2008). On the Computation of Collostruction Strength. Corpus Linguistics and Linguistic Theory, 42, 253–290.

Objavljeno

29. 12. 2021

Številka

Rubrika

Razprave

Kako citirati

Razvrščanje kolokatorjev v seznam: pogostost proti semantiki. (2021). Slovenščina 2.0: Empirične, Aplikativne in Interdisciplinarne Raziskave, 9(2), 41-70. https://doi.org/10.4312/slo2.0.2021.2.41-70

Najbolj brani prispevki istega avtorja(jev)

1 2 3 > >>