Universal Dependencies za slovenščino

Nove smernice, ročno označeni podatki in razčlenjevalni model

Avtorji

  • Kaja Dobrovoljc Univerza v Ljubljani, Filozofska fakulteta; Institut Jožef Stefan, Ljubljana
  • Luka Terčon Univerza v Ljubljani, Fakulteta za računalništvo in informatiko
  • Nikola Ljubešić Institut Jožef Stefan, Ljubljana; Univerza v Ljubljani, Fakulteta za računalništvo in informatiko

DOI:

https://doi.org/10.4312/slo2.0.2023.1.218-246

Ključne besede:

slovnično označeni korpusi, odvisnostna slovnica, drevesnica, skladenjsko razčlenjevanje, obdelava naravnega jezika

Povzetek

Universal Dependencies (UD) je mednarodno usklajena označevalna shema za medjezikovno primerljivo oblikoslovno in skladenjsko označevanje besedil po načelih odvisnostne slovnice, ki je bila ob več kot 130 drugih svetovnih jezikih uspešno uporabljena tudi za označevanje besedil v slovenščini. V prispevku predstavimo rezultate nedavnih aktivnosti v povezavi s shemo UD znotraj projekta Razvoj slovenščine v digitalnem okolju, v okviru katerega smo obstoječo infrastrukturo nadgradili s prenovo in podrobno dokumentacijo označevalnih smernic UD za slovenščino, razširitvijo drevesnice SSJ-UD za pisno slovenščino z novimi povedmi iz korpusov ssj500k in ELEXIS-WSD, izdelavo testne množice iz besedil korpusa SentiCoref za spletni portal SloBENCH ter polavtomatsko pretvorbo oblikoslovnih oznak referenčnih učnih korpusov SUK in Janes-Tag. Na razširjeni drevesnici SSJ-UD je bil naučen tudi novi napovedni model za skladenjsko razčlenjevanje v orodju CLASSLA-Stanza, ki ga v prispevku v podporo nadaljnjim jezikoslovnim aplikacijam podrobneje ovrednotimo z vidika splošne natančnosti razčlenjevanja in najpogostejših tipov napak.

Prenosi

Podatki o prenosih še niso na voljo.

Literatura

Arhar Holdt, S., & Gorjanc, V. (2007). Korpus FidaPLUS: nova generacija slovenskega referenčnega korpusa. Jezik in Slovstvo, 52, 95–110.

Arhar Holdt, Š., Krek, S., Dobrovoljc, K., Erjavec, T., Gantar, P., Čibej, J., Pori, E., Terčon, L., Munda, T., Žitnik, S., Robida, N., Blagus, N., Može, S., Ledinek, N., Holz, N., Zupan, K., Kuzman, T., Kavčič, T., Škrjanec, I., … Zajc, A. (2022). Training corpus SUK 1.0. http://hdl.handle.net/11356/1747

Brank, J. (2022). Q-CAT Corpus Annotation Tool 1.4. http://hdl.handle.net/11356/1684

Chen, X., & Gerdes, K. (2018). How Do Universal Dependencies Distinguish Language Groups? In J. Jiang & H. Liu (Eds.), Quantitative Analysis of Dependency Structures (pp. 277–294). De Gruyter Mouton. doi: 10.1515/9783110573565-014 DOI: https://doi.org/10.1515/9783110573565-014

Čibej, J., Gantar, K., Dobrovoljc, K., Krek, S., Holozan, P., Erjavec, T., Romih, M., Arhar Holdt, Š., Krsnik, L., & Robnik-Šikonja, M. (2022). Morphological lexicon Sloleks 3.0. Pridobljeno s http://hdl.handle.net/11356/1745

de Castilho, R., Mújdricza-Maydt, É., Yimam, S. M., Hartmann, S., Gurevych, I., Frank, A., & Biemann, C. (2016). A Web-based Tool for the Integrated Annotation of Semantic and Syntactic Structures. Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH) (pp. 76–84). Pridobljeno s https://aclanthology.org/W16-4011

de Marneffe, M.-C., Dozat, T., Silveira, N., Haverinen, K., Ginter, F., Nivre, J., & Manning, C. D. (2014). Universal Stanford dependencies: A cross-linguistic typology. Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14) (pp. 4585–4592). Pridobljeno s http://www.lrec-conf.org/proceedings/lrec2014/pdf/1062_Paper.pdf

de Marneffe, M.-C., Manning, C. D., Nivre, J., & Zeman, D. (2021). Universal Dependencies. Computational Linguistics, 47(2), 255–308. Pridobljeno s https://doi.org/10.1162/coli_a_00402 DOI: https://doi.org/10.1162/coli_a_00402

Dobrovoljc, K., Erjavec, T., & Krek, S. (2016). Pretvorba korpusa ssj500k v Univerzalno odvisnostno drevesnico za slovenščino. Zbornik Konference Jezikovne Tehnologije in Digitalna Humanistika, 29. September - 1. Oktober 2016, Filozofska Fakulteta, Univerza v Ljubljani, Ljubljana, Slovenija (str. 190–192). Pridobljeno s http://www.sdjt.si/wp/wp-content/uploads/2016/09/JTDH-2016_Dobrovoljc-et-al_Pretvorba-korpusa-ssj500k.pdf

Dobrovoljc, K., Erjavec, T., & Krek, S. (2017). The Universal Dependencies Treebank for Slovenian. Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing (pp. 33–38). doi: 10.18653/v1/W17-1406 DOI: https://doi.org/10.18653/v1/W17-1406

Dobrovoljc, K., Erjavec, T., & Ljubešić, N. (2019). Improving UD processing via satellite resources for morphology. Proceedings of the Third Workshop on Universal Dependencies (UDW, SyntaxFest 2019), (pp. 24–34). Pridobljeno s https://doi.org/10.18653/v1/W19-8004 DOI: https://doi.org/10.18653/v1/W19-8004

Dobrovoljc, K., Krek, S., Holozan, P., Erjavec, T., Romih, M., Arhar Holdt, Š., Čibej, J., Krsnik, L., & Robnik-Šikonja, M. (2019). Morphological lexicon Sloleks 2.0. Pridobljeno s http://hdl.handle.net/11356/1230

Dobrovoljc, K., & Ljubešić, N. (2022). Extending the SSJ Universal Dependencies Treebank for Slovenian: Was It Worth It? Proceedings of the 16th Linguistic Annotation Workshop (LAW-XVI) within LREC2022, (pp. 15–22). Pridobljeno s https://aclanthology.org/2022.law-1.3

Dobrovoljc, K., & Nivre, J. (2016). The Universal Dependencies Treebank of Spoken Slovenian. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16) (pp. 1566–1573). Pridobljeno s https://aclanthology.org/L16-1248

Dobrovoljc, K., & Terčon, L. (2023). Universal Dependencies: Smernice za označevanje besedil v slovenščini. Pridobljeno s https://wiki.cjvt.si/attachments/23

Dobrovoljc, K., Marušič, F., Mišmaš, P. & Žaucer, R. (2023). Odprta vprašanja pri prenosu označevalne sheme Universal Dependencies na slovenska besedila: Priloga k smernicam. Pridobljeno s https://wiki.cjvt.si/attachments/25

Dozat, T., & Manning, C. D. (2016). Deep Biaffine Attention for Neural Dependency Parsing. 5th International Conference on Learning Representations, ICLR 2017 – Conference Track Proceedings. doi: 10.48550/arxiv.1611.01734

Erjavec, T. (2013). Korpusi in konkordančniki na strežniku nl.ijs.si. Slovenščina 2.0: Empirične, Aplikativne in Interdisciplinarne Raziskave, 1(1), 24–49. doi: 10.4312/slo2.0.2013.1.24-49 DOI: https://doi.org/10.4312/slo2.0.2013.1.24-49

Erjavec, T., Dobrovoljc, K., Fišer, D., Javoršek, J. J., Krek, S., Kuzman, T., Laskowski, C. A., Ljubešić, N., & Meden, K. (2022). Raziskovalna infrastruktura CLARIN.SI. In D. Fišer & T. Erjavec (Eds.), Jezikovne tehnologije in digitalna humanistika: zbornik konference (pp. 47–54). Inštitut za novejšo zgodovino. Pridobljeno s https://nl.ijs.si/jtdh22/pdf/JTDH2022_Erjavec-et-al_Raziskovalna-infrastruktura-CLARIN.SI.pdf

Erjavec, T., Fišer, D., Čibej, J., Arhar Holdt, Š., Ljubešić, N., Zupan, K., & Dobrovoljc, K. (2019). CMC training corpus Janes-Tag 2.1. Pridobljeno s http://hdl.handle.net/11356/1238

Erjavec, T., Fišer, D., Krek, S., & Ledinek, N. (2010, May). The JOS Linguistically Tagged Corpus of Slovene. Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10). Pridobljeno s http://www.lrec-conf.org/proceedings/lrec2010/pdf/139_Paper.pdf

Futrell, R., Mahowald, K., & Gibson, E. (2015). Large-scale evidence of dependency length minimization in 37 languages. Proceedings of the National Academy of Sciences of the United States of America, 112(33), 10336–10341. doi: 10.1073/PNAS.1502134112/SUPPL_FILE/PNAS.1502134112.ST01.PDF DOI: https://doi.org/10.1073/pnas.1502134112

Guzmán Naranjo, M., & Becker, L. (2018). Quantitative Word Order Typology with UD. Proceedings of the 17th International Workshop on Treebanks and Linguistic Theories (TLT 2018), December 13–14, 2018, Oslo University, Norway (pp. 91–104).

Honnibal, M., & Montani, I. (2017). spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing.

Ide, N., & Pustejovsky, J. (2017). Handbook of linguistic annotation / Nancy Ide, James Pustejovsky, editors. In Handbook of linguistic annotation. Springer. DOI: https://doi.org/10.1007/978-94-024-0881-2

Jurafsky, D., & Martin, J. H. (2021). Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 3rd Edition Draft. Prentice Hall, Pearson Education International.

Krek, S., Erjavec, T., Dobrovoljc, K., Gantar, P., Arhar Holdt, Š., Čibej, J., & Brank, J. (2020). The ssj500k training corpus for Slovene language processing. Jezikovne Tehnologije in Digitalna Humanistika, 24–33. Pridobljeno s http://nl.ijs.si/jtdh20/pdf/JT-DH_2020_Krek-et-al_The-ssj500k-Training-Corpus-for-Slovene-Language-Processing.pdf

Krsnik, L., Dobrovoljc, K., & Robnik-Šikonja, M. (2019). Dependency tree extraction tool STARK 1.0. Pridobljeno s http://hdl.handle.net/11356/1284

Ledinek, N. (2018). Skladenjska analiza slovenščine in slovenski jezikoslovno označeni korpusi. Jezik in Slovstvo, 63(2/3), 103–116. Pridobljeno s http://www.dlib.si/details/URN:NBN:SI:doc-N94NNL3K

Lenardič, J., Čibej, J., Arhar Holdt, Š., Erjavec, T., & Fišer, D. (2022). CMC training corpus Janes-Norm 3.0. Pridobljeno s http://hdl.handle.net/11356/1733

Ljubešić, N., & Dobrovoljc, K. (2019). What does Neural Bring? Analysing Improvements in Morphosyntactic Annotation and Lemmatisation of Slovenian, Croatian and Serbian. Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing (pp. 29–34). doi: 10.18653/v1/W19-3704 DOI: https://doi.org/10.18653/v1/W19-3704

Ljubešić, N., & Erjavec, T. (2018). Word embeddings CLARIN.SI-embed.sl 1.0. Pridobljeno s http://hdl.handle.net/11356/1204

Martelli, F., Navigli, R., Krek, S., Kallas, J., Gantar, P., Koeva, S., Nimb, S., Sandford Pedersen, B., Olsen, S., Langemets, M., Koppel, K., Üksik, T., Dobrovoljc, K., Ureña-Ruiz, R., Sancho-Sánchez, J.-L., Lipp, V., Váradi, T., Győrffy, A., László, S., … Munda, T. (2022). Parallel sense-annotated corpus ELEXIS-WSD 1.0. Pridobljeno s http://hdl.handle.net/11356/1674

Martelli, F., Navigli, R., Krek, S., Tiberius, C., Kallas, J., Gantar, P., Koeva, S., Nimb, S., Pedersen, B. S., Olsen, S., Langements, M., Koppel, K., Üksik, T., Dobrovolijc, K., Ureña-Ruiz, R.-J., Sancho-Sánchez, J.-L., Lipp, V., Varadi, T., Györffy, A., … Munda, T. (2021). Designing the ELEXIS Parallel Sense-Annotated Dataset in 10 European Languages. ELex 2021 Proceedings. Pridobljeno s https://elex.link/elex2021/

Nguyen, M. van, Lai, V. D., Pouran Ben Veyseh, A., & Nguyen, T. H. (2021). Trankit: A Light-Weight Transformer-based Toolkit for Multilingual Natural Language Processing. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations (pp. 80–90). doi: 10.18653/v1/2021.eacl-demos.10 DOI: https://doi.org/10.18653/v1/2021.eacl-demos.10

Nivre, J., de Marneffe, M.-C., Ginter, F., Hajič, J., Manning, C. D., Pyysalo, S., Schuster, S., Tyers, F., & Zeman, D. (2020). Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection. Proceedings of the Twelfth Language Resources and Evaluation Conference (pp. 4034–4043). Pridobljeno s https://aclanthology.org/2020.lrec-1.497

Petrov, S., Das, D., & McDonald, R. (2012). A Universal Part-of-Speech Tagset. Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12) (pp. 2089–2096). Pridobljeno s http://www.lrec-conf.org/proceedings/lrec2012/pdf/274_Paper.pdf

Qi, P., Zhang, Y., Zhang, Y., Bolton, J., & Manning, C. D. (2020). Stanza: A Python Natural Language Processing Toolkit for Many Human Languages. arXiv. doi: 10.48550/ARXIV.2003.07082 DOI: https://doi.org/10.18653/v1/2020.acl-demos.14

Schwenk, H., Chaudhary, V., Sun, S., Gong, H., & Guzmán, F. (2021). WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume (pp. 1351–1361). doi: 10.18653/v1/2021.eacl-main.115 DOI: https://doi.org/10.18653/v1/2021.eacl-main.115

Štravs, M., & Dobrovoljc, K. (2022). Service for querying dependency treebanks Drevesnik 1.0, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1715

Terčon, L., & Ljubešić, N. (2023). The CLASSLA-Stanza model for UD dependency parsing of standard Slovenian 2.0, Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1769

Terčon, L. & Ljubešić, N. (2023). CLASSLA-Stanza: The Next Step for Linguistic Processing of South Slavic Languages. arXiv. doi: 10.48550/arXiv.2308.04255

Toporišič, J. (2000). Slovenska slovnica. Založba Obzorja Maribor.

Zeman, D. (2008). Reusable Tagset Conversion Using Tagset Drivers. Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08). Pridobljeno s http://www.lrec-conf.org/proceedings/lrec2008/pdf/66_paper.pdf

Zeman, D., Hajič, J., Popel, M., Potthast, M., Straka, M., Ginter, F., Nivre, J., & Petrov, S. (2018). CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, 1–21. doi: 10.18653/v1/K18-2001 DOI: https://doi.org/10.18653/v1/K18-2001

Zeman, D., Nivre, J., Abrams, M., Ackermann, E., Aepli, N., Aghaei, H., Agić, Ž., Ahmadi, A., Ahrenberg, L., Ajede, C. K., Aleksandravičiūtė, G., Alfina, I., Algom, A., Andersen, E., Antonsen, L., Aplonova, K., Aquino, A., Aragon, C., Aranes, G., … Ziane, R. (2022). Universal Dependencies 2.10, http://hdl.handle.net/11234/1-4758

Žitnik, S. (2019). Slovene corpus for aspect-based sentiment analysis - SentiCoref 1.0, http://hdl.handle.net/11356/1285

Žitnik, S., & Dragar, F. (2021). SloBENCH evaluation framework, http://hdl.handle.net/11356/1469

Prenosi

Objavljeno

12.09.2023

Kako citirati

Dobrovoljc, K., Terčon, L., & Ljubešić, N. (2023). Universal Dependencies za slovenščino: Nove smernice, ročno označeni podatki in razčlenjevalni model. Slovenščina 2.0: Empirične, Aplikativne in Interdisciplinarne Raziskave, 11(1), 218–246. https://doi.org/10.4312/slo2.0.2023.1.218-246

Številka

Rubrike

Članki – Sklop 2: Jezikovni viri in tehnologije

Najbolj brani prispevki istega avtorja(jev)

1 2 > >>