Linguistic data citation in Slovene scientific publications: Analysis and recommendations

Authors

  • Jakob Lenardič University of Ljubljana, Faculty of Arts, Slovenia
  • Tomaž Erjavec Jožef Stefan Institute, Ljubljana, Slovenia
  • Darja Fišer University of Ljubljana, Faculty of Arts, Slovenia

DOI:

https://doi.org/10.4312/slo2.0.2020.1.1-34

Keywords:

Open Science, research data citation, language resources, Austin Principles, Slovenian journals and conference proceedings

Abstract

Open science is based on freely and openly available scientific publications and data. The latter enable the verification and improvement of previous research. In the context of language technologies and manually annotated language resources, they also enable training of new text processing tools. However, just like scientific publications, research data need to be properly cited, as only this makes reproducibility of experiments possible and is the most important indicator of how interesting and useful researchers' work is in the community and plays a major role in their success with research grant proposals and career trajectory. In this paper, we survey the landscape of linguistic data, mainly (mainly language corpora) citation in six leading Slovene scientific journals (Jezik in slovstvo, Slavistična revija, Slovenščina 2.0, Linguistica, Slovene Linguistic Studies and Jezikoslovni zapiski) and in the proceedings of two scientific conferences focused on linguistics (Jezikovne tehnologije in digitalna humanistika and Obdobja) for the period of the last seven years, i.e. from 2013 to 2019. We consider 1,074 papers and analyse the results both quantitatively and qualitatively. From the quantitative perspective, we show that, overall, only about a fourth of the papers includes the use of language resources, and that in the later period (2018–2019) the use of language resources is over twice as frequent as it is in the earlier period (2013–2017). We classify the manner of language resource citation into five categories (e.g. citing the hyperlink in the texts or citing the key paper about the resource) and show that how a resource is cited is, to a large extent, dependent on the instructions for authors of the particular publication. Our qualitative analysis focuses mainly on resources deposited in the repository of the CLARIN.SI research infrastructure, where we show that they are, with few exceptions, incorrectly cited. We summarise the finding using the so-called Austin principles, show what has already been achieved in the scope of the CLARIN.SI infrastructure and propose guidelines for citing linguistic research data and how to implement them.

Downloads

Download data is not yet available.

References

Arhar Holdt, Š. in Dobrovoljc, K. (2016). Vrednost korpusa Janes za slovensko normativistiko. Slovenščina 2.0, 4(2), 1–37. doi: 10.4312/slo2.0.2016.2.1-37

Arhar Holdt, Š. in Čibej, J. (2018). Morphological Patterns in the Sloleks Lexicon of Slovene: An Initial Set of Patterns for Nouns. Slovenščina 2.0, 6(2), 33–66. doi: 10.4312/slo2.0.2018.2.33-66

Arias-Badia, B., Bernal, E. in Alonso, A. (2014). An online Spanish Learners' dictionary: the Daele project. Slovenščina 2.0, 2(2), 53–71. doi: 10.4312/slo2.0.2014.2.53-71

Atelšek, S. (2019). Navajanje prevzetih jezikoslovnih terminov in celovitost pojmovnih skupin v Cigaletovi Znanstveni terminologiji (1880). Jezikoslovni zapiski, 25(1), 67–82. doi: 10.3986/jz.v25i1.7566

Bálint Čeh, J. in Kosem, I. (2017). Prvi koraki do novega velikega slovensko-madžarskega slovarja: analiza relevantnih dvojezičnih virov. Slovenščina 2.0, 5(2), 113–150. doi: 10.4312/slo2.0.2017.2.113-150

Berez-Kroeker, A. L., Gawne, L., Holton, G., Smythe Kung, S., Pulsifer, P. in Collister, L. B. (2017). The Data Citation and Attribution in Linguistics Group, & the Linguistics Data Interest Group. The Austin Principles of Data Citation in Linguistics (Version 0.1). Dostopno prek http://site.uit.no/linguisticsdatacitation/austinprinciples

Berez-Kroeker, A. L., Gawne, L., Smythe Kung, S., Kelly, B. F., Heston, T., Holton, G., Pulsifer, P., Beaver, D. I., Chelliah, S., Dubinsky, S., Meier, R. P., Thieberger, N., Rice, K. in Woodbury, A. C. (2018). Reproducible research in linguistics: A position statement on data citation and attribution in our field. Linguistics, 56(1), 1–18. doi: 10.1515/ling-2017-0032

Data Citation Synthesis Group. (2014). Joint Declaration of Data Citation Principles. Martone, M. (ur.). San Diego CA: FORCE11. doi: 10.25490/a97f-egyk

Dobrovoljc, K. (2018a). Raba tipično govorjenih diskurznih označevalcev na spletu. Slavistična revija, 66(4), 497–513. Dostopno prek https://srl.si/ojs/srl/article/view/2018-4-1-6

Dobrovoljc, K. (2018b). Formulaicity in Slovene. Slovenščina 2.0, 6(2), 67–95. doi: 10.4312/slo2.0.2018.2.67-95

Dobrovoljc, K., Krek, S., Holozan, P., Erjavec, T., Romih, M., Arhar Holdt, Š., Čibej, J., Krsnik, L. in Robnik-Šikonja, M. (2019). Morphological lexicon Sloleks 2.0. Slovenian language resource repository CLARIN.SI. Dostopno prek http://hdl.handle.net/11356/1230

Erjavec, T. (2009). Odprtost jezikovnih virov za slovenščino. V M. Stabej (ur.), Simpozij OBDOBJA 28. Dostopno prek http://centerslo.si/wp-content/uploads/2015/10/28-Erjavec.pdf

Erjavec, T., Fišer, D., Krek, S. in Ledinek, N. (2010). The JOS Linguistically Tagged Corpus of Slovene. V Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10). Dostopno prek http://www.lrec-conf.org/proceedings/lrec2010/summaries/139.html

Erjavec, T. (2014). Digital library and corpus of historical Slovene IMP 1.1. Slovenian language resource repository CLARIN.SI. Dostopno prek http://hdl.handle.net/11356/1031

Erjavec, T. (2015a). The IMP historical Slovene language resources. Language Resources and Evaluation, 49, 753–775. doi: 10.1007/s10579-015-9294-7

Erjavec, T. (2015b). Reference corpus of historical Slovene goo300k 1.2. Slovenian language resource repository CLARIN.SI. Dostopno prek http://hdl.handle.net/11356/1025

European Commission. (2012). Towards better access to scientific information: Boosting the benefits of public investments in research. Dostopno prek http://ec.europa.eu/research/science-society/document_library/pdf_06/era-communication-towards-better-access-to-scientific-information_en.pdf

Fišer, D., Lenardič, J. in Erjavec, T. (2018). Citiranje jezikoslovnih podatkov v slovenskih znanstvenih objavah: stanje in priporočila. V D. Fišer in A. Pančur (ur.), Zbornik konference Jezikovne tehnologije in digitalna humanistika 2018 (str. 77–84). Univerza v Ljubljani, Filozofska fakulteta.

Furlan, M. (2018). O govejem lastnem imenu Hrdagata in kletvici (h)ardigata. Jezikoslovni zapiski, 24(1), 131–141. doi: 10.3986/JZ.24.1.6938

Haspelmath, M. (2014). The Generic Style Rules for Linguistics. Zenodo. doi: 10.5281/zenodo.253501

Hudeček, K. in Mihaljević, M. (2019). Hrvatsko mocijsko nazivlje. Jezikoslovni zapiski, 25(1), 107–126. doi: 10.3986/jz.v25i1.7569

Jakop, N. (2014). Leksikalizacija prostorskih razmerij v slovenščini: jezikovnopragmatični vidik. Slavistična revija, 62(3), 353–362. Dostopno prek https://srl.si/sql_pdf/SRL_2014_3_08.pdf

Jelovšek, A. in Erjavec, T. (2019). A corpus-based study of 16th-century Slovene clitics and clitic-like elements. Slovene Linguistic Studies, 12, 3–19. Dostopno prek http://hdl.handle.net/1808/29671

Kačič, Z., Horvat, B., Zögling Markuš, A., Veronik, R., Rojc, M., Žgank, A., Sepesy Maučec, M. in Rotovnik, T. (2002). SNABI database for continuous speech recognition 1.2. Slovenian language resource repository CLARIN.SI. Dostopno prek http://hdl.handle.net/11356/1051

Kilgarriff, A. in Renau, I. (2013). esTenTen, a vast webcorpus of Peninsular and American Spanish. Procedia-Social and Behavioral Sciences, 95, 12–19. doi: 10.1016/j.sbspro.2013.10.617

Krek, S., Erjavec, T., Dobrovoljc, K., Holz, N., Ledinek, N. in Može, S. (2015). Training corpus ssj500k 1.4 Slovenian language resource repository CLARIN.SI. Dostopno prek http://hdl.handle.net/11356/1052

Krek, S., Dobrovoljc, K., Erjavec, T., Može, S., Ledinek, N., Holz, N., Zupan, K., Gantar, P., Kuzman, T., Čibej, J., Arhar Holdt, Š., Kavčič, T., Škrjanec, I., Marko, D., Jezeršek, L. in Zajc, A. (2019). Training corpus ssj500k 2.2. Slovenian language resource repository CLARIN.SI. Dostopno prek http://hdl.handle.net/11356/1210

Krvina, D. (2019). Zaporednost dejanj in njen vpliv na rabo glagolskega vida v slovenščini. Slovene Linguistic Studies, 12, 75–83. doi: 10.3986/sjsls.12.1.05

Kulčar, M. (2018). Povezanost vida in vezljivosti pri netvorjenih in predponskoobrazilno tvorjenih glagolih. Jezikoslovni zapiski, 24(1), 45–62. doi: 10.3986/JZ.24.1.6932

Ljubešić, N., Fišer, D. in Erjavec, T. (2014). TweetCaT: A tool for building Twitter corpora of smaller languages. V N. Calzolari (ur.), Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14) (str. 2279–2283). Reykjavik, Islandija.

Ljubešić, N. in Klubička, F. (2016). Croatian web corpus hrWaC 2.1. Slovenian language resource repository CLARIN.SI. Dostopno prek http://hdl.handle.net/11356/1064

Ljubešić, N., Miličević Petrović, M. in Samardžić, T. (2019). Jezična akomodacija na Twitteru: primjer Srbije. Slavistična revija, 67(1), 87–106. Dostopno prek https://srl.si/ojs/srl/article/view/2019-1-1-6

Logar Berginc, N., Grčar, M., Brakus, M., Erjavec, T., Arhar Holdt, Š. in Krek, S. (2012). Korpusi slovenskega jezika Gigafida, KRES, ccGigafida in ccKRES: gradnja, vsebina, uporaba. Ljubljana: Trojina, zavod za uporabno slovenistiko, Fakulteta za družbene vede. Dostopno prek https://www.fdv.uni-lj.si/docs/default-source/zalozba/pages-from-logar-et-al---korpusi.pdf?sfvrsn=2

Logar Berginc, N., Erjavec, T., Krek, S., Grčar, M. in Holozan, P. (2013). Written corpus ccKres 1.0. Slovenian language resource repository CLARIN.SI. Dostopno prek http://hdl.handle.net/11356/1034

Logar Berginc, N., Gantar, P. in Kosem, I. (2014). Collocations and examples of use: a lexical-semantic approach to terminology. Slovenščina 2.0, 2(1), 41–61. doi: 10.4312/slo2.0.2014.1.41-61

Marvin, T., Derganc, J., Beguš, S. in Battelino, S. (2018). Word Selection in the Slovenian Sentence Matrix Test for Speech Audiometry. V D. Fišer in A. Pančur (ur.), Zbornik konference Jezikovne tehnologije in digitalna humanistika 2018 (str. 181–187). Univerza v Ljubljani, Filozofska fakulteta.

Marvin, T., Battelino, S., Beguš, S. in Derganc, J. (2019). Porazdelitev fonemov v slovenščini in izdelava matričnega testa za govorno avdiometrijo. Slavistična revija, 67(4), 537–550. Dostopno prek https://srl.si/ojs/srl/article/view/2019-4-1-1

Meterc, M. (2013). Antonimija enako motiviranih paremioloških enot (primeri iz slovenščine in slovaščine). Slavistična revija, 61(2), 361–376. Dostopno prek https://srl.si/sql_pdf/SRL_2013_2_02.pdf

Orel, I. (2019). Ženske dvojinske glagolske oblike v starejšem slovenskem knjižnem jeziku. Slavistična revija, 67(2), 273–280. Dostopno prek https://srl.si/ojs/srl/article/view/2019-2-1-15

Petrič, T. (2019). Modal Particles in German Declarative Sentences and their Slovenian Counterparts. Linguistica, 59(1), 235–251. doi: 10.4312/linguistica.59.1.235-251

Pisanski Peterlin, A. in Mikolič Južnjič, T. (2018). Subject Personal Pronouns in Slovene: Pragmatic Aspects of a Grammatical Category. Slovenščina 2.0, 6(2), 127–153. doi: 10.4312/slo2.0.2018.2.127-153

Polajnar, J. (2013). Neprodani in trdni. Ja, seveda, potem pa svizec ... Osamosvajanje oglasnih sloganov v slovenskem jeziku. Jezik in slovstvo, 58(3), 3–19. Dostopno prek https://www.jezikinslovstvo.com/pdf.php?part=2013|3|3%E2%80%9319

Pori, E. in Kosem, I. (2018). In the Search of Lexicographically Relevant Collocation: The Example of Grammatical Relations Containing Adverbs. Slovenščina 2.0, 6(2), 154–185. doi: 10.4312/slo2.0.2018.2.154-185

Rath, A. (2019). Anmerkung zur slowenischen Klitikakette (naslonski niz). Slovene Linguistic Studies, 12, 95–112. doi: 10.3986/sjsls.12.1.06

Rozman, T., Stritar Kučuk, M., Kosem, I., Krek, S., Krapš Vodopivec, I., Arhar Holdt, Š. in Stabej, M. (2013). Learners' corpus Šolar 1.0. Slovenian language resource repository CLARIN.SI. Dostopno prek http://hdl.handle.net/11356/1036

Rozman, T., Arhar Holdt, Š., Pollak, S. in Kosem, I. (2018). Kolokacije v korpusu Šolar. Jezik in slovstvo, 63(2–3), 117–128. Dostopno prek https://www.jezikinslovstvo.com/pdf.php?part=2018|2-3|117-128.

Sharoff, S. (2006). Creating general-purpose corpora using automated search engine queries. WaCky! Working papers on the Web as Corpus. Dostopno prek http://wackybook.sslmit.unibo.it/pdfs/sharoff.pdf

Stopar, A. in Ilc, G. (2019). Stilistična (ne)zaznamovanost moških in ženskih poimenovalnih parov za poklice v angleščini in slovenščini. Slavistična revija, 67(2), 333–342. Dostopno prek https://srl.si/ojs/srl/article/view/2019-2-1-21

Stramljič Breznik, I. (2018). Ženske ne povedo nič pametnega: jezikovnokorpusna analiza stereotipa. Jezikoslovni zapiski, 24(1), 27–44. doi: 10.3986/JZ.24.1.6931

Štebe, J., Bezjak, S. in Vipavc Brvar, I. (2015). Priprava raziskovalnih podatkov za odprt dostop. Priročnik za raziskovalce. Ljubljana: Založba FDV. Dostopno prek https://www.dlib.si/details/URN:NBN:SI:DOC-06SLBVXX

Štebe, J., Dolinar, M. in Bezjak, S. (2019). Smernice za oblikovanje politik znanstvenih založb glede navajanja raziskovalnih podatkov v znanstvenih publikacijah in zagotavljanja dostopa do primarnih podatkov, uporabljenih v člankih (Verzija 2.3.). Dostopno prek https://www.rd-alliance.org/system/files/documents/Smernice_za_razvoj_politike_zalo%C5%BEb_RDA_Slovenija_V2_3.pdf

Štumberger, S. (2015). Slovaropisna obravnava novejše leksike. Slovene Linguistic Studies, 10, 153–166. Dostopno prek https://ojs.zrc-sazu.si/sjsls/article/view/7365

Trivunović, E. (2019). Diahrono raziskovanje biblijskih in izbiblijskih frazemov. Jezikoslovni zapiski, 25(2), 47–61. doi: 10.3986/JZ.25.2.3

Uhlik, M. in Žele, A. (2018). Brezosebne zgradbe v slovenščini: kontrastiva z drugimi južnoslovanskimi jeziki in ruščino. Jezikoslovni zapiski, 24(2), 99–112. doi: 10.3986/jz.v24i2.7112

Verdonik, D., Potočnik, T., Sepesy Maučec, M. in Erjavec, T. (2016). Spoken corpus Gos VideoLectures 1.0 (transcription). Slovenian language resource repository CLARIN.SI. Dostopno prek http://hdl.handle.net/11356/1069

Verdonik, D., Potočnik, T., Sepesy Maučec, M. in Erjavec, T. (2017). Spoken corpus Gos VideoLectures 2.0 (transcription). Slovenian language resource repository CLARIN.SI. Dostopno prek http://hdl.handle.net/11356/1158

Vidovič Muha, A. (2015). Propozicija v funkcijski strukturi stavčne povedi – vprašanje besednih vrst (poudarek na povedkovniku in členku). Slavistična revija, 63(4), 389–406. Dostopno prek https://srl.si/sql_pdf/SRL_2015_4_04.pdf

Vončina, M. (2016). Zaključena znanstvena zbirka podatkov – primeri katalogizacije in Sicris vrednotenja. [Delavnica ADP, 26. 10. 2016.] Dostopno prek https://www.adp.fdv.uni-lj.si/adp_delavnica_okt2016/presentations/2016_MiraVoncina_Znanstvena_zbirka_podatkov.pdf

Zwitter Vitez, A., Zemljarič Miklavčič, J., Krek, S., Stabej, M. in Erjavec, T. (2013). Spoken corpus Gos 1.0. Slovenian language resource repository CLARIN.SI. Dostopno prek http://hdl.handle.net/11356/1040

Zwitter Vitez, A. (2018). Enota analize spontanega govora: interakcija prozodije, pragmatike in skladnje. Jezik in slovstvo, 63(2–3), 157–175. Dostopno prek https://www.jezikinslovstvo.com/pdf.php?part=2018|2-3|157-175

Žele, A. (2014). Členki tudi kot vnašalniki novih prostorskih razmerij v obstoječe sporočilo. Slavistična revija, 62(3), 321–330. Dostopno prek https://srl.si/sql_pdf/SRL_2014_3_05.pdf

Žele, A. (2018). O aktualnostnočlenitveni stavi v slovenščini. Jezik in slovstvo, 63(2–3), 59–73.

Žitnik, S., Šubelj, L. in Bajec, M. (2014). SkipCor: Skip-mention coreference resolution using linear-chain conditional random fields. PloS one, 9(6), e100101. doi: 10.1371/journal.pone.0100101

Žitnik, S., Draskovic, D., Nikolić, B. in Bajec, M. (2017). nutIE—A modern open source natural language processing toolkit. Proceedings of the 25th Telecommunication Forum (TELFOR), 1–4. doi: 10.1109/TELFOR.2017.8249486

Žitnik, S. in Bajec, M. (2018). Coreference Resolution for Slovene on Annotated Data from coref149. Slovenščina 2.0, 6(1), 37–67. doi: 10.4312/slo2.0.2018.1.37-67

Published

06.08.2020

How to Cite

Lenardič, J., Erjavec, T., & Fišer, D. (2020). Linguistic data citation in Slovene scientific publications: Analysis and recommendations. Slovenščina 2.0: Empirical, Applied and Interdisciplinary Research, 8(1), 1–34. https://doi.org/10.4312/slo2.0.2020.1.1-34

Issue

Section

Articles

Most read articles by the same author(s)

1 2 > >>