Vloga občanske znanosti pri množičnem zbiranju govornih virov v slovenščini
DOI:
https://doi.org/10.4312/slo2.0.2025.1.58-103Ključne besede:
govorni viri, pridobivanje govora, občanska znanost, množičenje, spontani govor, govorjena slovenščinaPovzetek
Govorni viri ustrezne kakovosti so za razvoj govornih tehnologij in raziskovanje govorjenega jezika ključnega pomena, a jih na področju spontano tvorjene govorjene slovenščine zaradi zahtevnosti zbiranja še vedno primanjkuje. Izgradnja govornih korpusov in podatkovnih baz je stroškovno zahtevna, zato raziskovalci vse pogosteje prepoznavajo potencial občanske znanosti. Ta z uporabo množičenja in drugih metod omogoča učinkovito zbiranje obsežnih govornih podatkov na daljavo. V prispevku obravnavamo ključne dejavnike – tehnične, finančne, pravne, etične in motivacijske – ki jih je treba upoštevati pri načrtovanju trajnostnega in razširljivega sistema za pridobivanje govornih posnetkov. Na podlagi pregleda literature, analiz obstoječih metod in globalnih iniciativ za zbiranje govornih virov podajamo priporočila, primerna za implementacijo v slovenski prostor.
Prenosi
Literatura
Arhar Holdt, Š., Kosem, I., & Kučuk, M. S. (2022). Metode in orodja za lažjo pripravo korpusov usvajanja jezika. V N. Pirih Svetina in I. Ferbežar (ur.), Simpozij Obdobja 41: Na stičišču svetov: slovenščina kot drugi in tuji jezik (str. 23–30). Pridobljeno s https://centerslo.si/wp-content/uploads/2022/11/Arhar-Holdt-et-al_Obdobja-41.pdf
Arhar Holdt, Š., Logar, N., Pori, E., & Kosem, I. (2021). “Game of Words”: Play the Game, Clean the Database. EURALEX XIX. Pridobljeno s XIX-Euralex-Proceedings-Lexicography-for-Inclusion.pdf
Bonney, R., Phillips, T. B., Ballard, H. L., & Enck, J. W. (2016). Can citizen science enhance public understanding of science?. Public understanding of science, 25(1), 2–16. doi: 10.1177/0963662515607406
Cieri, C., Corson, L., Graff, D., & Walker, K. (2007). Resources for new research directions in speaker recognition: the mixer 3, 4 and 5 corpora. V Interspeech, 8th Annual Conference of the International Speech Communication Association, Antwerp, Belgium, August 27–31 (str. 950–953). doi: 10.21437/Interspeech.2007-340
Cieri, C., Miller, D., & Walker, K. (2004). The Fisher corpus: A resource for the next generations of speech-to-text. V LREC (Vol. 4, str. 69–71).
Crall, A. W., Newman, G. J., Stohlgren, T. J., Holfelder, K. A., Graham, J., & Waller, D. M. (2011). Assessing citizen science data quality: an invasive species case study. Conservation Letters, 4(6), 433–442. doi: 10.1111/j.1755-263X.2011.00196.x
Čibej, J., Robida, N., & Krek, S. (2024). Nadgradnja Digitalne slovarske baze za slovenščino in Slovenskega oblikoslovnega leksikona Sloleks s podatki o govorjeni slovenščini: načrti in cilji. V M. Krajnc Ivič (ur.), Stanje in perspektive uporabe govornih virov v raziskavah govora (str. 27–39). Univerza v Mariboru, Univerzitetna založba. Pridobljeno s https://press.um.si/index.php/ump/catalog/book/898/chapter/46
Eskenazi, M., Levow, G. A., Meng, H., Parent, G., & Suendermann, D. (2013). Crowdsourcing for speech processing: Applications to data collection, transcription and assessment. John Wiley & Sons.
Garaus, C., Garaus, M., & Wagner, U. M. (2024). Getting users involved in idea crowdsourcing initiatives: An experimental approach to stimulate intrinsic motivation and intention to submit. IEEE Transactions on Engineering Management, 71, 3700–3711. Pridobljeno s https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10387737
Gershenfeld, N. (2011). Physics of the future: how science will shape human destiny and our daily lives by the year 2100. Physics Today, 64(10), 56–56.
Gneezy, U., & Rustichini, A. (2000). Pay enough or don‘t pay at all. The Quarterly journal of economics, 115(3), 791–810. Pridobljeno s https://www.jstor.org/stable/pdf/2586896.pdf
Kaufmann, N., Schulze, T., & Veit, D. (2011). More than fun and money. Worker motivation in crowdsourcing–a study on mechanical turk. Pridobljeno s https://opus.bibliothek.uni-augsburg.de/opus4/frontdoor/deliver/index/docId/45694/file/More_than_fun_and_money_Worker_Motivation_in_Crowd.pdf
Knight, D., Loizides, F., Neale, S., Anthony, L., & Spasić, I. (2021). Developing computational infrastructure for the CorCenCC corpus: the national corpus of contemporary Welsh. Language Resources and Evaluation, 55, 789–816. Pridobljeno s https://link.springer.com/content/pdf/10.1007/s10579-020-09501-9.pdf
Knight, D., Morris, S., Fitzpatrick, T., Rayson, P., Spasić, I., & Thomas, E. M. (2020). The national corpus of contemporary Welsh: Project report| Y corpws cenedlaethol Cymraeg cyfoes: adroddiad y prosiect. Pridobljeno s https://arxiv.org/abs/2010.05542
Lindén, K., Jauhiainen, T., Lennes, M., Kurimo, M., Rossi, A., Kurki, T., & Pitkänen, O. (2022). Donate Speech: Collecting and Sharing a Large-Scale Speech Database for Social Sciences, Humanities and Artificial Intelligence Research and Innovation. V CLARIN: the infrastructure for language resources (Digital Linguistics; Vol. 1). De Gruyter. doi: 10.1515/9783110767377-019
Lyding, V., Nicolas, L., & König, A. (2022). About the applicability of combining implicit crowdsourcing and language learning for the collection of NLP datasets. V Proceedings of the 2nd Workshop on Novel Incentives in Data Collection from People: models, implementations, challenges and results within LREC 2022 (str. 46–57). Pridobljeno s https://aclanthology.org/2022.nidcp-1.8.pdf
Mlinar, Z. (2021). Kaj nam prinašata koncept in gibanje občanska znanost/Citizen Science? Uveljavljanje raziskovanja kot sestavine vsakdanjega življenja. Casopis za Kritko Znanosti, Domisljijo in Novo Antropologijo (Journal for the Critique of Science, Imagination & New Anthropology), 49(282).
Neale, S., Spasic, I., Needs, J., Watkins, G., Morris, S., Fitzpatrick, T., ..., & Knight, D. (2017). The CorCenCC crowdsourcing app: A bespoke tool for the user-driven creation of the national corpus of contemporary Welsh. V Corpus Linguistics Conference, Birmingham. Pridobljeno s https://www.birmingham.ac.uk/Documents/college-artslaw/corpus/conference-archives/2017/general/paper273.pdf
Oinas-Kukkonen, H. (2008). Network analysis and crowds of people as sources of new organisational knowledge. Knowledge Management: Theoretical Foundation (str. 173–189).
Oxford English Dictionary. (2014). Citizen science. Oxford: Oxford University Press.
Poesio, M., Chamberlain, J., & Kruschwitz, U. (2017). Crowdsourcing. Handbook of linguistic annotation (str. 277–295).
Quinn, A. J., & Bederson, B. B. (2011, May). Human computation: a survey and taxonomy of a growing field. V Proceedings of the SIGCHI conference on human factors in computing systems (str. 1403–1412).
Republika Slovenija. (1995). Zakon o avtorski in sorodnih pravicah (ZASP). Uradni list RS, št. 21/95. Pridobljeno s https://www.pisrs.si/Pis.web/pregledPredpisa?id=ZAKO403
Rutten, M., Minkman, E., & van der Sanden, M. (2017). How to get and keep citizens involved in mobile crowd sensing for water management? A review of key success factors and motivational aspects. Wiley Interdisciplinary Reviews: Water, 4(4), e1218. Pridobljeno s https://wires.onlinelibrary.wiley.com/doi/pdfdirect/10.1002/wat2.1218
Surowiecki, J. (2005). The wisdom of crowds/james surowiecki. NY.: Anchor.
Tondello, G. F., Wehbe, R. R., Diamond, L., Busch, M., Marczewski, A., & Nacke, L. E. (2016). The gamification user types hexad scale. V Proceedings of the 2016 annual symposium on computer-human interaction in play (str. 229–243). Pridobljeno s https://dl.acm.org/doi/pdf/10.1145/2967934.2968082
Uredba (EU) 2016/679 Evropskega parlamenta in Sveta z dne 27. aprila 2016 o varstvu posameznikov pri obdelavi osebnih podatkov in o prostem pretoku takih podatkov (GDPR). Pridobljeno s https://eur-lex.europa.eu/legal-content/SL/TXT/?uri=CELEX%3A32016R0679
Van Leeuwen, D. A., Hinskens, F., Martinovic, B., Van Hessen, A., Grondelaers, S., & Orr, R. (2016). Sprekend Nederland: A heterogeneous speech data collection. Computational Linguistics in the Netherlands Journal, 6, 21–38. Pridobljeno s https://www.clinjournal.org/clinj/article/view/62/55
Verdonik, D., Bizjak, A., Žgank, A., Maučec, M. S., Trojar, M., Gros, J. Ž., ..., & Dobrišek, S. (2024). Strategies for managing time and costs in speech corpus creation: insights from the Slovenian ARTUR corpus. Language Resources and Evaluation, 1–26. Pridobljeno s https://link.springer.com/content/pdf/10.1007/s10579-024-09792-2.pdf
Verdonik, D., & Maučec, M. S. (2017). A speech corpus as a source of lexical information. International Journal of Lexicography, 30(2), 143–166.
Verdonik, D. (2008). Označevanje vrste diskurznih označevalcev. V T. Erjavec in J. Žganec Gros (ur.), Zbornik šeste konference Jezikovne tehnologije, 16.–17. oktober 2008, Ljubljana (Vol. 12, str. 25). Pridobljeno s https://nl.ijs.si/isjt08/IS-LTC08-Proceedings.pdf#page=33
Vohland, K. (2021). The Science of Citizen Science.
Wieczorkowska, A. (2025). Methodology for Obtaining High-Quality Speech Corpora. Applied Sciences, 15(4), 1848. Pridobljeno s https://www.mdpi.com/2076-3417/15/4/1848
Prenosi
Objavljeno
Številka
Rubrika
Licenca
Avtorske pravice (c) 2025 Andreja Bizjak

To delo je licencirano pod Creative Commons Priznanje avtorstva-Deljenje pod enakimi pogoji 4.0 mednarodno licenco.