Razpoloženjsko označeni leksikon v bosanskem jeziku

Avtorji

  • Sead Jahić Univerza na Primorskem, Fakulteta za matematiko, naravoslovje in informacijske tehnologije, Koper
  • Jernej Vičič Univerza na Primorskem, Fakulteta za matematiko, naravoslovje in informacijske tehnologije, Koper; ZRC SAZU, Inštitut za slovenski jezik Frana Ramovša, Ljubljana

DOI:

https://doi.org/10.4312/slo2.0.2023.2.59-83

Ključne besede:

Bosanski leksikon, korpus, analiza sentimenta, potrdilne in nepotrdilne besede (PnPbesede), ustavne besede, logaritemska verjetnost, označevanje

Povzetek

Prispevek predstavlja prvi razpoloženjsko označeni leksikon bosanskega jezika. Postopek in metodologija označevanja sta predstavljena skupaj s študijo uporabnosti, ki se osredotoča na jezikovno pokritost. Sestava izhodišča je bila izvedena s prevajanjem slovenskega označenega leksikona in kasnejšim ročnim preverjanjem prevodov in oznak. Jezikovna pokritost je bila preverjana z uporabo dveh referenčnih korpusov. Bosanski jezik še vedno velja za jezik z malo jezikovnimi viri. Za bosanski jezik je na voljo referenčni korpus, ki ga sestavljajo samodejno preiskane spletne strani, vendar so avtorji ugotavljamo, da korpus z jasnim časovnim okvirom vsebnega besedila ni dosegljiv. Z zbiranjem novic z več bosanskih spletnih portalov je bil sestavljen korpus sodobnih besedil. V raziskavi sta bili uporabljeni dve metodi jezikovnega pokrivanja. Pri prvi je bil uporabljen frekvenčni seznam vseh besed, ekstrahiranih iz dveh referenčnih korpusov bosanskega jezika, druga metoda pa je prezrla frekvence kot glavni dejavnik pri štetju. Izračunana pokritost po prvi predstavljeni metodi za prvi korpus je bila 19,24 %, drugi korpus pa 28,05 %. Druga metoda daje 2,34 % pokritost za prvi korpus in 6,98 % za drugi korpus. Rezultati študije predstavljajo jezikovno pokritost, ki je primerljiva s znanimi metodami na tem področju. Uporabnost leksikona je bila dokazana že s primerjavo na Twitterju.

Prenosi

Podatki o prenosih še niso na voljo.

Literatura

Bučar, J., Žnidaršič, M., & Povh, J. (2018). Annotated news corpora and a lexicon for sentiment analysis in slovene. Language Resources and Evaluation, 52, 895– 919. doi:10.1007/s10579-018-9413-3 DOI: https://doi.org/10.1007/s10579-018-9413-3

Chen, C., Hu, X., Zhang, H., & Shou, Z. (2020). Fine grained sentiment analysis based on Bert. Journal of Physics: Conference Series, 1651. DOI: https://doi.org/10.1088/1742-6596/1651/1/012016

Davies, M. (2005). Vocabulary range and text coverage. insights from the forthcoming routledge frequency dictionary of spanish. Selected Proceedings of the 7th Hispanic Linguistics Symposium (pp. 106–115).

Erjavec, T., Ignat, C., Pouliquen, B., & Steinberger, R. (2015). Massive multi lingual corpus compilation: Acquis communautaire and totale. Archives of Control Sciences 15.

Glavaš, G., Šnajder, J., & Bašić, B. D. (2012). Semi-supervised acquisition of croatian sentiment. Proceedings of the International Conference on Text, Speech and Dialogue, 7499 (pp. 166–173). Brno, Czech Republic. doi:10.1007/978- 3- 642- 32790- 2_20 DOI: https://doi.org/10.1007/978-3-642-32790-2_20

Hajiyeva, K. (2015). A corpus-based lexical analysis of subject-specific university textbooks for english majors, 2, 136–144. doi:https://doi.org/10.1016/j.amper.2015.10.001 DOI: https://doi.org/10.1016/j.amper.2015.10.001

Hartman, J. J., Stone, P. J., Dunphy, D. C., Smith, M. S., & Ogilvie, D. M. (1967). The General Inquirer: A Computer Approach to Content Analysis. American Sociological Review, 4. doi:10.2307/1161774 DOI: https://doi.org/10.2307/2092070

Iglesias, C., & Moreno, A. (2019). Sentiment Analysis for Social Media. Sentiment Analysis for Social Media, 1–4. Retrieved from https://www.mdpi.com/journal/applsci/special DOI: https://doi.org/10.3390/app9235037

Jahić, S., & Vičič, J. (2021). Determining sentiment of tweets using first Bosnian lexicon and (AnA)-affirmative and non-affirmative words. Advanced technologies, systems, and applications V, 142, 361–373. doi:https://doi.org/10.1007/978-3-030-54765-3_25 DOI: https://doi.org/10.1007/978-3-030-54765-3_25

Jahić, S., & Vičič, J. (2023a). Lists of stopwords and AnAwords of Bosnian language (1.00) [Data set]. doi:10.5281/zenodo.8021150

Jahić, S., & Vičič, J. (2023b). Sentiment polarity lexicon of Bosnian language. 361–373. Univerza na Primorskem; CERN. Retrieved from https://zenodo.org/record/7520809#.Y8-4L3bMLi0

Jahić, S., & Vičič, J. (2023c). Impact of Negation and AnA-Words on Overall Sentiment Value of the Text Written in the Bosnian Language. Applied Science, 13, 7760. doi:10.3390/app13137760 DOI: https://doi.org/10.3390/app13137760

Jones, R. L. (2006). An analysis of lexical text coverage in contemporary German. In Brill, Language and Computers (pp. 115–120). Leiden, The Netherlands: Brill. doi:https://doi.org/10.1163/9789401202213_010. DOI: https://doi.org/10.1163/9789401202213_010

Jovanoski, D., Pachovski, V., & Nakov, P. (2015). Sentiment analysis in Twitter for Macedonian. Proceedings of the International Conference Recent Advances in Natural Language Processing (pp. 249–257). Hissar, Bulgaria: INCOMA Ltd. Shoumen. Retrieved from https://aclanthology.org/R15-1034

Kadunc, K. (2016). Določanje sentimenta slovenskim spletnim komentarjem s pomočjo strojnega. Ljubljana: Fakulteta za računalništvo in informatiko Univerze v Ljubljani. Retrieved from https://repozitorij.uni-lj.si/IzpisGradiva.php?lang=eng&id=91182

Kapukaranov, B., & Nakov, P. (2015). Fine-grained sentiment analysis for movie reviews in Bulgarian. Proceedings of the International Conference Recent Advances in Natural Language Processing (pp. 266–274). Hissar, Bulgaria: INCOMA Ltd. Shoumen. Retrieved from https://aclanthology.org/R15-1036

Kia, D., Soujanya, P., Amir, H., Erik, C., Ahmad, H. Y., Alexander, G., & Qiang, Z. (2016). Multilingual Sentiment Analysis: State of the Art and Independent Comparison of Techniques. Springer Link – Cognitive Computation, 8, 757–771. doi:10.1007/s12559-016-9415-7 DOI: https://doi.org/10.1007/s12559-016-9415-7

Kilgarriff, A. (2001). Comparing corpora. International Journal of Corpus Linguistics, 6(1), 97–133. doi:https://doi.org/10.1075/ijcl.6.1.05kil DOI: https://doi.org/10.1075/ijcl.6.1.05kil

Ljubešić, N., & Klubička, F. (2014). bs,hr,srWaC - web corpora of Bosnian, Croatian and Serbian. Proceedings of the 9th Web as Corpus Workshop (WaC-9) (pp. 29–35). Gothenburg, Sweeden: Association for Computational Linguistics. doi:10.3115/v1/W14- 0405 DOI: https://doi.org/10.3115/v1/W14-0405

Moreno-Ortiz, A., & Pérez-Hernández, C. (2018). Lingmotif-lex: a wide-coverage, state-of-the-art lexicon for sentiment analysis. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (pp. 2653–2659). Miyazaki, Japan: European Language Resources Association (ELRA). Retrieved from https://aclanthology.org/L18-1420

Okruhlica, A. (2013). Slovak sentiment lexicon induction in absence of labeled data, Master’s Thesis. Comenius University Bratislava.

Osmankadić, M. (2003). A Contribution to the Classification of Intensifiers in English and Bosnian. 50–62.

Rayson, P., & Garside, R. (2000). Comparing corpora using frequency profiling. Proceedings of the Workshop on Comparing Corpora WCC’00. 9 (pp. 1–6). USA: Association for Computational Linguistics. doi:10.3115/117729.117730 DOI: https://doi.org/10.3115/1604683.1604686

Suciati, A., & Budi, I. (2020). Aspect-Based Sentiment Analysis and Emotion. (IJACSA) International Journal of Advanced Computer Science and Applications, 11(9), 179–186. DOI: https://doi.org/10.14569/IJACSA.2020.0110921

Veselovská, K. (2013). Czech subjectivity lexicon : A lexical resource for czech polarity classification. Proceedings of the 7th international conference Slovko (pp. 279–284). Bratislava.

Vičič, J. (2021). Bosnian news corpus 2021. Retrieved from http://hdl.handle.net/11356/1406

Wawer, A. (2012). Extracting emotive patterns for languages with rich morphology. International Journal of Computational Linguistics and Applications, 11–24.

Wu, F., Shi, Z., Dong, Z., Pand, C., & Zhang, B. (2020). Sentiment Analysis of Online Product Reviews Based On SenBERT-CNN. International Conference on Machine Learning and Cybernetics (ICMLC) (pp. 229–234). Adelaide, Australia: IEEE. doi:10.1109/ICMLC51923.2020.9469551 DOI: https://doi.org/10.1109/ICMLC51923.2020.9469551

Objavljeno

22. 12. 2023

Številka

Rubrika

Razprave

Kako citirati

Jahić, S., & Vičič, J. (2023). Razpoloženjsko označeni leksikon v bosanskem jeziku. Slovenščina 2.0: Empirične, Aplikativne in Interdisciplinarne Raziskave, 11(2), 59-83. https://doi.org/10.4312/slo2.0.2023.2.59-83