Razpoloženjsko označeni leksikon v bosanskem jeziku
DOI:
https://doi.org/10.4312/slo2.0.2023.2.59-83Ključne besede:
Bosanski leksikon, korpus, analiza sentimenta, potrdilne in nepotrdilne besede (PnPbesede), ustavne besede, logaritemska verjetnost, označevanjePovzetek
Prispevek predstavlja prvi razpoloženjsko označeni leksikon bosanskega jezika. Postopek in metodologija označevanja sta predstavljena skupaj s študijo uporabnosti, ki se osredotoča na jezikovno pokritost. Sestava izhodišča je bila izvedena s prevajanjem slovenskega označenega leksikona in kasnejšim ročnim preverjanjem prevodov in oznak. Jezikovna pokritost je bila preverjana z uporabo dveh referenčnih korpusov. Bosanski jezik še vedno velja za jezik z malo jezikovnimi viri. Za bosanski jezik je na voljo referenčni korpus, ki ga sestavljajo samodejno preiskane spletne strani, vendar so avtorji ugotavljamo, da korpus z jasnim časovnim okvirom vsebnega besedila ni dosegljiv. Z zbiranjem novic z več bosanskih spletnih portalov je bil sestavljen korpus sodobnih besedil. V raziskavi sta bili uporabljeni dve metodi jezikovnega pokrivanja. Pri prvi je bil uporabljen frekvenčni seznam vseh besed, ekstrahiranih iz dveh referenčnih korpusov bosanskega jezika, druga metoda pa je prezrla frekvence kot glavni dejavnik pri štetju. Izračunana pokritost po prvi predstavljeni metodi za prvi korpus je bila 19,24 %, drugi korpus pa 28,05 %. Druga metoda daje 2,34 % pokritost za prvi korpus in 6,98 % za drugi korpus. Rezultati študije predstavljajo jezikovno pokritost, ki je primerljiva s znanimi metodami na tem področju. Uporabnost leksikona je bila dokazana že s primerjavo na Twitterju.
Prenosi
Literatura
Bučar, J., Žnidaršič, M., & Povh, J. (2018). Annotated news corpora and a lexicon for sentiment analysis in slovene. Language Resources and Evaluation, 52, 895– 919. doi:10.1007/s10579-018-9413-3 DOI: https://doi.org/10.1007/s10579-018-9413-3
Chen, C., Hu, X., Zhang, H., & Shou, Z. (2020). Fine grained sentiment analysis based on Bert. Journal of Physics: Conference Series, 1651. DOI: https://doi.org/10.1088/1742-6596/1651/1/012016
Davies, M. (2005). Vocabulary range and text coverage. insights from the forthcoming routledge frequency dictionary of spanish. Selected Proceedings of the 7th Hispanic Linguistics Symposium (pp. 106–115).
Erjavec, T., Ignat, C., Pouliquen, B., & Steinberger, R. (2015). Massive multi lingual corpus compilation: Acquis communautaire and totale. Archives of Control Sciences 15.
Glavaš, G., Šnajder, J., & Bašić, B. D. (2012). Semi-supervised acquisition of croatian sentiment. Proceedings of the International Conference on Text, Speech and Dialogue, 7499 (pp. 166–173). Brno, Czech Republic. doi:10.1007/978- 3- 642- 32790- 2_20 DOI: https://doi.org/10.1007/978-3-642-32790-2_20
Hajiyeva, K. (2015). A corpus-based lexical analysis of subject-specific university textbooks for english majors, 2, 136–144. doi:https://doi.org/10.1016/j.amper.2015.10.001 DOI: https://doi.org/10.1016/j.amper.2015.10.001
Hartman, J. J., Stone, P. J., Dunphy, D. C., Smith, M. S., & Ogilvie, D. M. (1967). The General Inquirer: A Computer Approach to Content Analysis. American Sociological Review, 4. doi:10.2307/1161774 DOI: https://doi.org/10.2307/2092070
Iglesias, C., & Moreno, A. (2019). Sentiment Analysis for Social Media. Sentiment Analysis for Social Media, 1–4. Retrieved from https://www.mdpi.com/journal/applsci/special DOI: https://doi.org/10.3390/app9235037
Jahić, S., & Vičič, J. (2021). Determining sentiment of tweets using first Bosnian lexicon and (AnA)-affirmative and non-affirmative words. Advanced technologies, systems, and applications V, 142, 361–373. doi:https://doi.org/10.1007/978-3-030-54765-3_25 DOI: https://doi.org/10.1007/978-3-030-54765-3_25
Jahić, S., & Vičič, J. (2023a). Lists of stopwords and AnAwords of Bosnian language (1.00) [Data set]. doi:10.5281/zenodo.8021150
Jahić, S., & Vičič, J. (2023b). Sentiment polarity lexicon of Bosnian language. 361–373. Univerza na Primorskem; CERN. Retrieved from https://zenodo.org/record/7520809#.Y8-4L3bMLi0
Jahić, S., & Vičič, J. (2023c). Impact of Negation and AnA-Words on Overall Sentiment Value of the Text Written in the Bosnian Language. Applied Science, 13, 7760. doi:10.3390/app13137760 DOI: https://doi.org/10.3390/app13137760
Jones, R. L. (2006). An analysis of lexical text coverage in contemporary German. In Brill, Language and Computers (pp. 115–120). Leiden, The Netherlands: Brill. doi:https://doi.org/10.1163/9789401202213_010. DOI: https://doi.org/10.1163/9789401202213_010
Jovanoski, D., Pachovski, V., & Nakov, P. (2015). Sentiment analysis in Twitter for Macedonian. Proceedings of the International Conference Recent Advances in Natural Language Processing (pp. 249–257). Hissar, Bulgaria: INCOMA Ltd. Shoumen. Retrieved from https://aclanthology.org/R15-1034
Kadunc, K. (2016). Določanje sentimenta slovenskim spletnim komentarjem s pomočjo strojnega. Ljubljana: Fakulteta za računalništvo in informatiko Univerze v Ljubljani. Retrieved from https://repozitorij.uni-lj.si/IzpisGradiva.php?lang=eng&id=91182
Kapukaranov, B., & Nakov, P. (2015). Fine-grained sentiment analysis for movie reviews in Bulgarian. Proceedings of the International Conference Recent Advances in Natural Language Processing (pp. 266–274). Hissar, Bulgaria: INCOMA Ltd. Shoumen. Retrieved from https://aclanthology.org/R15-1036
Kia, D., Soujanya, P., Amir, H., Erik, C., Ahmad, H. Y., Alexander, G., & Qiang, Z. (2016). Multilingual Sentiment Analysis: State of the Art and Independent Comparison of Techniques. Springer Link – Cognitive Computation, 8, 757–771. doi:10.1007/s12559-016-9415-7 DOI: https://doi.org/10.1007/s12559-016-9415-7
Kilgarriff, A. (2001). Comparing corpora. International Journal of Corpus Linguistics, 6(1), 97–133. doi:https://doi.org/10.1075/ijcl.6.1.05kil DOI: https://doi.org/10.1075/ijcl.6.1.05kil
Ljubešić, N., & Klubička, F. (2014). bs,hr,srWaC - web corpora of Bosnian, Croatian and Serbian. Proceedings of the 9th Web as Corpus Workshop (WaC-9) (pp. 29–35). Gothenburg, Sweeden: Association for Computational Linguistics. doi:10.3115/v1/W14- 0405 DOI: https://doi.org/10.3115/v1/W14-0405
Moreno-Ortiz, A., & Pérez-Hernández, C. (2018). Lingmotif-lex: a wide-coverage, state-of-the-art lexicon for sentiment analysis. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (pp. 2653–2659). Miyazaki, Japan: European Language Resources Association (ELRA). Retrieved from https://aclanthology.org/L18-1420
Okruhlica, A. (2013). Slovak sentiment lexicon induction in absence of labeled data, Master’s Thesis. Comenius University Bratislava.
Osmankadić, M. (2003). A Contribution to the Classification of Intensifiers in English and Bosnian. 50–62.
Rayson, P., & Garside, R. (2000). Comparing corpora using frequency profiling. Proceedings of the Workshop on Comparing Corpora WCC’00. 9 (pp. 1–6). USA: Association for Computational Linguistics. doi:10.3115/117729.117730 DOI: https://doi.org/10.3115/1604683.1604686
Suciati, A., & Budi, I. (2020). Aspect-Based Sentiment Analysis and Emotion. (IJACSA) International Journal of Advanced Computer Science and Applications, 11(9), 179–186. DOI: https://doi.org/10.14569/IJACSA.2020.0110921
Veselovská, K. (2013). Czech subjectivity lexicon : A lexical resource for czech polarity classification. Proceedings of the 7th international conference Slovko (pp. 279–284). Bratislava.
Vičič, J. (2021). Bosnian news corpus 2021. Retrieved from http://hdl.handle.net/11356/1406
Wawer, A. (2012). Extracting emotive patterns for languages with rich morphology. International Journal of Computational Linguistics and Applications, 11–24.
Wu, F., Shi, Z., Dong, Z., Pand, C., & Zhang, B. (2020). Sentiment Analysis of Online Product Reviews Based On SenBERT-CNN. International Conference on Machine Learning and Cybernetics (ICMLC) (pp. 229–234). Adelaide, Australia: IEEE. doi:10.1109/ICMLC51923.2020.9469551 DOI: https://doi.org/10.1109/ICMLC51923.2020.9469551
Prenosi
Objavljeno
Številka
Rubrika
Licenca
Avtorske pravice (c) 2023 Sead Jahić, Jernej Vičič
To delo je licencirano pod Creative Commons Priznanje avtorstva-Deljenje pod enakimi pogoji 4.0 mednarodno licenco.