Tviterasi, tviteraši ali twiteraši? Izdelava in analiza normaliziranega nabora hrvaških in srbskih tvitov

Avtorji

  • Maja Miličević
  • Nikola Ljubešić

DOI:

https://doi.org/10.4312/slo2.0.2016.2.156-188

Ključne besede:

računalniško posredovana komunikacija, korpusi CMC, Twitter, normalizacija

Povzetek

V prispevku predstavimo vzporedno ročno normalizacijo vzorcev, izluščenih iz korpusov hrvaških in srbskih tvitov. Najprej opišemo nabor podatkov, podamo poenotene smernice za anotatorje in predstavimo analizo pretvorb iz nestandardnega v standardni jezik, ki smo jih zajeli v gradivu. Rezultati kažejo, da se zaprte besedne vrste (tiste, ki redkeje sprejemajo nove besede ali pa jih sploh ne sprejemajo, torej predvsem slovnične besedne vrste) pretvarjajo pogosteje kot odprte (tiste, ki pogosteje sprejemajo nove elemente), da so najpogosteje pretvorjene leme pomožni in modalni glagoli, medmeti, členki in zaimki, da so izbrisi pogostejši kot vstavljanja ali zamenjave in da do pretvorb pogosteje prihaja na koncu besed kot na drugih mestih. Ugotovili smo, da si hrvaščina in srbščina delita številne pretvorbne vzorce, ne pa vseh. Medtem ko lahko nekatere razlike pripišemo strukturnim razlikam med jezikoma, se za druge zdi, da bi jih lahko lažje razložili z zunajjezikovnimi dejavniki. Izdelani nabori podatkov in začetne analize se lahko uporabljajo za proučevanje nestandardnega jezika kot tudi za razvoj jezikovnih tehnologij za nestandardne jezikovne podatke.

Prenosi

Podatki o prenosih še niso na voljo.

Literatura

Benhardus, J., and Kalita, J. (2013): Streaming trend detection in Twitter. International Journal of Web Based Communities, 9(1): 122–139.

Biber, D., Conrad, S., and R. Reppen (1998): Corpus Linguistics. Investigating Language Structure and Use. Cambridge: Cambridge University Press.

Crystal, D. (2011): Internet Linguistics: A Student Guide. New York: Routledge.

Čibej, J., Fišer, D., and Erjavec, T. (2016): Normalisation, tokenisation and sentence segmentation of Slovene tweets. Proceedings of Normalisation and Analysis of Social Media Texts (NormSoMe) 2016, LREC 2016: 5–10. http://www.lrec-conf.org/proceedings/lrec2016/workshops/LREC2016Workshop-NormSoMe_Proceedings.pdf

Eisenstein, J. (2013): What to do about bad language on the Internet. Proceedings of HLT-NAACL 2013: 359–369. http://www.cc.gatech.edu/~jeisenst/papers/naacl2013-badlanguage.pdf

Fišer, D., Erjavec, T., Ljubešić, N., and Miličević, M. (2015): Comparing the nonstandard language of Slovene, Croatian and Serbian tweets. M. Smolej (Ed.): Simpozij Obdobja 34. Slovnica in slovar - aktualni jezikovni opis (1. del): 225–231. Ljubljana: Filozofska fakulteta.

Foster, J., Cetinoglu, O., Wagner, J., Le Roux, J., Nivre, J., Hogan, D., and van Genabith, J. (2011): From news to comment: Resources and benchmarks for parsing the language of web 2.0. Proceedings of the 5th International Joint Conference on Natural Language Processing (IJCNLP 2011): 893–901. http://www.aclweb.org/anthology/I/I11/I11-1100.pdf

Gimpel, K., Schneider, N., O’Connor, B., Das, D., Mills, D., Eisenstein, J., Heilman, M., Yotogama, D., Flanigan, J., and Smith, Noah A. (2011): Part-of-speech tagging for Twitter: annotation, features, and experiments. Proceedings of 49th Conference on Computational Linguistics (ACL 2011): 42–47. http://www.aclweb.org/ anthology/P/P11/P11-2008.pdf

Hu, Y., Talamadupula, K., and Kambhampati, S. (2013): Dude, srsly?: The surprisingly formal nature of Twitter’s language. Proceedings of The 7th International AAAI Conference on Weblogs and Social Media (ICWSM 2013). http://www.public.asu.edu/~ktalamad/papers/icwsm13.pdf

Kaufmann, J., and Kalita, J. (2010): Syntactic normalization of Twitter messages. International Conference on Natural Language Processing (ICON 2010): 149–158. Kharagpur, India.

Levenshtein, V. I. (1966): Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10 (8): 707–710.

Liu, F., Weng, F., Wang, B., and Liu, Y. (2011): Insertion, deletion, or substitution? Normalizing text messages without pre-categorization nor supervision. Proceedings of 49th Conference on Computational Linguistics (ACL 2011): 71–76. http://www.aclweb.org/anthology/P/P11/P11-2013.pdf

Ljubešić, N., Erjavec, T., and Fišer, D. (2014a): Standardizing tweets with character-level machine translation. A. Gelbukh (Ed.): Proceedings of the 15th International Conference CICLing 2014: 164–175. Lecture Notes in Computer Science. Berlin: Springer.

Ljubešić, N., Fišer, D., and Erjavec, T. (2014b): TweetCaT: a tool for building Twitter corpora of smaller languages. Proceedings of LREC 9: 2279–2283. http://www.lrec-conf.org/ proceedings/lrec2014/pdf/834_Paper.pdf

Ljubešić, N., Fišer, D., Erjavec, T., Čibej, J., Marko, D., Pollak, S. and Škrjanec I. (2015): Predicting the level of text standardness in user-generated content. Proceedings of Recent Advances in Natural Language Processing (RANLP 2015): 371-378. https://aclweb.org/anthology/R/R15/R15-1049.pdf

Ljubešić, N., Zupan, K., Fišer, D., Erjavec, T. Normalising Slovene data: historical texts vs. user-generated content. Proceedings of KONVENS 2016: in print.

Ljubešić, N., Klubička, F., Agić, Ž. and Jazbec I. (2016b): New inflectional lexicons and training corpora for improved morphosyntactic annotation of Croatian and Serbian. Proceedings of LREC 10: 4264–4270. http://www.lrec-conf.org/proceedings/lrec2016/pdf/340_Paper.pdf

Mair, C., Hundt, M., Leech, G., and Smith, N. (2002): Short term diachronic shifts in part-of-speech frequencies. A comparison of the tagged LOB and F-LOB corpora. International Journal of Corpus Linguistics, 7(2): 245–264.

Noblia, M. V. (1998): The computer-mediated communication: A new way of understanding the language. Proceedings of the 1st Conference on Internet Research and Information for Social Scientists (IRISS’98): 10–12.
Oliva, J., Serrano, J. I., Del Castillo, M. D., and Igesias, A. (2013): A SMS normalization system integrating multiple grammatical resources. Natural Language Engineering, 19: 121–141.

Pešikan, M., Jerković, J., and Pižurica, M. (2010): Pravopis srpskoga jezika. Novi Sad: Matica srpska.

Petrov, S., and McDonald, R. (2012): Overview of the 2012 shared task on parsing the web. Notes of the First Workshop on SANCL 2012. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1. 261.2294&rep=rep1&type=pdf

Sidarenka, U., Scheffler, T., and Stede, M. (2013): Rule-based normalization of German Twitter messages. Proceedings of the International Conference of the German Society for Computational Linguistics and Language Technology. https://gscl2013.ukp.informatik.tu-darmstadt.de/fileadmin/user_upload/Group_UKP/conferences/ gscl2013/workshops/sidarenka_scheffler_stede.pdf

Tagg, C. (2012): Discourse of Text Messaging. London: Continuum.

Objavljeno

27. 09. 2016

Kako citirati

Miličević, M., & Ljubešić, N. (2016). Tviterasi, tviteraši ali twiteraši? Izdelava in analiza normaliziranega nabora hrvaških in srbskih tvitov. Slovenščina 2.0: Empirične, Aplikativne in Interdisciplinarne Raziskave, 4(2), 156-188. https://doi.org/10.4312/slo2.0.2016.2.156-188

Najbolj brani prispevki istega avtorja(jev)