Tviterasi, tviteraši ali twiteraši? Izdelava in analiza normaliziranega nabora hrvaških in srbskih tvitov
DOI:
https://doi.org/10.4312/slo2.0.2016.2.156-188Ključne besede:
računalniško posredovana komunikacija, korpusi CMC, Twitter, normalizacijaPovzetek
V prispevku predstavimo vzporedno ročno normalizacijo vzorcev, izluščenih iz korpusov hrvaških in srbskih tvitov. Najprej opišemo nabor podatkov, podamo poenotene smernice za anotatorje in predstavimo analizo pretvorb iz nestandardnega v standardni jezik, ki smo jih zajeli v gradivu. Rezultati kažejo, da se zaprte besedne vrste (tiste, ki redkeje sprejemajo nove besede ali pa jih sploh ne sprejemajo, torej predvsem slovnične besedne vrste) pretvarjajo pogosteje kot odprte (tiste, ki pogosteje sprejemajo nove elemente), da so najpogosteje pretvorjene leme pomožni in modalni glagoli, medmeti, členki in zaimki, da so izbrisi pogostejši kot vstavljanja ali zamenjave in da do pretvorb pogosteje prihaja na koncu besed kot na drugih mestih. Ugotovili smo, da si hrvaščina in srbščina delita številne pretvorbne vzorce, ne pa vseh. Medtem ko lahko nekatere razlike pripišemo strukturnim razlikam med jezikoma, se za druge zdi, da bi jih lahko lažje razložili z zunajjezikovnimi dejavniki. Izdelani nabori podatkov in začetne analize se lahko uporabljajo za proučevanje nestandardnega jezika kot tudi za razvoj jezikovnih tehnologij za nestandardne jezikovne podatke.Metrike
Nalaganej metrik....
Prenosi
Podatki o prenosih še niso na voljo.
Literatura
Benhardus, J., and Kalita, J. (2013): Streaming trend detection in Twitter. International Journal of Web Based Communities, 9(1): 122–139.
Biber, D., Conrad, S., and R. Reppen (1998): Corpus Linguistics. Investigating Language Structure and Use. Cambridge: Cambridge University Press.
Crystal, D. (2011): Internet Linguistics: A Student Guide. New York: Routledge.
Čibej, J., Fišer, D., and Erjavec, T. (2016): Normalisation, tokenisation and sentence segmentation of Slovene tweets. Proceedings of Normalisation and Analysis of Social Media Texts (NormSoMe) 2016, LREC 2016: 5–10. http://www.lrec-conf.org/proceedings/lrec2016/workshops/LREC2016Workshop-NormSoMe_Proceedings.pdf
Eisenstein, J. (2013): What to do about bad language on the Internet. Proceedings of HLT-NAACL 2013: 359–369. http://www.cc.gatech.edu/~jeisenst/papers/naacl2013-badlanguage.pdf
Fišer, D., Erjavec, T., Ljubešić, N., and Miličević, M. (2015): Comparing the nonstandard language of Slovene, Croatian and Serbian tweets. M. Smolej (Ed.): Simpozij Obdobja 34. Slovnica in slovar - aktualni jezikovni opis (1. del): 225–231. Ljubljana: Filozofska fakulteta.
Foster, J., Cetinoglu, O., Wagner, J., Le Roux, J., Nivre, J., Hogan, D., and van Genabith, J. (2011): From news to comment: Resources and benchmarks for parsing the language of web 2.0. Proceedings of the 5th International Joint Conference on Natural Language Processing (IJCNLP 2011): 893–901. http://www.aclweb.org/anthology/I/I11/I11-1100.pdf
Gimpel, K., Schneider, N., O’Connor, B., Das, D., Mills, D., Eisenstein, J., Heilman, M., Yotogama, D., Flanigan, J., and Smith, Noah A. (2011): Part-of-speech tagging for Twitter: annotation, features, and experiments. Proceedings of 49th Conference on Computational Linguistics (ACL 2011): 42–47. http://www.aclweb.org/ anthology/P/P11/P11-2008.pdf
Hu, Y., Talamadupula, K., and Kambhampati, S. (2013): Dude, srsly?: The surprisingly formal nature of Twitter’s language. Proceedings of The 7th International AAAI Conference on Weblogs and Social Media (ICWSM 2013). http://www.public.asu.edu/~ktalamad/papers/icwsm13.pdf
Kaufmann, J., and Kalita, J. (2010): Syntactic normalization of Twitter messages. International Conference on Natural Language Processing (ICON 2010): 149–158. Kharagpur, India.
Levenshtein, V. I. (1966): Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10 (8): 707–710.
Liu, F., Weng, F., Wang, B., and Liu, Y. (2011): Insertion, deletion, or substitution? Normalizing text messages without pre-categorization nor supervision. Proceedings of 49th Conference on Computational Linguistics (ACL 2011): 71–76. http://www.aclweb.org/anthology/P/P11/P11-2013.pdf
Ljubešić, N., Erjavec, T., and Fišer, D. (2014a): Standardizing tweets with character-level machine translation. A. Gelbukh (Ed.): Proceedings of the 15th International Conference CICLing 2014: 164–175. Lecture Notes in Computer Science. Berlin: Springer.
Ljubešić, N., Fišer, D., and Erjavec, T. (2014b): TweetCaT: a tool for building Twitter corpora of smaller languages. Proceedings of LREC 9: 2279–2283. http://www.lrec-conf.org/ proceedings/lrec2014/pdf/834_Paper.pdf
Ljubešić, N., Fišer, D., Erjavec, T., Čibej, J., Marko, D., Pollak, S. and Škrjanec I. (2015): Predicting the level of text standardness in user-generated content. Proceedings of Recent Advances in Natural Language Processing (RANLP 2015): 371-378. https://aclweb.org/anthology/R/R15/R15-1049.pdf
Ljubešić, N., Zupan, K., Fišer, D., Erjavec, T. Normalising Slovene data: historical texts vs. user-generated content. Proceedings of KONVENS 2016: in print.
Ljubešić, N., Klubička, F., Agić, Ž. and Jazbec I. (2016b): New inflectional lexicons and training corpora for improved morphosyntactic annotation of Croatian and Serbian. Proceedings of LREC 10: 4264–4270. http://www.lrec-conf.org/proceedings/lrec2016/pdf/340_Paper.pdf
Mair, C., Hundt, M., Leech, G., and Smith, N. (2002): Short term diachronic shifts in part-of-speech frequencies. A comparison of the tagged LOB and F-LOB corpora. International Journal of Corpus Linguistics, 7(2): 245–264.
Noblia, M. V. (1998): The computer-mediated communication: A new way of understanding the language. Proceedings of the 1st Conference on Internet Research and Information for Social Scientists (IRISS’98): 10–12.
Oliva, J., Serrano, J. I., Del Castillo, M. D., and Igesias, A. (2013): A SMS normalization system integrating multiple grammatical resources. Natural Language Engineering, 19: 121–141.
Pešikan, M., Jerković, J., and Pižurica, M. (2010): Pravopis srpskoga jezika. Novi Sad: Matica srpska.
Petrov, S., and McDonald, R. (2012): Overview of the 2012 shared task on parsing the web. Notes of the First Workshop on SANCL 2012. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1. 261.2294&rep=rep1&type=pdf
Sidarenka, U., Scheffler, T., and Stede, M. (2013): Rule-based normalization of German Twitter messages. Proceedings of the International Conference of the German Society for Computational Linguistics and Language Technology. https://gscl2013.ukp.informatik.tu-darmstadt.de/fileadmin/user_upload/Group_UKP/conferences/ gscl2013/workshops/sidarenka_scheffler_stede.pdf
Tagg, C. (2012): Discourse of Text Messaging. London: Continuum.
Biber, D., Conrad, S., and R. Reppen (1998): Corpus Linguistics. Investigating Language Structure and Use. Cambridge: Cambridge University Press.
Crystal, D. (2011): Internet Linguistics: A Student Guide. New York: Routledge.
Čibej, J., Fišer, D., and Erjavec, T. (2016): Normalisation, tokenisation and sentence segmentation of Slovene tweets. Proceedings of Normalisation and Analysis of Social Media Texts (NormSoMe) 2016, LREC 2016: 5–10. http://www.lrec-conf.org/proceedings/lrec2016/workshops/LREC2016Workshop-NormSoMe_Proceedings.pdf
Eisenstein, J. (2013): What to do about bad language on the Internet. Proceedings of HLT-NAACL 2013: 359–369. http://www.cc.gatech.edu/~jeisenst/papers/naacl2013-badlanguage.pdf
Fišer, D., Erjavec, T., Ljubešić, N., and Miličević, M. (2015): Comparing the nonstandard language of Slovene, Croatian and Serbian tweets. M. Smolej (Ed.): Simpozij Obdobja 34. Slovnica in slovar - aktualni jezikovni opis (1. del): 225–231. Ljubljana: Filozofska fakulteta.
Foster, J., Cetinoglu, O., Wagner, J., Le Roux, J., Nivre, J., Hogan, D., and van Genabith, J. (2011): From news to comment: Resources and benchmarks for parsing the language of web 2.0. Proceedings of the 5th International Joint Conference on Natural Language Processing (IJCNLP 2011): 893–901. http://www.aclweb.org/anthology/I/I11/I11-1100.pdf
Gimpel, K., Schneider, N., O’Connor, B., Das, D., Mills, D., Eisenstein, J., Heilman, M., Yotogama, D., Flanigan, J., and Smith, Noah A. (2011): Part-of-speech tagging for Twitter: annotation, features, and experiments. Proceedings of 49th Conference on Computational Linguistics (ACL 2011): 42–47. http://www.aclweb.org/ anthology/P/P11/P11-2008.pdf
Hu, Y., Talamadupula, K., and Kambhampati, S. (2013): Dude, srsly?: The surprisingly formal nature of Twitter’s language. Proceedings of The 7th International AAAI Conference on Weblogs and Social Media (ICWSM 2013). http://www.public.asu.edu/~ktalamad/papers/icwsm13.pdf
Kaufmann, J., and Kalita, J. (2010): Syntactic normalization of Twitter messages. International Conference on Natural Language Processing (ICON 2010): 149–158. Kharagpur, India.
Levenshtein, V. I. (1966): Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10 (8): 707–710.
Liu, F., Weng, F., Wang, B., and Liu, Y. (2011): Insertion, deletion, or substitution? Normalizing text messages without pre-categorization nor supervision. Proceedings of 49th Conference on Computational Linguistics (ACL 2011): 71–76. http://www.aclweb.org/anthology/P/P11/P11-2013.pdf
Ljubešić, N., Erjavec, T., and Fišer, D. (2014a): Standardizing tweets with character-level machine translation. A. Gelbukh (Ed.): Proceedings of the 15th International Conference CICLing 2014: 164–175. Lecture Notes in Computer Science. Berlin: Springer.
Ljubešić, N., Fišer, D., and Erjavec, T. (2014b): TweetCaT: a tool for building Twitter corpora of smaller languages. Proceedings of LREC 9: 2279–2283. http://www.lrec-conf.org/ proceedings/lrec2014/pdf/834_Paper.pdf
Ljubešić, N., Fišer, D., Erjavec, T., Čibej, J., Marko, D., Pollak, S. and Škrjanec I. (2015): Predicting the level of text standardness in user-generated content. Proceedings of Recent Advances in Natural Language Processing (RANLP 2015): 371-378. https://aclweb.org/anthology/R/R15/R15-1049.pdf
Ljubešić, N., Zupan, K., Fišer, D., Erjavec, T. Normalising Slovene data: historical texts vs. user-generated content. Proceedings of KONVENS 2016: in print.
Ljubešić, N., Klubička, F., Agić, Ž. and Jazbec I. (2016b): New inflectional lexicons and training corpora for improved morphosyntactic annotation of Croatian and Serbian. Proceedings of LREC 10: 4264–4270. http://www.lrec-conf.org/proceedings/lrec2016/pdf/340_Paper.pdf
Mair, C., Hundt, M., Leech, G., and Smith, N. (2002): Short term diachronic shifts in part-of-speech frequencies. A comparison of the tagged LOB and F-LOB corpora. International Journal of Corpus Linguistics, 7(2): 245–264.
Noblia, M. V. (1998): The computer-mediated communication: A new way of understanding the language. Proceedings of the 1st Conference on Internet Research and Information for Social Scientists (IRISS’98): 10–12.
Oliva, J., Serrano, J. I., Del Castillo, M. D., and Igesias, A. (2013): A SMS normalization system integrating multiple grammatical resources. Natural Language Engineering, 19: 121–141.
Pešikan, M., Jerković, J., and Pižurica, M. (2010): Pravopis srpskoga jezika. Novi Sad: Matica srpska.
Petrov, S., and McDonald, R. (2012): Overview of the 2012 shared task on parsing the web. Notes of the First Workshop on SANCL 2012. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1. 261.2294&rep=rep1&type=pdf
Sidarenka, U., Scheffler, T., and Stede, M. (2013): Rule-based normalization of German Twitter messages. Proceedings of the International Conference of the German Society for Computational Linguistics and Language Technology. https://gscl2013.ukp.informatik.tu-darmstadt.de/fileadmin/user_upload/Group_UKP/conferences/ gscl2013/workshops/sidarenka_scheffler_stede.pdf
Tagg, C. (2012): Discourse of Text Messaging. London: Continuum.
Prenosi
Objavljeno
27. 09. 2016
Številka
Rubrika
Razprave
Licenca
Avtorske pravice (c) 2016 Nikola Ljubešić, Maja Miličević

To delo je licencirano pod Creative Commons Priznanje avtorstva-Deljenje pod enakimi pogoji 4.0 mednarodno licenco.
Kako citirati
Miličević, M., & Ljubešić, N. (2016). Tviterasi, tviteraši ali twiteraši? Izdelava in analiza normaliziranega nabora hrvaških in srbskih tvitov. Slovenščina 2.0: Empirične, Aplikativne in Interdisciplinarne Raziskave, 4(2), 156-188. https://doi.org/10.4312/slo2.0.2016.2.156-188