Tviterasi, tviteraši or twitteraši? Producing and analysing a normalised dataset of Croatian and Serbian tweets
Keywords:computer-mediated communication, CMC corpora, Twitter, normalisation
AbstractIn this paper we discuss the parallel manual normalisation of samples extracted from Croatian and Serbian Twitter corpora. We describe the datasets, outline the unified guidelines provided to annotators, and present a series of analyses of standard-to-non-standard transformations found in the Twitter data. The results show that closed part-of-speech classes are transformed more frequently than the open classes, that the most frequently transformed lemmas are auxiliary and modal verbs, interjections, particles and pronouns, that character deletions are more frequent than insertions and replacements, and that more transformations occur at the word end than in other positions. Croatian and Serbian are found to share many, but not all transformation patterns; while some of the discrepancies can be ascribed to the structural differences between the two languages, others appear to be better explained by looking at extralinguistic factors. The produced datasets and their initial analyses can be used for studying the properties of non-standard language, as well as for developing language technologies for non-standard data.
Biber, D., Conrad, S., and R. Reppen (1998): Corpus Linguistics. Investigating Language Structure and Use. Cambridge: Cambridge University Press.
Crystal, D. (2011): Internet Linguistics: A Student Guide. New York: Routledge.
Čibej, J., Fišer, D., and Erjavec, T. (2016): Normalisation, tokenisation and sentence segmentation of Slovene tweets. Proceedings of Normalisation and Analysis of Social Media Texts (NormSoMe) 2016, LREC 2016: 5–10. http://www.lrec-conf.org/proceedings/lrec2016/workshops/LREC2016Workshop-NormSoMe_Proceedings.pdf
Eisenstein, J. (2013): What to do about bad language on the Internet. Proceedings of HLT-NAACL 2013: 359–369. http://www.cc.gatech.edu/~jeisenst/papers/naacl2013-badlanguage.pdf
Fišer, D., Erjavec, T., Ljubešić, N., and Miličević, M. (2015): Comparing the nonstandard language of Slovene, Croatian and Serbian tweets. M. Smolej (Ed.): Simpozij Obdobja 34. Slovnica in slovar - aktualni jezikovni opis (1. del): 225–231. Ljubljana: Filozofska fakulteta.
Foster, J., Cetinoglu, O., Wagner, J., Le Roux, J., Nivre, J., Hogan, D., and van Genabith, J. (2011): From news to comment: Resources and benchmarks for parsing the language of web 2.0. Proceedings of the 5th International Joint Conference on Natural Language Processing (IJCNLP 2011): 893–901. http://www.aclweb.org/anthology/I/I11/I11-1100.pdf
Gimpel, K., Schneider, N., O’Connor, B., Das, D., Mills, D., Eisenstein, J., Heilman, M., Yotogama, D., Flanigan, J., and Smith, Noah A. (2011): Part-of-speech tagging for Twitter: annotation, features, and experiments. Proceedings of 49th Conference on Computational Linguistics (ACL 2011): 42–47. http://www.aclweb.org/ anthology/P/P11/P11-2008.pdf
Hu, Y., Talamadupula, K., and Kambhampati, S. (2013): Dude, srsly?: The surprisingly formal nature of Twitter’s language. Proceedings of The 7th International AAAI Conference on Weblogs and Social Media (ICWSM 2013). http://www.public.asu.edu/~ktalamad/papers/icwsm13.pdf
Kaufmann, J., and Kalita, J. (2010): Syntactic normalization of Twitter messages. International Conference on Natural Language Processing (ICON 2010): 149–158. Kharagpur, India.
Levenshtein, V. I. (1966): Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10 (8): 707–710.
Liu, F., Weng, F., Wang, B., and Liu, Y. (2011): Insertion, deletion, or substitution? Normalizing text messages without pre-categorization nor supervision. Proceedings of 49th Conference on Computational Linguistics (ACL 2011): 71–76. http://www.aclweb.org/anthology/P/P11/P11-2013.pdf
Ljubešić, N., Erjavec, T., and Fišer, D. (2014a): Standardizing tweets with character-level machine translation. A. Gelbukh (Ed.): Proceedings of the 15th International Conference CICLing 2014: 164–175. Lecture Notes in Computer Science. Berlin: Springer.
Ljubešić, N., Fišer, D., and Erjavec, T. (2014b): TweetCaT: a tool for building Twitter corpora of smaller languages. Proceedings of LREC 9: 2279–2283. http://www.lrec-conf.org/ proceedings/lrec2014/pdf/834_Paper.pdf
Ljubešić, N., Fišer, D., Erjavec, T., Čibej, J., Marko, D., Pollak, S. and Škrjanec I. (2015): Predicting the level of text standardness in user-generated content. Proceedings of Recent Advances in Natural Language Processing (RANLP 2015): 371-378. https://aclweb.org/anthology/R/R15/R15-1049.pdf
Ljubešić, N., Zupan, K., Fišer, D., Erjavec, T. Normalising Slovene data: historical texts vs. user-generated content. Proceedings of KONVENS 2016: in print.
Ljubešić, N., Klubička, F., Agić, Ž. and Jazbec I. (2016b): New inflectional lexicons and training corpora for improved morphosyntactic annotation of Croatian and Serbian. Proceedings of LREC 10: 4264–4270. http://www.lrec-conf.org/proceedings/lrec2016/pdf/340_Paper.pdf
Mair, C., Hundt, M., Leech, G., and Smith, N. (2002): Short term diachronic shifts in part-of-speech frequencies. A comparison of the tagged LOB and F-LOB corpora. International Journal of Corpus Linguistics, 7(2): 245–264.
Noblia, M. V. (1998): The computer-mediated communication: A new way of understanding the language. Proceedings of the 1st Conference on Internet Research and Information for Social Scientists (IRISS’98): 10–12.
Oliva, J., Serrano, J. I., Del Castillo, M. D., and Igesias, A. (2013): A SMS normalization system integrating multiple grammatical resources. Natural Language Engineering, 19: 121–141.
Pešikan, M., Jerković, J., and Pižurica, M. (2010): Pravopis srpskoga jezika. Novi Sad: Matica srpska.
Petrov, S., and McDonald, R. (2012): Overview of the 2012 shared task on parsing the web. Notes of the First Workshop on SANCL 2012. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1. 261.2294&rep=rep1&type=pdf
Sidarenka, U., Scheffler, T., and Stede, M. (2013): Rule-based normalization of German Twitter messages. Proceedings of the International Conference of the German Society for Computational Linguistics and Language Technology. https://gscl2013.ukp.informatik.tu-darmstadt.de/fileadmin/user_upload/Group_UKP/conferences/ gscl2013/workshops/sidarenka_scheffler_stede.pdf
Tagg, C. (2012): Discourse of Text Messaging. London: Continuum.
How to Cite
Copyright (c) 2016 Nikola Ljubešić, Maja Miličević
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All content of Slovenščina 2.0 is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0).
Slovenščina 2.0 applies the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license to all published material. Under this license, authors retain ownership of the copyright for their content, but allow anyone to download, reuse, reprint, modify, distribute, copy, remix, transform and/or build upon the content for any purpose, even commercial, as long as the original authors and source are cited. No permission is required from the authors or the publishers. Appropriate attribution can be provided by simply citing the original article. If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original. For any reuse or redistribution of a work, users must also make clear the license terms under which the work was published.
No separate publishing agreements are signed between the author and the publisher. Authors retain copyright and the publishing rights of their work without any restrictions.
Authors are permitted and encouraged to post the journal’s published version of the work online (e.g., in institutional repositories, on their own websites), with an acknowledgement of its initial publication in Slovenščina 2.0.