JANES v0.4: Corpus of Slovene User-Generated Content
DOI:
https://doi.org/10.4312/slo2.0.2016.2.67-99Keywords:
corpus construction, computer-mediated communication, user-generated content, Internet Slovene, non-standard SloveneAbstract
The paper presents the current version of the Slovene corpus of netspeak Janes which contains tweets, forum posts, news comments, blogs and blog comments, and user and talk pages from Wikipedia. First, we describe the harvesting procedure for each data source and provide a quantitative analysis of the corpus. Next, we present automatic and manual procedures for enriching the corpus with metadata, such as user type, gender and region, and text sentiment and standardness level. Finally, we give a detailed account of the linguistic annotation workflow which includes tokenization, sentence segmentation, rediacritisation, normalization, morphosyntactic tagging and lemmatization.Downloads
References
Bartz, T.; Beißwenger, M.; Storrer, A. (2014): Optimierung des Stuttgart-Tübingen-Tagset für die linguistische Annotation von Korpora zur internetbasierten Kommunikation: Phänomene, Herausforderungen, Erweiterungsvorschläge. Journal for Language Technology and Computational Linguistics 28 (1): 157–198.
Beißwenger, M. (2013): Raumorientierung in der Netzkommunikation. Korpusgestützte Untersuchungen zur lokalen Deixis in Chats. Die Dynamik sozialer und sprachlicher Netzwerke, 207–258. Springer.
Beißwenger, M.; Ermakova, M.; Geyken, A.; Lemnitzer, L.; Storrer, A. (2012): A TEI Schema for the Representation of Computer-mediated Communication. Journal of the Text Encoding Initiative 3 (2012).
Čibej, J.; Fišer, D.; Erjavec, T. (2016): Normalisation, Tokenisation and Sentence Segmentation of Slovene Tweets. Proceedings of the workshop Normalisation and Analysis of Social Media Texts at LREC'16. Portorož, Slovenia, May 28 2016.
Čibej, J.; Ljubešić, N. (2015): »S kje pa si?« – Metapodatki o regionalni pripadnosti uporabnikov družbenega omrežja Twitter. Zbornik konference Slovenščina na spletu in v novih medijih. Ljubljana: Znanstvena založba Filozofske fakultete, 10–14.
Crystal, D. (2011): Internet Linguistics: A Student Guide. Routledge, New York.
Dobrovoljc, H.; Jakop, N. (2012). Sodobni pravopisni priročnik med normo in predpisom. Založba ZRC.
Dobrovoljc, K., Krek, S., Holozan, P., Erjavec, T., Romih, M. (2015): Morphological lexicon Sloleks 1.2. Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1039.
Erjavec, T. Fišer, D. (2013): Jezik slovenskih tvitov: korpusna raziskava. Družbena funkcijskost jezika: vidiki, merila, opredelitve, 109–116. Znanstvena založba Filozofske fakultete.
Erjavec, T.; Čibej, J.; Fišer, D. (2015): Pravna podlaga za zagotavljanje prostega dostopa korpusov spletnih besedil. Smolej, M. (ur.). OBDOBJA 34: Slovnica in slovar – aktualni jezikovni opis. Ljubljana: Znanstvena založba Filozofske fakultete, 193–199.
Fišer, D.; Erjavec, T. (2016): Analysis of sentiment labelling of Slovene user generated content. Proceedings of the 4th conference on CMC and Social Media Corpora for the Humanities, 27.-28.9. 2016, Ljubljana: Filozofska fakulteta.
Fišer, D.; Smailović, J.; Erjavec, T.; Mozetič, I.; Grčar, M. (2016): Sentiment Annotation of the Janes Corpus of Slovene User-Generated Content. Proceedings of the 10th Languate Technologies and Digital Humanities Conference, 29.9.-1.10. 2016, Ljubljana: Filozofska fakulteta.
Krek, S., Erjavec, T., Dobrovoljc, K., Može, S., Ledinek, N., Holz, N. (2013): Training corpus ssj500k 1.3, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1029.
Krippendorff, K. (2012). Content Analysis, An Introduction to Its Methodology. Sage Publications, Thousand Oaks, CA, 3rd edition.
Lebar, L.; Petrovčič, A.; Petrič, G. (2012): Analiza slovenskih spletnih forumov. Poročilo. http://www.nebojse.si/portal/Dokumenti/Analiza_slovenskih_spletnih_forumov.pdf
Liu, B. (2015): Sentiment analysis. Mining opinions, sentiments, and emotions. Cambridge University Press.
Ljubešić, N.; Erjavec, T. (2016): Corpus vs. Lexicon Supervision in Morphosyntactic Tagging: The Case of Slovene. Proceedings of LREC'16 Conference, Portorož, Slovenija.
Ljubešić, N.; Erjavec, T. Fišer, D. (2014a): Standardizing Tweets with Character-Level Machine Translation. Lecture notes in computer science, 164–75. Springer.
Ljubešić, N.; Erjavec, T. in Fišer D. (2016): Corpus-Based Diacritic Restoration for South Slavic Languages. Proceedings of LREC'16 Conference, Portorož, Slovenija.
Ljubešić, N.; Fišer, D.; Erjavec, T. (2014): TweetCaT: a tool for building Twitter corpora of smaller languages. Proceedings of LREC’14 Conference, Reykjavik, Islandija.
Ljubešić, N.; Fišer, D.; Erjavec, T.; Čibej, J.; Marko, D.; Pollak, S.; Škrjanec, I. (2015): Predicting the level of text standardness in user-generated content. Proceedings of RANLP'15 Conference, 7-9 September 2015, Hissar, Bulgaria. Hissar: 371–378.
Michelizza, M. (2015): Spletna besedila in jezik na spletu. Primer blogov in Wikipedije v slovenščini. Lingua Slovenica 6. ZRC.
Mozetič, I.; Grčar, M.; Smailović, J. (2016). Multilingual Twitter sentiment classification: The role of human annotators. PLoS ONE, 11(5):e0155036.
Rychlý, P. (2007): Manatee/Bonito - A Modular Corpus Manager. Proceedings of the Workshop on Recent Advances in Slavonic Natural Language Processing. Brno: Masaryk University, 65-70.
Smailović, J.; Grčar, M.; Lavrač, N.; Žnidaršič, M. (2014): Stream-based active learning for sentiment analysis in the financial domain. Information sciences 285:181–203.
Statistični urad Republike Slovenije (2015): Uporaba interneta v gospodinjstvih in pri posameznikih v Sloveniji. http://www.stat.si/StatWeb/prikazi-novico?id=5509&idp=10&headerbar=8
TEI Consortium (2016): Guidelines for Electronic Text Encoding and Interchange. http://www.tei-c.org/P5/.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2016 Darja Fišer, Tomaž Erjavec, Nikola Ljubešić

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All content of Slovenščina 2.0 is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0).
Slovenščina 2.0 applies the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license to all published material. Under this license, authors retain ownership of the copyright for their content, but allow anyone to download, reuse, reprint, modify, distribute, copy, remix, transform and/or build upon the content for any purpose, even commercial, as long as the original authors and source are cited. No permission is required from the authors or the publishers. Appropriate attribution can be provided by simply citing the original article. If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original. For any reuse or redistribution of a work, users must also make clear the license terms under which the work was published.
No separate publishing agreements are signed between the author and the publisher. Authors retain copyright and the publishing rights of their work without any restrictions.
Authors are permitted and encouraged to post the journal’s published version of the work online (e.g., in institutional repositories, on their own websites), with an acknowledgement of its initial publication in Slovenščina 2.0.