Semi-Semantic Annotation: A guideline for the URDU.KON-TB treebank POS annotation
Keywords:semi-semantic part of speech, rich information, deep learning, parsing aid, linguistically motivated annotation, humanistic annotation
This work elaborates the semi-semantic part of speech annotation guidelines for the URDU.KON-TB treebank: an annotated corpus. A hierarchical annotation scheme was designed to label the part of speech and then applied on the corpus. This raw corpus was collected from the Urdu Wikipedia and the Jang newspaper and then annotated with the proposed semi-semantic part of speech labels. The corpus contains text of local & international news, social stories, sports, culture, finance, religion, traveling, etc. This exercise finally contributed a part of speech annotation to the URDU.KON-TB treebank. Twenty-two main part of speech categories are divided into subcategories, which conclude the morphological, and semantical information encoded in it. This article reports the annotation guidelines in major; however, it also briefs the development of the URDU.KON-TB treebank, which includes the raw corpus collection, designing & employment of annotation scheme and finally, its statistical evaluation and results. The guidelines presented as follows, will be useful for linguistic community to annotate the sentences not only for the national language Urdu but for the other indigenous languages like Punjab, Sindhi, Pashto, etc., as well.
Abbas, Q. (2012, March). Building a hierarchical annotated corpus of urdu: the URDU. KON-TB treebank. In International Conference on Intelligent Text Processing and Computational Linguistics (pp. 66-79). Springer Berlin Heidelberg.
Abbas, Q. (2014a). Semi-semantic part of speech annotation and evaluation. LAW VIII, 75.
Abbas, Q. (2014b). Building Computational Resources: The URDU. KON-TB Treebank and the Urdu Parser (Doctoral dissertation).
Abbas, Q. (2014c). Exploiting language variants via grammar parsing having morphologically rich information. LT4CloseLang 2014, 36.
Abbas, Q. (2014d). A Stochastic Prediction Interface for Urdu. International Journal of Intelligent Systems and Applications, 7(1), 94.
Abbas, Q. (2015). Morphologically rich Urdu grammar parsing using Earley algorithm, Natural Language Engineering (NLE), Vol.21(2), PP.1-36, Cambridge University Press, UK
Abbas, Q., & Khan, A. N. (2009). Lexical functional grammar for Urdu modal verbs. In Emerging Technologies, 2009. ICET 2009. International Conference on (pp. 7-12). IEEE.
Abbas, Q., & Raza, G. (2014). A Computational Classification of Dynamic Urdu Copula Verb. International Journal of Computer Applications, 85(10).
Abbas, Q., Ahmed, M. S., & Niazi, S. (2010). Language Identifier For Languages Of Pakistan Including Arabic And Persian. International Journal of Computational Linguistics (IJCL), 1(03), 27-35.
Abbas, Q., Karamat, N., & Niazi, S. (2009). Development of Tree-bank based probabilistic grammar for Urdu Language. International Journal of Electrical & Computer Science, 9(09), 231-235.
Abbas, Q., Zia, T., & Khan, A. N. (2014). Syntactic and semantic analysis of Urdu modal verbs using XLE parser. International Journal of Computer Applications, 107(10).
Abbi, A. (1992). Reduplication in South Asian Languages: An Areal, Typological, And Historical Study. Allied Publishers, New Delhi.
Ahmed, T., & Butt, M. (2011, January). Discovering semantic classes for Urdu NV complex predicates. In Proceedings of the Ninth International Conference on Computational Semantics (pp. 305-309). Association for Computational Linguistics.
Bhatt, R., Bögel, T., Butt, M., Hautli, A., Sulger, S., & King, T. H. (2011). Urdu/Hindi modals. Bibliothek der Universität Konstanz.
Bögel, T., Butt, M., Hautli, A., & Sulger, S. (2007). Developing a finite-state morphological analyzer for Urdu and Hindi. Finite State Methods and Natural Language Processing, 86.
Butt, M. (1995). The structure of complex predicates in Urdu. Center for the Study of Language (CSLI).
Butt, M. (2003). The light verb jungle [OL]. Butt, M. (2010). The light verb jungle: Still hacking away. Complex predicates in cross-linguistic perspective, 48-78.
Butt, M., & King, T. H. (2004). The status of case. In Clause structure in South Asian languages (pp. 153-198). Springer Netherlands.
Butt, M., & Ramchand, G. (2001). Complex aspectual structure in Hindi/Urdu. M. Liakata, B. Jensen, & D. Maillat, Eds, 1-30.
Butt, M., & Rizvi, J. (2010). Tense and aspect in Urdu. Layers of aspect, 43-66. Stanford: CSLI Publications.
Butt, M., & Sadler, L. (2003). Verbal morphology and agreement in Urdu. Syntactic structures and morphological information. Mouton, 57-100.
Clark, A., Fox, C., & Lappin, S. (2010). The Handbook Of Computational Linguistics And Natural Language Processing, 57. Wiley.com.
Facchinetti, R., Palmer, F., & Krug, M. (Eds.). (2003). Modality in contemporary English (Vol. 44). Walter de Gruyter.
Hayes, A. F., & Krippendorf, K. (2007). Answering The Call For A Standard Reliability Measure For Coding Data. Communication Methods and Measures, 1(1), 77–89.
Hirsch, E. D., Kett, J. F., & Trefil, J. S. (2014). The new dictionary of cultural literacy. Houghton Mifflin Harcourt.
Ijaz, M., & Hussain, S. (2007, August). Corpus based Urdu lexicon development. In the Proceedings of Conference on Language Technology (CLT07), University of Peshawar, Pakistan (Vol. 73).
Kamran Malik, M., Ahmed, T., Sulger, S., Bögel, T., Gulzar, A., Raza, G., ... & Butt, M. (2010). Transliterating Urdu for a Broad-Coverage Urdu/Hindi LFG Grammar. In LREC 2010, Seventh International Conference on Language Resources and Evaluation (pp. 2921- 2927).
Krippendorff, K. (2004). Reliability in content analysis. Human communication research, 30(3), 411-433.
Leech, G. (2005). Adding linguistic annotation. , 17-29, Oxbow Books, Oxford.
Matthews, P. H. (2007). The concise Oxford dictionary of linguistics. Oxford University Press.
Mikulova, M., & Stepanek, J. (2010). Ways Of Evaluation Of The Annotators In Building The Prague Czech-English Dependency Treebank. In LREC.
Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., & Miller, K. J. (1990). Introduction to WordNet: An on-line lexical database. International journal of lexicography, 3(4), 235- 244.
Mohanan, T. (1994). Argument structure in Hindi. Center for the Study of Language (CSLI).
Raza, G. (2010). Inferring Subcat Frames of Verbs in Urdu. In LREC.
Raza, G. (2011). Subcategorization acquisition and classes of predication in Urdu (Doctoral dissertation).
Schmidt, R. L. (2013). Urdu, an Essential Grammar. Psychology Press.
Skut, W., Krenn, B., Brants, T., & Uszkoreit, H. (1997, March). An annotation scheme for free word order languages. In Proceedings of the fifth conference on Applied natural language processing (pp. 88-95). Association for Computational Linguistics.
Stevenson, A. (Ed.). (2010). Oxford dictionary of English. Oxford University Press, USA.
Urooj, S., Hussain, S., Adeeba, F., Jabeen, F., & Parveen, R. (2012). CLE Urdu digest corpus. LANGUAGE & TECHNOLOGY, 47.
Zia, T, Akhtar, M. P., Abbas, Q. (2015a). Comparative Study of Feature Selection Approaches for Urdu Text Categorization. Malaysian Journal of Computer Science, 28(2).
Zia, T., Abbas, Q., & Akhtar, M. P. (2015b). Evaluation of Feature Selection Approaches for Urdu Text Categorization. International Journal of Intelligent Systems and Applications, 7(6), 33.
How to Cite
Copyright (c) 2016 Qaiser Abbas, Miriam Butt
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors are confirming that they are the authors of the submitting article, which will be published online in journal Acta Linguistica Asiatica by Ljubljana University Press, Faculty of Arts (University of Ljubljana, Faculty of Arts, Aškerčeva 2, 1000 Ljubljana, Slovenia). Author’s name will be evident in the article in journal. All decisions regarding layout and distribution of the work are in hands of the publisher.
- Authors guarantee that the work is their own original creation and does not infringe any statutory or common-law copyright or any proprietary right of any third party. In case of claims by third parties, authors commit their self to defend the interests of the publisher, and shall cover any potential costs.
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-ShareAlike 4.0 International License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.