Devising a Sketch Grammar for Academic Portuguese
Keywords:sketch grammar, Portuguese, corpus, dictionary, evaluation
AbstractThis paper presents the development of a new sketch grammar designed specifically for CoPEP, a newly compiled 40-million corpus comprising texts from academic journals, tagged with Freeling v3, the default tagger available in the Sketch Engine for corpora of Portuguese. We first provide an overview and evaluation of existing sketch grammars for Portuguese, followed by a detailed description of the development of a new sketch grammar, and the presentation of some of the problems encountered. We conclude by summarizing the main findings, highlighting important implications, and offering suggestions for further improvement of the sketch grammar. More accurate and varied word sketch results than those offered by the current default sketch grammar indicate that our sketch grammar can be used for advanced lexicographic tasks such as automatic extraction of lexical data from CoPEP, the methodology of knowledge acquisition planned for the compilation of the proposed dictionary of Portuguese for university students. Moreover, this new sketch grammar can be used with any other corpus of Portuguese tagged with Freeling v3, which makes it an important resource for lexicographic and corpus linguistic research of the Portuguese language.
Atkins,S. B.T., and Rundell, M. (2008): The Oxford Guide to Practical Lexicography. Oxford: Oxford University Press.
Benko, V. (2014a): Aranea: Yet Another Family of (Comparable) Web Corpora. In P. Sojka, A. Horák, I. Kopeček, and K. Pala (eds.): Text, Speech and Dialogue. 17th International Conference, TSD 2014, Brno, Czech Republic, September 8-12, 2014. Proceedings. LNCS 8655: 257–264. Brno: Springer International Publishing Switzerland.
Benko, V. (2014b): Compatible Sketch Grammars for Comparable Corpora. In A. Abel, C. Vettori, and N. Ralli (eds.): Proceedings of the XVI EURALEX International Congress: The User in Focus: 417–430. Bolzano/Bozen: Institute for Specialised Communication and Multilingualism.
Benko, V.: Araneum Portugallicum Maius, verze 15.05. Ústav Českého národního korpusu FF UK, Praha 2015. Available at: https://kontext.korpus.cz/first_form?corpname=aranea%2Faranport_pt_ar13__b_a# (Accessed on 23 November 2016).
Biber, D., Conrad, S., and Leech, G. (2015) : Longman Student Grammar of Spoken and Written English. Harlow: Pearson Education.
Bick, E. (2000): The Parsing System Palavras, Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. Arhus: Aarhus University Press.
Capes. Available at: www.capes.gov.br (Accessed on 4 February 2016)
Capes’ Areas of Knowledge Classification. Available at: http://www.capes.gov.br/avaliacao/instrumentos-de-apoio/tabela-de-areas-do-conhecimento-avaliacao (Accessed on 4 February 2016).
Cegalla, D. P. (2008): Novíssima gramática da língua portuguesa. São Paulo: Ed. Nacional.
CLUL (Centro de Linguística da Universidade de Lisboa). Online Resources. Available at: http://clul.ul.pt/en/resources (Accessed on 20 November 2016)
Corpus Brasileiro. Available at: http://corpusbrasileiro.pucsp.br/cb/Acesso.html (Accessed on 20 November 2016).
Corpus do Português: genre/historical. Available at: www.corpusdoportugues.org/hist-gen/ (Accessed on 20 November 2016).
CRPC - Corpus de Referência do Português Contemporâneo. Available at: http://alfclul.clul.ul.pt/CQPweb/crpcfg16/ (Accessed on 23 November 2016).
Gantar, P., Kosem, I., and Krek, S. (2016): Discovering Automated Lexicography: The Case of the Slovene Lexical Database. International Journal of Lexicography, 29 (2): 200–225.
Généreux, Michel, Iris Hendrickx, and Amália Mendes (2012): Introducing the Reference Corpus of Contemporary Portuguese On-Line. Proceedings of the Eighth International Conference on Language Resources and Evaluation - LREC 2012: 2237-2244. Istambul.
Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., and Suchomel, V. (2013): The TenTen Corpus Family. Proceedings of the 7th International Corpus Linguistics Conference: 125–127. Lancaster.
Kallas, J., Kilgarriff, A., Koppel, K., Kudritski, E., Langemets, M., Michelfeit, J., Tuulik, M., and Viks, Ü. (2015): Automatic generation of the Estonian Collocations Dictionary database. In I. Kosem, M Jakubíček, J. Kallas, and S. Krek (eds.): Electronic lexicography in the 21st century: linking lexical data in the digital age. Proceedings of the eLex 2015 conference, 11-13 August 2015, Herstmonceux Castle, United Kingdom: 1-20. Ljubljana/Brighton: Trojina, Institute for Applied Slovene Studies/Lexical Computing.
Kilgarriff, A., and Kosem, I. (2012): Corpus tools for lexicographers. In S. Granger, and M. Paquot (eds): Electronic Lexicography: 31–55. Oxford: Oxford University Press.
Kilgarriff, A., Baisa, V., Rychlý, P., and Jakubíček, M. (2015): Longest-commonest Match. In I. Kosem, M Jakubíček, J. Kallas, and S. Krek (eds.): Electronic lexicography in the 21st century: linking lexical data in the digital age. Proceedings of the eLex 2015 conference, 11-13 August 2015, Herstmonceux Castle, United Kingdom: 397–404. Ljubljana/Brighton: Trojina, Institute for Applied Slovene Studies/Lexical Computing.
Kilgarriff, A., Kovář, V., Krek, S., Srdanovic, I., and Tiberius, C. (2010): A Quantitative Evaluation of Word Sketches. In A. Dykstra, and T. Schoonheim (eds.): Proceedings of the XIV Euralex International Congress: 372–379. Leeuwarden: Fryske Akademy; Afûk.
Kilgarriff, A., Rychlý, P., Smrz, P., and Tugwell, D. (2004): The Sketch Engine. In G. Williams, and S. Vessier (eds.): Proceedings of the 11th EURALEX International Congress: 105–115. Lorient: Université de Bretagne-Sud, Faculté des lettres et des sciences humaines.
Kosem, I., Gantar, P., and Krek, S. (2013): Automation of lexicographic work: an opportunity for both lexicographers and crowdsourcing. In I. Kosem, J. Kallas, P. Gantar, S. Krek, M. Langemets, and M. Tuulik (eds.): Electronic Lexicography in the 21st Century: Thinking Outside the Paper: Proceedings of the eLex 2013 Conference, 17-19 October 2013, Tallinn, Estonia: 32–48. Ljubljana/Tallinn: Trojina, Institute for Applied Slovene Studies/Eesti Keele Instituut.
Kosem, I., Gantar, P., Logar, N. and Krek, S. (2014): Automation of lexicographic work using general and specialized corpora: two case studies. In A. Abel, C. Vettori, and N. Ralli (eds.): Proceedings of the XVI EURALEX International Congress: The User in Focus: 355–364. Bolzano/Bozen: Institute for Specialised Communication and Multilingualism.
Kuhn, T.Z., and Ferreira, J.P. (2016): Building a corpus of written academic texts in Portuguese. Teaching and Language Corpora Conference (TaLC12). Book of Abstracts: 103. Giessen.
Linguateca. Available at: http://www.linguateca.pt/ (Accessed on 20 November 2016)
Logar, N., and Kosem, I. (2013): TERMIS: a corpus-driven approach to compiling an e-dictionary of terminology. In I. Kosem, J. Kallas, P. Gantar, S. Krek, M. Langemets, and M. Tuulik (eds.): Electronic Lexicography in the 21st Century: Thinking Outside the Paper: Proceedings of the eLex 2013 Conference, 17-19 October 2013, Tallinn, Estonia: 164–178. Ljubljana/Tallinn: Trojina, Institute for Applied Slovene Studies/Eesti Keele Instituut.
Newspapers in Portuguese (CetemPúblico, CetenFolha). Available at: https://the.sketchengine.co.uk/bonito/corpus/first_form?corpname=preloaded/portuguese (Accessed on 28 November 2016).
NILC (Interinstitutional Center for Computational Linguistics). Tool and Resources. Available at: http://www.nilc.icmc.usp.br/nilc/index.php/tools-and-resources (Accessed on 20 November 2016)
Oxford Portuguese Dictionary (2015). S. Lopez, A. Frankenberg-Garcia, and H. Newstead. Oxford: Oxford University Press.
Padró, L., and Stanilovsky, E. (2012): FreeLing 3.0: Towards Wider Multilinguality. Proceedings of the Language Resources and Evaluation Conference (LREC 2012) ELRA: 1–7. Istanbul.
Peixoto, R. M. T. (2015): O Fenômeno (De)Queísta no Corpus do Português Brasileiro Acadêmico. Unpublished Master’s Degree Dissertation. Porto Alegre: PUCRS.
Perini, M. A. (2002): Modern Portuguese: A reference grammar. New Haven: Yale University Press.
Portuguese Web 2011 (ptTenTen11, Palavras parsed). Available at: https://the.sketchengine.co.uk/bonito/corpus/first_form?corpname=preloaded/pttenten11 (Accessed on 6 April 2016).
Portuguese Web 2011 (ptTenTen11, Freeling v3). Available at: https://the.sketchengine.co.uk/bonito/corpus/corp_info?corpname=preloaded/pttenten11_freeling_v3_1 (Accessed on 23 November 2016).
Rundell, M., and Kilgarriff, A. (2011): Automating the creation of dictionaries: where will it all end?. In F. Meunier, S. De Cock, G. Gilquin, and M. Paquot (eds.): A Taste for Corpora: In honour of Sylviane Granger. Amsterdam: John Benjamins.
Scielo Brazil Analytics. Available at: http://analytics.scielo.org/w/publication/article?collection=scl (Accessed on 24 November 2016).
Scielo Brazil. Available at: www.scielo.br (Accessed on 15 February 2016)
Scielo Portugal Analytics. Available at: http://analytics.scielo.org/w/publication/article?collection=prt (Accessed on 24 November 2016).
Scielo Portugal. Available at: www.scielo.mec.pt (Accessed on 1 February 2016)
Scielo. Available at: www.scielo.org (Accessed on 23 November 2016)
Sketch Engine. Available at: https://www.sketchengine.co.uk (Accessed on 20 November 2016)
How to Cite
Copyright (c) 2017 Tanara Zingano Kuhn, Iztok Kosem
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All content of Slovenščina 2.0 is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0).
Slovenščina 2.0 applies the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license to all published material. Under this license, authors retain ownership of the copyright for their content, but allow anyone to download, reuse, reprint, modify, distribute, copy, remix, transform and/or build upon the content for any purpose, even commercial, as long as the original authors and source are cited. No permission is required from the authors or the publishers. Appropriate attribution can be provided by simply citing the original article. If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original. For any reuse or redistribution of a work, users must also make clear the license terms under which the work was published.
No separate publishing agreements are signed between the author and the publisher. Authors retain copyright and the publishing rights of their work without any restrictions.
Authors are permitted and encouraged to post the journal’s published version of the work online (e.g., in institutional repositories, on their own websites), with an acknowledgement of its initial publication in Slovenščina 2.0.