Učenje jezikov iz vzporednih korpusov

Zasnova za spreminjanje korpusnih primerov v vaje za učenje jezikov

Avtorji

  • Johannes Graën Univerza v Zürichu, Inštitut za računalniško jezikoslovje, Švica; Univerza v Göteborgu, Švedska

DOI:

https://doi.org/10.4312/slo2.0.2022.2.101-131

Ključne besede:

ICALL, vaje za učenje jezikov, vzporedni korpusi, učenje na podlagi podatkov, množičenje

Povzetek

Članek opisuje arhitekturo aplikacije, ki iz vzporednih korpusov generira vaje za učenje jezika. Poravnava besed in vzporedne strukture omogočajo samodejno ocenjevanje stavčnih parov v izvornem in ciljnem jeziku, medtem ko uporabniki aplikacije s svojimi interakcijami nenehno izboljšujejo kakovost podatkovne zbirke in tako množičijo vzporedno jezikovno učno gradivo. S pomočjo triangulacije se lahko njihovo ocenjevanje prenese tudi na druge jezikovne pare, če kot vir uporabimo več vzporednih korpusov.

Da bi lahko takšna aplikacija delovala, je treba nasloviti več izzivov. V nadaljevanju bomo obravnavali tri. Prvič, v zadnjem desetletju se je nekaj pozornosti posvetilo vprašanju, kako v korpusih prepoznati ustrezno učno gradivo. Podrobno bomo opisali, kako na to vpliva struktura vzporednih korpusov. Drugič, katere vrste vaj je mogoče samodejno ustvariti iz vzporednih korpusov, tako da spodbujajo učenje in ohranjajo motivacijo učencev. In tretjič, kakšne so možnosti vključevanja uporabnikov, tj. učiteljev in učencev, kot množice, ki bi pomagala izboljšati gradivo.

Aplikacijo, ki jo opisujemo v članku, smo delno implementirali in preizkusili v različnih eksperimentalnih okoljih. Več funkcij, ki bodo vključene v končno programsko opremo, smo razvili in ovrednotili ločeno. Za implementacijo vseh delov, ki so podrobno opisani v tem dokumentu, pa je potrebno še veliko dela in razpoložljivost dejanskih učiteljev in učencev za namene preskušanja. Da bi lahko potrdili želene pozitivne učinke prispevkov uporabnikov, bo treba končne aplikacije uporabljati dalj časa, kar predstavlja še dodaten izziv.

Prenosi

Podatki o prenosih še niso na voljo.

Literatura

Alfter, D., & Graën, J. (2019). Interconnecting Lexical Resources and Word Alignment: How Do Learners Get on with Particle Verbs? In Proceedings of the 22nd Nordic Conference of Computational Linguistics (NODALIDA) (pp. 321–26). Turku, Finland: Linköping University Electronic Press. Retrieved from https://www.aclweb.org/anthology/W19-6135

Barrón-Cedeno, A., España Bonet, C., Boldoba Trapote, J., & Márquez Villodre, L. (2015). A Factory of Comparable Corpora from Wikipedia. In Proceedings of the Eighth Workshop on Building and Using Comparable Corpora (pp. 3–13). Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/W15-3402

Bluemel, B. (2014). Learning in Parallel: Using Parallel Corpora to Enhance Written Language Acquisition at the Beginning Level. Dimension, 31, 48.

Boulton, A., & Cobb, T. (2017). Corpus Use in Language Learning: A Meta-Analysis. Language Learning, 67(2), 348–393. DOI: https://doi.org/10.1111/lang.12224

Braune, F., & Fraser, A. (2010). Improved Unsupervised Sentence Alignment for Symmetrical and Asymmetrical Parallel Corpora. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING): Posters (pp. 81–89). Association for Computational Linguistics (ACL).

Cambridge University Press. 2015. English Vocabulary Profile. Retrieved from https://www.englishprofile.org/wordlists

Clematide, S., Graën, J., & Volk, M. (2016). Multilingwis – a Multilingual Search Tool for Multi-Word Units in Multiparallel Corpora. In G. Corpas Pastor (Ed.), Computerised and Corpus-Based Approaches to Phraseology: Monolingual and Multilingual Perspectives – Fraseologia Computacional y Basada En Corpus: Perspectivas Monolingües y Multilingües (pp. 447–455). Geneva: Tradulex. doi: 10.5167/uzh-120153

Cobb, T., & Boulton, A. (2015). Classroom Applications of Corpus Analysis. In D. Biber & R. Reppen (Eds.), The Cambridge Handbook of English Corpus Linguistics (pp. 478–497). Cambridge University Press. doi: 10.1017/CBO9781139764377.027 DOI: https://doi.org/10.1017/CBO9781139764377.027

Council of Europe. (2001). Common European Framework of Reference for Languages: Learning, Teaching, Assessment. Press Syndicate of the University of Cambridge.

Dekker, P., Zingano Kuhn, T., Šandrih, B., Zviel-Girshin, R., Arhar Holdt, Š., & Schoonheim, T. (2019). Corpus Filtering via Crowdsourcing for Developing a Learner’s Dictionary. In I. Kosem & S. Krek (Eds.), Proceedings of the eLexicography in the 21st Century (eLex 2019): Smart Lexicography, 1–3 October 2019, Sintra, Portugal (pp. 84–85). Brno: Lexical Computing CZ, s.r.o.

Dou, Z.-Y., & Neubig, G. (2021). Word Alignment by Fine-Tuning Embeddings on Parallel Corpora. In Conference of the European Chapter of the Association for Computational Linguistics (EACL), 19–23 April 2021. DOI: https://doi.org/10.18653/v1/2021.eacl-main.181

Dürlich, L., & François, T. (2018). EFLLex: A Graded Lexical Resource for Learners of English as a Foreign Language. In N. Calzolari et al. (Eds.), Proceedings of the 11th International Conference on Language Resources and Evaluation, 7–12 May 2018, Miyazaki, Japan. European Language Resources Association (ELRA).

Eisele, A., & Chen, Y. (2010). MultiUN: A Multilingual Corpus from United Nation Documents. In N. Calzolari et al. (Eds.), Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC), 17–23 May 2010, Valletta, Malta (pp. 2868–2872). European Language Resources Association (ELRA). Retrieved from https://aclanthology.org/volumes/L10-1/

Fouz-González, J. (2015). Trends and Directions in Computer-Assisted Pronunciation Training. Investigating English Pronunciation, 314–342. DOI: https://doi.org/10.1057/9781137509437_14

François, T., Fairon, C., & Watrin, P. (2016). CEFRLex: A Graded Lexical Resource for French Foreign Learners. Retrieved from http://cental.uclouvain.be/cefrlex/

François, T., Gala, N., Watrin, P., & Fairon, C. (2014). FLELex: A Graded Lexical Resource for French Foreign Learners. In N. Calzolari et al. (Eds.), Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC), 26–31 May, Reykjavik, Iceland (pp. 3766–3773). European Language Resources Association (ELRA). Retrieved from https://aclanthology.org/L14-1

François, T., Volodina, E., Pilán, I., & Tack, A. (2016). SVALex: A CEFR-Graded Lexical Resource for Swedish Foreign and Second Language Learners. In N. Calzolari et al. (Eds.), Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC), May 2016, Portorož, Slovenia (pp. 213–219). Retrieved from https://aclanthology.org/L16-1032.pdf

Gaillat, T., Simpkin, A., Ballier, N., Stearns, B., Sousa, A., Bouyé, M., & Zarrouk, M. (2022). Predicting CEFR Levels in Learners of English: The Use of Microsystem Criterial Features in a Machine Learning Approach. ReCALL, 34(2), 130–146. DOI: https://doi.org/10.1017/S095834402100029X

Gale, W. A., & Church, K. W. (1991). A Program for Aligning Sentences in Bilingual Corpora. In D. E. Appelt et al. (Eds.), Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics (ACL), 18–21 June 1991, Berkeley, California, USA (pp. 177–184). Stroudsburg, PA, USA. Association for Computational Linguistics (ACL). doi: 10.3115/981344.981367 DOI: https://doi.org/10.3115/981344.981367

Gale, W. A., & Church, K. W. (1993). A Program for Aligning Sentences in Bilingual Corpora. Computational Linguistics, 19(1), 75–102.

Geiger, D., Seedorf, S., Schulze, T., Nickerson, R. C., & Schader, M. (2011). Managing the Crowd: Towards a Taxonomy of Crowdsourcing Processes. In AMCIS 2011 Proceedings - All Submissions: Virtual Communities and Collaborations (p. 430).

Graën, J. (2018). Exploiting Alignment in Multiparallel Corpora for Applications in Linguistics and Language Learning. PhD thesis. University of Zurich.

Graën, J., Alfter, D., & Schneider, G. (2020). Using Multilingual Resources to Evaluate CEFRLex for Learner Applications. In Proceedings of the 12th Language Resources and Evaluation Conference (LREC), 2020, Marseille, France (pp. 346–355). Marseille, France: European Language Resources Association (ELRA). Retrieved from https://www.aclweb.org/anthology/2020.lrec-1.43

Graën, J., Bach, C., & Cassany, D. (in press). Using a Bilingual Concordancer to Promote Metalinguistic Reflection in the Learning of an Additional Language: The Case of B1 Learners of Catalan. In n/a. Peter Lang.

Graën, J., Batinic, D., & Volk, M. (2014). Cleaning the Europarl Corpus for Linguistic Applications. In J. Ruppenhofer & G. Faaß (Eds.), Proceedings of the 12th edition of the Conference on Natural Language Processing (KONVENS) (Vol 1, pp. 222–227). Stiftung Universität Hildesheim. GSCL, ÖGAI, DGfS, Clarin-D, University of Hildesheim. doi: 10.5167/uzh-99005

Graën, J., Kew, T., Shaitarova, A., & Volk, M. (2019). Modelling Large Parallel Corpora: The Zurich Parallel Corpus Collection. In P. Bański et al. (Eds.), Challenges in the Management of Large Corpora (CMLC). Leibniz-Institut für Deutsche Sprache. doi: 10.14618/ids-pub-9020

Graën, J., Sandoz, D., & Volk, M. (2017). Multilingwis. Explore Your Parallel Corpus. In J. Tiedemann & N. Tahmasebi (Eds.), Proceedings of the 21st Nordic Conference of Computational Linguistics (NODALIDA), May 2017, Gothenburg, Sweden (pp. 247–250). Association for Computational Linguistics (ACL). doi: 10.5167/uzh-137129

Graën, J., & Schneider, G. (2020). Exploiting Multiparallel Corpora as a Measure for Semantic Relatedness to Support Language Learners. In D. Levey (Ed.), Strategies and Analyses of Language and Communication in Multilingual and International Contexts (pp. 153–167). Cambridge Scholars Publishing.

Heift, T., & Vyatkina, N. (2017). Technologies for Teaching and Learning L2 Grammar. The Handbook of Technology and Second Language Teaching and Learning, 26–44. DOI: https://doi.org/10.1002/9781118914069.ch3

Jalili Sabet, M., Dufter, P., Yvon, F., & Schütze, H. (2020). SimAlign: High Quality Word Alignments Without Parallel Training Data Using Static and Contextualized Embeddings. In B. Webber, T. Cohn, Y. He & Y. Liu (Eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, November 2020, online (pp. 1627–1643). Association for Computational Linguistics (ACL). Retrieved from https://www.aclweb.org/anthology/2020.findings-emnlp.147 DOI: https://doi.org/10.18653/v1/2020.findings-emnlp.147

Jiang, C., Maddela, M., Lan, W., Zhong, Y., & Xu, W. (2020). Neural CRF Model for Sentence Alignment in Text Simplification. In D. Jurafsky, J. Chai, N. Schluter, J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, July 2020, online (pp. 7943–7960). Association for Computational Linguistics (ACL). doi: 10.18653/v1/2020.acl-main.709 DOI: https://doi.org/10.18653/v1/2020.acl-main.709

Koehn, P. (2005). Europarl: A Parallel Corpus for Statistical Machine Translation. In Machine Translation Summit, 5, 79–86. Asia-Pacific Association for Machine Translation.

Laufer, B., & Ravenhorst-Kalovski, G. C. (2010). Lexical Threshold Revisited: Lexical Text Coverage, Learners’ Vocabulary Size and Reading Comprehension.

Lawson, A. (2001). Collecting, Aligning and Analysing Parallel Corpora. Small Corpus Studies and ELT: Theory and Practice. Amsterdam, John Benjamins, 279–309. DOI: https://doi.org/10.1075/scl.5.17law

Lison, P., & Tiedemann, J. (2016). OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In N. Calzolari et al. (Eds.), Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC), May 2016, Portorož, Slovenia. European Language Resources Association (ELRA). Retrieved from https://aclanthology.org/L16-1147/

Lu, X. (2018). Natural Language Processing and Intelligent Computer-Assisted Language Learning (ICALL). In The TESOL Encyclopedia of English Language Teaching (pp. 1–6). John Wiley & Sons, Ltd. doi: 10.1002/9781118784235.eelt0422 DOI: https://doi.org/10.1002/9781118784235.eelt0422

McEnery, T., & Xiao, Z. (2007). Parallel and Comparable Corpora: The State of Play. Corpus-Based Perspectives in Linguistics 6. DOI: https://doi.org/10.1075/ubli.6.11mce

Montero Perez, M., Paulussen, H., Macken, L., & Desmet, P. (2014). From Input to Output: The Potential of Parallel Corpora for CALL. Language Resources and Evaluation, 48(1), 165–189. DOI: https://doi.org/10.1007/s10579-013-9241-4

Mousavian Rad, S. E., Roohani, A., & Mirzaei, A. (2022). Developing and Validating Precursors of Students’ Boredom in EFL Classes: An Exploratory Sequential Mixed-Methods Study. Journal of Multilingual and Multicultural Development, 1–18. doi: 10.1080/01434632.2022.2082448 DOI: https://doi.org/10.1080/01434632.2022.2082448

Nakata, T., & Webb, S. (2016). Vocabulary Learning Exercises: Evaluating a Selection of Exercises Commonly Featured in Language Learning Materials. In SLA Research and Materials Development for Language Learning, 139–154. Routledge. DOI: https://doi.org/10.4324/9781315749082-21

Nation, I. S. P., & Webb, S. 2011. Researching and Analyzing Vocabulary. Heinle, Cengage Learning Boston, MA.

Otero, P. G., & González López, I. (2010). Wikipedia as Multilingual Source of Comparable Corpora. In Proceedings of the 3rd Workshop on Building and Using Comparable Corpora, LREC (pp. 21–25). Citeseer.

Pearson. (2017). GSE Teacher Toolkit. Retrieved from https://www.english.com/gse/teacher-toolkit/user/lo

Pilán, I. (2018). Automatic Proficiency Level Prediction for Intelligent Computer-Assisted Language Learning. PhD thesis. University of Gothenburg.

Pilán, I., Volodina, E., & Borin, L. (2017). Candidate Sentence Selection for Language Learning Exercises: From a Comprehensive Framework to an Empirical Evaluation. Revue Traitement Automatique Des Langues. Special Issue on NLP for Learning and Teaching. 57(3), 67–91.

Qi, P., Zhang, Y., Zhang, Y., Bolton, J., & Manning, C. D. (2020). Stanza: A Python Natural Language Processing Toolkit for Many Human Languages. In D. Jurafsky, J. Chai, N. Schluter, J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, July 2020, online (pp. 101–108). Association for Computational Linguistics (ACL). doi: 10.18653/v1/2020.acl-demos.14 DOI: https://doi.org/10.18653/v1/2020.acl-demos.14

Rafalovitch, A., & Dale, R. (2009). United Nations General Assembly Resolutions: A Six-Language Parallel Corpus. In Proceedings of the Machine Translation Summit, 12, 292–299.

Reimers, N., & Gurevych, I. (2020). Making Monolingual Sentence Embeddings Multilingual Using Knowledge Distillation. In Q. Liu & D. Schlangen (Eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 4512–4525). Association for Computational Linguistics (ACL). doi: 10.18653/v1/2020.emnlp-main.365 DOI: https://doi.org/10.18653/v1/2020.emnlp-main.365

Ribeiro, M. S. (2018). Parallel Audiobook Corpus (version 1.0), University of Edinburgh. School of Informatics. doi: 10.7488/ds/2468

Scherrer, Y., Nerima, L., Russo, L., Ivanova, M., & Wehrli, E. (2014). SwissAdmin: A Multilingual Tagged Parallel Corpus of Press Releases. In N. Calzolari et al. (Eds.), Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC), 26–31 May, Reykjavik, Iceland. European Language Resources Association (ELRA). Retrieved from https://aclanthology.org/L14-1

Schneider, G., & Graën, J. (2018). NLP Corpus Observatory – Looking for Constellations in Parallel Corpora to Improve Learners’ Collocational Skills. In I. Pilán, E. Volodina, D. Alfter & L. Borin (Eds.), Proceedings of the 7th Workshop on NLP for Computer Assisted Language Learning at SLTC 2018 (NLP4CALL), November 2018, Stockholm, Sweden (pp. 69–78). LiU Electronic Press. doi: 10.5167/uzh-157985

Schwab, S., & Goldman, J.-P. (2018). MIAPARLE: Online Training for Discrimination and Production of Stress Contrasts. In K. Klessa et al. (Eds.), Proc. 9th Int. Conf. Speech Prosody, 13–16 June 2018, Poznań, Poland (pp. 572–576). doi: 10.21437/SpeechProsody.2018-116 DOI: https://doi.org/10.21437/SpeechProsody.2018-116

Sennrich, R., & Volk, M. (2010). MT-Based Sentence Alignment for OCR-Generated Parallel Texts. In Proceedings of the 9th Conference of the Association for Machine Translation in the Americas (AMTA), 31 October – 5 November 2010, Denver, Colorado, USA. Association for Machine Translation in the Americas (AMTA). Retrieved from https://aclanthology.org/2010.amta-papers.14.pdf

Steingrı́msson, S., Loftsson, H., & Way, A. (2021). CombAlign: A Tool for Obtaining High-Quality Word Alignments. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), 31 May – 2 June 2021, Reykjavik, Iceland, Sweden, online (pp. 64–73). Linköping University Electronic Press, Sweden. Retrieved from https://aclanthology.org/2021.nodalida-main.7

Tack, A. (2021). Mark My Words! On the Automated Prediction of Lexical Difficulty for Foreign Language Readers. PhD thesis.

Thompson, B., & Koehn, P. (2019). Vecalign: Improved Sentence Alignment in Linear Time and Space. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), November 2019, Hong Kong, China (pp. 1342–1348). Association for Computational Linguistics (ACL). Retrieved from https://aclanthology.org/D19-3.pdf DOI: https://doi.org/10.18653/v1/D19-1136

Tiedemann, J. (2009). News from OPUS – a Collection of Multilingual Parallel Corpora with Tools and Interfaces. In Proceedings of Recent Advances in Natural Language Processing (RANLP), 5, 237–248. DOI: https://doi.org/10.1075/cilt.309.19tie

Tiedemann, J. (2011). Synthesis Lectures on Human Language Technologies 2. Morgan & Claypool. doi: 10.2200/S00367ED1V01Y201106HLT014 DOI: https://doi.org/10.2200/S00367ED1V01Y201106HLT014

Tiedemann, J. (2012). Parallel Data, Tools and Interfaces in OPUS. In N. Calzolari et al. (Eds.), Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC), May 2012, Istanbul, Turkey (pp. 2215–2218). European Language Resources Association (ELRA). Retrieved from http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf

Vanallemeersch, T. (2010). Belgisch Staatsblad Corpus: Retrieving French-Dutch Sentences from Official Documents. In N. Calzolari et al. (Eds.), Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC), May 2010, Valletta, Malta (pp. 3413–3416). European Language Resources Association (ELRA). Retrieved from http://www.lrec-conf.org/proceedings/lrec2010/pdf/758_Paper.pdf

Varga, D., Németh, L., Halácsy, P., Kornai, A., Trón, V., & Nagy, V. (2005). Parallel Corpora for Medium Density Languages. In G. Angelova, K. Bontcheva, R. Mitkov, N. Nicolov, N. Nikolov (Eds.), Proceedings of Recent Advances in Natural Language Processing (RANLP), 21–23 September 2005, Borovets, Bulgaria (pp. 590–596). Retrieved from http://lml.bas.bg/ranlp2005/

Volk, M., Amrhein, C., Aepli, N., Müller, M., & Ströbel, P. (2016). Building a Parallel Corpus on the World’s Oldest Banking Magazine. In KONVENS. s.n. doi: 10.5167/uzh-125746.

Volk, M., Bubenhofer, N., Althaus, A., Bangerter, M., Furrer, L., & Ruef, B. (2010). Challenges in Building a Multilingual Alpine Heritage Corpus. In N. Calzolari et al. (Eds.), Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC), 17–23 May 2010, Valletta, Malta. European Language Resources Association (ELRA). Retrieved from http://www.lrec-conf.org/proceedings/lrec2010/pdf/110_Paper.pdf

Von Ahn, L., Maurer, B., McMillen, C., Abraham, D., & Blum, M. (2008). Recaptcha: Human-Based Character Recognition via Web Security Measures. Science, 321(5895), 1465–68. DOI: https://doi.org/10.1126/science.1160379

Wang, C., Daneva, M., Van Sinderen, M., & Liang, P. (2019). A Systematic Mapping Study on Crowdsourced Requirements Engineering Using User Feedback. Journal of Software: Evolution and Process, 31(10), e2199. DOI: https://doi.org/10.1002/smr.2199

Wilson, E. (1997). The Automatic Generation of CALL Exercises from General Corpora. In A. Wichmann, S. Fligelstone, T. McEnery & G. Knowles (Eds.), Teaching and Language Corpora (Applied linguistics and language study) (pp. 116–30). DOI: https://doi.org/10.4324/9781315842677-10

Wojatzki, M., Melamud, O., & Zesch, T. (2016). Bundled Gap Filling: A New Paradigm for Unambiguous Cloze Exercises. In Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications, June 2016, San Diego, CA (pp. 172–81). Association for Computational Linguistics (ACL). doi: 10.18653/v1/W16-0519 DOI: https://doi.org/10.18653/v1/W16-0519

Zanetti, A., Volodina, E., & Graën, J. (2021). Automatic Generation of Exercises for Second Language Learning from Parallel Corpus Data. International Journal of TESOL Studies, 3(2), 55–71.

Ziemski, M., Junczys-Dowmunt, M., & Pouliquen, B. (2016). The United Nations Parallel Corpus V1.0. In N. Calzolari et al. (Eds.), Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC), May 2016, Portorož, Slovenia. European Language Resources Association (ELRA). Retrieved from https://aclanthology.org/L16-1561.pdf

Objavljeno

29.12.2022

Kako citirati

Graën, J. (2022). Učenje jezikov iz vzporednih korpusov: Zasnova za spreminjanje korpusnih primerov v vaje za učenje jezikov. Slovenščina 2.0: Empirične, Aplikativne in Interdisciplinarne Raziskave, 10(2), 101–131. https://doi.org/10.4312/slo2.0.2022.2.101-131