Primerjava slovenskih in hrvaških besednih vektorskih vložitev z vidika spola na analogijah poklicev
DOI:
https://doi.org/10.4312/slo2.0.2021.1.26-59Ključne besede:
besedne vložitve, spolna pristranost, besedne analogije, poklici, obdelava naravnega jezikaPovzetek
V zadnjih letih je uporaba globokih nevronskih mrež in gostih vektorskih vložitev za predstavitve besedil privedla do vrste odličnih rezultatov na področju računalniškega razumevanja naravnega jezika. Prav tako se je pokazalo, da vektorske vložitve besed pogosto zajemajo pristranosti z vidika spola, rase ipd. Prispevek se osredotoča na evalvacijo vektorskih vložitev besed v slovenščini in hrvaščini z vidika spola z uporabo besednih analogij. Sestavili smo seznam moških in ženskih samostalnikov za poklice v slovenščini in ovrednotili spolno pristranost modelov vložitev fastText, word2vec in ELMo z različnimi konfiguracijami in pristopi k računanju analogij. Izkazalo se je, da najmanjšo poklicno spolno pristranost vsebujejo vložitve fastText. Tudi za hrvaško evalvacijo smo uporabili sezname poklicev in primerjali različne fastText vložitve.
Prenosi
Literatura
Argamon, S., Koppel, M., Fine, J., & Shimoni, A. R. (2003). Gender, genre, and writing style in formal written texts. TEXT, 23, 321–346. DOI: https://doi.org/10.1515/text.2003.014
Baker, P. (2010). Will Ms ever be as frequent as Mr? A corpus-based comparison of gendered terms across four diachronic corpora of British English. Gender & Language, 4(1), 125–149. DOI: https://doi.org/10.1558/genl.v4i1.125
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135–146. DOI: https://doi.org/10.1162/tacl_a_00051
Bolukbasi, T., Chang, K.-W., Zou, J. Y., Saligrama, V., & Kalai, A. (2016). Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. Proceedings of the 30th Conference on Neural Information Processing Systems (NIPS’16) (pp. 4356–4364).
Bordia, S., & Bowman, S. (2019). Identifying and Reducing Gender Bias in Word-Level Language Models. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, (pp. 7–15). DOI: https://doi.org/10.18653/v1/N19-3002
Brunet, M. E., Alkalay-Houlihan, C., Anderson, A., & Zemel, R. S. (2019). Understanding the Origins of Bias in Word Embeddings. Proceedings of International Conference on Machine Learning (ICML 2019).
Caldas-Coulhard, C. R., & Moon, R. (2010). ‘Curvy, hunky, kinky’: Using corpora as tools for critical analysis. Discourse & Society, 21(2), 99–133. DOI: https://doi.org/10.1177/0957926509353843
Caliskan, A., Bryson, J. J., & Narayanan, A. (2017). Semantics derived automatically from language corpora necessarily contain human biases. Science, 356(6334), 183–186. DOI: https://doi.org/10.1126/science.aal4230
Conneau, A., Lample, G., Ranzato, M., Denoyer, L., & Jegou, H. (2018). Word translation without parallel data. Proceedings of the International Conference on Learning Representation (ICLR).
Dobrovoljc, K., Krek, S., Holozan, P., Erjavec, T., Romih, T., Arhar Holdt, Š., Čibej, J., Krsnik L., & Robnik-Šikonja, M. (2019). Morphological lexicon Sloleks 2.0. CLARIN.SI. http://hdl.handle.net/11356/1230
Eurostat (2021). Gender statistics. Retrieved from https://ec.europa.eu/eurostat/statistics-explained/index.php/Gender_statistics#Labour_market
Garg, N., Schiebinger, L., Jurafsky, D., & Zou, J. (2018). Word embeddings quantify 100 years of gender and ethnic stereotypes. PNAS, 115(16). DOI: https://doi.org/10.1073/pnas.1720347115
Garimella, A., Banea, C., Hovy, D., & Mihalcea, R. (2019). Women’s syntactic resilience and men’s grammatical luck: Gender-bias in part-of-speech tagging and dependency parsing. Proceedings of the 57th Annual Meeting of the ACL (pp. 3493–3498). DOI: https://doi.org/10.18653/v1/P19-1339
Gigafida 2.0. Retrieved from https://viri.cjvt.si/gigafida
Gonen, H., & Goldberg, Y. (2019). Lipstick on a pig: Debiasing methods cover up systematic gender biases in word embeddings but do not remove them. Proceedings of NAACL-HLT 2019 (pp. 609–614). DOI: https://doi.org/10.18653/v1/N19-1061
Gorjanc, V. (2007). Kontekstualizacija oseb ženskega in moškega spola v slovenskih tiskanih medijih. In I. Novak-Popov (Ed.), Stereotipi v slovenskem jeziku, literaturi in kulturi: zbornik predavanj 43. seminarja slovenskega jezika, literature in culture (pp. 173–180). Ljubljana: Center za slovenščino kot drugi/tuji jezik.
Hill, B., & Shaw, A. (2013). The Wikipedia gender gap revisited: Characterising survey response bias with propensity score estimation. PloS One, 8. DOI: https://doi.org/10.1371/journal.pone.0065782
Hirasawa, T., & Komachi, M. (2019). Debiasing Word Embeddings Improves Multimodal Machine Translation. Proceedings of Machine Translation Summit XVII, Vol. 1 (pp. 32–42). DOI: https://doi.org/10.18653/v1/N19-3012
Hovy, D., & Søgaard, A. (2015). Tagging performance correlates with author age. Proceedings of the 53rd Annual Meeting of the ACL and the 7th IJCNLP (pp. 483–488). DOI: https://doi.org/10.3115/v1/P15-2079
Hovy, D. (2015). Demographic factors improve classification performance. Proceedings of the 53rd Annual Meeting of the ACL and the 7th IJCNLP (pp. 752–762). DOI: https://doi.org/10.3115/v1/P15-1073
Hutchinson, B., Prabhakaran, V., Denton, E., Webster, K., Zhong, Y., & Denuyl, S. (2020). Social Biases in NLP Models as Barriers for Persons with Disabilities. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 5491–5501). DOI: https://doi.org/10.18653/v1/2020.acl-main.487
Kern, B., & Dobrovoljc, H. (2017). Pisanje moških in ženskih oblik in uporaba podčrtaja za izražanje »spolne nebinarnosti«. Jezikovna svetovalnica. Retrieved from https://svetovalnica.zrc-sazu.si/topic/2247/pisanje-mo%C5%A1kih-in-%C5%BEenskih-oblik-in-uporaba-pod%C4%8Drtaja-za-izra%C5%BEanje-spolne-nebinarnosti
Kiritchenko, S., & Mohammad, S., (2018). Examining Gender and Race Bias in Two Hundred Sentiment Analysis Systems. Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics (pp. 43–53). DOI: https://doi.org/10.18653/v1/S18-2005
Koolen, C., & van Cranenburgh, A. (2017). These are not the stereotypes you are looking for: Bias and fairness in authorial gender attribution. Proceedings of the First Ethics in NLP workshop (pp. 12–22). DOI: https://doi.org/10.18653/v1/W17-1602
Lakoff, R. (1973). Language and woman’s place. Language in Society, 2(1), 45–80. DOI: https://doi.org/10.1017/S0047404500000051
Liang, P. P, Li, I. M., Zheng, E., Lim, Y. C., Salakhutdinov, R., & Morency, L. (2020). Towards Debiasing Sentence Representations. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 5502–5515). DOI: https://doi.org/10.18653/v1/2020.acl-main.488
Ljubešić, N., & Erjavec, T. (2018). Word embeddings CLARIN.SI-embed.sl 1.0. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1204
Ljubešić, N. (2018). Word embeddings CLARIN.SI-embed.hr 1.0, Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1205
Martinc, M., Škrjanec, I., Zupan, K., & Pollak, S. (2017). PAN 2017: Author profiling - gender and language variety prediction: notebook for PAN at CLEF 2017. Proceedings of the Conference and Labs of the Evaluation Forum.
Mikolov, T., Corrado, G. S., Chen, K., & Dean, J. (2013a). Efficient estimation of word representations in vector space. Proceedings of the International Conference on Learning Representations (pp. 1–12).
Mikolov, T., Yih, W-t., & Zweig, G. (2013b). Linguistic regularities in continuous space word representations. Proceedings of the 2013 Conference of the North American Chapter of the ACL: Human Language Technologies (pp. 746–751).
Nozza, D., Volpetti, C., & Fersini, E. (2019). Unintended Bias in Misogyny Detection. Proceedings of IEEE/WIC/ACM International Conference on Web Intelligence (pp. 149–155). DOI: https://doi.org/10.1145/3350546.3352512
Nissim, M., van Noord, R., & van der Goot, R. (2019). Fair is better than sensational: Man is to doctor as woman is to doctor. Computational Linguistics, 46(3), 487–497. DOI: https://doi.org/10.1162/coli_a_00379
Pearce, M. (2008). Investigating the collocational behaviour of man and woman in the BNC using Sketch Engine. Corpora, 3(1), 1–29. DOI: https://doi.org/10.3366/E174950320800004X
Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualised word representations. Proceedings of NAACL-HLT 2018 (pp. 2227–2237). DOI: https://doi.org/10.18653/v1/N18-1202
Plahuta, M. (2020). O slovarju. Retrieved from https://kontekst.io/oslovarju
Popič, D., & Gorjanc, V. (2018). Challenges of adopting gender-inclusive language in Slovene. Suvremena lingvistika, 44(86), 329–350. DOI: https://doi.org/10.22210/suvlin.2018.086.07
Prates, M. O. R., Avelar, P. H., & Lamb, L. C. (2020). Assessing gender bias in machine translation: A case study with Google Translate. Neural Computing and Applications, 32, 6363–6381. DOI: https://doi.org/10.1007/s00521-019-04144-6
Rangel, F., Celli, F., Rosso, P., Potthast, M., Stein, B., & Daelemans, W. (2015). Overview of the 3rd author profiling task at PAN 2015. In L. Cappellato, N. Ferro, G. J. F. Jones in E. SanJuan (Eds.), CLEF 2015 Labs and Workshops, Notebook Papers.
Schick, T., Udupa, S., & Schütze, H. (2021). Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP. arXiv preprint arXiv:2103.00453. DOI: https://doi.org/10.1162/tacl_a_00434
Sun, T., Gaut, A., Tang, S., Huang, Y., ElSherief, M., Zhao, J., Mirza, D., Belding, E., Chang, K-W., & Wang, W. Y. (2019). Mitigating gender bias in natural language processing: Literature review. Proceedings of the 57th Annual Meeting of the ACL (pp. 1630–1640). DOI: https://doi.org/10.18653/v1/P19-1159
Supej, A., Plahuta, M., Purver, M., Mathioudakis, M., & Pollak, S. (2019). Gender, language, and society: Word embeddings as a reflection of social inequalities in linguistic corpora. Proceedings of the Slovensko sociološko srečanje 2019 – Znanost in družbe prihodnosti (pp. 75–83).
Supej, A., Ulčar, M., Robnik-Šikonja, M., & Pollak, S. (2020). Primerjava slovenskih besednih vektorskih vložitev z vidika spola na analogijah poklicev. Proceedings of the Conference on Language Technologies & Digital Humanities 2020 (pp. 93–100).
Svoboda, L., & Beliga, S. (2018). Evaluation of Croatian Word Embeddings. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (pp. 1512–1518).
Škrjanec, I., Lavrač, N., & Pollak, S. (2018). Napovedovanje spola slovenskih blogerk in blogerjev. In D. Fišer (Ed.), Viri, orodja in metode za analizo spletne slovenščine (pp. 356–373). Ljubljana: Znanstvena založba FF.
Tannen, D. (1990). You Just Don’t Understand: Women and Men in Conversation. New York: Ballantine Books.
Ulčar, M. (2019). ELMo embeddings model, Slovenian. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1257
Vanmassenhove, E., Hardmeier, C., & Way, A. (2018). Getting gender right in neural machine translation. Proceedings of the EMNLP (pp. 3003–3008). DOI: https://doi.org/10.18653/v1/D18-1334
Verhoeven, B., Škrjanec, I., & Pollak, S. (2017). Gender profiling for Slovene Twitter communication: The influence of gender marking, content and style. Proceedings of the 6th BSNLP Workshop (pp. 119–125). DOI: https://doi.org/10.18653/v1/W17-1418
Vlada RS (1997). 1641. uredba o uvedbi in uporabi standardne klasifikacije poklicev. Uradni list RS, 28, 2217. Retrieved from https://www.uradni-list.si/glasilo-uradni-listrs/vsebina?urlid=199728&stevilka=1641
Volkova, S., Wilson, T., & Yarowsky, D. (2013). Exploring demographic language variations to improve multilingual sentiment analysis in social media. Proceedings of the EMNLP (pp. 1815–1827).
Zhao, J., Wang, T., Yatskar, M., Ordonez, V., & Chang, K-W. (2017). Men also like shopping: Reducing gender bias amplification using corpus-level constraints. Proceedings of the EMNLP (pp. 2979–2989). DOI: https://doi.org/10.18653/v1/D17-1323
Zhao, J., Wang, T., Yatskar, M., Ordonez, V., & Chang, K-W. (2018). Gender bias in coreference resolution: Evaluation and debiasing methods. Proceedings of the NAACL-HLT (pp. 15–20). DOI: https://doi.org/10.18653/v1/N18-2003
Prenosi
Objavljeno
Verzije
- 6. 07. 2021 (2)
- 1. 07. 2021 (1)
Številka
Rubrika
Licenca
Avtorske pravice (c) 2021 Matej Ulčar, Anka Supej, Marko Robnik-Šikonja, Senja Pollak
To delo je licencirano pod Creative Commons Priznanje avtorstva-Deljenje pod enakimi pogoji 4.0 mednarodno licenco.