The Quantitative Turn: NLP and AI Methods in Romance Linguistics
39. Romanistiktag Universität Konstanz | 22.–25. September 2025
Sektionsleitung und Kontakt
Iris Ferrazzo (Universität Bonn / Bonner Center for Digital Humanities)
Olga Kellert (Universität A Coruña / Universität Göttingen)
In the last few decades, linguistic research has undergone an undoubtedly significant shift of focus towards the use of more empirical methods, and, relatedly, of mathematical formalisation and modelling. The experts talk about an actual “Quantitative Turn” (Kortmann 2021). This is partly driven by criticism of previous theoretical research relying mostly on subjective introspection of few experts or speakers, and partly by easier access to quantitative or quantifiable data, such as crowdsourced data from
judgement tasks via websites as Amazon’s Mechanical Turk (Winter 2022) or social media data like from X (former Twitter), useful to capture more spontaneous linguistic behaviour (Kellert et al. 2023). However, crowdsourced data and naturally occurring speech may display an unstructured nature and need pre-processing to undergo quantitative modelling. Manual pre-processing of a large amount of natural language data is time consuming and costly. To overcome these shortcomings, Natural Language Processing (NLP) and AI (Artificial Intelligence) language models have shown to perform well in handling large amounts of data. An example is delivered by Large Language Models (LLMs), which are capable of handling and generating coherent natural language based on word embeddings and transformers architecture (Vaswani et al. 2017). This combination makes it possible to numerically encode semantic relations between words in an embedding space without previous pre-processing (e.g., word2vec, Mikolov et al. 2013; BERT, Devlin et al. 2019), and to reach sophisticated understanding and generation of human language. These and other language models enable the use of NLP and AI methods in various linguistic fields, such as dialect variation and language change (Kellert & Zaman 2022), development of syntactic parsers for unstructured data and semantic role labelling (Zhang et al. 2022), coreference resolution (Dobrovolskii 2021), text summarization and machine reading comprehension, among others.
However, the application of the newest NLP and AI methods is largely focused on big languages such as English and on more standardised language varieties, and neglects Romance languages and smaller language varieties (Kellert & Zaman 2023). As a result, datasets, tools, and methods are often not adjusted yet to the application in Romance Linguistics. This leads to the missing opportunity of using unstructured data and automated research pipelines to complement traditional linguistic methodologies.
In this workshop, we address the need for further exploration of data sources and data processing methods with the means of NLP and AI in order to answer questions in Romance Linguistics.
Suggestions of topics and questions that can be addressed by the contributors are:
- What are the challenges of unstructured data types for linguistic analysis and how can we address them?
- (How) Can Romance languages and Romance varieties benefit from the newest developments in NLP and AI?
- How can we ensure the accuracy and reliability of linguistic insights derived from large-scale social media data?
- How can interdisciplinary collaboration enhance the application of NLP and AI methods in Romance Linguistics?
- What are the implications of the Quantitative Turn in Linguistics for language policy and planning in Romance-speaking communities? How can the results influence decisions about language use, education, preservation, and other aspects of language policy?
- What are the emerging trends and challenges in NLP and/or AI specific to Romance languages, and how do they impact the development of NLP and AI models, techniques, and applications?
Invited speakers: Yoshifumi Kawasaki (University of Tokyo, title: Diachronic Studies of Romance
Languages in the Era of Deep Learning), Alessandro Lenci (University of Pisa, provisional title will be announced soon).
Bibliographie
Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186. doi: 10.18653/v1/N19-1423
Dobrovolskii, V. (2021). Word-level coreference resolution. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 7670–7675. doi: 10.18653/v1/2021.emnlp-main.605
Kellert, O., & Zaman, M. (2022). Using neural topic models to track context shifts of words: a case study of COVID-related terms before and after the lockdown in April 2020. In Proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change, Association for Computational Linguistics, 131–139. doi: 10.18653/v1/2022.lchange-1.14
Kellert, O., & Zaman, M. (2023). Use of NLP in the context of belief states of ethnic minorities in Latin America. In Proceedings of the Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP), Association for Computational Linguistics, 1–5. doi: 10.18653/v1/2023.americasnlp-1.1
Kellert, O., Zaman, M., Matlis, N., & Gomez-Rodriguez, C. (2023). Experimenting with UD adaptation of an unsupervised rule-based approach for sentiment analysis of Mexican tourist texts. In Alvarez-Carmona et al. (Eds.), CEUR Workshop Proceedings, Vol. 3496.
Kortmann, B. (2021). Reflecting on the quantitative turn in linguistics. In Linguistics, Vol. 59, no. 5, 1207–1226. Doi: 10.1515/ling-2019-0046
Mikolov, T.,Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space, ICLR. https://arxiv.org/abs/1301.3781
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, 6000–6010.
Winter, B. (2022). Mapping the landscape of exploratory and confirmatory data analysis in linguistics. In D. Tay, & M. Pan (Eds.), Data analytics in cognitive linguistics: methods and insights, 13–48. doi: 10.1515/9783110687279-002
Zhang, Y., Xia, Q., Zhou, S., Jiang, Y., Fu, G., & Zhang, M. (2022). Semantic role labeling as dependency parsing: Exploring latent tree structures inside arguments. In Proceedings of the 29th International Conference on Computational Linguistics, 4212–4227.