What Does It Take for Large Language Models to Represent the World's Languages?
Mittwoch, 20.05.2026, 18:15 Uhr
| Veranstaltende: | Collegium generale |
|---|---|
| Referierende: | Prof. Dr. Antoine Bosselut, Laboratoire de traitement du langage naturel, EPFL |
| Datum: | 20.05.2026 |
| Uhrzeit: | 18:15 - 19:45 Uhr |
| Ort: |
Auditorium maximum, Raum 110 Hauptgebäude Hochschulstrasse 4 3012 Bern |
| Anmeldung: | Keine Anmeldung erforderlich |
| Merkmale: |
Öffentlich kostenlos |
Abstract
What does it take for large language models to truly represent the world's languages? In this talk, I will argue that meaningful multilingual representation requires rethinking LLM development from the ground up, from data curation and tokenization to evaluation. I will walk through four interconnected challenges we tackled in building Apertus, the most globally representative LLM to date, trained on 15T tokens spanning over 1000 languages. First, I will describe how we mixed multilingual data while pretraining Apertus, finding the right balance to maintain English performance while narrowing the scope of the "curse of multilinguality" to represent as many languages as possible. Second, I will discuss how standard tokenizers silently disadvantage non-English languages, and how we can design fairer forms of tokenization that achieve parity across languages. Finally, I will show that mainstream benchmarks fail to capture whether models actually understand local and regional context, and how we designed novel multiregional evaluation addresses this gap.
