Ringvorlesung des Collegium generale: Wie wir sprechen

What Does It Take for Large Language Models to Represent the World's Languages?

Mittwoch, 20.05.2026, 18:15 Uhr

Bild von Antoine Bosselut

 

Veranstaltende: Collegium generale
Referierende: Prof. Dr. Antoine Bosselut, Laboratoire de traitement du langage naturel, EPFL
Datum: 20.05.2026
Uhrzeit: 18:15 - 19:45 Uhr
Ort: Auditorium maximum, Raum 110
Hauptgebäude
Hochschulstrasse 4
3012 Bern
Anmeldung: Keine Anmeldung erforderlich
Merkmale: Öffentlich
kostenlos

Abstract

What does it take for large language models to truly represent the world's languages? In this talk, I will argue that meaningful multilingual representation requires rethinking LLM development from the ground up, from data curation and tokenization to evaluation. I will walk through four interconnected challenges we tackled in building Apertus, the most globally representative LLM to date, trained on 15T tokens spanning over 1000 languages. First, I will describe how we mixed multilingual data while pretraining Apertus, finding the right balance to maintain English performance while narrowing the scope of the "curse of multilinguality" to represent as many languages as possible. Second, I will discuss how standard tokenizers silently disadvantage non-English languages, and how we can design fairer forms of tokenization that achieve parity across languages. Finally, I will show that mainstream benchmarks fail to capture whether models actually understand local and regional context, and how we designed novel multiregional evaluation addresses this gap.

Website des Referenten