Chennai Mathematical Institute

Seminars




Seminar Announcement
Date: Wednesday, 16 April 2025
Time: 2.30 to 3.30 PM
Venue: Seminar Hall
Bridging Non-Roman Scripts with English-Centric LLMs: Evidence and Insights from RomanSetu and RomanLens

Ratish Surendran Puduppully
IT University of Copenhagen.
16-04-25


Abstract

While large language models (LLMs) are generally trained on English text, they can still exhibit meaningful capabilities in non-English languages—including those using non-Roman scripts. In this talk, I present RomanSetu and RomanLens, two studies exploring how romanization—the representation of non-Roman scripts with Roman characters—can serve as an effective interface for English-centric LLMs.

RomanSetu shows that continuing pretraining and instruction-tuning with romanized text reduces token overhead (by 2–4x) and often outperforms native-script baselines across natural language understanding, generation, and translation tasks. RomanLens employs mechanistic interpretability methods to reveal that even when generating native-script output, intermediate model layers frequently encode tokens in a latent romanized form. This suggests a shared underlying representation that enhances transfer to underrepresented languages. Ultimately, these findings highlight the potential of romanization as both a practical and theoretical bridge for extending LLMs beyond English-centric applications.