Jakhongir Saydaliev

I research large language and vision models as an MSc student at EPFL. I’m fortunate to have worked at the NLP, DHLAB, LINX labs and also at SwissAI and Logitech. I have done my Bachelor’s at Politecnico di Torino.

Research Interests

I research building inclusive, multimodal reasoning AI systems that work for everyone. Below are some areas I’ve been working on:

Multilingual NLP: I want to bridge the gaps in multilingual NLP & ensure AI benefits linguistically diverse and underrepresented communities
- ConLID: Contrastive language identification for low-resource languages
- Apertus: The first large-scale language model developed in Switzerland
Multimodal Reasoning: Models need to reason across modalities, not just text, to handle real-world scenarios
- Multimodal reasoning: Bounding boxes based multimodal reasoning
- Tool-augmented visual reasoning: Multi-turn VLMs training with RL
- GUI agents: Building autonomous agents for mouse/keyboard operations (ongoing)
Efficient Reasoning: As we scale to multimodal scenarios, we need computationally efficient reasoning to make deployment practical
- Investigating the “overthinking” phenomenon in LLMs (ongoing)

Starting from Fall 2026, I am seeking a PhD position; a brief overview of my proposed work is available in this research proposal video.

News

Jan 2026	Our ConLID paper got accepted to EACL
Oct 2025	Our paper got published at Computational Humanities Research
Sep 2025	Joined Logitech as an ML Research Intern to work on Computer Use Agents

Selected Publications

ConLID: Supervised Contrastive Learning for Low-Resource Language Identification

Negar Foroutan^*, Jakhongir Saydaliev^*, Ye Eun Kim, and 1 more author

European Chapter of the Association for Computational Linguistics (EACL), 2026

Awarded Abs PDF Code

Winner of WMDQS Shared Task #2 at COLM 2025

Language identification (LID) is a critical step in curating multilingual LLM pretraining corpora from web crawls. While many studies on LID model training focus on collecting diverse training data to improve performance, lowresource languages – often limited to single domain data, such as the Bible – continue to perform poorly. To resolve these class imbalance and bias issues, we propose a novel supervised contrastive learning (SCL) approach to learn domain-invariant representations for low-resource languages. Through an extensive analysis, we show that our approach improves LID performance on out-of-domain data for low-resource languages by 3.2%, demonstrating its effectiveness in enhancing LID models.
LLM Agents for Interactive Exploration of Historical Cadastre Data: Framework and Application to Venice

Tristan Karch^*, Jakhongir Saydaliev^*, Isabella Di Lenardo, and 1 more author

Computational Humanities Research (CHR), 2025

Abs DOI PDF Code

Cadastral data reveal key information about the historical organization of cities but are often non-standardized due to diverse formats and human annotations, complicating large-scale analysis. We explore as a case study Venice’s urban history during the critical period from 1740 to 1808, capturing the transition following the fall of the ancient Republic and the Ancien Régime. This era’s complex cadastral data, marked by its volume and lack of uniform structure, presents unique challenges that our approach adeptly navigates, enabling us to generate spatial queries that bridge past and present urban landscapes. We present a text-to-programs framework that leverages Large Language Models (LLMs) to process natural language queries as executable code for analyzing historical cadastral records. Our methodology implements two complementary techniques: a SQL agent for handling structured queries about specific cadastral information, and a coding agent for complex analytical operations requiring custom data manipulation. We propose a taxonomy that classifies historical research questions based on their complexity and analytical requirements, mapping them to the most appropriate technical approach. This framework is supported by an investigation into the execution consistency of the system, alongside a qualitative analysis of the answers it produces. By ensuring interpretability and minimizing hallucination through verifiable program outputs, we demonstrate the system’s effectiveness in reconstructing past population information, property features, and spatiotemporal comparisons in Venice.
Apertus: Democratizing Open and Compliant LLMs for Global Language Environments

Apertus Team

Submitted to Association for Computational Linguistics (ACL), 2026

Contributed to the pre-training data through my ConLID project

Abs PDF Code

We present Apertus, a fully open suite of large language models (LLMs) designed to address two systemic shortcomings in today’s open model ecosystem: data compliance and multilingual representation. Unlike many prior models that release weights without reproducible data pipelines or regard for content-owner rights, Apertus models are pretrained exclusively on openly available data, retroactively respecting this http URL exclusions and filtering for non-permissive, toxic, and personally identifiable content. To mitigate risks of memorization, we adopt the Goldfish objective during pretraining, strongly suppressing verbatim recall of data while retaining downstream task performance. The Apertus models also expand multilingual coverage, training on 15T tokens from over 1800 languages, with 40% of pretraining data allocated to non-English content. Released at 8B and 70B scales, Apertus approaches state-of-the-art results among fully open models on multilingual benchmarks, rivalling or surpassing open-weight counterparts. Beyond model weights, we release all scientific artifacts from our development cycle with a permissive license, including data preparation scripts, checkpoints, evaluation suites, and training code, enabling transparent audit and extension.

Other contributions

INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge

Angelika Romanou, Negar Foroutan, Anna Sotnikova, and 54 more authors

In International Conference on Learning Representations (ICLR), 2025

Contributed to collecting the Uzbek dataset

Awarded Abs PDF Website

Spotlight Paper (Top 5% of Papers)

The performance differential of large language models (LLM) between languages hinders their effective deployment in many regions, inhibiting the potential economic and societal value of generative AI tools in many communities. However, the development of functional LLMs in many languages (i.e., multilingual LLMs) is bottlenecked by the lack of high-quality evaluation resources in languages other than English. Moreover, current practices in multilingual benchmark construction often translate English resources, ignoring the regional and cultural knowledge of the environments in which multilingual systems would be used. In this work, we construct an evaluation suite of 197,243 QA pairs from local exam sources to measure the capabilities of multilingual LLMs in a variety of regional contexts. Our novel resource, INCLUDE, is a comprehensive knowledge- and reasoning-centric benchmark across 44 written languages that evaluates multilingual LLMs for performance in the actual language environments where they would be deployed.