Post

Benchmark project presented at a RadiUnce research meeting in Utrecht

Benchmark project presented at a RadiUnce research meeting in Utrecht

Summary

Bastián González-Bustamante, together with Christopher Klamm, Tom Bellens, and Marta Koch, presented our benchmark project at a RadiUnce research meeting in Utrecht on 20 May 2026.

The presentation introduced the project’s broader goal of developing a transparent and reproducible framework for evaluating LLMs and fine-tuned models on multilingual policy agenda annotation in parliamentary speeches. It also provided an opportunity to discuss the benchmark in the context of the RadiUnce project, Politicians under Radical Uncertainty, which examines how politicians respond to different forms of uncertainty across political systems and reflects on lessons and guidelines when working with computational analysis of parliamentary speech.

Key highlight

A central highlight of the presentation was a comparative plot showing our preliminary benchmark results for original-language speeches and their machine-translated English versions. The figure compares model performance across recent OpenAI GPT models, DeepSeek, and Qwen 3, providing an initial view of how these model families perform under the two input conditions.

The main takeaway so far is the very high level of agreement between classifications based on original-language speeches and those based on machine translations, with Krippendorff’s alpha = 0.976. This suggests that, at least in the current benchmark stage, machine translation preserves the information needed for highly consistent policy agenda classification across languages.

Preliminary comparison of benchmark results in original-language speeches and machine translations across OpenAI GPT models, DeepSeek, and Qwen 3

More information

This post is licensed under CC BY 4.0 by the authors.