Benchmarking Chatbot Performance

Weißmüller, Kristina; Rudzitis, Ralfs

doi:None

Title	Authors

Benchmarking Chatbot Performance

Contributors:

Date created: | Last Updated:

: DOI | ARK

Creating DOI. Please wait...

Create DOI

Category: Project

Description: This study introduces the Public-sector Chatbot Performance (PCP) framework, a novel and comprehensive approach to systematically assess AI chatbot performance in public administration. The framework evaluates both technical competence—factual accuracy, completeness, and source reliability—and normative integrity, including lawfulness, transparency, equality, and privacy. To demonstrate applicability of the PCP framework, we benchmark the full set of municipal chatbot systems currently deployed in Dutch local governments, alongside two leading proprietary large language models (LLMs): ChatGPT-4o and Gemini 2.5 Pro. Using a pragmatic mixed methods approach, we developed 26 prompts with systematic user-based variation to explore algorithmic bias, resulting in a dataset of n=326 user-chatbot interactions. Quantitative analysis revealed that ChatGPT-4o achieved a composite performance score of 95.7%, significantly outperforming all municipal systems. Municipal chatbots exhibited notable shortcomings in competence and integrity, with some failing to meet basic standards of lawful and equal service provision. Exploratory qualitative analysis further uncovered algorithmic opacity, discretionary advice in violation of Dutch good governance regulations, and discriminatory responses based on “ethnic” usernames. These insights challenge assumptions about neutrality in public sector AI and underscore the need for ethical benchmarks in chatbot evaluation. The PCP framework offers actionable guidance for policymakers, technologists, and scholars committed to responsible digital governance.

License: CC-By Attribution 4.0 International

Projects
Registrations

Results: All Projects Results: My Projects Results: All Registrations Results: My Registrations

Has supplemental materials for Benchmarking Municipal AI Chatbot Performance: Mixed Methods Insights into Competence, Integrity, and Algorithmic Discrimination in Dutch Public Administration on SocArXiv

Files

Files can now be accessed and managed under the Files tab.

Citation

Recent Activity

Loading logs...

OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.

This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.

Create an Account Learn More Hide this message

Main content

Benchmarking Chatbot Performance

Files

Citation

Tags

Recent Activity

Start managing your projects on the OSF today.

Main content

Links to this project

Benchmarking Chatbot Performance

Link other OSF projects

Files

Citation

Tags

Recent Activity

Start managing your projects on the OSF today.