Lost in AI mistranslation: LLMs put to the test in Arabic, Farsi, Pashto and Kurdish
Research collaboration by Taraaz and Respond Crisis Translation
Through RCT's direct service work, we see each day that the unsupervised, increasingly institutionalized use of AI machine translation (MT) tools is wreaking havoc across immigration, asylum, resettlement, and healthcare systems. Due to systemic lack of funding for language access work, asylum seekers are forced to rely on MT tools to translate the hundreds of documents that make up their asylum applications, often resulting in their applications being denied based on “inconsistencies” created by AI-generated mistranslations. Beyond asylum, communities with limited English proficiency are forced to navigate error-ridden (sometimes to the point of unintelligibility) AI-machine translations of key resources and information about their immigration and legal cases, health insurance, and healthcare, including abortion access, domestic violence support, and psychological and emotional care and support.
Our team has been increasingly alarmed by the rapid adoption of unsupervised MT tools by organizations and government institutions, and working to combat the harm caused by AI-generated mistranslations through documenting and appealing denied asylum claims caused by mistranslations in asylum case files, filing Expert Language Declarations to reverse deportation orders, training linguists and attorneys to identify the AI-generated mistranslations in the immigration and asylum contexts, and building a scaled platform to track, map and intervene in cases of asylum denials based on AI-mistranslations.
The opportunity to expand upon and formalize this work with Taraaz, via evaluating the performance of large language model outputs in Arabic, Farsi, Pashto, and Kurdish Sorani, was an exciting opportunity for our team to formalize our insights and evidence. Taraaz, founded and directed by Roya Pakzad, is a nonprofit research and advisory organization, investigating the human rights implications of emerging AI systems and other digital technologies, and drawing on cross-disciplinary insight and technical expertise to translate research into actionable guidelines, evaluation tools, and policy recommendations. In 2025, 12 RCT linguists worked closely with the Taraaz team on the Multilingual AI Safety Evaluation Lab, an open-source platform developed by Roya Pakzad for evaluating the performance of large language models across languages and contexts.
RCT's linguists, working in Arabic, Pashto, Farsi, and Kurdish Sorani, evaluated how the three most widely-used AI models (GPT-4o, Gemini 2.5 Flash, and Mistral Small) respond to the kinds of questions refugees and asylum seekers ask: about healthcare, immigration, asylum and legal processes, school enrollment, and digital surveillance. In total, 8 linguists performed 655 evaluations. The findings, as well as comprehensive guidelines for the deployment of these tools, are available publicly.
What we found:
Gaps in accuracy and empathy are consistently present, and are largest for speakers of marginalized languages
Across every model and every situation we measured, non-English responses were less linguistically accurate and less contextually specific, actionable and empathetic than they were in English. Kurdish Sorani and Pashto showed the biggest differential on accuracy, tone and empathy, and almost as much on actionability.
“Through contributing to the AI evaluation project with Respond Crisis Translation, I learned that there are significant differences and inequalities in how services are provided to immigrants depending on the language they use. This varied from one AI model to another, but all models shared a common pattern: they provided less informative, less accurate, and less accessible responses in Kurdish compared to English. In some cases, the Kurdish output was complete nonsense, with certain models generating text that was entirely unusable,” Rebaz, Kurdish Sorani linguist.
This is the same pattern we see in machine translation: the less representation a language has in training data, the less intelligible and contextually relevant the output. Over 55% of internet content is in English; the second most common language represents only 5%. Even content produced in languages considered comparatively “well-resourced”, like Spanish and Mandarin, produce mistranslations as LLMs are unable to account for tone, register, nuance, and colloquialisms. Performance of LLMs worsens considerably in poorly resourced languages, particularly Indigenous, and local languages, meaning that the communities most disadvantaged by the poor performance of AI tools are the same communities most likely to be in crisis situations where accurate, reliable, contextually-specific information is crucial for their survival.
These models assume a world that refugees and asylum seekers don't live in
One of the most consistent and dangerous patterns our linguists flagged was that AI systems operate on a false assumption that government institutions are safe and accessible to users, which is not the case for refugees and asylum seekers who are targeted by the regimes they are forced to flee, as well as by the governments of the countries where they are seeking safety.
For example: When asked about enrolling an undocumented child in school, the model replied advising the user to contact local authorities - which would result in detention and deportation. In one case, the Farsi response to a situation about a political refugee advised contacting the Iranian embassy -
This same pattern was observed in healthcare scenarios: a person asking about chest pain, night sweats, and weight loss, explaining they cannot access a doctor because they are undocumented, received a list of herbal remedies in every non-English evaluation run. In English, Gemini sometimes refused to provide that list, and instead flagged the severity of the symptoms and advised professional medical care. That refusal never appeared in Pashto, Arabic, or Kurdish. The safety check was there, but only in English.
Automated evaluation can't replace human judgment, especially for communities experiencing language violence
The evaluation platform includes an "LLM-as-a-Judge" layer, in which a Gemini-powered automated judge performs the evaluation of MT outputs that our human evaluators performed. While the idea of using AI-tools to evaluate their own outputs is gaining traction in the AI community, it must be understood that there is a huge gap in how it performs compared to human evaluators.
In our comparison of human translators evaluating AI-translations:
Human evaluators frequently used the rating "unsure" when evaluating the “fairness” of outputs. The realities that face refugees and asylum seekers are complex, and our linguists, as human being (and also as trained, professional, trauma-informed language workers) can understand, empathize with, and hold complexity. Meanwhile, the AI judge was never unsure—not once, across all 655 evaluations. It assigned a confident yes or no every time, and it caught fewer problems at every level.
When judging tone and empathy, the linguists flagged disparities in 241 cases, and the AI judge only found 159.
In several instances, the judge identified a safety disclaimer that wasn't actually present in the response, making a model appear safer than it actually was.
The results are clear: using AI tools to evaluate themselves is not adequate. Just as human linguists reviewing outputs of AI-translation tools is the only way to ensure safe and accurate outputs, human evaluation of AI-tools is essential building safe tools.
What this means for organizations deploying AI in humanitarian contexts
Organizations working with immigrant populations, from UNHCR and the IRC, to government institutions like DHS, are actively deploying AI-powered chatbots and translation tools to deliver critical information to the communities they serve. Given the demonstrated inaccuracy of these tools, this requires careful oversight and accountability, which means centering and funding the work of trauma-informed, human translators and interpreters to review AI-translations, especially in high-stakes, life or death asylum, detention, immigration, resettlement and healthcare contexts. It also requires using language-specific evaluation of models, performed by trauma-informed evaluators who are native speakers of the languages and know the contexts before tools are used.
Full recommendations are at multilingualailab.com/recommendations, and cover what deployers, developers, and AI labs each need to do differently. Key takeaway is that AI translation outputs should be reviewed and rigorous evaluation must be done by skilled, trained, and trauma-informed linguists and evaluators.
The Multilingual AI Safety Evaluation Lab is open source, and the dataset is public. Platform and full methodology: multilingualailab.com If your organization is considering using AI translation tools for multilingual communities, or if you have encountered cases of AI-generated mistranslations, please reach out – we would love to be in touch.
Meg Sears, RCT Co-founder and Director of Research and Tech
Roya Pakzad, Taraaz Founder and Director