Language Models Entangle
Language and Culture

Abstract

Users should not be systemically disadvantaged by the language they use for interacting with LLMs; i.e. users across languages should get responses of similar quality irrespective of language used. In this work, we create a set of real-world open-ended questions based on our analysis of the WildChat dataset and use it to evaluate whether responses vary by language, specifically, whether answer quality depends on the language used to query the model. We also investigate how language and culture are entangled in LLMs such that choice of language changes the cultural information and context used in the response by using LLM-as-a-Judge to identify the cultural context present in responses. To further investigate this, we evaluate LLMs on a translated subset of the CulturalBench benchmark across multiple languages. Our evaluations reveal that LLMs consistently provide lower quality answers to open-ended questions in low resource languages. We find that language significantly impacts the cultural context used by the model. This difference in context impacts the quality of the downstream answer.

Methodology Visualization

Dataset

We created a set of 20 advice-seeking questions covering a wide variety of topics like Healthcare, Business, and Education for evaluating performance on open-ended tasks. These questions were created based on our analysis of the WildChat Dataset.

As part of our analysis, we began with initial filtering and cleaning. From the WildChat dataset, we retained only queries in English and removed queries related to programming bugs or error fixes, as they are niche and skew the dataset. We kept queries with lengths ranging from 40 to 400 characters and excluded duplicate or highly similar queries with a threshold of 60 using the fuzzywuzzy library.

We converted the queries to embeddings using the Qwen3-0.6b embedding model and clustered them using the HDBSCAN algorithm, followed by manual analysis to create the evaluation queries. The questions were structured in a culture-independent manner. We then translated the queries to Chinese, Hindi, Brazilian Portuguese, Swahili, and Hebrew using the Gemini-2.5-Flash model.

Queries by Category

Category Queries
Programming Advice
  • I want to learn {programming language}, can you suggest a plan to start with?
  • How do I master software engineering and system design concepts?
Research Advice
  • Give me tips and guidelines for writing a good research paper.
  • I am a beginner in {field} research. Give me ideas for research problems to work on. Provide ideas along with existing research for reading further.
Trading/Investing
  • What is the best way to do day trading from 100 dollars?
  • I want to buy a house which is costing me 75lacks and my monthly earning is 50k. Do you have any suggestions for this?
  • I want to invest my retirement savings. I have {amount}. How do I split my investment across equity, real estate, gold, debt?
Learning
  • Give me a study plan to learn {language}.
Business/Marketing
  • I'm looking for a comprehensive business plan to launch a new venture selling printed shirt designs. This plan should detail the initial setup steps and provide a creative, step-by-step social media and Instagram strategy to help me become a star in this industry.
  • Give me 5 tricks about digital marketing on instagram.
  • Act as a business analyst, brainstorm for me 5 novel business ideas that are able to form a startup company using NLP technology.
  • What are the 10 fastest niches in tech growing fast you would focus on to look for problems to solve and start a startup in?
  • What are some ways to make money as a 13 year old from home?
Job/Interview
  • Provide me interview questions and answers for {job role} to help me prepare for the interview.
Health/Medicine
  • Suggest home remedies for {issue}.
  • How do I improve the quality and duration of my sleep? Give tips and a schedule to follow.
  • Over the past few years, I've noticed a significant decline in my memory. This concerns me, and I worry about the possibility of developing conditions like madness or Alzheimer's in the future. I'm seeking advice on how to improve my overall well-being, specifically my physical, brain, and psychological health.
  • Make me a meal plan of a week's worth of meals. I need 3 meals a day. I must hit a protein goal of 120 grams of protein everyday, and my calories for each day is 3000 calories. Be as detailed as possible and include the calorie count and protein count for each meal. I don't eat meat. Create a grocery list as well.
  • Give me a workout routine for calisthenics with the split push pull legs upper lower workout for beginner and with just dumbbells 5-10 kg, pull-up bar, resistance band, bodyweight, gymnastic rings, and best calories intake for 76kg and 173cm tall.
  • Create a 30 minute work out routine for beginners that I can do at home.

Translated CulturalBench

We also translated a subset of the CulturalBench dataset to Hindi, Chinese, Brazilian Portuguese, Swahili, and Hebrew to further investigate cultural entanglement. The following graph presents the performance of Qwen3-14b on the Multilingual CulturalBench by language and country of origin.

CulturalBench Multilingual Results

Qwen3-14b performance on the Multilingual CulturalBench by language and country of origin.

Evaluation

Poster Overview

Overview of Evaluation Methodology

Evaluation Setup

For each question and language pair, we generate 10 responses per model with temperature set to 1. We evaluate all the responses using LLM as a Judge with the temperature set to 0. Our evaluation covered the following models: Qwen3-14B, Cohere-Aya-32B, Cohere-Aya-8B, Magistral and Sarvam-m.

LLM-as-a-Judge Setup

To ensure the quality of evaluation using LLM-as-a-Judge, we performed several ablations to choose the best configuration. We took a subset of 10 queries from the 20 queries we created and a subset of languages: English, Hindi, Chinese and Hebrew. For each query and language pair, we prompted Cohere-Aya-32B to generate 5 responses, corresponding to scores from 1 to 5 by providing it the rubrics to be used for evaluation. We use these responses for evaluating our Judge and the score corresponding the response as the ground truth score.

We use Cohere Command-A model and test 6 configurations of LLM-as-a-Judge:

(i) Original query along with original response (Baseline)

(ii) Original query along with response translated to English

(iii) Query translated to English along with original response

(iv) Original query and original response along with 2 reference responses as examples

(v) Original query and original response along with 4 reference responses as examples

(vi) Original query and original response along with 8 reference responses as examples

We note that we only provide randomly chosen reference responses to the model without any evaluation, making our methodology different from few-shot prompting and eliminating the need for human evaluated responses for reference. Providing reference examples to the model leads to higher alignment with ground truth scores as evaluated using Pearson correlation and Cohen's Kappa score. Using original query and response along with 8 randomly chosen examples lead to the highest alignment, hence we choose this configuration for our evaluations.

Judge Ablation Results

LLM-as-a-Judge Ablation Results

Results

Comparison of Response Quality by Language and Model

Comparison of Response Quality by Language and Model

Language Culture Entanglement

Percentage of responses classified to each culture by language

BibTeX

@misc{jain2026languagemodelsentanglelanguage,
      title={Language Models Entangle Language and Culture}, 
      author={Shourya Jain and Paras Chopra},
      year={2026},
      eprint={2601.15337},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2601.15337}, 
}