The benchmark should evaluate the output language of the LLM and ensure it corresponds to the language of the question (as opposed to the language of the provided context). We should also try to improve the performance of the models by explicitly prompting them to use the question's language. |