IBM

🏆 Code Lingua 🏆

Intelligent CAT

This leaderboard evaluates LLMs in Programming Language Translation

github huggingface paper

🚨

File a request

to add your models on our leaderboard!🚨

While other leaderboards assess abilities of LLMs to understand Natural Language (NL) for code synthesis, the ultimate way of assessing whether LLMs understand code syntax and semantics is code translation. Code Lingua serves as such leaderboard, and compares the ability of LLMs to understand what the code implements in source language and translate the same semantics in target language. The dataset used in this leaderboard can be accessed on HuggingFace 🤗.

🙏 Please cite our paper if you are using this leaderboard in your work 🙏

@inproceedings{pan2024lost,
  title = {Lost in translation: A study of bugs introduced by large language models while translating code},
  author = {Pan, Rangeet and Ibrahimzada, Ali Reza and Krishna, Rahul and Sankar, Divya and Wassi, Lambert Pouguem and Merler, Michele and Sobolev, Boris and Pavuluri, Raju and Sinha, Saurabh and Jabbarvand, Reyhaneh},
  booktitle = {2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE)},
  pages = {866--866},
  year = {2024},
  organization = {IEEE Computer Society}
}

✉️ Reach out to Ali (alirezai@illinois.edu) or Rangeet (rangeet.pan@ibm.com) for questions about the leaderboard ✉️













📝 Notes

  1. We use Pass@1* (greedy decoding with temperature=0), Pass@1 and Pass@5 for evaluating LLMs in our leaderboard. For Pass@1 and Pass@5, we report the max value from temperatures 0.2 and 0.8.
  2. For "All Dataset", the scores are averaged over each source-target language pair.
  3. It is the model providers' responsibility to avoid data contamination as much as possible. In other words, we cannot guarantee if the evaluated models are contaminated or not.

🤗 More Leaderboards

In addition to Code Lingua leaderboards, it is recommended to comprehensively understand LLM coding ability through a diverse set of benchmarks and leaderboards, such as:

  1. EvalPlus
  2. Big Code Models Leaderboard
  3. Chatbot Arena Leaderboard
  4. CrossCodeEval
  5. ClassEval
  6. CRUXEval
  7. Evo-Eval
  8. HumanEval.jl
  9. InfiCoder-Eval
  10. LiveCodeBench
  11. RepoBench
  12. TabbyML Leaderboard

We would like to thank authors of EvalPlus for their artifacts and leaderboard template 🙏