LLM evaluation

lightsong發表於2024-08-01

TinyEval

https://github.com/datawhalechina/tiny-universe/tree/main/content/TinyEval

https://huzixia.github.io/2024/05/29/eval/

https://meeting.tencent.com/user-center/shared-record-info?id=8b9cf6ca-add6-477b-affe-5b62e2d8f27e&from=3


首先,根據目標資料集的任務型別指定合理的評測metric.
根據目標資料的形式總結模型引導prompt.
根據模型初步預測結果採納合理的抽取方式.
對相應的pred與anwser進行得分計算.

opencompass

https://opencompass.org.cn/home

Large Model Evaluation System
Shanghai AI Laboratory
Open-source, efficient, and comprehensive
large model evaluation system and open platform

C-Eval

https://opendatalab.com/OpenDataLab/C-Eval/tree/main

https://hub.opencompass.org.cn/dataset-detail/C-Eval

New NLP benchmarks are urgently needed to align with the rapid development of large language models (LLMs). We present C-Eval, the first comprehensive Chinese evaluation suite designed to assess advanced knowledge and reasoning abilities of foundation models in a Chinese context. C-Eval comprises multiple-choice questions across four difficulty levels: middle school, high school, college, and professional. The questions span 52 diverse disciplines, ranging from humanities to science and engineering. C-Eval is accompanied by C-Eval Hard, a subset of very challenging subjects in C-Eval that requires advanced reasoning abilities to solve. We conduct a comprehensive evaluation of the most advanced LLMs on C-Eval, including both English- and Chinese-oriented models. Results indicate that only GPT-4 could achieve an average accuracy of over 60%, suggesting that there is still significant room for improvement for current LLMs. We anticipate C-Eval will help analyze important strengths and shortcomings of foundation models, and foster their development and growth for Chinese users.
Meta Data

The data set has

Question: The body of the question
A, B, C, D: The options which the model should choose from
Answer: (Only in dev and val set) The correct answer to the question
Explanation: (Only in dev set) The reason for choosing the answer.

Example

Question: 對於UDP協議,如果想實現可靠傳輸,應在哪一層實現____
A. 資料鏈路層
B. 網路層
C. 傳輸層
D. 應用層
Answer: D

lmdeploy

https://lmdeploy.readthedocs.io/zh-cn/latest/benchmark/evaluate_with_opencompass.html

https://opencompass.readthedocs.io/zh-cn/latest/advanced_guides/evaluation_turbomind.html

issue with openai

https://github.com/open-compass/opencompass/discussions/1100

https://github.com/open-compass/opencompass/issues/673

dataset

https://github.com/open-compass/opencompass/releases/tag/0.2.2.rc1

https://zhuanlan.zhihu.com/p/669291064

LVLM

https://mmbench.opencompass.org.cn/home

https://github.com/open-compass/VLMEvalKit

https://github.com/open-compass/MMBench/tree/main/samples

相關文章