HumanEval (2021年)

lusonixs發表於2024-05-30

原文網址 : https://www.cnblogs.com/epicblue/p/18223297

HumanEval: Hand-Written Evaluation Set

HumanEval是一個OpenAI在2021年構造的程式碼生成LLM評估資料集。

資料格式
所有資料放在一個json檔案中，每條資料包含提示詞，有效程式碼示例，多個測試程式碼。

下面是擷取的第一條資料

{
"task_id": "HumanEval/0",
"prompt":
"from typing import List

def has_close_elements(numbers: List[float], threshold: float) -> bool:
""" Check if in given list of numbers, are any two numbers closer to each other than
given threshold.
>>> has_close_elements([1.0, 2.0, 3.0], 0.5)
False
>>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
True
"""
",

"entry_point": "has_close_elements",

"canonical_solution":
" for idx, elem in enumerate(numbers):
for idx2, elem2 in enumerate(numbers):
if idx != idx2:
distance = abs(elem - elem2)
if distance < threshold:
return True

return False

"test":
"METADATA = {
'author': 'jt',
'dataset': 'test'
}

def check(candidate):
assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.3) == True
assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.05) == False
assert candidate([1.0, 2.0, 5.9, 4.0, 5.0], 0.95) == True
assert candidate([1.0, 2.0, 5.9, 4.0, 5.0], 0.8) == False
assert candidate([1.0, 2.0, 3.0, 4.0, 5.0, 2.0], 0.1) == True
assert candidate([1.1, 2.2, 3.1, 4.1, 5.1], 1.0) == True
assert candidate([1.1, 2.2, 3.1, 4.1, 5.1], 0.5) == False

"
}

提示詞風格
可見提示詞風格是：函式宣告行+函式doc_string，函式doc_string包含功能和多個輸入輸出示例。

把上面提示詞格式化成ollama提示詞格式

提示詞
"""from typing import List

測試一個剛出的LLM：codestral 22b，量化為Q4_1（很奇怪，4060Ti 16G不能完全放下Q5_0）

輸出
This problem can be solved by sorting the list of numbers and then checking if any two adjacent numbers are less
than or equal to the threshold apart from each other. If such a pair is found, we return True; otherwise, we
return False after checking all pairs. Here's how you could do this in Python:

from typing import List

def has_close_elements(numbers: List[float], threshold: float) -> bool:
    numbers.sort()  # sort the list of numbers
    for i in range(1, len(numbers)):
        if abs(numbers[i] - numbers[i - 1]) <= threshold:
            return True  # two close elements are found
    return False  # no two close elements are found

You can test this function with the provided examples:

print(has_close_elements([1.0, 2.0, 3.0], 0.5))  # prints: False
print(has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3))  # prints: True

其中程式碼能成功透過所有測試用例，可見codestral適應這種提示詞風格，且推理正確。

測試一個老牌的LLM：codellama:34b-python-q4_K_M

輸出
# turn into list sorted in ascending order
numbers = sorted(numbers)

for i, element in enumerate(numbers):
    if (i >= len(numbers)):
        return False

    diff = abs(element - numbers[i+1])
    print(f"diff: {diff}")

    if (diff < threshold):
        return True

return False

其中程式碼能成功透過所有測試用例，可見codellama 34B適應這種提示詞風格，且推理正確，也是不錯的。

小結
可以使用該資料集對LLM進行測試。
可以用該資料集的提示詞風格編寫本地LLM提示詞。
該資料集時代久遠了一點。
該資料集期望的程式碼行大約10行左右。

參考
https://blog.csdn.net/qq_27590277/article/details/135163862
https://github.com/openai/human-eval