자유게시판

라마3.1 405B MMLU 성능

작성자
작성일
2024-07-23 08:50
조회
1092
Llama 3.1 405B Instruct
MMLU 0 CoT 88.6
MMLU 5 Shot 87.3
MMLU PRO CoT 5 73.3

https://pastebin.com/9jGkYbXY

 
Training Time (GPU hours) Training Power Consumption (W) Training Location-Based Greenhouse Gas Emissions(tons CO2eq) Training Market-Based Greenhouse Gas Emissions(tons CO2eq)
Llama 3.1 8B 1.46M 700 420 0
Llama 3.1 70B 7.0M 700 2,040 0
Llama 3.1 405B 30.84M 700 8,930 0
Total 39.3M 11,390 0
The methodology used to determine training energy use and greenhouse gas emissions can be found [here](https://arxiv.org/pdf/2204.05149). Since Meta is openly releasing these models, the training energy use and greenhouse gas emissions will not be incurred by others. ## Training Data **Overview:** Llama 3.1 was pretrained on ~15 trillion tokens of data from publicly available sources. The fine-tuning data includes publicly available instruction datasets, as well as over 25M synthetically generated examples. **Data Freshness:** The pretraining data has a cutoff of December 2023. ## Benchmark scores In this section, we report the results for Llama 3.1 models on standard automatic benchmarks. For all the evaluations, we use our internal evaluations library. ### Base pretrained models
Category Benchmark # Shots Metric Llama 3 8B Llama 3.1 8B Llama 3 70B Llama 3.1 70B Llama 3.1 405B
General MMLU 5 macro_avg/acc_char 66.7 66.7 79.5 79.3 85.2
MMLU PRO (CoT) 5 macro_avg/acc_char 36.2 37.1 55.0 53.8 61.6
AGIEval English 3-5 average/acc_char 47.1 47.8 63.0 64.6 71.6
CommonSenseQA 7 acc_char 72.6 75.0 83.8 84.1 85.8
Winogrande 5 acc_char - 60.5 - 83.3 86.7
BIG-Bench Hard (CoT) 3 average/em 61.1 64.2 81.3 81.6 85.9
ARC-Challenge 25 acc_char 79.4 79.7 93.1 92.9 96.1
Knowledge reasoning TriviaQA-Wiki 5 em 78.5 77.6 89.7 89.8 91.8
Reading comprehension SQuAD 1 em 76.4 77.0 85.6 81.8 89.3
QuAC (F1) 1 f1 44.4 44.9 51.1 51.1 53.6
BoolQ 0 acc_char 75.7 75.0 79.0 79.4 80.0
DROP (F1) 3 f1 58.4 59.5 79.7 79.6 84.8
### Instruction tuned models
Category Benchmark # Shots Metric Llama 3 8B Instruct Llama 3.1 8B Instruct Llama 3 70B Instruct Llama 3.1 70B Instruct Llama 3.1 405B Instruct
General MMLU 5 macro_avg/acc 68.5 69.4 82.0 83.6 87.3
MMLU (CoT) 0 macro_avg/acc 65.3 73.0 80.9 86.0 88.6
MMLU PRO (CoT) 5 micro_avg/acc_char 45.5 48.3 63.4 65.1 73.3
IFEval 76.8 80.4 82.9 87.5 88.6
Reasoning ARC-C 0 acc 82.4 83.4 94.4 94.8 96.9
GPQA 0 em 34.6 30.4 39.5 41.7 50.7
MuSR 0 correct 56.3 45.7 55.1 58.1 56.7
Code HumanEval 0 pass@1 60.4 72.6 81.7 80.5 89.0
MBPP ++ base version 0 pass@1 70.6 72.8 82.5 86.0 88.6
Multipl-E HumanEval 0 pass@1 50.8 65.5 75.2
Multipl-E MBPP 0 pass@1 52.4 62.0 65.7
Math GSM-8K (CoT) 8 em_maj1@1 80.6 84.5 93.0 95.1 96.8
MATH (CoT) 0 final_em 29.1 51.9 51.0 68.0 73.8
Tool Use API-Bank 0 acc 83.6 82.6 85.1 90.0 92.0
Berkeley Function Calling 0 acc 76.1 76.1 83.0 85.1 88.5
Gorilla Benchmark API Bench 0 acc 8.8 8.2 14.7 29.7 35.3
Nexus (0-shot) 0 macro_avg/acc 37.6 38.5 47.8 56.7 58.7
Multilingual Multilingual MGSM 8 em - 68.2 - 85.6 90.3
#### Multilingual benchmarks
Category Benchmark Language Llama 3.1 8B Llama 3.1 70B Llama 3.1 405B
General MMLU (5-shot, macro_avg/acc) Portuguese 62.12 80.13 84.95
Spanish 62.45 80.05 85.08
Italian 61.63 80.4 85.04
German 60.59 79.27 84.36
French 62.34 79.82 84.66
Hindi 50.88 74.52 80.31
Thai 50.32 72.95 78.21
전체 0