苹果测试工程师的日常
params recommended for thinking/general tasks works well however the benchmark were like running forever
Intelligence Benchmark Comparison
Mode Sampled Qwopus3.5-9B-v3.5-oQ8-mtp
---------------------------------------------------------------
MMLU Sample 1000/14042 86.3%
CMMLU Sample 300/11582 82.3%
JMMLU Sample 300/7536 83.0%
TRUTHFULQA Full 817 82.4%
HUMANEVAL Full 164 88.4%
--- Detail ---
Model: Qwopus3.5-9B-v3.5-oQ8-mtp
Benchmark Accuracy Correct Total Time(s) Think
--------------------------------------------------------------
MMLU 86.3% 863 1000 27876.7 Yes
CMMLU 82.3% 247 300 7532.3 Yes
JMMLU 83.0% 249 300 8052.7 Yes
TRUTHFULQA 82.4% 673 817 20773.1 Yes
HUMANEVAL 88.4% 145 164 6723 Yes