苹果测试工程师的日常
params recommended for thinking/general tasks works well however the benchmark were like running forever
qwen3.5-9b original

qwopus-9b-v3.5
(thinking/general tasks)

MMLU↓
CMMLU↓
JMMLU↑
TRUTHFULQA↓
HUMANEVAL↑

gain more than loss, yet not very impressive
 
 
Back to Top