苹果测试工程师的日常
params recommended for thinking/general tasks works well however the benchmark were like running forever
qwen3.5-9b original
→
qwopus-9b-v3.5
(thinking/general tasks)
MMLU↓
CMMLU↓
JMMLU↑
TRUTHFULQA↓
HUMANEVAL↑
gain more than loss, yet not very impressive