It seems there's a lot of difference between the benchmark results and users experience with the Llama4 models. The 2 lesser Llama models fail at reasoning and several ordinary tests by AI youtubers. Maybe a configuration error, maybe the high lmsys results where actually from the Behemoth model, but something seem wrong to me.
Anyway, I use these models at the moment (Imho best, and free on Groq, cerebra, openrouter and others); 80% qwq, 15% R1, and Deepseek v3 for non thinking. Used to be Llama3.3 70b for most, but DeepSeek and reasoning happened.