Interesting, I had missed that there are "non-official" models that can be used with Ollama just like the official ones. e.g. https://ollama.com/huihui_ai/deephermes3-abliterated
And it gave a good explanation to my "lithmus test" code snippet
Welcome to LocalLLama! This is a community to discuss local large language models such as LLama, Deepseek, Mistral, and Qwen.
Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.
As ambassadors of the self-hosting machine learning community, we strive to support eachother and share our enthusiasm in a positive constructive way.
Interesting, I had missed that there are "non-official" models that can be used with Ollama just like the official ones. e.g. https://ollama.com/huihui_ai/deephermes3-abliterated
And it gave a good explanation to my "lithmus test" code snippet
Have you tried QwQ or deepseek yet? FuseAI? Or any of the "thinking" models?
They are mind-blowingly good for the size, albiet very slow unless you keep them fully offloaded.
Ive tried official Deepseek qwen 2.5 14b r1 distill and a few unofficial mistrals trained on R1 CoT. They are indeed pretty amazing and I found myself switching between a general purpose model and a thinking model regularly before this released.
DeepHermes is a thinking model family with R1 distill CoT that you can toggle between standard short output or spending a few thousand tokens thinking about a solution.
I found that pure thinking models are fantastic for asking certain kinds of problem solving questions, but awful at following system prompt changes for roleplay scenarios or adopting complex personality archetypes.
This let's you have your cake and eat it too by letting CoT be optional while keeping regular system prompt capabilities.
The thousands of tokens spent thinking can get time consuming when you only getting 3t/s on the larger 24b models. So its important to choose between a direct answer or spend 5 minutes to let it really think. Its abilities are impressive even if it takes 300 seconds to fully think out a problem at 2.5t/s.
Thats why I am so happy the 8b model is pretty intelligent with CoT enabled so I can fit a thinking model entire in vram and its not dumb as rocks in knowledge base either. I'm getting 15-20t/s with 8b instead of 2.5-3t/s partially offloading a larger model. 6.4x speed inceease at the CoT is a huge W for my real life human time spent waiting for a complete output.