LocalLLaMA

2739 readers

51 users here now

Welcome to LocalLLama! This is a community to discuss local large language models such as LLama, Deepseek, Mistral, and Qwen.

Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.

As ambassadors of the self-hosting machine learning community, we strive to support eachother and share our enthusiasm in a positive constructive way.

founded 2 years ago

MODERATORS

SkySyrup@sh.itjust.works

pax@sh.itjust.works

noneabove1182@sh.itjust.works

Smokeydope@lemmy.world

MonsterBug@sh.itjust.works

Returning back to where it started with llama 3 8B. DeepHermes is a great for 8gb VRAM cards (lemmy.world)

submitted 4 days ago by Smokeydope@lemmy.world to c/localllama@sh.itjust.works

3 comments fedilink hide all child comments

I first started this hobby almost a year ago. Llama 3 8b had released a day or so prior. I had finally caught on and loaded up a llamafile on my old thinkpad.

It only ran at 0.7-1 t/s. But it ran. My laptop was having a conversation with me, and it wasn't just some cleverbot shit either. I was hooked man! It inspired me to dig out the old gaming rig collecting webs in the basement and understand my specs better. Machine learning and neural networks are fascinating.

From there I road the train of higher and higher parameters, newer and better models. My poor old nvidia 1070 8gb has its limits though as do I.

I love mistral models. 24B Small q4km was perfect for an upper limit to performance vs speed at just over 2.7-3t/s. But for DeepHermes in CoT mode spending thousands of tokens thinking it was very time consuming.

Well, I neglected to try DeepHermes 8b based off my first model, llama 3. Until now. I can fit the highest q6 on my card completely. Ive never loaded a model fully on vram always partial offloading.

What a night and day difference it makes! Entire paragraphs in seconds instead of a sentence or two. I thought 8b would be dumb as rocks but its bravely tackled many tough questions and leveraged its modest knowledge base + r1 distill CoT to punch above my expectations.

Its absolutely incredible how far things have come in a year. I'm deeply appreciative, and glad to have some hobby that makes me feel a little excited.

top 3 comments

sorted by: hot top controversial new old

[–] Oskar@piefed.social 4 points 4 days ago

Interesting, I had missed that there are "non-official" models that can be used with Ollama just like the official ones. e.g. https://ollama.com/huihui_ai/deephermes3-abliterated

And it gave a good explanation to my "lithmus test" code snippet

[–] brucethemoose@lemmy.world 1 points 4 days ago (1 children)

Have you tried QwQ or deepseek yet? FuseAI? Or any of the "thinking" models?

They are mind-blowingly good for the size, albiet very slow unless you keep them fully offloaded.

[–] Smokeydope@lemmy.world 2 points 4 days ago* (last edited 3 days ago)

Ive tried official Deepseek qwen 2.5 14b r1 distill and a few unofficial mistrals trained on R1 CoT. They are indeed pretty amazing and I found myself switching between a general purpose model and a thinking model regularly before this released.

DeepHermes is a thinking model family with R1 distill CoT that you can toggle between standard short output or spending a few thousand tokens thinking about a solution.

I found that pure thinking models are fantastic for asking certain kinds of problem solving questions, but awful at following system prompt changes for roleplay scenarios or adopting complex personality archetypes.

This let's you have your cake and eat it too by letting CoT be optional while keeping regular system prompt capabilities.

The thousands of tokens spent thinking can get time consuming when you only getting 3t/s on the larger 24b models. So its important to choose between a direct answer or spend 5 minutes to let it really think. Its abilities are impressive even if it takes 300 seconds to fully think out a problem at 2.5t/s.

Thats why I am so happy the 8b model is pretty intelligent with CoT enabled so I can fit a thinking model entire in vram and its not dumb as rocks in knowledge base either. I'm getting 15-20t/s with 8b instead of 2.5-3t/s partially offloading a larger model. 6.4x speed inceease at the CoT is a huge W for my real life human time spent waiting for a complete output.