37
submitted 5 months ago by ylai@lemmy.ml to c/localllama@sh.itjust.works
top 1 comments
sorted by: hot top controversial new old
[-] fhein@lemmy.world 15 points 5 months ago* (last edited 5 months ago)

Very nice speedups for people running CPU inference on supported hardware, but unfortunately does not help CPU+GPU split according to comment on one of the PRs.. That person says that for prompt evaluation, where these kernels would make a difference, llama.cpp performs all the calculations on the GPU. And during token generation it is IO-bound, so the faster CPU calculation becomes negligible.

this post was submitted on 08 Apr 2024
37 points (100.0% liked)

LocalLLaMA

2218 readers
2 users here now

Community to discuss about LLaMA, the large language model created by Meta AI.

This is intended to be a replacement for r/LocalLLaMA on Reddit.

founded 1 year ago
MODERATORS