384 points by marban 30 days ago | 39 comments
axegon_ 29 days ago
I did something similar but using a K80 and M40 I dug up from eBay for pennies. Be advised though, stay as far away as possible from the K80 - the drivers were one of the most painful tech things I've ever had to endure, even if 24GB of VRAM for 50 bucks sounds incredibly appealing. That said, I had a decent-ish HP workstation laying around with 1200 watt power supply so I had where to put those two in. The one thing to note here is that these types of GPUs do not have a cooling of their own. My solution was to 3d print a bunch of brackets and attach several Noctua fans and have them blow at full speed 24/7. Surprisingly it worked way better than I expected - I've never gone above 60 degrees. As a side efffect, the CPUs are also benefiting from this hack: at idle, they are in the mid-20 degrees range. Mind you, the noctua fans are located on the front and the back of the case: the ones on the front act as an intake and the ones on the back as exhaust and there's two more inside the case that are stuck in front of the GPUs.

The workstation was refurbished for just over 600 bucks, and another 120 bucks for the GPUs and another ~60 for the fans.

Edit: and before someone asks - no I have not uploaded the STL's anywhere cause I haven't had the time but also since this is a very niche use case, though I might: the back(exhaust) bracket came out brilliant the first try - it was a sub-millimeter fit. Then I got cocky and thought that I'd also nail it first try on the intake and ended up re-printing it 4 times.

yjftsjthsd-h 28 days ago
> Be advised though, stay as far away as possible from the K80 - the drivers were one of the most painful tech things I've ever had to endure, even if 24GB of VRAM for 50 bucks sounds incredibly appealing.

I thought the problem was that those cards have loads of RAM but lack really important compute capabilities such that they're kind of useless for actually running AI workloads on. Is that not the case?

almostgotcaught 28 days ago
> Is that not the case?

it is - they're laughably slow and not even supported by latest CUDA

> NVIDIA Driver support for Kepler is removed beginning with R495. CUDA Toolkit development support for Kepler continues through CUDA 11.x.

GTP 28 days ago
But Deepseek R1 doesn't use CUDA, so maybe for this specific case, it isn't a big deal?
almostgotcaught 28 days ago
> it isn't a big deal?

friend you shouldn't make comments like this unless you understand the definitions of the words. Deepseek wrote some parts of their kernels using PTX. newsflash: PTX support for features is lockstep with CUDA support for the same features ie the fact that CUDA doesn't support it means you couldn't write the PTX to use those features either.

therealfiona 28 days ago
It is poor form to condemn someone from asking a question.

Thank you for providing the information to clear up ignorance though.

almostgotcaught 28 days ago
this is a question:

> is deepseak's use of PTX instead of CUDA relevant here?

this is a conclusion/assumption thinly veiled as a question

> Deepseek R1 doesn't use CUDA, so ... it isn't a big deal?

note, genuine questions don't already presuppose an answer.

GTP 22 days ago
Asking if it is a big deal or not is definitely a question ;) Thank you for providing the information I was missing though.
numpad0 28 days ago
The PTX hack is for backend runner and training infra, the public weights are often executed using existing backends. Especially R1-distill-* models are.
almostgotcaught 26 days ago
the two things (weights and kernels) have nothing to do with each other in the slightest. again i wish people would take a beat before commenting out of their depth and consider whether their comment adds to the conversation or not.
TrueDuality 28 days ago
I'm running P41s in one of my test boxes. These don't have support for BF16 but they do support F16 and F32 and those are accelerated to a certain degree, they're lacking kernels that are as optimized but its not terribly hard to adapt other ones for the purposes.

You don't get great out-of-the-box performance but it only took me three work days or so with no experience writing these to adapt, test, and validate a kernel using the acceleration hardware that was available (no prior experience writing these kernels).

They're not as powerful as others but still significantly better than running on a CPU alone and I'd bet my kernel is missing more advanced optimizations.

My issue with these was the power cable and fans. The author touches on the fans and I did try a 3D printed shroud and some of the higher pressure fans but I could only run the cards in short stints. I ended up making an enclosure that went straight out of the case using two high pressure SAN array fans I harvested from the IT graveyard per card and making a hole with an angle grinder.

The power cable is NOT STANDARD on these. I had to find a weird specific cable to adapt the standard 8-pin GPU connector and each card takes two of these bad boys.

egorfine 29 days ago
> K80 - the drivers were one of the most painful tech things I've ever had to endure

Well, for a dedicated LLM box it might be feasible to suffer with drivers a bit, no? What was your experience like with the software side?

JKCalhoun 29 days ago
Curious what HP workstation you have?
9front 29 days ago
HP Z440, it's in the article.
JKCalhoun 29 days ago
My comment was not directed at the blog but at the person I responded to.
BizarroLand 29 days ago
What kind of performance did you get out of that?
deadbabe 29 days ago
What’s the most pain you’ve ever felt?
kamranjon 29 days ago
For the same price ($1799) you could buy a Mac Mini with 48gb of unified memory and an m4 pro. It’d probably use less power and be much quieter to run and likely could outperform this setup in terms of tokens per second. I enjoyed the write up still, but I would probably just buy a Mac in this situation.
diggan 29 days ago
> likely could outperform this setup in terms of tokens per second

I've heard arguments both for and against this, but they always lack concrete numbers.

I'd love something like "Here is Qwen2.5 at Q4 quantization running via Ollama + these settings, and M4 24GB RAM gets X tokens/s while RTX 3090ti gets Y tokens/s", otherwise we're just propagating mostly anecdotes without any reality-checks.

fkyoureadthedoc 29 days ago
On an M1 Max 64GB laptop running gemma2:27b same prompt and settings from blog post

    total duration:       24.919887458s
    load duration:        39.315083ms
    prompt eval count:    37 token(s)
    prompt eval duration: 963.071ms
    prompt eval rate:     38.42 tokens/s
    eval count:           441 token(s)
    eval duration:        23.916616s
    eval rate:            18.44 tokens/s
I have a gaming PC with a 4090 I could try, but I don't think this model would fit
condiment 29 days ago
On a 3090 (24gb vram), same prompt & quant, I can report more than double the tokens per second, and significantly faster prompt eval.

    total_duration:       10530451000
    load_duration:        54350253
    prompt_eval_count:    36
    prompt_eval_duration: 29000000
    prompt_token/s:       1241.38
    eval_count:           460
    eval_duration:        10445000000
    response_token/s:     44.04
Fast prompt eval is important when feeding larger contexts into these models, which is required for almost anything useful. GPUs have other advantages for traditional ML, whisper models, vision, and image generation. There's a lot of flexibility that doesn't really get discussed when folks trot out the 'just buy a mac' line.

Anecdotally I can share my revealed preference. I have both an M3 (36gb) as well as a GPU machine, and I went through the trouble of putting my GPU box online because it was so much faster than the mac. And doubling up the GPUs allows me to run models like the deepseek-tuned llama 3.3, with which I have completely replaced my use of chatgpt 4o.

svachalek 29 days ago
Thanks for numbers! People should include their LLM runner as well I think, as there are differences in hardware optimization support. Like I haven't tested it but I've heard MLX is noticeably faster than Ollama on Macs.
diggan 29 days ago
> gemma2:27b

What quantization are you using? What's the runtime+version you run this with? And the rest of the settings?

Edit: Turns out parent is using Q4 for their test. Doing the same test with LM Studio and a 3090ti + Ryzen 5950X (with 44 layers on GPU, 2 on CPU) I get ~15 tokens/second.

fkyoureadthedoc 29 days ago
Fresh install from brew, ollama version is 0.5.7

Only settings I did were the ones shown in the blog post

    OLLAMA_FLASH_ATTENTION=1
    OLLAMA_KV_CACHE_TYPE=q8_0
Ran the model like

    ollama run gemma2:27b --verbose
With the same prompt, "Can you write me a story about a tortoise and a hare, but one that involves a race to get the most tokens per second?"
diggan 29 days ago
When you run that, what quantization do you get? The library website of Ollama (https://ollama.com/library/gemma2:27b) isn't exactly a good use case in surfacing useful information like what the default quantization is.
mkesper 29 days ago
If you leave the :27b off from that URL you'll see the default size which is 9b. Ollama seems to always use Q4_0 even if other quants are better.
fkyoureadthedoc 29 days ago
not sure how to tell, but here's the full output from ollama serve https://pastes.io/ollama-run-gemma2-27b
navbaker 29 days ago
If you hit the drop-down menu for the size of the model, then tap “view all”, you will see the size and hash of the model you have selected and can compare it to the full list below it that has the quantization specs in the name.
diggan 29 days ago
Still, I don't see a way (from the web library) to see the default quantization (from Ollama's POV) at all, is that possible somehow?
navbaker 29 days ago
The model displayed in the drop-down when you access the web library is the default that will be pulled. Compare the size and hash to the more detailed model listing below it and you will see what quantization you have.

Example: the default model weights for Llama 3.3 70b, after hitting the “view all” have this hash and size listed next to it - a6eb4748fd29 • 43GB

Now scroll down through the list and you will find the one that matches that hash and size is “70b-instruct-q4_K_M”. That tells you that the default weights for Llama 3.3 70B from Ollama are 4-bit quantized (q4) while the “K_M” tells you a bit about what techniques were used during quantization to balance size and performance.

diggan 29 days ago
Thanks, that seems to indicate Q4 for the quantization, you're probably able to run that on the 4090 as well FWIW, the size of the model is just 14.55 GiB.
rahimnathwani 29 days ago
gemma2:27b-instruct-q4_0 (checksum 53261bc9c192)
fkyoureadthedoc 28 days ago
7800X3D, 32GB DDR5, 4090:

    total duration:       10.5922028s
    load duration:        21.1739ms
    prompt eval count:    36 token(s)
    prompt eval duration: 546ms
    prompt eval rate:     65.93 tokens/s
    eval count:           467 token(s)
    eval duration:        10.023s
    eval rate:            46.59 tokens/s
cruffle_duffle 29 days ago
I think we are somewhat still at the “fuzzy super early adopter” stage of this local LLM game and hard data is not going to be easy to come by. I almost want to use the word “hobbiest stage” where almost all of the “data” and “best practice” is anecdotal but I think we are a step above that.

Still, it’s way to early and there are simply way to many hardware and software combinations that change almost weekly to establish “the best practice hardware configuration for training / inferencing large language models locally”.

Some day there will be established guides with solid. In fact someday there will be be PC’s that specifically target LLMs and will feature all kinds of stats aimed at getting you to bust out your wallet. And I even predict they’ll come up with metrics that all the players will chase well beyond when those metrics make sense (megapixels, clock frequency, etc)… but we aren’t there yet!

motorest 29 days ago
> I think we are somewhat still at the “fuzzy super early adopter” stage of this local LLM game and hard data is not going to be easy to come by.

What's hard about it? You get the hardware, you run the software, you take measurements.

GTP 29 days ago
Yes, but we don't have enough people doing that to get quality data. Not many people are building this kind of setup, and even less are publishing their results. Additionally, if I just run a test a couple of time and then average the results, this is still far from a solid measurement.
diggan 29 days ago
> but we don't have enough people doing that to get quality data

But how are we supposed to get enough people doing those things if everyone say "There isn't enough data right now for it to be useful"? We have to start somewhere

unshavedyak 29 days ago
I don't think they're saying anything counter to that. The people who don't require the volume of data will run these. Ie the super early adopters.
colonCapitalDee 29 days ago
We've already started, we just haven't finished yet
diggan 29 days ago
Right, but how are we supposed to be getting anywhere else unless people start being more specific and stop leaning on anecdotes or repeating what they've heard elsewhere?

Saying "Apple seems to be somewhat equal to this other setup" doesn't really contribute to someone getting an accurate picture if it is equal or not, unless we start including raw numbers, even if they aren't directly comparable.

I don't think it's too early to say "I get X tokens/second with this setup + these settings" because then we can at least start comparing, instead of just guessing which seems to be the current SOTA.

kamranjon 29 days ago
A great thread with the type of info your looking for lives here: https://github.com/ggerganov/whisper.cpp/issues/89

But you can likely find similar threads for the llama.cpp benchmark here: https://github.com/ggerganov/llama.cpp/tree/master/examples/...

These are good examples because the llama.cpp and whisper.cpp benchmarks take full advantage of the Apple hardware but also take full advantage of non-Apple hardware with GPU support, AVX support etc.

It’s been true for a while now that the memory bandwidth of modern Apple systems in tandem with the neural cores and gpu has made them very competitive Nvidia for local inference and even basic training.

diggan 29 days ago
I guess I'm mostly lamenting about how unscientific these discussions are in general, on HN and elsewhere (besides specific GitHub repositories). Every community is filled with just anecdotal stories, or some numbers but missing to specify a bunch of settings + model + runtime details so people could at least compare it to something.

Still, thanks for the links :)

t1amat 29 days ago
In fairness it’s become even more difficult now than ever before.

* hardware spec

* inference engine

* specific model - differences to tokenizer will make models faster/slower with equivalent parameter count

* quantization used - and you need to be aware of hardware specific optimizations for particular quants

* kv cache settings

* input context size

* output token count

This is probably not a complete list either.

nickthegreek 29 days ago
Best place to get that kinda info is gonna be /r/LocalLlama
vladgur 29 days ago
as someone who is paying $0.50 per kwh, id also like to include kw per 1000 tokens or something to give me a sense of cost of ownership these local systems
troyvit 28 days ago
That would be an awesome thing across the industry -- even for the big commercial models -- for those who care not only about price but also carbon footprint.
un_ess 29 days ago
Per the screenshot, this is a DeepSeek running on a 192GB M2 Studio https://nitter.poast.org/ggerganov/status/188461277009384272...

The same on Nvidia (various models) https://github.com/ggerganov/llama.cpp/issues/11474

[1] this is a the model: https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/De...

diggan 29 days ago
So Apple M2 Studio does ~15 tks/second and A100-SXM4-80GB does 9 tks/second?

I'm not sure I'm reading the results wrong or missing some vital context, but that sounds unlikely to me.

achierius 29 days ago
The studio has a lot more ram available to the GPU (up to 192gb) than the a100 (80gb), and iirc at least comparable memory bandwidth -- those are what matter when you're doing LLM inference, so the studio tends to win out there.

Where the a100 and other similar chips dominate is in training &c, which is mostly a question of flops.

diggan 29 days ago
> and iirc at least comparable memory bandwidth

I don't think they do.

From Wikipedia:

> the M2 Pro, M2 Max, and M2 Ultra have approximately 200 GB/s, 400 GB/s, and 800 GB/s respectively

From techpowerup:

> NVIDIA A100 SXM4 80 GB - Memory bandwidth - 2.04 TB/s

Seems to be a magnitude of difference, and that's just the bandwidth.

29 days ago
motorest 29 days ago
> For the same price ($1799) you could buy a Mac Mini with 48gb of unified memory and an m4 pro.

Around half that price tag was attributed to the blogger reusing an old workstation he had lying around. Beyond this point, OP slapped two graphics cards into an old rig. A better description would be something like "what buying two graphics cards gets you in terms of AI".

Capricorn2481 29 days ago
> Beyond this point, OP slapped two graphics cards into an old rig

Meaning what? This is largely what you do on a budget since RAM is such a difference maker in token generation. This is what's recommended. OP could buy an a100, but that wouldn't be a budget build.

oofbaroomf 29 days ago
The bottleneck for single batch inference is memory bandwidth. The M4 Pro has less memory bandwidth than the P40, so it would be slower. Also, the setup presented in the OP has system RAM, allowing you to run models than what fits in 48GB of VRAM (and with good speeds too if you offload with something like ktransformers).
anthonyskipper 29 days ago
>>M4 Pro has less memory bandwidth than the P40, so it would be slower

Why do you say this? I thought the p40 only had a memory bandwidth of 346 Gbytes/sec. The m4 is 546 GB/s. So the macbook should kick the crap out of the p40.

oofbaroomf 29 days ago
The M4 Max has up to 546 GB/s. The M4 Pro, what GP was talking about, has only 273 GB/s. An M4 Max with that much RAM would most likely exceed OP's budget.
ekianjo 29 days ago
Mac Mini will be very slow for context ingestion compared to nvidia GPU, and the other issue is that they are not usable for Stable Diffusion... So if you just want to use LLMs, maybe, but if you have other interests in AI models, probably not the right answer.
drcongo 29 days ago
I use a Mac Studio for Stable Diffusion, what's special about the Mac Mini that means it won't work?
vunderba 29 days ago
What models are you using? Stable diffusion 1.5, SDXL, or flux?

I've heard that Macs are pretty slow with XL and borderline unusable for flux requiring minutes at a time to generate a single image - whereas an RTX4090 can generate a 1024x1024 image with the higher quality Flux Dev model (not schnell) in 14 seconds.

OP is probably correct that if you want to branch out of just strictly LLM's, cuda is the way to go. I've never heard of anyone getting LTX or hunyuan running on a Mac for example.

drcongo 27 days ago
I've used 1.5 through 3.5, XL has been kinda fine for me but tiny adjustments can take it from good to terrible. Good point about Flux though, that's awful on the Mac.
JKCalhoun 29 days ago
For this use case though, I would prefer something more modular than Apple hardware — where down the road I could upgrade the GPUs, for example.
UncleOxidant 29 days ago
I wish Apple would offer a 128GB option in the Mac Mini - That would require an M4 Max which they don't offer in the mini. I know they have a MBP with M4 Max and 128GB, but I don't need another laptop.
kridsdale1 29 days ago
I’m waiting until this summer with the M4 Ultra Studio.
UncleOxidant 29 days ago
Which will likely be over five grand for 128GB.
joshstrange 29 days ago
I’d really love to build a machine for local LLMs. I’ve tested models on my MBP M3 Max with 128GB of ram and it’s really cool but I’d like a dedicated local server. I’d also like an excuse to play with proxmox as I’ve just run raw Linux servers or UnRaid w/ containers in the past.

I have OpenWebUI and LibreChat running on my local “app server” and I’m quite enjoying that but every time I price out a beefier box I feel like the ROI just isn’t there, especially for an industry that is moving so fast.

Privacy is not something to ignore at all but the cost of inference online is very hard to beat, especially when I’m still learning how best to use LLMs.

cwalv 29 days ago
> but every time I price out a beefier box I feel like the ROI just isn’t there, especially for an industry that is moving so fast.

Same, esp. if you factor in the cost of renting. Even if you run 24/7 it's hard to see it paying off in half the time it will take to be obsolete

datadrivenangel 29 days ago
You pay a premium to get the theoretical local privacy and reliability of hosting your own models.

But to get commercially competitive models you need 5 figures of hardware, and then need to actually run it securely and reliably. Pay as you go with multiple vendors as fallback is a better option right now if you don't need harder privacy.

joshstrange 29 days ago
Yeah, really I'd love for my Home Assistant to be able to use a local LLM/TTS/STT which I did get working but was way too slow. Also it would fun to just throw some problems/ideas at the wall without incurring (more) cost, that's a big part of it. But each time I run the numbers I would be better off using Anthropic/OpenAI/DeepSeek/other.

I think sooner or later I'll break down and buy a server for local inference even if the ROI is upside down because it would be a fun project. I also find that these thing fall in the "You don't know what you will do with it until you have it and it starts unlocking things in your mind"-category. I'm sure there are things I would have it grind on overnight just to test/play with an idea which is something I'd be less likely to do on a paid API.

nickthegreek 29 days ago
You shouldn't be having slow response issues with LLM/TTS/STT for HA on a mbp m3 max 128gb. I'd either limit the entities exposed or choose a smaller model.
joshstrange 29 days ago
Oh, I can get smaller models to run reasonably fast but I'm very interested in tool calling and I'm having a hard time finding a model that runs fast and is good at calling tools locally (I'm sure that's due to my own ignorance).
nickthegreek 29 days ago
I decided on openai api for now after setting up so many differnt methods. the local stuff isn't up to snuff yet for what I am trying to accomplish but decent for basic control.
joshstrange 29 days ago
I use a combo of Anthropic and OpenAI for now through my bots and my chat UIs and that lets me iterate faster. My hope is once I've done all my testing I could consider moving to local models if it made sense.
cruffle_duffle 29 days ago
> You don't know what you will do with it until you have it and it starts unlocking things in your mind

Exactly. Once the price and performance get to the level where buying stuff for local training and inferencing… that is when we will start to see the LLM break out of its current “corporate lawyer safe” stage and really begin to shake things up.

rsanek 29 days ago
With something like OpenRouter, you don't even have to manually integrate with multiple vendors
wkat4242 29 days ago
Is that like LiteLLM? I have that running but never tried OpenRouter. I wonder now if it's better :)
pelatimtt 29 days ago
You can also try LangDB or Portkey
whalesalad 29 days ago
The juice aint worth the squeeze to do this locally.

But you should still play with proxmox, just not for this purpose. My recommendation would be to get an i7 HP Elitedesk. I have multiple racks in my basement, hundreds of gigs of ram, multiple 2U 2x processor enterprise servers etc.... but at this point all of it is turned off and a single HP Elitedesk with a 2nd NIC added and 64GB of ram is doing everything I ever needed and more.

joshstrange 29 days ago
Yeah, right now I'm running a tower PC (Intel Core i9-11900K, 64GB Ram) with Unraid as my local "app server". I want to play with Proxmox (for professional and mostly fun reasons) though. Someday I'd like a rack in my basement as my homelab stuff has overgrown the space it's in and I'm going to need to add a new 12-bay Synology (on top of 2x12-bay) soon since I'm running out of space again. For now I've been sticking with consumer/prosumer equipment but my needs are slowly outstripping that I think.
smith7018 29 days ago
For what it's worth, looking at the benchmarks, I think the machine they built is comparable to what your MBP can already do. They probably have a better inference speed, though.
moffkalast 29 days ago
A Strix Halo minipc might be a good mid tier option once they're out, though AMD still isn't clear on how much they'll overprice them.

Core Ultra Arc iGPU boxes are pretty neat too for being standalone and can be loaded up with DDR5 shared memory, efficient and usable in terms of speed, though that's definitely low end performance, plus SYCL and IPEX are a bit eh.

reacharavindh 29 days ago
The thing is though.... the locally hosted models in such hardware are cute as toys, and sure do write funny jokes and importantly, perform private tasks that I would never consider passing to non-selfhosted models, but pale in comparison to the models accessible over APIs(Claude 3.5 Sonnet, OpenAI etc). If I could run deepseek-r1-678b locally, without breaking the bank, I would. But, for now, opex > capex at a consumer level.
walterbell 29 days ago
200+ comments, https://news.ycombinator.com/item?id=42897205

> This runs the 671B model in Q4 quantization at 3.5-4.25 TPS for $2K on a single socket Epyc server motherboard using 512GB of RAM.

elorant 29 days ago
Runs is an overstatement though. With 4 tokens/second you can't use it on production.
mechagodzilla 29 days ago
I have a similar setup running at about 1.5 tokens/second, and it's perfectly usable for the sorts of difficult tasks one needs a frontier model like this for - give it a prompt and come back an hour or two later. You interact with it like e-mailing a coworker. If I need an answer back in seconds, it's probably not a very complicated question, and a much smaller model will do.
xienze 29 days ago
I get where you’re coming from, but the problem with LLMs is that you very regularly need a lot of back-and-forth with them to tease out the information you’re looking for. A more apt analogy might be a coworker that you have to follow up with three or four times, at an hour per. Not so appealing anymore. Doubly so when you have to stand up $2k+ of hardware for the privilege. If I’m paying good money to host something locally, I want decent performance.
MonkeyClub 29 days ago
> If I’m paying good money to host something locally

The thing is, however, that at 2k one is not paying good money, one is paying near the least amount possible. TFA specifically is about building a machine on a budget, and as such cuts corners to save costs, e.g. by buying older cards.

Just because 2k is not a negligible amount in itself, that doesn't also automatically make it adequate for the purpose. Look for example at the 15k, 25k, and 40k price range tinyboxes:

https://tinygrad.org/#tinybox

It's like buying a 2k-worth used car, and expecting it to perform as well as a 40k one.

unshavedyak 29 days ago
Agreed. Furthermore, for some tasks like large context code assistant windows i want really fast responses. I've not found a UX i'm happy with yet but for anything i care about i'd want very fast token responses. Small blocks of code which instantly autocomplete, basically.
Aurornis 29 days ago
> give it a prompt and come back an hour or two later.

This is the problem.

If your use case is getting a small handful of non-urgent responses per day then it's not a problem. That's not how most people use LLMs, though.

deadbabe 29 days ago
Isn’t 4 tps good enough for local use by a single user, which is the point of a personal AI computer?
IanCal 28 days ago
4 tokens per second is pretty slow. That's 5-10s for a comment the length of yours (and R1 specifically likes to output a lot of tokens). It's 10-20x slower than many top end models, which are available cheaply. Even high cost versions of R1 (at more than twice the price of sonnet) are $7/million tokens. For $2K you get 285 million tokens. You'd have to run the box at full whack for over two years (for 4tps) to hit that spending, and that ignores electricity prices. Sonnet 3.5 is half that price, and other R1 providers you could probably hit about a billion tokens for $2k. Gemini flash 2 is over 100 tokens per second and $2k gets you something like 5+B tokens (more really but I'm taking an easy estimate over the more expensive part).

If there are things you cannot send to a random party, you might want to look at hosted versions with agreements (if it's a code issue, if you're fine with github then azure is probably fine too).

Outside of that, if you really need to then sure, but these are the kinds of things that really benefit from being able to get high usage on GPUs for short periods of time.

JKCalhoun 29 days ago
It is for me. I'm happy to switch over to another task (and maybe that task is refilling my coffee) and come back when the answer is fully formed.
ErikBjare 29 days ago
I tend to get impatient at less than 10tok/s: If the answer is 600tok (normal for me) that's a minute.
Cascais 29 days ago
I agree with elorant. Indirectly, some youtubers ended up demonstrating that it's difficult to run the best models with less than 7k$, even if NVIDIA hardware is very efficient.

In the future, I expect this to not be the case, because models will be far more efficient. At this pace, maybe even 6 months can make a difference.

walterbell 29 days ago
Some LLM use cases are async, e.g. agents, "deep research" clones.
diggan 29 days ago
Not to mention even simpler things, like wanting to tag all of your local notes based on the content, basically a bash loop you can run indefinitely and speed doesn't matter much, as long as it eventually finishes
vunderba 29 days ago
Additionally if all you're doing is simple tagging and classification, you can probably get away with a significantly smaller model (sub 14b parameter model) like Mistral 7b or Qwen 14b.
CamperBob2 29 days ago
What I'd like to know is how well those dual-Epyc machines run the 1.58 bit dynamic quant model. It really does seem to be almost as good as the full Q8.
DrNosferatu 28 days ago
I tried that that: ~1.5 to 3 tokens/sec.
CamperBob2 26 days ago
Ouch, thanks. About what I get now on a single-CPU box with 128 GB+a 4090. Was hoping for a major speedup.
DrNosferatu 26 days ago
Peak performance is achieved at ~21 cores. Bottleneck - without any special configs - is RAM to CPU bandwidth.

Let me know if you find some config that really leverages more cores!

cratermoon 29 days ago
This is not because the models are better. These services have unknown and opaque levels of shadow prompting[1] to tweak the behavior. The subject article even mentions "tweaking their outputs to the liking of whoever pays the most". The more I play with LLMs locally, the more I realize how much prompting going on under the covers is shaping the results from the big tech services.

1 https://www.techpolicy.press/shining-a-light-on-shadow-promp...

CamperBob2 29 days ago
The 1.58-bit DeepSeek R1 dynamic quant model from Unsloth is no joke. It just needs a lot of RAM and some patience.
jaggs 29 days ago
There seems to be a LOT of work going on to optimize the 1.58-bit option in terms of hardware and add-ons. I get the feeling that someone from Unsloth is going to have a genuine breakthrough shortly, and the rig/compute costs are going to plummet. Hope I'm not being naïve or over-confident.
vanillax 29 days ago
Huh? Toys? You can run DeepSeek 70b on 36GB ram Macbook pro.. You can run Phi4, Qwen2.5, or llama3.3. They work great for coding tasks
3s 29 days ago
Yeah but as one of the replies points out the resulting tokens/second would be unusable in production environments
vanillax 28 days ago
What? Literally use it at work to write code.
jt_b 28 days ago
I think they're talking about using it to power inference for self hosted user facing applications.
vanillax 28 days ago
ahhhhh yes ok. Totally agree here.
jmyeet 29 days ago
The author mentions it but I want to expand on it: Apple is a seriously good option here, specifically the M4 Mac Mini.

What makes Apple attractive is (as the author mentions) that RAM is shared between main and video RAM whereas NVidia is quite intentionally segmenting the market and charging huge premiums for high VRAM cards. Here are some options:

1. Base $599 Mac Mini: 16GB of RAM. Stocked in store.

2. $999 Mac Mini: 24GB of RAM. Stocked in store.

3. Add RAM to either of the above up to 32GB. It's not cheap at $200/8GB but you can buy a Mac Mini with 32GB of shared RAM for $999, substantially cheaper than the author's PC build but less storage (although you can upgrade that too).

4. M4 Pro: $1399 w/ 24GB of RAM. Stocked in store. You can customize this all the way to 64GB of RAM for +$600 so $1999 in total. That is amazing value for this kind of workload.

5. The Mac Studio is really the ultimate option. Way more cores and you can go all the way to 192GB of unified memory (for a $6000 machine). The problem here is that the Mac Studio is old, still on the M2 architecture. An M4 Ultra update is expected sometime this year, possibly late this year.

6. You can get into clustering these (eg [1]).

7. There are various Macbook Pro options, the highest of which is a 16" Mackbook Pro with 128GB of unified memory for $4999.

But the main takeaway is the M4 Mac Mini is fantastic value.

Some more random thoughts:

- Some Mac Minis have Thunderbolt 5 ("TB5"), which is up to either 80Gbps or 120Gbps bidirectional (I've seen it quoted as both);

- Mac Minis have the option of 10GbE (+$200);

- The Mac Mini has 2 USB3 ports and either 3 TB4 or 3 TB5 ports.

[1]: https://blog.exolabs.net/day-2/

atwrk 29 days ago
Worth pointing out that you "only" get <= 270GB/s of memory bandwith with those Macs, unless you choose the max/ultra models.

If that is enough for your use case, it may make sense to wait 2 months and get a Ryzen AI Max+ 395 APU, which will have the same memory bandwith, but allows for up to 128GB RAM. For probably ~half the Mac's price.

Usual AMD driver disclaimer applies, but then again inference is most often way easier to get running than training.

sofixa 29 days ago
The issue with Macs is that below Max/Ultra processors, the memory bandwidth is pretty slow. So you need to spend a lot on a high level processor and lots of memory, and the current gen processor, M4, doesn't even have an Ultra, while the Max is only available in a laptop form factor (so thermal constraints).

An M4 Pro still has only 273GB/s, while even the 2 generations old RTX 3090 has 935GB/s.

https://github.com/ggerganov/llama.cpp/discussions/4167

jmyeet 29 days ago
That's a good point. I checked the M2 Mac Studio and it's 400GB/s for the M2 Max and 800GB/s for the M2 Ultra so the M4 Ultra when we get it later this year should really be a beast.

Oh and the top end Macbook Pro 16 (the only current Mac with an M4 Max) has 410GB/s memory bandwidth.

Obviously the Mac Studio is at a much higher price point.

Still, you need to spend $1500+ to get an NVidia GPU with >12GB of RAM. Multiple of those starts adding up quick. Put multiple in the same box and you're talking more expensive case, PSU, mainboard, etc and cooling too.

Apple has a really interesting opportunity here with their unified memory architecture and power efficiency.

diggan 29 days ago
How is the performance difference between using a dedicated GPU from Nvidia for example compared to whatever Apple does?

So lets say we'd run a model on a Mac Mini M4 with 24GB RAM, how many tokens/s are you getting? Then if we run the exact same model but with a RTX 3090ti for example, how many tokens/s are you getting?

Do these comparisons exist somewhere online already? I understand it's possible to run the model on Apple hardware today, with the unified memory, but how fast is that really?

redman25 29 days ago
Not the exact same comparison but I have an M1 mac with 16gb ram and can get about 10 t/s with a 3B model. The same model on my 3060ti gets more than 100 t/s.

Needless to say, ram isn't everything.

diggan 29 days ago
Could you say what exact model+quant you're using for that specific test + settings + runtime? Just so I could try to compare with other numbers I come across.
oofbaroomf 29 days ago
Unified memory is great because it's fast, but you can also get a lot of system memory on a "conventional" machine like OP's, and offload MOE layers like what Ktransformers did, so you can run huge models with acceptable speeds. While the Mac mini may have better value for anything that fits in the unified memory, if you want to run Deepseek R1 or other large models, then it's best to max out system RAM and get a GPU to offload.
sethd 29 days ago
For sure and the Mac Mini M4 Pro with 64GB of RAM feels like the sweet spot right now.

That said, the base storage option is only 512GB, and if this machine is also a daily driver, you’re going to want to bump that up a bit. Still, it’s an amazing machine for under $3K.

wolfhumble 29 days ago
It would be better/cheaper to buy an external Thunderbolt 5 enclosure for the NVME drive you need.
sethd 29 days ago
I looked into this a couple months ago and external TB5 was still more expensive at 1-2 TB not sure about above, though.
wolfhumble 28 days ago
Going from 500GB to 2TB built-in is €600 in the US at the moment.

Samsung 990PRO 2TB is $170 and Acasis T5 80Gbps is €300. So it makes sense to buy external for ≥ 2TB, more flexible as well :-)

For 1TB it makes more sense to buy built-in as you note above.

sethd 27 days ago
It also depends on the quality of that enclosure and whether or not it adds heat or fan noise.
iamleppert 29 days ago
The hassle of not being able to work with native CUDA isn't worth it for a huge amount of AI. Good luck getting that latest paper or code working quickly just to try it out, if the author didn't explicitly target M4 (unlikely but all the most mainstream of stuff).
darkwater 29 days ago
In a homelab scenario, having your own AI assistant not ran by someone else, that is not an issue. If you want to tinker/learn AI it's definitely an issue.
ollybee 29 days ago
The middle ground is to rent a GPU VPS as needed. You can get an H100 for $2/h. Not quite the same privacy as fully local offline, but better than a SASS API and good enough for me. Hopefully in a year or three it will truly be cost effective to run something useful locally and then I can switch.
anonzzzies 29 days ago
That is what I do but it costs a lot of $, more than just using openrouter. I would like to have a machine so I can have a model talk to itself 24/7 for a realtively fixed price. I have enough solar and wind + cheap net electric so it would basically be free after buying. Just hard to pick what to buy without just forking out a fortune on GPU's.
1shooner 29 days ago
Do you have a recommended provider or other pointers for GPU rental?
birktj 29 days ago
I was wondering if anyone here has experimented with running a cluster of SBC for LLM inference? Ex. the Radxa ROCK 5C has 32GB of memory and also a NPU and only costs about 300 euros. I'm not super up to date on the architecture on modern LLMs, but as far as I understand you should be able to split the layers between multiple nodes? It is not that much data the needs to be sent between them, right? I guess you won't get quite the same performance as a modern mac or nvidia GPU, but it could be quite acceptable and possibly a cheap way of getting a lot of memory.

On the other hand I am wondering about what is the state of the art in CPU + GPU inference. Prompt processing is both compute and memory constrained, but I think token generation afterwards is mostly memory bound. Are there any tools that support loading a few layers at a time into a GPU for initial prompt processing and then switches to CPU inference for token generation? Last time I experimented it was possible to run some layers on the GPU and some on the CPU, but to me it seems more efficient to run everything on the GPU initially (but a few layers at a time so they fit in VRAM) and then switch to the CPU when doing the memory bound token generation.

Eisenstein 29 days ago
> I was wondering if anyone here has experimented with running a cluster of SBC for LLM inference? Ex. the Radxa ROCK 5C has 32GB of memory and also a NPU and only costs about 300 euros.

Look into RPC. Llama.cpp supports it.

* https://www.reddit.com/r/LocalLLaMA/comments/1cyzi9e/llamacp...

> Last time I experimented it was possible to run some layers on the GPU and some on the CPU, but to me it seems more efficient to run everything on the GPU initially (but a few layers at a time so they fit in VRAM) and then switch to the CPU when doing the memory bound token generation.

Moving layers over the PCIe bus to do this is going to be slow, which seems to be the issue with that strategy. I think it the key is to use MoE and be smart about which layers go where. This project seems to be doing that with great results:

* https://github.com/kvcache-ai/ktransformers/blob/main/doc/en...

dgrabla 29 days ago
Great breakdown!. The "own your own AI" at home is a terrific hobby if you like to tinker, but you are going to spend a ton of time and money on hardware that will be underutilized most of the time. If you want to go nuts check out Mitko Vasilev's dream machine. It makes no sense if you don't have a very clear use case that only requires small models or really slow token generation speeds.

If the goal however is not to tinker but to really build and learn AI, it is going to be financially better to rent those GPUs/TPUs as needs arise.

theshrike79 29 days ago
Any M-series Mac is "good enough" for home LLMs. Just grab LM studio and a model that fits in memory.

Yes, it will not rival OpenAI, but it's 100% local with no monthly fees and depending on the model no censoring or limits on what you can do with it.

jrm4 29 days ago
For what purpose? I'm asking this as someone who threw one of the cheap $500 Nvidia's with 16gb of VRAM and I'm already overwhelmed with what I can do already with Ollama, Krita+ComfyUI etc etc.
lioeters 29 days ago
> spend a ton of time and money

Not necessarily. For non-professional purposes, I've spent zero dollars (no additional memory or GPU) and I'm running a local language model that's good enough to help with many kinds of tasks including writing, coding, and translation.

It's a personal, private, budget AI that requires no network connection or third-party servers.

ImPostingOnHN 29 days ago
on what hardware (and how much did you spend on it)?
memhole 29 days ago
This is correct. The cost makes no sense outside of hobby and interest. You're far better off renting. I think there is some merit to having a local inference server if you're doing development. You can manage models and have a little more control over your infra as the main benefits.
JKCalhoun 29 days ago
Terrific hobby? Sign me up!
miniwark 29 days ago
2 x Nvidia Tesla P40 card for €660 is not a thing i consider to be "on a budget".

People can play with "small" or "medium" models less powerfull and cheaper cards. A Nvidia Geforce RTX 3060 card with "only" 12Gb VRAM can be found around €200-250 on second hand market (and they are around 300~350 new).

In my opinion, 48Gb of VRAM is overkill to call it "on a budget", for me this setup is nice but it's for semi-professional or professional usage.

There is of course a trade off to use medium or small models, but being "on a budget" is also to do trade off.

whywhywhywhy 29 days ago
> A Nvidia Geforce RTX 3060 card with "only" 12Gb VRAM can be found around €200-250 on second hand market

1080Ti might even be a better option, it also has a 12gb model and some reports say it even outperforms the 3060, in non-rtx I presume.

Eisenstein 29 days ago
CUDA compute version is a big deal. 1080ti is 6.1. 3060 is 8.6. It also has tensor cores.

Note that CUDA version numbers are confusing, the compute number is a different thing than the runtime/driver version.

Melatonic 29 days ago
Not sure what used prices are like these days but the Titan XP (similar to the 1080 ti) is even better
mock-possum 29 days ago
Yeesh, yeah, that was my first thought too - who’s budget??

less than $500 total feels more fitting as a ‘budget’ build - €1700 is more along the lines of ‘enthusiast’ or less charitably “I am rich enough to afford expensive hobbies”

If it’s your business and you expect to recoup the cost and write off the cost on your taxes, that’s one thing - but if you’re just looking to run a personal local LLM for funnies, that’s not an accessible price tag.

I suppose “or you could just buy a Mac” should have tipped me off though.

cwoolfe 29 days ago
As others have said, a high powered Mac could be used for the same purpose at a comparable price and lower power usage. Which makes me wonder: why doesn't Apple get into the enterprise AI chip game and compete with Nvidia? They could design their own ASIC for it with all their hardware & manufacturing knowledge. Maybe they already are.
cwoolfe 16 days ago
gmueckl 29 days ago
The primary market for such a product would be businesses. And Apple isn't particularly good at selling to companies. The consumer product focus may just be too ingrained to be successful with such a move.

A beefed up home pod with a local LLM-based assistant would be a more typical Apple product. But they'd probably need LLMs to become much, much more reliable to not ruin their reputation over this.

fragmede 29 days ago
Why? Siri's still total crap but that doesn't seem to have slowed down iPhone sales.
gmueckl 29 days ago
Siri mostly hit the expectations they themselves were able to set through their ads when launching that product - having a voice based assistant at all was huge back then. With an LLM-based assistant, the market has set the expectations for them and they are just unreasonably high and don't mirror reality. That's a potentially big trap for Apple now.
lolinder 29 days ago
> And Apple isn't particularly good at selling to companies.

With a big glaring exception: developer laptops are overwhelmingly Apple's game right now. It seems like they should be able to piggyback off of that, given that the decision makers are going to be in the same branch of the customer company.

jrm4 29 days ago
For roughly the same reason Steve Jobs et al killed Hypercard; too much power to the users.
gregwebs 29 days ago
The problem for me with making such an investment is that next month a better model will be released. It will either require more or less RAM than the current best model- making it either not runnable or expensive to run on an overbuilt machine.

Using cloud infrastructure should help with this issue. It may cost much more per run but money can be saved if usage is intermittent.

How are HN users handling this?

michaelt 29 days ago
Among people who are running large models at home, I think the solution is basically to be rich.

Plenty of people in tech earn enough to support a family and drive a fancy car, but choose not to. A used RTX 3090 isn't cheap, but you can afford a lot of $1000 GPUs if you don't buy that $40k car.

Other options include only running the smaller LLMs; buying dated cards and praying you can get the drivers to work; or just using hosted LLMs like normal people.

tempoponet 29 days ago
Most of these new models release several variants, typically in the 8b, 30b, and 70b range for personal use. YMMV with each, but you usually use the models that fit your hardware, and the models keep getting better even in the same parameter range.

To your point about cloud models, these are really quite cheap these days, especially for inference. If you're just doing conversation or tool use, you're unlikely to spend more than the cost of a local server, and the price per token is a race to the bottom.

If you're doing training or processing a ton of documents for RAG setups, you can run these in batches locally overnight and let them take as long as they need, only paying for power. Then you can use cloud services on the resulting model or RAG for quick and cheap inference.

idrathernot 29 days ago
There is also an overlooked “tail risk” with cloud services that can end up costing you more than a a few entire on-premise rigs if you don’t correctly configure services or forget to shut down a high end vm instance. Yeah you can implement additional scripts and services as a fail-safe, but this adds another layer of complexity that isn’t always trivial (especially for a hobbyist).

I’m not saying that dumping $10k into rapidly depreciating local hardware is the more economical choice, just that people often discount the likelihood and cost of making mistakes in the cloud during their evaluations and the time investment required to ensure you have the correct safeguards in-place.

anon373839 28 days ago
Yes. And somehow, those cloud providers just can’t seem to work out how to build a spend limit feature for customers who’d like to prevent that. It must be a really difficult engineering problem…
nickthegreek 29 days ago
I plan to wait for the NVIDIA Digits release and see what the token/sec is there. Ideally it will work well for at least 2-3 years then I can resell and upgrade if needed.
3s 29 days ago
Exactly! While I have llama running locally on RTX and it’s fun to tinker with, I can’t use it for my workflows and don’t want to invest 20k+ to run a decent model locally

> How are HN users handling this? I’m working on a startup for end-to-end confidential AI using secure enclaves in the cloud (think of it like extending a local+private setup to the cloud with verifiable security guarantees). Live demo with DeepSeek 70B: chat.tinfoil.sh

JKCalhoun 29 days ago
I think the solution is already in the article and comments here: go cheap. Even next year the author will still have, at the very least, their P40 setup running late 2024 models.

I'm about to plunge in as others have to get my own homelab running the current crop of models. I think there's no time like the present.

walterbell 29 days ago
> expensive to run on an overbuilt machine

There's a healthy secondary market for GPUs.

xienze 29 days ago
The price goes up dramatically once you go past 12GB though, that’s the problem.
JKCalhoun 29 days ago
Not on these server GPUs.

I'm seeing 24GB M40 cards for $200, 24GB K80 cards for $40 on eBay.

xienze 29 days ago
Well OK, I should have been more specific that, even for server GPUs on eBay:

* Cheap

* Fast

* Decent amount of RAM

Pick two.

These old GPUs are as cheap as they are because they don’t perform well.

JKCalhoun 29 days ago
That's fair. From what I read though I think there is some interplay between Fast and Decent amount of RAM. Or at least there is a large falloff in performance when RAM is too small.

So Cheap and Decent amount of RAM work for me.

diggan 29 days ago
> How are HN users handling this?

Combine the best of both worlds. I have a local assistant (communicate via Telegram) that handles tool-calling and basic calendar/todo management (running on a RTX 3090ti), but for more complicated stuff, it can call out to more advanced models (currently using OpenAI APIs for this) granted the request itself doesn't involve personal data, then it flat out refuses, for better or worse.

refibrillator 29 days ago
Pay attention to IO bandwidth if you’re building a machine with multiple GPUs like this!

In this setup the model is sharded between cards so data must be shuffled through a PCIe 3.0 x16 link which is limited to ~16 GB/s max. For reference that’s an order of magnitude lower than the ~350 GB/s memory bandwidth of the Tesla P40 cards being used.

Author didn’t mention NVLink so I’m presuming it wasn’t used, but I believe these cards would support it.

Building on a budget is really hard. In my experience 5-15 tok/s is a bit too slow for use cases like coding, but I admit once you’ve had a taste of 150 tok/s it’s hard to go back (I’ve been spoiled by RTX 4090 with vLLM).

Miraste 29 days ago
Unless you run the GPUs in parallel, which you have to go out of your way to do, the IO bandwidth doesn't matter. The cards hold separate layers of the model, they're not working together. They're only passing a few kilobytes per second between them.
Xenograph 29 days ago
Which models do you enjoy most on your 4090? and why vLLM instead of ollama?
ekianjo 29 days ago
> Author didn’t mention NVLink so I’m presuming it wasn’t used, but I believe these cards would support it.

How would you setup NVLink, if the cards support it?

zinccat 29 days ago
I feel that you are mistaking the two bandwidth numbers
gwern 29 days ago
> Another important finding: Terry is by far the most popular name for a tortoise, followed by Turbo and Toby. Harry is a favorite for hares. All LLMs are loving alliteration.

Mode-collapse. One reason that the tuned (or tuning-contaminated models) are bad for creative writing: every protagonist and place seems to be named the same thing.

diggan 29 days ago
Couldn't you just up the temperature/change some other parameter to get it to be more random/"creative"? It wouldn't be active/intentional randomness/novelty like what a human would do, but at least it shouldn't generate exactly the same naming.
gwern 29 days ago
No. The collapse is not as simple as simply shifting down most of the logits, so ramping up the temperature does little until outputs start degenerating.
lewisl9029 29 days ago
This article is coming out at an interesting time for me.

We probably have different definitions for "budget", but I just ordered a super janky eGPU setup for my very dated 8th gen Intel NUC, with a m2->pcie adapter, a PSU, and a refurb Intel A770 for about 350 all-in, not bad considering that's about the cost of a proper Thunderbolt eGPU enclosure alone.

The overall idea: A770 seems like a really good budget LLM GPU since it has more memory (16GB) and more memory bandwidth (512GB/s) than a 4070, but costs a tiny fraction. The m2-pcie adapter should give it a bit more bandwidth to the rest of the system than Thunderbolt as well, so hopefully it'll make for a decent gaming experience too.

If the eGPU part of the setup doesn't work out for some reason, I'll probably just bite the bullet and order the rest of the PC for a couple hundred more, and return the m2-pcie adapter (I got it off of Amazon instead of Aliexpress specifically so I could do this), looking to end up somewhere around 600 bux total. I think that's probably a more reasonable price of entry for something like this for most people.

Curious if anyone else has experience with the A770 for LLM? Been looking at Intel's https://github.com/intel/ipex-llm project and it looked pretty promising, that's what made me pull the trigger in the end. Am I making a huge mistake?

UncleOxidant 29 days ago
> refurb Intel A770 for about 350

I'm seeing A770s for about $500 - $550. Where did you find a refurb one for $350 (or less since you're also including other parts of the system)

lewisl9029 29 days ago
I got this one from Acer's Ebay for $220: https://www.ebay.com/itm/266390922629

It's out of stock now unfortunately, but it does seem to pop up again from time to time according to Slickdeals: https://slickdeals.net/newsearch.php?q=a770&pp=20&sort=newes...

I would probably just watch the listing and/or set up a deal alert on Slickdeals and wait. If you're in a hurry though, you can probably find a used one on Ebay for not too much more.

rcarmo 29 days ago
Given the power and noise involved, a Mac Mini M4 seems like a much nicer approach, although the RAM requirements will drive up the price.
29 days ago
dandanua 29 days ago
I doubt it is that efficient. Even though it has 48GB of VRAM, it's more than twice slower than a single 3090 GPU.

In my budget AI setup I use 7840 Ryzen based miniPC with USB4 port and connect 3090 to it via the eGPU adapter (ADT-link UT3G). It costed me about $1000 total and I can easily achieve 35 t/s with qwen2.5-coder-32b using ollama.

mrbonner 29 days ago
Wouldn't eGPU defeat the purpose of having fast memory bandwidth? Have you tried it with stable diffusion?
dandanua 29 days ago
40Gbps of USB4 is plenty. I've tried this pytorch tests https://github.com/aime-team/pytorch-benchmarks/ and saw only 10% drop in performance. No drop in performance for LLM inference, if a model is already loaded to the VRAM.
mrbonner 28 days ago
Wow that makes sense now if you can load the entire model to vram. What eGPU dock and GPU setup you use if you don’t mind?
dandanua 28 days ago
That ADT-link UT3G is the eGPU doc, you just need an ordinary PSU to power it and the GPU (I use an old one that I have). Though, to make everything work you might need time to fix quirks. E. g., I had to use "nvidia error 43 fixer" on Windows and to find a correct configuration on my Linux NixOS system (you need to load the Nvidia driver into the kernel in the boot process etc.). Here is how it looks https://imgur.com/a/qySDN4n
cratermoon 29 days ago
From the article: In the future, I fully expect to be able to have a frank and honest discussion about the Tiananmen events with an American AI agent, but the only one I can afford will have assumed the persona of Father Christmas who, while holding a can of Coca-Cola, will intersperse the recounting of the tragic events with a joyful "Ho ho ho... Didn't you know? The holidays are coming!"

How unfortunate that people are discounting the likelihood that American AI agents will avoid saying things their master think should not be said. Anyone want to take bets on when the big 3 (Open AI, Meta, and Google) will quietly remove anything to do with DEI, trans people, or global warming? They'll start out changing all mentions of "Gulf of Mexico" to "Gulf of America", but then what?

CamperBob2 29 days ago
It is easy to get a local R1 model to talk about Tiananmen Square to your heart's content. Telling it to replace problematic terms with "Smurf" or another nonsense word is very effective, but with the local model you don't even have to do that in many cases. (e.g., https://i.imgur.com/btcI1fN.png)
cratermoon 29 days ago
Ah yes, evidence of Shadow Prompting[1]. Likely their API is given additional, hidden, prompt context making Tiananmen verboten.

1 https://www.techpolicy.press/shining-a-light-on-shadow-promp...

hexomancer 29 days ago
Isn't the fact the P40 has horrible fp16 performance a deal breaker for local setups?
behohippy 29 days ago
You probably won't be running fp16 anything locally. We typically run Q5 or Q6 quants to maximize the size of the model and context length we can run with the VRAM we have available. The quality loss is negligable at Q6.
Eisenstein 29 days ago
But the inference doesn't necessarily run at the quant precision.
wkat4242 29 days ago
As far as I understand it does if you quantify the K/V store as well (the context). And that's pretty standard now because it can increase maximum context size a lot.
Eisenstein 29 days ago
It is available in most inference engines, but I wouldn't call it in standard use, as it can degrade quality tremendously.
wkat4242 28 days ago
Even at q8_0? I thought it wasn't bad just like the models itself. But very interested to hear.

And q8_0 already halves the memory usage compared to fp16.

One of the ollama Devs called the quality impact negligible at q8_0: https://smcleod.net/2024/12/bringing-k/v-context-quantisatio...

But perhaps quantifying the KV cache does not scale as gracefully as the model itself?

Eisenstein 28 days ago
It highly depends on the model and the context use. A model like command-r for instance is practically unaffected by it, but Qwen will go nuts. As well, tasks highly dependent on context like translation or evaluation will be more impacted than say, code generation or creative output.
behohippy 28 days ago
Qwen is a little fussy about the sampler settings, but it does run well quantized. If you were getting infinite repetition loops, try dropping the top_p a bit. I think qwen likes lower temps too
Eisenstein 28 days ago
We are talking about dynamically quantizing KV cache, not the model weights.
behohippy 27 days ago
I run the KV cache at Q8 even on that model. Is it not working well for you?
wkat4242 28 days ago
Interesting. I didn't know that. I thought it was basically 'free' space saving. Would you know how llama3.1 fares by any chance?
numpad0 29 days ago
Is it cheaper in $/GB than used Vega 56(HBM2 8GB) besides? There are mining boards with bunch of x1 slots that probably can run half a dozen of them for same 48GB.
magicalhippo 29 days ago
AFAIK this doesn't really work for interactive use, as LLMs process data serially. So your request needs to pass through all of the cards for each token, one at a time. Thus a lot of PCIe traffic and hence latency. Better than nothing, but only really useful if you can batch requests so you can keep each GPU working all the time, rather than just one at a time.
numpad0 28 days ago
Clearly I wasn't aware enough that DNN is by default like a mesh. Makes sense that it's going to be bottlenecked by the tightest link. Thanks...
42lux 29 days ago
Would take a bunch of time just to load the model...
Havoc 29 days ago
Best as I can tell most of the disadvantages relate to larger batches. And for home use you’re likely running batch of 1 anyway
apples_oranges 29 days ago
Does using 2x24GB VRAM mean that the model can be fully loaded into memory if it's between 24 and 48 GB in size? I somehow doubt it, at least ollama wouldn't work like that I think. But does anyone know?
memhole 29 days ago
No. Hopefully, someone with more knowledge can explain better. But you need room for the kvcache is my understanding. You also need to factor in the size of the context window. If anyone has good resources on this, that would be awesome. Presently, it feels very much like a dark art to host these without crashing or being massively over-provisioned.
htrp 29 days ago
The dark art is to massively overprovision hardware.
cratermoon 29 days ago
Thus you get Open AI spending billions while DeepSeek comes long with, shocking, actual understanding of the hardware and how to optimize for it[1] and spends $6 Million[2]

1. https://arxiv.org/abs/2412.19437v1

2. Quibble over the exact figure. Far less than Open AI, doing more with less.

michaelt 29 days ago
For a great many LLMs, you can find someone on HuggingFace who has produced a set of different quantised versions, with approximate RAM requirements.

For example, if you want to run "CodeLlama 70B" from https://huggingface.co/TheBloke/CodeLlama-70B-Python-GGUF where's a table saying the "Q4_K_M" quantised version is a 41.42 GB download and runs in 43.92 GB of memory.

Eisenstein 29 days ago
No, you need to have extra space for the context (which requires more space the larger the model is).

But it should be said that basing model quality on its size in GB is like qualifying a video based on its size in GB. You can have the same video be small or huge with anywhere from negligible to huge differences in quality between the two.

You will be running quantizied model weights, which can range in precision from 1 to 16 bits per parameter (the B for billion in the model name). Model weights at Q8 are generally their parameter size without the B in GB (Llama 3 8B at Q8 would be ~8GB). There are many different strategies for quantizing as well, so this is just a rough guide.

So basically if you can't fit the 48GB model into your 48GB of VRAM, just download a lower precision quant.

brianzelip 29 days ago
Useful recent podcast about homelab LLMs, https://changelog.com/friends/79
whalesalad 29 days ago
How many credits does this get you in any cloud or via openai/anthropic api credits? For ~$1700 I can accomplish way more without hardware locally. Don't get me wrong I enjoy tinkering and building projects like this - but it doesn't make financial sense here to me. Unless of course you live 100% off-grid and have Stallman level privacy concerns.

Of course I do want my own local GPU compute setup, but the juice just isn't worth the squeeze.

rjurney 29 days ago
A lot of people build personal deep learning machines. The economics and convenience can definitely work out... I am confused however by "dummy GPU" - I searched for "dummy" for an explanation but didn't find one. Modern motherboards all include an integrated video card, so I'm not sure what this would be for?

My personal DL machine has a 24 core CPU, 128GB RAM and 2 x 3060 GPUs and 2 x 2TB NVMe drives in a RAID 1 array. I <3 it.

T-A 29 days ago
Look under "Available Graphics" at

https://www.hp.com/us-en/shop/mdp/business-solutions/z440-wo...

No integrated graphics.

Author's explanation of the problem:

The Teslas are intended to crunch numbers, not to play video games with. Consequently, they don't have any ports to connect a monitor to. The BIOS of the HP Z440 does not like this. It refuses to boot if there is no way to output a video signal.

rjurney 23 days ago
Okay, wow I've never heard of one of those!
renewiltord 29 days ago
I have 6x 4090s in a rack with an Epyc driving them but tbh I am selling them all to get a Mac Studio. Simpler to work with I think.
DrPhish 29 days ago
This is just a limited recreation of the ancient mikubox from https://rentry.org/lmg-build-guides

Its funny to see people independently "discover" these builds that are a year plus old.

Everyone is sleeping on these guides, but I guess the stink of 4chan scares people away?

cratermoon 29 days ago
"ancient" guide. Pub: 10 May 2024 21:48 UTC
DrPhish 29 days ago
The build guide index page is newer, but to be fair, the mikubox rentry is from Oct 6, 2023.

If that isn't "ancient" in terms of AI workstation build guides, then I don't know what is.

walterbell 29 days ago
One reason to bother with private AI: cloud AI ToS for consumers may have legal clauses about usage of prompt and context data, e.g. data that is not already on the Internet. Enterprise customers can exclude their data from future training.

https://stratechery.com/2025/deep-research-and-knowledge-val...

> Unless, of course, the information that matters is not on the Internet. This is why I am not sharing the Deep Research report that provoked this insight: I happen to know some things about the industry in question — which is not related to tech, to be clear — because I have a friend who works in it, and it is suddenly clear to me how much future economic value is wrapped up in information not being public. In this case the entity in question is privately held, so there aren’t stock market filings, public reports, barely even a webpage! And so AI is blind.

(edited for clarity)

icepat 29 days ago
ToS can change. Companies can (and do) act illegally. Data breaches happen. Insider threats happen.

Why trust the good will of a company, over a box that you built yourself, and have complete control over?

rovr138 29 days ago
Cost for one
almosthere 29 days ago
I bought a Mac M4 Mini (the cheapest one) at Costco for 559 and while I don't know exactly how many tokens per second, it seems to generate text from llama 3.2 (through ollama) as fast as chatgpt.
pg5 28 days ago
I can run the 17b deepseek models (I know, these smaller ones are not actually deepseek) on my old 1080ti gaming desktop with 64 Gb of RAM. Not exactly speedy, but pretty neat nonetheless.
robblbobbl 29 days ago
Deep true words. I'm sorry for the author but thx for the article!
asasidh 29 days ago
you can run 32B and even 70B (a bit slow) models on a m4 mac mini pro with 48 GB ram. Out of the box using Ollama. If you enjoy putting together a desktop, thats understandable.

https://deepgains.substack.com/p/running-deepseek-locally-fo...

gytisgreitai 29 days ago
1.7k€ and 300w for a playground. Man this world is getting crazy and I’m getting f-kin old by not understanding it.
_boffin_ 29 days ago
For around 1800, I was able to get myself a Dell T5820 with 2x Dell 3090s. Can’t complain at all.
rootsudo 28 days ago
I did something similar and bought a Mac mini. Everyone’s budget is different, right?
windex 28 days ago
Ive been running deepseek R1 1.5 on a RPI4. Slow as hell but satisfying.
zinccat 29 days ago
P40 don't support fp16 well, buy 3090 instead
JKCalhoun 29 days ago
Cost is about 4X for the 24 GB 3090 on eBay.
brador 29 days ago
You can do 8b local on the latest iPhones.
0xEF 29 days ago
How useful is this, though? In my modest experience, these tiny models aren't good for much more than tinkering with, definitely not something I'd integrate into my workflow since the output quality is pretty low.

Again, though, my experience is limited. I imagine others know something I do not and would absolutely love to hear more from people who are running tiny models on low-end hardware for things like code assistance, since that's where the use-case would lie for me.

At the moment, I subscribe to "cloud" models that I use for various tasks and that seems to be working well enough, but it would be nice to have a personal model that I could train on very specific data. I'm sure I am missing something, since it's also hard to keep up with all the developments in the Generative AI world.

DemetriousJones 29 days ago
I tried running the 8B model on my 8GB M2 Macbook Air through Ollama and it was awful. It took ages to do anything and the responses were bad at best.
redman25 29 days ago
Doesn't 8B need at least 16gb of ram? Otherwise, your swapping I would imagine...
blebo 29 days ago
Depends on quantization selected - see https://www.canirunthisllm.net/
29 days ago
jll29 29 days ago
Can we just call it "PC to run ML models on" on a budget?

"AI computer" sounds pretentious and misleading to outsiders.

hemant1041 29 days ago
Interesting read!
black_13 29 days ago
[dead]