This web page was created programmatically, to learn the article in its unique location you possibly can go to the hyperlink bellow:
https://blog.tymscar.com/posts/v100localllm/
and if you wish to take away this text from our web site please contact us
I already had an RTX 4080. 16GB of VRAM. Good sufficient for gaming, not ok for the fashions I wished to run regionally. The subsequent step up in GPU land is both lay our a fortune on a card with extra VRAM, or discover one other approach.
I discovered one other approach.
I purchased a datacenter GPU that doesn’t also have a regular PCIe connector, caught it in my gaming PC with an adapter, and now I’ve 32GB of VRAM throughout two GPUs operating a 27 billion parameter mannequin at 32 tokens per second. The entire factor price me £200.
This is a Tesla V100 SXM2 16GB. It was designed for NVIDIA’s DGX servers and hyperscaler racks. The SXM2 type issue means it doesn’t have a PCIe slot. It doesn’t have show outputs. It doesn’t have a standard energy connector. It sits on a proprietary board inside a server rack and communicates over NVLink.
You can’t plug this right into a motherboard. Not with out assist.
But right here is the factor: it is a Volta GPU with 16GB of HBM2 reminiscence, 5120 CUDA cores, and I picked it up for about £150 on eBay. The compute remains to be actual. The VRAM remains to be actual. And the reminiscence bandwidth is the place it will get genuinely stunning.
HBM2 is a distinct class of reminiscence. The V100 has a 4096-bit reminiscence bus delivering 900 GB/s of bandwidth. To put that in perspective, my RTX 4080 with its fancy GDDR6X manages 736 GB/s. The V100 from 2017 has 22% extra reminiscence bandwidth than a GPU that launched in 2022.
And it’s not simply NVIDIA’s client playing cards that lose. Apple’s M3 Max does 400 GB/s. The M4 Max does 546 GB/s. The model new M5 Max, which can set you again over £3,000 for a laptop computer, manages 614 GB/s. A GPU from 2017 beats each Mac in the marketplace.
The closest AMD competitors to my 4080 is the RX 7900 XTX, which does 960 GB/s on its 24GB of GDDR6. Technically that edges out the V100, however the 7900 XTX prices £700+ and ROCm assist for LLM inference remains to be tough in comparison with CUDA. The V100 offers you 94% of that bandwidth for lower than 1 / 4 of the value, and it simply works with llama.cpp.
The solely client GPU that comfortably beats it’s the RTX 5090 at 1,792 GB/s, and that card prices over £2,000. For LLM inference, the place reminiscence bandwidth is the bottleneck that determines your tokens per second, this issues greater than nearly the rest.
The solely drawback is the connector.
Turns out, somebody makes an SXM2-to-PCIe adapter. It shouldn’t be made by NVIDIA. It shouldn’t be formally supported by anybody. It is a naked PCB with the SXM2 socket on one aspect and a PCIe edge connector on the opposite. I paid about £50 for it. Half of that may simply be the copper.
So for about £200 complete, I had a 16GB VRAM GPU that might slot into my motherboard alongside my RTX 4080. That is 32GB of complete VRAM. A single RTX 5090 with 32GB prices over £2,000. I’m not saying this is similar expertise. I’m saying the VRAM is similar.
Before I may do something helpful with the V100, I needed to take care of the fan.
The V100 SXM2 was designed to reside inside a 2U server with industrial cooling. The fan on the adapter shouldn’t be refined. It shouldn’t be quiet. It shouldn’t be one thing you need in a room you additionally sleep in.
I measured it with my Apple Watch:
82 decibels. That is someplace between a rubbish disposal and a lawnmower, effectively previous “loud PC” and into “should I be wearing earplugs in my own house” territory.
And the worst half: you can’t management it. I attempted nvidia-smi, I attempted scanning for it on Linux, I even tried Afterburner on Windows (extra on that later, the entire setup barely works on Windows). Nothing. The fan on this adapter shouldn’t be designed to be managed. It is designed to run at 100%, endlessly, inside a server rack the place no one has to listen to it.
Here is me attempting to determine the fan pinout. I guessed it could be a typical case fan pinout on a bizarre connector, so I jammed two jumper wires into VCC and floor and prodded a 9V battery in opposition to them. It spun. And it was a lot quieter than the 12V it usually will get:
That confirmed the pinout and gave me hope that the fan may really be tamed.
The 9V battery check informed me the pinout was commonplace case fan territory, simply with a bizarre connector. The subsequent query was whether or not the fan would really reply to PWM management if I wired the tachometer and PWM pins to my motherboard.
So I shoved some jumper wires into the connector and jammed the opposite ends right into a spare fan header (flip your quantity up):
It works. The motherboard can learn the RPM and the fan responds to PWM. I maintain it at 10%. It by no means goes above 50C even at full load, and I can’t actually hear it.
Now I simply wanted a correct cable as a substitute of jumper wires held in by hope.
The fan connector on the adapter is a small JST PH2.0 plug with 4 pins. Motherboard fan headers use a typical 0.1 inch (2.54mm) pitch. The GPU fan makes use of a 2.0mm JST PH connector. The pins are nearer collectively and the plug is smaller.
The answer was a 2.54mm male to PH2.0 feminine jumper cable. The feminine PH2.0 finish plugs into the fan’s tachometer and PWM pins, and the male 2.54mm finish goes right into a spare fan header on the motherboard:
That went from 82dB ear harm to one thing I can really reside with.
With the fan state of affairs dealt with, the V100 slotted proper in alongside my 4080:
llama.cpp can break up the mannequin throughout each GPUs utilizing tensor splitting. It pipelines the layers throughout the PCIe bus so the 4080 handles some layers and the V100 handles the remaining. It shouldn’t be as quick as having a single GPU with 32GB, however it works, and it price me roughly 10% of what a 32GB GPU would price. For what it’s price, essentially the most I’ve ever seen the V100 pull is round 150W. That shouldn’t be nothing, however it’s not out of this world for a GPU operating native LLM inference.
The V100 additionally is available in a 32GB variant. It prices greater than double what I paid, however we’re nonetheless speaking about just a few hundred kilos for 32GB of HBM2 reminiscence on a single card. Two of these would offer you 64GB of VRAM for roughly 20% of what an RTX 5090 prices in right now’s market.
You may cluster them. The SXM2 format helps NVLink natively, which implies if you’re constructing a correct multi-GPU setup, these playing cards can discuss to one another at very excessive bandwidth. Even by the PCIe adapter, the tensor break up efficiency is strong.
This half was surprisingly clean because of NixOS. The V100 is a Volta chip. NVIDIA dropped Volta assist beginning with driver department 560. The final driver that helps each my RTX 4080 (Ada) and the V100 (Volta) is department 550.x, which maps to nvidiaPackages.legacy_535 on NixOS.
That driver solely helps CUDA as much as 12.2. Current nixpkgs ships CUDA 12.6 minimal. So I needed to pull CUDA 12.2 from nixpkgs 24.05.
Also, the driving force requires kernel 6.6. Newer kernels usually are not supported with the legacy driver.
And here’s a bizarre one: although it is a headless inference server, providers.xserver.allow = true is required. Without it, the NVIDIA kernel modules don’t load.
NixOS made most of this easy. Here is the important thing configuration for getting the driving force and kernel proper:
boot.kernelPackages = pkgs.linuxPackages_6_6;
{hardware}.nvidia.package deal = config.boot.kernelPackages.nvidiaPackages.legacy_535;
providers.xserver.allow = true;
providers.xserver.videoDrivers = [ "nvidia" ];
And for loading CUDA 12.2 from an older nixpkgs for the reason that present one solely ships 12.6+:
nixpkgs.overlays = [
(final: prev: {
cudaPackages_12_2 = nixpkgs-cuda.legacyPackages.${prev.system}.cudaPackages_12_2;
})
];
The vital factor is: it really works. Both GPUs present up, CUDA is purposeful, and NixOS dealt with the entire thing elegantly. If you need to replicate this, the complete machine definition is in this commit on my dotfiles repo, together with the llama.cpp service definition and the customized construct pinned to the correct model.
I’m operating Qwen3.6-27B-MTP quantized at Q5_K_M, which is available in at about 19GB. With each GPUs, the complete mannequin suits in VRAM with room for context:
| Setting | Value |
|---|---|
| Model | Qwen3.6-27B-MTP Q5_K_M (19GB) |
| Context dimension | 128k tokens |
| GPU layers | 99 (all offloaded) |
| Tensor break up | -ts 1.0,1.0 (even throughout each GPUs) |
And the efficiency:
| Metric | Value |
|---|---|
| Inference velocity | ~32 tok/s |
| Prompt processing | ~133-160 tok/s |
32 tokens per second is quick sufficient for interactive use. It is quicker than most cloud API endpoints if you consider community latency. And that is with tensor splitting throughout two completely different GPU architectures linked by PCIe.
I need to be clear about one thing. This shouldn’t be “good for a local model.” This shouldn’t be “acceptable if you lower your expectations.” Qwen3.6-27B ties with Claude Sonnet 4.6 on Artificial Analysis’s Agentic Index. It beats Sonnet 4.6 on MMMU-Pro and Terminal-Bench 2.0. A 27 billion parameter mannequin operating on secondhand {hardware} is genuinely aggressive with the newest cloud fashions from Anthropic.
Yes, Sonnet 4.6 edges it out on GPQA and SWE-Bench Verified. It ought to, it’s a large proprietary mannequin. And sure, if you’d like the very best, Opus 4.8 exists. It additionally prices extra per 20 minutes of heavy use than I paid for this complete GPU and adapter setup mixed. But the hole is shockingly small. We have reached the purpose the place the mannequin you run in your bed room is in the identical dialog as those that cost you per token.
The MTP within the mannequin identify stands for Multi-Token Prediction. Normal LLM inference predicts one token at a time. Predict one token, settle for it, predict the subsequent token, repeat. MTP modifications this by having the mannequin predict a number of future tokens directly, then verifying which of them had been appropriate. Accepted tokens are primarily free. Wrong predictions fall again to the conventional path.
The result’s roughly 1.5-2x quicker technology with no accuracy loss. On my setup which means inference goes from round 32 tok/s to probably 50-60 tok/s when MTP hits its stride, particularly on predictable output like code.
The catch is that MTP assist in llama.cpp is new. The model in nixpkgs doesn’t assist the Qwen3.6 MTP structure, so I needed to construct llama.cpp from supply at a particular commit that added assist. On NixOS that is painless. I’ve a customized derivation pinned to the correct commit, and the entire thing is reproducible. When I need to replace the mannequin or change the llama.cpp model, I modify one line in my config, run nixos-rebuild change, and I’m finished. No dependency hell, no reinstalling by hand, no questioning whether or not I constructed in opposition to the correct CUDA model.
The Qwen3.6-27B mannequin helps picture enter by a separate multimodal projector file (mmproj). This is about 928MB additional, and it’s fascinating.
The approach it really works is {that a} imaginative and prescient encoder (just like what ChatGPT and Claude use) takes picture pixels and interprets them into the LLM’s token embedding area. The mannequin doesn’t “see” the picture the way in which a human does. Instead, the imaginative and prescient encoder compresses the picture right into a sequence of vectors that reside in the identical mathematical area as textual content tokens. The LLM then processes these vectors as in the event that they had been simply one other sequence of tokens.
What this implies in observe: you ship the mannequin a picture URL alongside your textual content immediate, and it could actually describe, analyze, and cause about what it sees. The complete imaginative and prescient functionality provides about 1GB to the mannequin dimension. That is it. One gigabyte and your native LLM can learn pictures.
In llama.cpp, the flags are easy:
--mmproj /mnt/nas/llamacpp/mmproj-F16.gguf --mmproj-offload
The --mmproj-offload flag masses the imaginative and prescient encoder onto GPU alongside the mannequin, so you continue to get quick inference even with pictures.
I exploit this setup with OpenCode, which is an AI coding assistant that may run in opposition to native fashions. The LLM server runs on my desktop, however I don’t use it from that machine. I exploit it from every other machine in my home over the community, or from outdoors over Tailscale (however that could be a weblog publish for an additional time). Pointing OpenCode on the llama.cpp server is so simple as setting the API URL. The mannequin runs regionally, the responses are quick, and nothing leaves my community.
All the fashions reside on my TrueNAS server, mounted by way of NFS:
fileSystems."/mnt/nas" = {
machine = "truenas-nfs.tymscar.com:/mnt/oasis/services";
fsType = "nfs";
choices = [ "nfsvers=4" "_netdev" "auto" "nofail" ];
};
The llama.cpp service depends upon mnt-nas.mount, so it doesn’t begin till the NAS is on the market. This means I can retailer terabytes of fashions with out worrying about native disk area.
The complete OS runs from a Corsair MP600 MINI in a DockCase USB-C NVMe enclosure. No inner drive modification wanted. When I need to recreation, I unplug the drive and reboot into my essential Windows set up, and recreation usually on the 4080. When I need to do LLM stuff, I plug the drive again in, reboot into NixOS, and each GPUs can be found.
This shouldn’t be as elegant as a dual-boot menu, however it’s easy and it really works. No GRUB, no bootloader conflicts, no partition administration. Just a bodily change.
The V100 often disappears from lspci and nvidia-smi after a heat reboot (the place the OS restarts however the motherboard stays powered). This appears to be an ACPI enumeration problem with the PCIe slot. A chilly reboot (bodily energy off, wait just a few seconds, energy again on) all the time restores it.
When the V100 is absent, llama.cpp fails to start out as a result of it can’t match the mannequin on a single 16GB GPU. The service crash-loops till the GPU comes again. This shouldn’t be a giant deal in observe since I’m normally round once I reboot, however it’s price realizing about. It offers me the identical vibes because the notorious AMD GPU reset bug, the place passing by an AMD GPU to a VM after which shutting it down leaves the GPU in a state that solely a full host energy cycle can repair.
For £200, I obtained:
The solely actual price was the noise, and I solved that with £2 price of jumper cables and a little bit of connector spelunking. The V100 shouldn’t be the quickest GPU for inference, and the tensor break up throughout two completely different architectures shouldn’t be as clear as a single GPU. But for the value, it’s absurdly good worth.
If you need to run correct fashions regionally, take a look at the secondhand server GPU market. You don’t even want an present GPU. I occur to have a 4080 in my gaming PC, however a single V100 in an affordable server field would offer you 16GB of VRAM and a wonderfully usable native LLM for little or no cash. The V100 SXM2 shouldn’t be the one choice. The P40 offers you 24GB for comparable cash, although it’s slower and has no Tensor Cores. The V100 32GB variant prices extra however nonetheless undercuts any client GPU with that a lot VRAM.
Just be prepared for the fan.
This web page was created programmatically, to learn the article in its unique location you possibly can go to the hyperlink bellow:
https://blog.tymscar.com/posts/v100localllm/
and if you wish to take away this text from our web site please contact us
This web page was created programmatically, to learn the article in its unique location you'll…
This web page was created programmatically, to learn the article in its unique location you'll…
This web page was created programmatically, to learn the article in its unique location you…
This web page was created programmatically, to learn the article in its authentic location you…
This web page was created programmatically, to learn the article in its unique location you'll…
This web page was created programmatically, to learn the article in its authentic location you…