I had to assemble a new PC for LLM inference work since my existing development lacked a discrete GPU so running any local LLM was extremely slow. My aim is a dedicated development machine running Linux and will not be use for any gaming. I wanted to keep the budget between $1500 and $2500 and wanted it to be quiet and not take up much space. In the end, I had to make some compromises and this is what I ended up with:
Components
Gigabyte GeForce RTX 4070 Ti Super with 16GB GPU
Traditionally, a development machine didn't need a graphics card but that's different when it comes to working with LLMs so I started with the GPU and picked the . The two makers are Nvidia and AMD but it seems like AMD cards are less supported and would require more effort to get it working. The general consensus is that VRAM is the most important thing for running LLMs and you'd want to try to load as much of the model into the VRAM as possible so I aimed for 12-16GB*. I picked the Gigabyte card because its length is on the shorter side and wouldn't require as large of a case to support it.
*Rough estimating the amount of memory needed by a model is to take the number of parameters and multiply it by 2 bytes (using 16 bit parameter type = 2 bytes). A 7B model would use 14GB, 8B use 16GB, etc.
**"How come my GPU only have 12GB of VRAM but I can a 8B model which would use 16GB?" You might be using a quantitized/compressed model or you're using both the GPU VRAM and system RAM. For example, if you download the default Llama 8B parameters model from Ollama, it's quantization is 4 so it doesn't take the full 16GB of memory. If you don't have enough VRAM, then Ollama will use both system ram and VRAM:
Loading the 27B parameter Gemma model with quantization of 4 shows that it requires 18GB of memory and Ollama loaded 82% of it into the GPU's VRAM (~14.7 GB):
> ollama ps
NAME ID SIZE PROCESSOR UNTIL
gemma2:27b 53261bc9c192 18 GB 18%/82% CPU/GPU 4 minutes from now
Nvidia shows that 14GB of its 16GB is being used matching what Ollama says:
> nvidia-smi
...
0 N/A N/A 2338 C ...unners/cuda_v12/ollama_llama_server 14020MiB
...
AMD Ryzen 7 7700X CPU (8-core, 16-threads, 4.5GHz base, 5.5Ghz Max Boost)
A solid performer for development work with integrated graphics. AMD's integrated graphics are pretty good and by running my windows manager and GUI through the integrated graphics allows me to save the GPU for the LLMs and not use any of the GPU's VRAM.
ASUS B650M-PLUS WIFI AM5 Motherboard
I didn't want a fancy motherboard with RGB but I did want something that is compatible and can support modern peripheral.
Corsair Vengeance DDR5 64GB Memory
Got the DDR5 memory for the speed and 64GB since when the VRAM isn't enough for the LLM some of it will be loaded into memory.
be quiet! Pure Rock 2 CPU Cooler
I don't plan to over clock this system and the Pure Rock is well rated for being quiet and affordable.
Corsair RM750e (2023) Power Supply
Corsair 4000D Airflow Mid-Tower ATX Case
Although I would much prefer a small form factor case, doing so limits the options for the GPU and other components so this is a compromize. The 4000D case comes with two fans and has good airflow. When the system doesn't get as hot, the fans don't have to work as hard and the system is quieter.
The 1TB was out-of-stock and the 2TB was still relatively well priced.
The total price for the system came under $1900 so I was able to stay in my budget range.
Assembly and Usage
glxinfo | grep "OpenGL renderer" # See what the system is using
sudo lsof /dev/dri/* # Shows what is running on it.
No comments:
Post a Comment