Not sure if this is the right place, if not please let me know.

GPU prices in the US have been a horrific bloodbath with the scalpers recently. So for this discussion, let’s keep it to MSRP and the lucky people who actually managed to afford those insane MSRPs + managed to actually find the GPU they wanted.

Which GPU are you using to run what LLMs? How is the performance of the LLMs you have selected? On an average, what size of LLMs are you able to run smoothly on your GPU (7B, 14B, 20-24B etc).

What GPU do you recommend for decent amount of VRAM vs price (MSRP)? If you’re using the TOTL RX 7900XTX/4090/5090 with 24+ GB of RAM, comment below with some performance estimations too.

My use-case: code assistants for Terraform + general shell and YAML, plain chat, some image generation. And to be able to still pay rent after spending all my savings on a GPU with a pathetic amount of VRAM (LOOKING AT BOTH OF YOU, BUT ESPECIALLY YOU NVIDIA YOU JERK). I would prefer to have GPUs for under $600 if possible, but I want to also run models like Mistral small so I suppose I don’t have a choice but spend a huge sum of money.

Thanks


You can probably tell that I’m not very happy with the current PC consumer market but I decided to post in case we find any gems in the wild.

  • skozzii@lemmy.ca
    link
    fedilink
    English
    arrow-up
    4
    ·
    23 hours ago

    Hopefully once Trump crashes economy we will see some bankruptcies and markets flooded with commercial GPUs as AI companies go under.

  • umami_wasabi@lemmy.ml
    link
    fedilink
    English
    arrow-up
    5
    ·
    edit-2
    1 day ago

    Using 7900XTX with LMS. Speed are everwhere, driver dependent. With QwQ-32B-Q4_K_M, I got about 20 tok/s, with all VRAM filled. Phi-4 runs at about 30-40 tok/s. I can give more numbers if you can wait for a bit.

    If you don’t enjoy finding which driver works best, I strongly aginst running AMD for AI workload.

  • j4k3@lemmy.world
    link
    fedilink
    English
    arrow-up
    6
    ·
    edit-2
    2 days ago
    Anything under 16 is a no go. Your number of CPU cores are important. Use Oobabooga Textgen for an advanced llama.cpp setup that splits between the CPU and GPU. You'll need at least 64 GB of RAM or be willing to offload layers using the NVME with deepspeed. I can run up to a 72b model with 4 bit quantization in GGUF with a 12700 laptop with a mobile 3080Ti which has 16GB of VRAM (mobile is like that).

    I prefer to run a 8×7b mixture of experts model because only 2 of the 8 are ever running at the same time. I am running that in 4 bit quantized GGUF and it takes 56 GB total to load. Once loaded it is about like a 13b model for speed but is ~90% of the capabilities of a 70b. The streaming speed is faster than my fastest reading pace.

    A 70b model streams at my slowest tenable reading pace.

    Both of these options are exponentially more capable than any of the smaller model sizes even if you screw around with training. Unfortunately, this streaming speed is still pretty slow for most advanced agentic stuff. Maybe if I had 24 to 48gb it would be different, I cannot say. If I was building now, I would be looking at what hardware options have the largest L1 cache, the most cores that include the most advanced AVX instructions. Generally, anything with efficiency cores are removing AVX and because the CPU schedulers in kernels are usually unable to handle this asymmetry consumer junk has poor AVX support. It is quite likely that all the problems Intel has had in recent years has been due to how they tried to block consumer stuff from accessing the advanced P-core instructions that were only blocked in microcode. It requires disabling the e-cores or setting up a CPU set isolation in Linux or BSD distros.

    You need good Linux support even if you run windows. Most good and advanced stuff with AI will be done with WSL if you haven’t ditched doz for whatever reason. Use https://linux-hardware.org/ to see support for devices.

    The reason I mentioned avoid consumer e-cores is because there have been some articles popping up lately about all p-core hardware.

    The main constraint for the CPU is the L2 to L1 cache bus width. Researching this deeply may be beneficial.

    Splitting the load between multiple GPUs may be an option too. As of a year ago, the cheapest option for a 16 GB GPU in a machine was a second hand 12th gen Intel laptop with a 3080Ti by a considerable margin when all of it is added up. It is noisy, gets hot, and I hate it many times, wishing I had gotten a server like setup for AI, but I have something and that is what matters.

    • marauding_gibberish142@lemmy.dbzer0.comOP
      link
      fedilink
      English
      arrow-up
      2
      ·
      1 day ago

      I don’t mind multiple GPUs but my motherboard doesn’t have 2+ electrically connected X16 slots. I could build a new homeserver (I’ve been thinking about it) but consumer platforms simply don’t have the PCIE lanes for 2 actual x16 slots. I’d have to go back to Broadwell Xeons for that, which are really power hungry. Oh well, I don’t think it matters considering how power hungry GPUs are now.

      • j4k3@lemmy.world
        link
        fedilink
        English
        arrow-up
        2
        ·
        1 day ago

        I haven’t looked into the issue of PCIe lanes and the GPU.

        I don’t think it should matter with a smaller PCIe bus, in theory, if I understand correctly (unlikely). The only time a lot of data is transferred is when the model layers are initially loaded. Like with Oobabooga when I load a model, most of the time my desktop RAM monitor widget does not even have the time to refresh and tell me how much memory was used on the CPU side. What is loaded in the GPU is around 90% static. I have a script that monitors this so that I can tune the maximum number of layers. I leave overhead room for the context to build up over time but there are no major changes happening aside from initial loading. One just sets the number of layers to offload on the GPU and loads the model. However many seconds that takes is irrelevant startup delay that only happens once when initiating the server.

        So assuming the kernel modules and hardware support the more narrow bandwidth, it should work… I think. There are laptops that have options for an external FireWire GPU too, so I don’t think the PCIe bus is too baked in.

  • liliumstar@lemmy.dbzer0.com
    link
    fedilink
    English
    arrow-up
    3
    ·
    edit-2
    2 days ago

    I know you said consumer GPU, but I run a used Tesla P40. It has 24 GB of vram. The price has gone up since I got it a couple years ago, there might be better options in the same price category. Still, it’s going to be cheaper than a modern full fat consumer gpu, with a reasonable performance hit.

    My use case is text generation, chat kind of things. In most cases, the inference is more than fast enough, but it can get slow when swapping out large context lengths.

    Mostly I run quantized 8-20B models with the sweet spot being around 12. For specialized use cases outside of general language, you can run more compact models. The general output is quite good, and I would have never had thought it was possible 10 years ago.

    ETA: I paid about $200 USD for the P40 a couple years ago, plus the price for a fan and 3d printed shroud.

  • AMillionMonkeys@lemmy.world
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 day ago

    I would prefer to have GPUs for under $600 if possible

    Unfortunately not possible for a new nvidia card (you want CUDA) with 16GB VRAM. You can get them for ~$750 if you’re patient. This deal was available for awhile earlier today:
    https://us-store.msi.com/Graphics-Cards/NVIDIA-GPU/GeForce-RTX-50-Series/GeForce-RTX-5070-Ti-16G-SHADOW-3X-OC
    Or you could try to find a 16GB 4070Ti Super like I got. It runs Deepseek 14B and stuff like Stable Diffusion no problem.

    • marauding_gibberish142@lemmy.dbzer0.comOP
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 day ago

      I am OK with either Nvidia or AMD especially if Ollama supports it. With that said I have heard that AMD takes some manual effort whilst Nvidia is easier. Depends on how difficult ROCM is