TL;DR

A user successfully combines an RTX 5080 and RTX 3090 to run Qwen 3.6 27B Q8 at over 80 tokens/sec. This setup highlights potential for high-performance local LLM deployment.

A user has demonstrated that a dual-GPU setup with an RTX 5080 and an RTX 3090 can run Qwen 3.6 27B Q8 at over 80 tokens per second, marking a significant performance milestone for local large language model (LLM) deployment.

The setup involves specific hardware choices, BIOS configurations, and driver modifications to enable effective multi-GPU operation. The user reports achieving over 80 tokens/sec, with peaks reaching 90, on a custom build using an Asus Prime X570-Pro motherboard, PCIe 4 riser, and patched NVIDIA drivers. The model used, Huihui-Qwen3.6-27B-abliterated-ggml, is quantized at Q8, fitting within 39GB of VRAM.

Key technical steps included disabling CSM in BIOS, enabling Above 4G Decoding, setting ReSize BAR Support to Auto, and configuring CUDA architectures to support both Ampere and Blackwell GPUs. The user employed llama.cpp with specific build flags to optimize multi-GPU performance, including disabling NCCL as it was found counterproductive with different GPU models.

Performance measurements show the system consistently achieves around 81-91 tokens/sec during inference, with timing logs confirming the high throughput. This demonstrates that combining an RTX 5080 with an RTX 3090 can significantly accelerate local LLM tasks, even with models quantized at Q8.

Potential for High-Performance Local AI with Dual-GPU Setup

This development matters because it shows that combining high-end GPUs like the RTX 5080 and RTX 3090 can deliver substantial performance improvements for running large language models locally. It opens possibilities for enthusiasts and researchers to deploy powerful AI models without relying on cloud services, reducing costs and increasing control over data. The achievement also highlights the importance of BIOS tuning and driver modifications in maximizing hardware capabilities.

Amazon

NVIDIA RTX 5080 GPU

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Advances in Multi-GPU AI Deployment Techniques

Over the past year, users have increasingly experimented with multi-GPU setups to enhance AI inference speeds. The RTX 5080, a newer GPU, and the RTX 3090, known for its large VRAM, are now being combined by enthusiasts to push local LLM performance beyond previous limits. Prior efforts focused on single-GPU configurations or less optimized multi-GPU arrangements, but recent testing indicates that with proper BIOS and driver configurations, significant gains are achievable. This particular setup builds on ongoing community efforts to optimize hardware and software for AI workloads, emphasizing the importance of PCIe configurations, driver patches, and model quantization.

“Achieving over 80 tokens/sec with this dual-GPU setup shows promising potential for local AI deployment at near-data-center speeds.”

— User/Tester

Amazon

high performance dual GPU setup

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Uncertainties in Long-Term Stability and Compatibility

It is still unclear how stable this setup remains under prolonged use or different workload types. Compatibility issues may arise with other models or hardware configurations, and driver modifications are not officially supported, which could lead to system instability or driver failures in some cases. Further testing is needed to confirm long-term reliability and broader applicability.

Amazon

NVIDIA RTX 3090 graphics card

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Next Steps for Broader Adoption and Optimization

Further testing across different models, workloads, and hardware combinations will help determine the stability and scalability of this approach. Software updates and driver improvements may also enhance performance and compatibility. The community is expected to explore more BIOS tuning, driver patches, and software optimizations to replicate and extend these results, potentially enabling wider use in local AI deployments.

Local LLM Inference Optimization: A Comprehensive Guide to Quantization, Hardware Acceleration, and Efficient Private AI Deployment

Local LLM Inference Optimization: A Comprehensive Guide to Quantization, Hardware Acceleration, and Efficient Private AI Deployment

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Key Questions

Can I replicate this setup with different GPU models?

While the user successfully combined an RTX 5080 and RTX 3090, compatibility and performance may vary with other models. Proper BIOS, driver, and software configurations are essential, and some trial and error may be needed.

Is this performance achievable with other models or just Huihui-Qwen3.6?

The reported 80+ tokens/sec is specific to the Huihui-Qwen3.6-27B model with Q8 quantization. Results may differ with other models or quantization schemes.

What are the main technical challenges in setting up this system?

Key challenges include BIOS configuration, driver patching, ensuring PCIe and CUDA support, and managing hardware compatibility issues, especially when using different GPU models.

Will this setup work with other motherboards or only the Asus Prime X570-Pro?

The specific BIOS and PCIe configurations are tailored to the Asus Prime X570-Pro. Other motherboards may require different settings or may not support the same level of performance without additional modifications.

Source: Hacker News


You May Also Like

Minerva. The opposite path.

Italy’s Minerva-3B, trained from scratch with 50% Italian data, scores just 4.9% on Italian exams, raising questions about native-language investment needs.

OpenAI weighs letting Japan access new Mythos-class cybersecurity AI

OpenAI is evaluating whether to allow Japan access to its advanced GPT-5.5-Cyber cybersecurity AI amid rising Chinese and open-source cyber threats.

The Key AI Questions Haunting Executives in the Boardroom.

Lurking beneath AI’s promise in talent management are critical questions executives must answer to ensure ethical, fair, and effective implementation.

AI-powered NPM deprecation tracker with dependency tree Ghost Detection

A new AI-powered tool now tracks NPM package deprecations and detects ghost dependencies within project dependency trees, enhancing package management security.