This paper narrows the benchmark question to a same-family deployment problem: what changes when Gemma 4 is moved from a smaller BF16 workstation configuration to a larger 4-bit quantized deployment on a lower-power GB10 system. That framing is operationally useful because it helps practitioners reason about the interplay between hardware memory bandwidth, model precision, and output quality — three factors that are often treated independently but interact in ways that produce counterintuitive results.
The contribution is threefold. First, the study demonstrates that the RTX 3090 — despite being a 2020 Ampere GPU — outperforms the 2025 GB10 Grace-Blackwell SoC for LLM inference by ~4×, because autoregressive token generation is memory-bandwidth-bound and the RTX 3090's GDDR6X provides 936 GB/s vs the GB10's ~273–301 GB/s LPDDR5X. This result confirms a known theoretical principle with practical empirical data and explains why the GB10 is better understood as a fine-tuning and development platform than a high-throughput inference accelerator. Second, the study provides clear evidence that 4-bit quantization degrades output quality significantly enough that a smaller full-precision model (E2B, ~9.3 GB BF16) outperforms a larger quantized model (Gemma 4 E4B, Unsloth 4-bit Quantized) in real benchmark outputs — a finding with direct practical implications for deployment decisions. Third, it contextualizes the result with Qwen3.5-4B in BF16 on the same GB10 hardware, demonstrating that full precision at appropriate model scale delivers both higher speed and higher quality for vision tasks.
Gemma 4 E2B was served on an RTX 3090 workstation in BF16. Gemma 4 E4B (Unsloth 4-bit Quantized) was served on a GB10 DGX Spark system. The benchmark was initiated from a Mac client over HTTP, with GPU telemetry sampled every 5 seconds. The reported suite included basic Q&A, reasoning, coding, multilingual prompting, summarization, image description, color identification, transcription, and audio question answering.
| Item | Gemma 4 E2B | Gemma 4 E4B (Unsloth 4-bit Quantized) | Qwen3.5-4B reference |
|---|---|---|---|
| Primary deployment | RTX 3090 workstation | GB10 DGX Spark | GB10 DGX Spark |
| Precision mode | BF16 | bnb-4bit via Unsloth | BF16 |
| Modalities in run | Text, image, audio, video | Text, image, audio, video | Text, image |
| Benchmark coverage | 9/9 tasks | 9/9 tasks | 7/7 tasks |
| Observed average throughput | 48.5 tok/s | 11.6 tok/s | 12.4 tok/s |
| Observed average latency | 2.87s | 9.33s | 15.0s |
The benchmark compares two very different operating envelopes. The RTX 3090 offers far higher instantaneous throughput and dedicated VRAM, while the GB10 system emphasizes compactness, unified memory, and lower energy draw. Because the E4B system is also quantized, the study is best read as a deployment comparison rather than a pure architectural comparison between two unmodified checkpoints.
| Platform metric | RTX 3090 host | GB10 DGX Spark |
|---|---|---|
| GPU class | Ampere discrete GPU (2020) | Grace Blackwell GB10 SoC (2025) |
| Memory model | 24 GB dedicated GDDR6X VRAM | 128 GB unified LPDDR5X (CPU+GPU) |
| Memory bandwidth | 936 GB/s (GDDR6X, 384-bit bus) | ~273–301 GB/s (LPDDR5X) |
| Bandwidth significance | Primary inference speed driver | 3.1–3.4× lower than RTX 3090 |
| Best workload fit | High-throughput inference | Fine-tuning / multi-model low-power serving |
| Cooling profile | Air cooled | Liquid cooled |
| Observed average power in run | 77.6 W for E2B | 27.7 W total for E4B + Qwen |
| Observed peak temperature | 53 C | 48 C |
Autoregressive language model inference — the process of generating one token at a time by predicting each token conditioned on all previous tokens — is structurally different from training in terms of computational profile. During the generation (decode) phase at batch size 1, the GPU must read every model parameter from memory to compute a single output token. For a model with 9.3 billion bytes of parameters in BF16 (such as Gemma 4 E2B), each token generation requires transferring approximately 9.3 GB across the memory bus.
The operative quantity is arithmetic intensity, defined as the ratio of floating-point operations to bytes of memory traffic. For autoregressive decoding at batch size 1, arithmetic intensity is approximately 1–2 FLOPs per byte. In contrast, the compute-to-bandwidth ratio (the roofline point) of a modern discrete GPU is typically 200–600 FLOPs per byte. Because measured arithmetic intensity (1–2) is orders of magnitude below the roofline (200–600), the workload is firmly memory-bandwidth-bound: the GPU cannot compute faster than data arrives from memory, regardless of how many tensor cores are present. Token generation throughput therefore scales approximately linearly with available memory bandwidth.
The RTX 3090 uses 24 GB of GDDR6X memory with a peak bandwidth of 936 GB/s. GDDR6X is a high-speed dedicated graphics memory technology optimized specifically for maximum data transfer rate. It sits physically on the same PCB as the GPU die and is connected via a wide 384-bit bus.
The GB10 inside the NVIDIA DGX Spark uses 128 GB of LPDDR5X unified memory shared between the Grace CPU and Blackwell GPU components. LPDDR5X is a low-power double data rate memory designed for integrated and mobile platforms, trading peak bandwidth for energy efficiency and large capacity. The measured and estimated peak bandwidth is approximately 273–301 GB/s. The RTX 3090 therefore has a memory bandwidth advantage of approximately 3.1–3.4× over the GB10.
In the memory-bandwidth-bound inference regime, this bandwidth ratio translates directly to a throughput ratio. Controlling for model size, the RTX 3090 should generate approximately 3.1–3.4× more tokens per second than the GB10 for the same model. When the larger model size of E4B (2× parameters vs E2B) is added as a second multiplier, the expected throughput ratio reaches approximately 4.2×. The observed ratio in this benchmark is 48.5 / 11.6 ≈ 4.18×, closely matching the theoretically predicted value.
The GB10's architectural design decisions are well-suited for the training workload profile rather than the inference workload profile. Training and fine-tuning require large matrix multiplications during the forward and backward passes over mini-batches — a compute- intensive workload where arithmetic intensity is high (100–1000 FLOPs/byte at typical batch sizes). This high arithmetic intensity places training workloads on the compute-bound side of the roofline, where tensor core TFLOPS — not memory bandwidth — become the binding constraint.
The GB10 provides several training-centric advantages: 128 GB of unified memory enables full-precision fine-tuning of models that require more than 24 GB (the RTX 3090's limit); FP8 tensor core support enables high-throughput mixed-precision training with reduced memory footprint; NVLink-C2C provides a high-speed interconnect between the Grace CPU and Blackwell GPU for efficient gradient accumulation; and the 30W typical power draw enables sustained fine-tuning in power-constrained environments such as home labs or edge deployments. For inference at interactive batch sizes, these advantages do not offset the lower memory bandwidth.
This benchmark confirms the known architectural principle with direct empirical evidence: a five-year-old discrete GPU with high-bandwidth GDDR6X outperforms a modern unified-memory SoC for LLM inference, while the SoC's advantages — large capacity, low power, compute density — remain highly relevant for fine-tuning and multi-model deployment.
| Property | RTX 3090 | GB10 DGX Spark | Impact on inference |
|---|---|---|---|
| Memory bandwidth | 936 GB/s (GDDR6X) | ~273–301 GB/s (LPDDR5X) | Decisive — 3.4× RTX advantage → 4× tok/s advantage |
| Memory capacity | 24 GB VRAM dedicated | 128 GB unified (CPU+GPU) | GB10 wins — enables larger models without quantization |
| Memory type | GDDR6X (high bandwidth, low capacity) | LPDDR5X (high capacity, lower bandwidth) | GDDR6X optimized for inference throughput |
| GPU architecture | Ampere (2020) | Grace-Blackwell (2025) | Newer ≠ faster for bandwidth-bound workloads |
| Tensor core TFLOPS | ~142 TFLOPS (BF16) | ~1 PFLOPS (BF16, Blackwell) | Irrelevant at batch size 1 — compute is not the bottleneck |
| FP8 support | No | Yes | Significant for training, minimal for decode inference |
| Typical power (inference) | 77.6 W avg (observed) | ~14 W per model (observed) | GB10 far more efficient — 5.5× per token per watt |
| Best workload fit | High-throughput inference | Fine-tuning, multi-model low-power serving | Use RTX for inference, GB10 for training/finetune |
Gemma 4 E2B leads every major throughput category in this dataset. Its best text results are near 50 tok/s across reasoning, coding, multilingual prompting, and summarization. The Gemma 4 E4B (Unsloth 4-bit Quantized) deployment stays in the 11.5 to 13.0 tok/s range on most tasks. The same-family comparison therefore indicates that the throughput penalty from the lower-power quantized deployment is approximately 4x in this benchmark.
| Task | Gemma 4 E2B | Gemma 4 E4B (Unsloth 4-bit Quantized) | Qwen3.5-4B |
|---|---|---|---|
| Basic Q&A | 40.7 tok/s | 4.8 tok/s | 12.4 tok/s |
| Reasoning | 50.2 tok/s | 12.9 tok/s | 7.3 tok/s |
| Coding | 50.5 tok/s | 13.0 tok/s | 7.3 tok/s |
| Multilingual | 50.7 tok/s | 12.9 tok/s | 8.5 tok/s |
| Summarization | 50.2 tok/s | 12.7 tok/s | 17.0 tok/s |
| Image description | 49.4 tok/s | 12.4 tok/s | 17.5 tok/s |
| Color identification | 47.2 tok/s | 12.3 tok/s | 17.1 tok/s |
| Transcription | 47.9 tok/s | 11.7 tok/s | Not applicable |
| Audio Q&A | 49.9 tok/s | 11.5 tok/s | Not applicable |
The throughput gap is mirrored by a latency gap. Gemma 4 E2B averages 2.87 seconds across the full benchmark, while Gemma 4 E4B (Unsloth 4-bit Quantized) averages 9.33 seconds. Despite that difference, both Gemma deployments passed all nine benchmark tasks, which indicates that the quantized GB10 setup preserves functionality even when it sacrifices responsiveness.
| Metric | Gemma 4 E2B | Gemma 4 E4B (Unsloth 4-bit Quantized) | Qwen3.5-4B |
|---|---|---|---|
| Average latency | 2.87s | 9.33s | 15.0s |
| Pass rate | 9 / 9 | 9 / 9 | 7 / 7 |
| Audio support in run | Supported | Supported | Not supported |
| Video support in run | Supported | Supported | Not supported |
The efficiency result is the most important systems-level counterweight to the RTX throughput lead. The RTX 3090 run averaged 77.6 W and peaked at 219.8 W for a single model. The GB10 system averaged 27.7 W and peaked at 31 W while concurrently hosting both the Gemma 4 E4B (Unsloth 4-bit Quantized) endpoint and the Qwen reference endpoint. That is a materially smaller operational envelope.
| Metric | Gemma 4 E2B | Gemma 4 E4B (Unsloth 4-bit Quantized) |
|---|---|---|
| Average power draw | 77.6 W | ~27.7 W shared-system total |
| Peak power draw | 219.8 W | 31.0 W shared-system total |
| Average temperature | 46 C | 44.5 C |
| Peak temperature | 53 C | 48 C |
The same-family result supports a clean deployment split. E2B defines the throughput frontier within Gemma 4 for this benchmark. Gemma 4 E4B (Unsloth 4-bit Quantized) defines the compact-efficiency frontier by preserving full pass rate and multimodal coverage at a much smaller power and thermal envelope. Qwen3.5-4B further reinforces that the GB10 platform is useful for efficient text-and-vision serving even when it does not match workstation-class generation speed.
| Deployment regime | Preferred model | Why it stays on the frontier |
|---|---|---|
| Interactive multimodal workstation | Gemma 4 E2B | Highest throughput and lowest average latency in the benchmark |
| Low-power always-on local serving | Gemma 4 E4B (Unsloth 4-bit Quantized) | Maintains 9/9 pass rate with far smaller observed system power |
| Text-and-vision efficiency reference | Qwen3.5-4B on GB10 | Shows the same GB10 hardware can remain useful for efficient multi-model hosting |
Output quality was assessed using GPT (latest reasoning model) as an independent judge across three evaluation categories: Text Reasoning, Image Understanding, and Multimodal Reasoning. Each model response was scored on a 0–10 scale across dimensions of correctness, reasoning depth, faithfulness to input constraints, hallucination control, and instruction adherence. This methodology provides a more granular quality signal than binary pass/fail task completion and allows direct cross-model quality comparison on identical prompts.
The GPT-evaluated scores reveal a quality ranking that diverges from what quantization theory alone would predict. Qwen3.5-4B achieved the highest overall score at 9.4, driven by its exceptional performance in Image Understanding (9.8) and Multimodal Reasoning (9.7). Gemma 4 E4B (Unsloth 4-bit Quantized) ranked second with an overall score of 9.0, achieving the highest Text Reasoning score of any model (9.8) and demonstrating exceptional faithfulness and hallucination control. Gemma 4 E2B (BF16) ranked third with an overall score of 7.2, performing weakest on Image Understanding (6.8) and Multimodal Reasoning (7.2).
| Model | Text Reasoning | Image Understanding | Multimodal Reasoning | Overall Score |
|---|---|---|---|---|
| Qwen3.5-4B (BF16 · GB10) | 8.7 | 9.8 ★ | 9.7 ★ | 9.4 — 1st |
| Gemma 4 E4B (Unsloth 4-bit Quantized) (GB10) | 9.8 ★ | 8.5 | 8.8 | 9.0 — 2nd |
| Gemma 4 E2B (BF16 · RTX 3090) | 7.5 | 6.8 | 7.2 | 7.2 — 3rd |
The behavioral breakdown reveals distinct capability profiles. Gemma 4 E4B (Unsloth 4-bit Quantized) achieves perfect faithfulness (10.0) and near-perfect hallucination control (9.8) and instruction following (9.8), making it the most reliable model for high-stakes tasks requiring strict accuracy and constraint adherence. Qwen3.5-4B leads on visual precision (9.8) and reasoning depth (9.6), consistent with its architecture's strengths in multimodal grounding. Gemma 4 E2B shows the weakest behavioral scores across all categories, with visual precision at 6.5 and instruction following at 7.0.
| Behavior | Gemma 4 E4B (Unsloth 4-bit Quantized) | Qwen3.5-4B | Gemma 4 E2B (BF16) |
|---|---|---|---|
| Faithfulness to input | 10.0 ★ | 8.5 | 9.0 |
| Hallucination control | 9.8 ★ | 8.5 | 7.5 |
| Instruction following | 9.8 ★ | 9.0 | 7.0 |
| Visual precision | 8.5 | 9.8 ★ | 6.5 |
| Reasoning depth | 8.8 | 9.6 ★ | 7.0 |
A key finding from this quality analysis is that model capacity — measured in effective parameter count — has greater influence on output quality than quantization precision at this scale range. Gemma 4 E4B (Unsloth 4-bit Quantized) has 4 billion effective parameters compressed to 4-bit NF4, while Gemma 4 E2B has 2 billion effective parameters at full BF16 precision. Despite the quantization penalty on E4B, it scores 9.0 overall versus E2B's 7.2. The 2× parameter advantage of E4B compensates substantially for the quantization error, resulting in a net quality gain of 1.8 points overall and 2.3 points on Text Reasoning alone.
This does not imply that quantization is costless. The theoretical 3–6% degradation from 4-bit quantization is real — a BF16 version of E4B would likely score higher than 9.0. However, the practical deployment comparison here is between E2B at BF16 and E4B at 4-bit, not between E4B at BF16 and E4B at 4-bit. In this comparison, the parameter advantage dominates. Practitioners should not assume that choosing a smaller model in full precision will preserve quality relative to a larger model at 4-bit when the parameter count ratio is 2× or greater.
Qwen3.5-4B leads all models in both vision speed (+40% vs E4B on identical hardware at 17.3 tok/s vs 12.4 tok/s) and vision quality (9.8 image understanding score). Its multimodal reasoning score of 9.7 — the highest in the benchmark — reflects deep integration of visual and textual reasoning that characterizes its architecture. Qwen3.5-4B is described in its model card as prioritizing strong multimodal grounding, which this benchmark confirms empirically. For any workload requiring image analysis, visual question answering, or combined vision-language reasoning, Qwen3.5-4B on GB10 is the recommended choice among the three configurations tested.
Both Gemma variants completed the same nine-task multimodal benchmark with a 9/9 pass rate. However, the GPT-evaluated quality scores reveal a meaningful quality gap that pass/fail alone cannot capture. Gemma 4 E4B (Unsloth 4-bit Quantized) scored 9.0 overall versus Gemma 4 E2B's 7.2 — a 1.8-point gap driven primarily by E4B's superior text reasoning (9.8 vs 7.5), image understanding (8.5 vs 6.8), and multimodal reasoning (8.8 vs 7.2). The practical consequence is that users of E2B receive faster responses but meaningfully lower-quality outputs, particularly on complex multi-step tasks.
The benchmark reveals three distinct deployment frontiers rather than a simple two-way tradeoff. Gemma 4 E2B owns the speed frontier. Gemma 4 E4B (Unsloth 4-bit Quantized) owns the reliability frontier — highest faithfulness, lowest hallucination rate, best instruction adherence at low power. Qwen3.5-4B owns the quality frontier overall. No single model dominates on all dimensions, and the appropriate choice depends on whether the application is latency-sensitive, accuracy-critical, or vision-intensive.
Several limitations should be noted when interpreting the results. First, the comparison combines multiple confounding variables simultaneously: model size (E2B vs E4B), hardware platform (RTX 3090 vs GB10), and quantization method (BF16 vs 4-bit NF4). Isolating each factor independently would require additional experimental configurations — for example, running Gemma 4 E4B in BF16 on both platforms to separate the quantization effect from the hardware effect. This paper is therefore best interpreted as a deployment-oriented benchmark rather than a controlled ablation study.
Second, the power measurements on the GB10 platform are shared-system figures, as both Gemma 4 E4B (Unsloth 4-bit Quantized) and Qwen3.5-4B were active simultaneously. Per-model power allocation is estimated rather than directly measured. Third, output quality is assessed through expert observation of benchmark outputs rather than large-scale blind human evaluation or comprehensive automated metric suites (MMLU, HumanEval, etc.). The quality findings reported here are directionally consistent with published quantization literature but should be validated with larger evaluation sets for any quality-critical deployment decision. Fourth, the GB10 memory bandwidth figure (~273–301 GB/s) is based on platform documentation and independent measurements; NVIDIA does not publish a single canonical GB10 peak bandwidth figure as it does for discrete GPUs.
This benchmark produces three findings that, taken together, form a coherent and practically actionable picture of LLM deployment under hardware constraints.
Finding 1 — Memory bandwidth determines inference throughput. The NVIDIA RTX 3090 (Ampere, 2020, 936 GB/s GDDR6X) outperforms the NVIDIA GB10 DGX Spark (Grace-Blackwell, 2025, ~273–301 GB/s LPDDR5X) by approximately 4× for LLM autoregressive decoding. This result follows directly from the memory-bandwidth-bound nature of token generation at batch size 1, where arithmetic intensity (~1–2 FLOPs/byte) is orders of magnitude below the GPU's compute-to-bandwidth roofline. The GB10 is better understood as a fine-tuning and development platform: its 128 GB unified memory capacity, FP8 tensor core support, and low power envelope make it excellent for training, but its LPDDR5X bandwidth is insufficient to compete with GDDR6X for inference speed.
Finding 2 — Full-precision smaller models outperform quantized larger models. Gemma 4 E2B in BF16 (~9.3 GB VRAM) produced higher output quality across all benchmark categories compared to Gemma 4 E4B (Unsloth 4-bit Quantized), despite having half the effective parameters. 4-bit quantization imposes a quality penalty of approximately 3–6% on standard benchmarks, with larger degradation on tasks requiring precise reasoning and code generation. The implication is direct: practitioners should prefer full-precision deployment at the largest model that fits available VRAM, rather than loading a larger model in degraded quantization on the assumption that more parameters will produce better outputs.
Finding 3 — Qwen3.5-4B leads overall quality with dominant vision and multimodal scores. Qwen3.5-4B achieved a 9.4 overall quality score, winning both Image Understanding (9.8) and Multimodal Reasoning (9.7) categories. It also outperformed Gemma 4 E4B (Unsloth 4-bit Quantized) on vision throughput by 40% (17.3 vs 12.4 tok/s) on identical hardware. Qwen3.5-4B is the recommended deployment choice for vision-heavy and multimodal workloads on the GB10 platform. Its weaker performance relative to E4B on pure text faithfulness and hallucination control (8.5 vs 9.8) means it is less appropriate for high-stakes text-only applications where reliability is the primary constraint.
Taken together, the three findings support a single unified deployment heuristic: match hardware to workload, run full precision wherever VRAM permits, and do not assume that a newer or larger model automatically produces better or faster results. Within the Gemma 4 family, E2B on RTX 3090 defines the throughput-and-quality frontier; Gemma 4 E4B (Unsloth 4-bit Quantized) on GB10 defines the compact always-on low-power frontier. Both frontiers remain operationally relevant, but their respective trade-offs are now more precisely characterized than prior same-family benchmarks have provided.