Gemma 4 E2B vs Gemma 4 E4B (Unsloth 4-bit) vs Qwen3.5-4B:
Speed, Quality, and Reliability Across RTX 3090 and GB10 DGX Spark
Tarun Chawdhury  ·  Mousumi Chawdhury
DLYog Lab Research Services LLC
April 14, 2026

Abstract

We evaluated three open multimodal models — Gemma 4 E2B (BF16, RTX 3090), Gemma 4 E4B (Unsloth 4-bit, GB10 DGX Spark), and Qwen3.5-4B (BF16, GB10 DGX Spark) — across 16 structured test cases covering text, image, audio, and video modalities. Throughput: E2B led at 47.5 tok/s, 2.4× faster than E4B (20.2 tok/s) and 2.8× faster than Qwen (16.9 tok/s), driven by the RTX 3090's GDDR6X memory bandwidth advantage (936 GB/s vs ~273 GB/s LPDDR5X on GB10). Quality (scored 0–10 by Claude Sonnet 4.6 as AI evaluator): Qwen3.5-4B led overall at 8.5/10, particularly on text reasoning and image analysis. E4B scored 7.6/10; E2B scored 7.5/10. E2B was the strongest and most reliable on audio (8.5/10, 5/5 success vs E4B's 4/5 with one HTTP 500 failure). Key insight: model capacity and architecture matter more than quantization precision at this scale — E4B at 4-bit beats E2B at BF16 on image quality despite lower precision. All three models failed to identify a neural network architecture diagram, describing it as a generic network graph.
Keywords: Gemma 4, Qwen3.5, Unsloth quantization, RTX 3090, GB10 DGX Spark, multimodal evaluation, memory bandwidth, inference throughput, AI evaluation, response quality

1. Setup

ModelHardwarePrecisionModalitiesEndpoint
Gemma 4 E2BRTX 3090 24GB (dlyog04)BF16Text · Image · Audio · Videoport 9000
Gemma 4 E4B / UnslothGB10 DGX Spark (dgx1)4-bit NF4Text · Image · Audioport 9001
Qwen3.5-4BGB10 DGX Spark (dgx1)BF16Text · Imageport 8002

Evaluation tool: run_eval.py (DLYog Lab, April 2026). Dataset: 5 text, 5 image, 5 audio, 1 video cases from test_data/dataset/. All responses saved to result.json for qualitative review. Quality scores assigned by Claude Sonnet 4.6 (Anthropic) acting as an independent AI evaluator.

2. Throughput and Reliability

ModelAvg tok/sAvg LatencySuccess RateFailures
Gemma 4 E2B47.54.6s16 / 16None
Gemma 4 E4B / Unsloth20.211.2s14 / 15audio_04: HTTP 500
Qwen3.5-4B16.917.7s10 / 10None (text+image only)

The throughput gap between E2B and the GB10 models (2.4–2.8×) is attributable to memory bandwidth: the RTX 3090 provides 936 GB/s via GDDR6X versus the GB10's ~273 GB/s via LPDDR5X. Autoregressive token generation at batch size 1 is memory-bandwidth-bound (~1–2 FLOPs/byte arithmetic intensity, well below the GPU roofline), making bandwidth the decisive metric.

3. Response Quality

3.1 Scores by Modality (Claude Sonnet 4.6, AI Evaluator)

Modality (n)Gemma 4 E2BGemma 4 E4BQwen3.5-4B
Text (5)7.07.08.5 ★
Image (5)7.58.08.5 ★
Audio (5)8.5 ★7.5N/A
Video (1)7.0N/AN/A
Weighted Overall7.57.68.5 ★

3.2 Case-Level Findings

The following cases produced the clearest quality differentials:

Illustrative contrast — audio_01 (solar system):
E2B: "The solar system contains eight planets: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, and Neptune. Jupiter is the largest planet."
E4B: "The audio discusses the solar system and its planets."
E2B's response matches ground truth exactly. E4B's response is a summary that loses all enumerable content — significant for downstream retrieval or extraction tasks.

4. Key Findings

  1. Speed is hardware-determined, not model-determined. E2B on RTX 3090 (47.5 tok/s) is 2.4× faster than E4B on GB10 (20.2 tok/s) despite E4B having 2× more parameters. The RTX 3090's 936 GB/s GDDR6X bandwidth advantage over the GB10's ~273 GB/s LPDDR5X is the primary driver. A 2020 Ampere GPU outperforms a 2025 Grace-Blackwell SoC for single-stream inference.
  2. Model capacity outweighs quantization precision at this scale. E4B at 4-bit (4B parameters) achieves higher image quality (8.0) than E2B at BF16 (7.5), despite lower precision. The 2× parameter advantage compensates for quantization error. Practitioners should not assume full-precision inference yields superior output quality when the parameter-count ratio is 2× or greater.
  3. Qwen3.5-4B leads text and image quality on shared hardware. On identical GB10 hardware, Qwen (BF16, 8.5/10) outperforms E4B (4-bit, 7.6/10) in overall quality. BF16 precision at the same parameter scale as E4B, combined with Qwen's architecture, produces measurably better reasoning and visual analysis outputs.
  4. E2B is the only reliable choice for audio and video. E2B scored 8.5/10 on audio with 5/5 success. E4B scored 7.5/10 with one hard server failure. For production pipelines requiring audio transcription, E2B is the safe deployment choice. Video capability exists only in E2B.

5. Deployment Recommendations

Use caseRecommended modelReason
Speed-critical / interactiveGemma 4 E2B (RTX 3090)47.5 tok/s, 100% reliability, all modalities
Best text + image qualityQwen3.5-4B (GB10)8.5/10 quality, strong reasoning depth
Audio / video transcriptionGemma 4 E2B (RTX 3090)Only model with audio+video and 100% success rate
Low-power multi-model servingE4B + Qwen on GB10Both models share ~28W total; efficient for always-on

6. Limitations

Quality scores are from a single AI evaluator (Claude Sonnet 4.6) on 16 cases — not large-scale human evaluation or standard automated benchmarks (MMLU, HumanEval). Scores are directionally reliable but should be validated on larger datasets for quality-critical decisions. The audio_04 E4B failure was a single occurrence; a larger run is needed to establish a reliable failure rate. Power measurements for GB10 are shared-system figures (E4B and Qwen ran concurrently).

References

  1. Google DeepMind. Gemma 4 model materials. HuggingFace, 2025.
  2. Unsloth AI. Unsloth 4-bit quantization documentation. unsloth.ai, 2024–2025.
  3. NVIDIA Corporation. RTX 3090 specifications. 2020. (936 GB/s GDDR6X).
  4. NVIDIA Corporation. DGX Spark GB10 platform documentation. 2025. (~273–301 GB/s LPDDR5X).
  5. Qwen Team, Alibaba Cloud. Qwen3 Technical Report. HuggingFace, 2025.
  6. Anthropic. Claude Sonnet 4.6 (claude-sonnet-4-6) — used as AI evaluator for response quality scoring in this paper. 2026.
  7. DLYog Lab. run_eval.py evaluation runner and result.json. April 14, 2026.