Gemma 4 E2B vs Gemma 4 E4B (Unsloth 4-bit) vs Qwen3.5-4B:
Speed, Quality, and Reliability Across RTX 3090 and GB10 DGX Spark
Tarun Chawdhury · Mousumi Chawdhury
DLYog Lab Research Services LLC
April 14, 2026
Abstract
We evaluated three open multimodal models — Gemma 4 E2B (BF16, RTX 3090), Gemma 4 E4B
(Unsloth 4-bit, GB10 DGX Spark), and Qwen3.5-4B (BF16, GB10 DGX Spark) — across 16
structured test cases covering text, image, audio, and video modalities.
Throughput: E2B led at 47.5 tok/s, 2.4× faster than E4B (20.2 tok/s) and
2.8× faster than Qwen (16.9 tok/s), driven by the RTX 3090's GDDR6X memory bandwidth
advantage (936 GB/s vs ~273 GB/s LPDDR5X on GB10).
Quality (scored 0–10 by Claude Sonnet 4.6 as AI evaluator):
Qwen3.5-4B led overall at 8.5/10, particularly on text reasoning and image analysis.
E4B scored 7.6/10; E2B scored 7.5/10. E2B was the strongest and most reliable on audio
(8.5/10, 5/5 success vs E4B's 4/5 with one HTTP 500 failure).
Key insight: model capacity and architecture matter more than quantization
precision at this scale — E4B at 4-bit beats E2B at BF16 on image quality despite lower precision.
All three models failed to identify a neural network architecture diagram, describing it as a
generic network graph.
Keywords: Gemma 4, Qwen3.5, Unsloth quantization, RTX 3090, GB10 DGX Spark,
multimodal evaluation, memory bandwidth, inference throughput, AI evaluation, response quality
1. Setup
| Model | Hardware | Precision | Modalities | Endpoint |
| Gemma 4 E2B | RTX 3090 24GB (dlyog04) | BF16 | Text · Image · Audio · Video | port 9000 |
| Gemma 4 E4B / Unsloth | GB10 DGX Spark (dgx1) | 4-bit NF4 | Text · Image · Audio | port 9001 |
| Qwen3.5-4B | GB10 DGX Spark (dgx1) | BF16 | Text · Image | port 8002 |
Evaluation tool: run_eval.py (DLYog Lab, April 2026). Dataset: 5 text, 5 image,
5 audio, 1 video cases from test_data/dataset/. All responses saved to
result.json for qualitative review. Quality scores assigned by Claude Sonnet 4.6
(Anthropic) acting as an independent AI evaluator.
2. Throughput and Reliability
| Model | Avg tok/s | Avg Latency | Success Rate | Failures |
| Gemma 4 E2B | 47.5 | 4.6s | 16 / 16 | None |
| Gemma 4 E4B / Unsloth | 20.2 | 11.2s | 14 / 15 | audio_04: HTTP 500 |
| Qwen3.5-4B | 16.9 | 17.7s | 10 / 10 | None (text+image only) |
The throughput gap between E2B and the GB10 models (2.4–2.8×) is attributable to memory
bandwidth: the RTX 3090 provides 936 GB/s via GDDR6X versus the GB10's ~273 GB/s via LPDDR5X.
Autoregressive token generation at batch size 1 is memory-bandwidth-bound (~1–2 FLOPs/byte
arithmetic intensity, well below the GPU roofline), making bandwidth the decisive metric.
3. Response Quality
3.1 Scores by Modality (Claude Sonnet 4.6, AI Evaluator)
| Modality (n) | Gemma 4 E2B | Gemma 4 E4B | Qwen3.5-4B |
| Text (5) | 7.0 | 7.0 | 8.5 ★ |
| Image (5) | 7.5 | 8.0 | 8.5 ★ |
| Audio (5) | 8.5 ★ | 7.5 | N/A |
| Video (1) | 7.0 | N/A | N/A |
| Weighted Overall | 7.5 | 7.6 | 8.5 ★ |
3.2 Case-Level Findings
The following cases produced the clearest quality differentials:
- text_02 (reasoning — sheep puzzle). E4B: "There are 9 sheep left" with no reasoning. E2B: "classic riddle" framing without explanation. Qwen: numbered step-by-step analysis of the "all but = all except" idiom. Qwen clearly superior on pedagogical reasoning.
- text_05 (summarization — meditation). Only Qwen cited cortisol levels — matching the ground truth's scientific specificity. E2B and E4B produced accurate but generic responses.
- img_02 (cyberpunk city street). Only Qwen identified Chinese characters in the neon signs, correctly localising the scene as East Asian. E2B missed the cyberpunk/futuristic framing. E4B noted "Asian city like Tokyo."
- img_04 (home library). Qwen hallucinated two fireplaces; E4B described the room accurately — single fireplace, correct furnishings. E4B was the most reliable responder on this case.
- img_05 (neural network diagram). All three models described the image as a generic "network graph." None identified it as a neural network architecture. This is a shared limitation across the evaluated models.
- audio_01 (solar system). E2B produced a near-perfect transcript naming all eight planets. E4B gave only a summary paraphrase missing all specifics.
- audio_03 (medical clinical info). E4B produced a structured bullet list with proper clinical units (mmHg); E2B gave accurate prose. E4B's structured format is more suitable for clinical review pipelines.
- audio_04 (Stanford AI news). E2B: perfect transcription. E4B: HTTP 500 server failure — the only hard failure in the run.
Illustrative contrast — audio_01 (solar system):
E2B: "The solar system contains eight planets: Mercury, Venus, Earth, Mars, Jupiter, Saturn,
Uranus, and Neptune. Jupiter is the largest planet."
E4B: "The audio discusses the solar system and its planets."
E2B's response matches ground truth exactly. E4B's response is a summary that loses all
enumerable content — significant for downstream retrieval or extraction tasks.
4. Key Findings
- Speed is hardware-determined, not model-determined. E2B on RTX 3090 (47.5 tok/s) is 2.4× faster than E4B on GB10 (20.2 tok/s) despite E4B having 2× more parameters. The RTX 3090's 936 GB/s GDDR6X bandwidth advantage over the GB10's ~273 GB/s LPDDR5X is the primary driver. A 2020 Ampere GPU outperforms a 2025 Grace-Blackwell SoC for single-stream inference.
- Model capacity outweighs quantization precision at this scale. E4B at 4-bit (4B parameters) achieves higher image quality (8.0) than E2B at BF16 (7.5), despite lower precision. The 2× parameter advantage compensates for quantization error. Practitioners should not assume full-precision inference yields superior output quality when the parameter-count ratio is 2× or greater.
- Qwen3.5-4B leads text and image quality on shared hardware. On identical GB10 hardware, Qwen (BF16, 8.5/10) outperforms E4B (4-bit, 7.6/10) in overall quality. BF16 precision at the same parameter scale as E4B, combined with Qwen's architecture, produces measurably better reasoning and visual analysis outputs.
- E2B is the only reliable choice for audio and video. E2B scored 8.5/10 on audio with 5/5 success. E4B scored 7.5/10 with one hard server failure. For production pipelines requiring audio transcription, E2B is the safe deployment choice. Video capability exists only in E2B.
5. Deployment Recommendations
| Use case | Recommended model | Reason |
| Speed-critical / interactive | Gemma 4 E2B (RTX 3090) | 47.5 tok/s, 100% reliability, all modalities |
| Best text + image quality | Qwen3.5-4B (GB10) | 8.5/10 quality, strong reasoning depth |
| Audio / video transcription | Gemma 4 E2B (RTX 3090) | Only model with audio+video and 100% success rate |
| Low-power multi-model serving | E4B + Qwen on GB10 | Both models share ~28W total; efficient for always-on |
6. Limitations
Quality scores are from a single AI evaluator (Claude Sonnet 4.6) on 16 cases — not large-scale
human evaluation or standard automated benchmarks (MMLU, HumanEval). Scores are directionally
reliable but should be validated on larger datasets for quality-critical decisions. The audio_04
E4B failure was a single occurrence; a larger run is needed to establish a reliable failure rate.
Power measurements for GB10 are shared-system figures (E4B and Qwen ran concurrently).
References
- Google DeepMind. Gemma 4 model materials. HuggingFace, 2025.
- Unsloth AI. Unsloth 4-bit quantization documentation. unsloth.ai, 2024–2025.
- NVIDIA Corporation. RTX 3090 specifications. 2020. (936 GB/s GDDR6X).
- NVIDIA Corporation. DGX Spark GB10 platform documentation. 2025. (~273–301 GB/s LPDDR5X).
- Qwen Team, Alibaba Cloud. Qwen3 Technical Report. HuggingFace, 2025.
- Anthropic. Claude Sonnet 4.6 (claude-sonnet-4-6) — used as AI evaluator for response quality scoring in this paper. 2026.
- DLYog Lab. run_eval.py evaluation runner and result.json. April 14, 2026.