Gemma4-E2B vs Gemma4-E4B vs Qwen3.5-4B — Evaluation Paper

We evaluated three open multimodal models — Gemma 4 E2B (BF16, RTX 3090), Gemma 4 E4B (Unsloth 4-bit, GB10 DGX Spark), and Qwen3.5-4B (BF16, GB10 DGX Spark) — across 16 structured test cases covering text, image, audio, and video modalities. Throughput: E2B led at 47.5 tok/s, 2.4× faster than E4B (20.2 tok/s) and 2.8× faster than Qwen (16.9 tok/s), driven by the RTX 3090's GDDR6X memory bandwidth advantage (936 GB/s vs ~273 GB/s LPDDR5X on GB10). Quality (scored 0–10 by Claude Sonnet 4.6 as AI evaluator): Qwen3.5-4B led overall at 8.5/10, particularly on text reasoning and image analysis. E4B scored 7.6/10; E2B scored 7.5/10. E2B was the strongest and most reliable on audio (8.5/10, 5/5 success vs E4B's 4/5 with one HTTP 500 failure). Key insight: model capacity and architecture matter more than quantization precision at this scale — E4B at 4-bit beats E2B at BF16 on image quality despite lower precision. All three models failed to identify a neural network architecture diagram, describing it as a generic network graph.

1. Setup

Model	Hardware	Precision	Modalities	Endpoint
Gemma 4 E2B	RTX 3090 24GB (dlyog04)	BF16	Text · Image · Audio · Video	port 9000
Gemma 4 E4B / Unsloth	GB10 DGX Spark (dgx1)	4-bit NF4	Text · Image · Audio	port 9001
Qwen3.5-4B	GB10 DGX Spark (dgx1)	BF16	Text · Image	port 8002

Evaluation tool: run_eval.py (DLYog Lab, April 2026). Dataset: 5 text, 5 image, 5 audio, 1 video cases from test_data/dataset/. All responses saved to result.json for qualitative review. Quality scores assigned by Claude Sonnet 4.6 (Anthropic) acting as an independent AI evaluator.

2. Throughput and Reliability

The throughput gap between E2B and the GB10 models (2.4–2.8×) is attributable to memory bandwidth: the RTX 3090 provides 936 GB/s via GDDR6X versus the GB10's ~273 GB/s via LPDDR5X. Autoregressive token generation at batch size 1 is memory-bandwidth-bound (~1–2 FLOPs/byte arithmetic intensity, well below the GPU roofline), making bandwidth the decisive metric.

3. Response Quality

3.1 Scores by Modality (Claude Sonnet 4.6, AI Evaluator)

3.2 Case-Level Findings

Model	Avg tok/s	Avg Latency	Success Rate	Failures
Gemma 4 E2B	47.5	4.6s	16 / 16	None
Gemma 4 E4B / Unsloth	20.2	11.2s	14 / 15	audio_04: HTTP 500
Qwen3.5-4B	16.9	17.7s	10 / 10	None (text+image only)

Modality (n)	Gemma 4 E2B	Gemma 4 E4B	Qwen3.5-4B
Text (5)	7.0	7.0	8.5 ★
Image (5)	7.5	8.0	8.5 ★
Audio (5)	8.5 ★	7.5	N/A
Video (1)	7.0	N/A	N/A
Weighted Overall	7.5	7.6	8.5 ★

Illustrative contrast — audio_01 (solar system):
E2B: "The solar system contains eight planets: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, and Neptune. Jupiter is the largest planet."
E4B: "The audio discusses the solar system and its planets."
E2B's response matches ground truth exactly. E4B's response is a summary that loses all enumerable content — significant for downstream retrieval or extraction tasks.

4. Key Findings

5. Deployment Recommendations

6. Limitations

Quality scores are from a single AI evaluator (Claude Sonnet 4.6) on 16 cases — not large-scale human evaluation or standard automated benchmarks (MMLU, HumanEval). Scores are directionally reliable but should be validated on larger datasets for quality-critical decisions. The audio_04 E4B failure was a single occurrence; a larger run is needed to establish a reliable failure rate. Power measurements for GB10 are shared-system figures (E4B and Qwen ran concurrently).

Use case	Recommended model	Reason
Speed-critical / interactive	Gemma 4 E2B (RTX 3090)	47.5 tok/s, 100% reliability, all modalities
Best text + image quality	Qwen3.5-4B (GB10)	8.5/10 quality, strong reasoning depth
Audio / video transcription	Gemma 4 E2B (RTX 3090)	Only model with audio+video and 100% success rate
Low-power multi-model serving	E4B + Qwen on GB10	Both models share ~28W total; efficient for always-on

Abstract