thc1006/qwen3.6-speculative-decoding-rtx3090

First public benchmark of llama.cpp speculative decoding on Qwen3.6-35B-A3B with a single RTX 3090 (post PR #19493 merge, 2026-04-19). 19 configurations covering ngram-cache, ngram-mod, and classic draft with vocab-matched Qwen3.5-0.8B. Finding: no variant achieves net speedup on Ampere + A3B MoE. Raw JSON, plots, full reproducibility.

View on GitHub

0Active

On the radar — signal detected

Stars

Forks

Contributors

Language

Python

Score updated Apr 24, 2026

// SUBSCRIBE

The repos that moved this week, why they matter, and what to watch next. One email. No noise.