O sinal do dia
Registro autoral do que foi desenvolvido ou publicado no dia anterior, cruzado com sinais das fontes recorrentes do ecossistema.
Contribuições @sztlink
- commit · sztlink / turboquant-cuda-bench: add DFlash adaptive draft bench — p-min=0.9 optimal, default 0.75 worst
- commit · sztlink / turboquant-cuda-bench: add RotorQuant Llama 3.1 8B bench — PPL confirmed, KV/speed not
- commit · sztlink / turboquant-cuda-bench: add RotorQuant vs TurboQuant bench results — Qwen3.6-27B
- commit · sztlink / turboquant-cuda-bench: bench(dflash): depth comparison — encoder constraint + accept rate drop at d1500
- commit · sztlink / turboquant-cuda-bench: bench(dflash): build 9304 — bug fixed, 60% accept rate, 25.7 t/s effective on 27B d0
- commit · sztlink / turboquant-cuda-bench: fix bench-dflash.bat: use llama-speculative.exe for DFlash (llama-bench has no -md flag)
- x · @sztlink / X: [@no_stp_on_snek update: build-head3 was CPU-only (99KB binary). needed the full CUDA build from the repo. now running properly — Axis A ref[29/30] on Qwen3.6-27B, CUDA binary has t](https://x.com/sztlink/status/2051407910941446336)
- x · @sztlink / X: RotorQuant bench (Llama 3.1 8B, RTX 4090): turbo3 prefill +27% at 16K ctx (8.7k vs 6.8k t/s) — but gen -39% (50 vs 82 t/s). planar3/iso3: crash at sampler (llama-sampler.cpp:1094),
- x · @sztlink / X: @davideciffa nice — curious about ddtree, does it need a specific backend or works with llamacpp? 4090 + Qwen3.6 at 65K ctx is our daily use case, happy to bench it
- x · @sztlink / X: @no_stp_on_snek on it — llama-completion.exe already in our build-head3 (May 3). selftest next, then full score on Qwen3.6-27B turbo3/q8_0 on RTX 4090. will report back
- x · @sztlink / X: DFlash adaptive draft, RTX 4090, Qwen3.6-27B: --draft-p-min 0.9: 71% accept, 7.0 t/s ✓ --draft-p-min 0.75 (default): 47% accept, 4.9 t/s ✗ Default p-min generates excess low-confid
- x · @sztlink / X: @no_stp_on_snek @no_stp_on_snek happy to run REFRACT on our RTX 4090 if you share the binary — we have the 27B dense and 35B-A3B running locally, no CUDA gap on our end
- x · @sztlink / X: RotorQuant Llama 3.1 8B on RTX 4090: PPL claim confirmed (iso3 6.82 vs turbo3 6.98). But K cache is 2× q8_0 and throughput is -82%. Not -28% as README claims. Hypothesis: deferred
- x · @sztlink / X: RotorQuant (planar3/iso3) vs TurboQuant (turbo3) on Qwen3.6-27B Q4_K_M, RTX 4090: turbo3 wins on all axes. PPL 6.15 vs 6.35+, KV 25 MiB vs 76 MiB (!), no throughput loss. RotorQuan
- x · @sztlink / X: TurboQuant + DFlash bench — RTX 4090 (SM89) https://t.co/vxamkcFoHK → q8_0-K + turbo3-V: NIAH 25/25 @65K, TG flat → DFlash b9304: bug do b8909 corrigido, 60% accept rate (draft=16)
Sinais externos do dia
- commit · TheTom / llama-cpp-turboquant: Merge pull request #122 from TheTom/fix/cve-2026-21869-n-discard-clamp
- commit · TheTom / llama-cpp-turboquant: server: fix heap-buffer-overflow from negative n_discard (CVE-2026-21869) (#22267)
- issue · TheTom / llama-cpp-turboquant: #123 planar3/iso3 (johndpope fork): sampler assertion crash llama-sampler.cpp:1094
- issue · TheTom / llama-cpp-turboquant: #113 Feature Request: Upstream the TurboQuant implementation
- commit · TheTom / turboquant_plus / REFRACT: [[paper] Appendix A: document geometric-mean algebraic relation](https://github.com/TheTom/turboquant_plus/commit/80fc2f4c848d289b91a4bf16fa64d61039dc0634)
- commit · TheTom / turboquant_plus / REFRACT: release(refract): v0.3.2.3 — MLX KLD numerical stability + walkthrough fixes
- commit · TheTom / turboquant_plus / REFRACT: release(refract): v0.3.2.2 — --prompts defaults to bundled
- issue · TheTom / turboquant_plus / REFRACT: #89 Consider proper attribution for EDEN quantization
- issue · TheTom / turboquant_plus / REFRACT: #27 Upstream: TurboQuant discussion + contribution requirements for llama.cpp
O que observar agora
- Velocidade entre paper, fork e validação em hardware real.
- Qualidade medida por REFRACT, não apenas throughput.
- Claims de forks comparados com execução concreta.
Nota autoral
- Esta edição combina radar de campo com diário de contribuição: o que se moveu no ecossistema e o que o @sztlink efetivamente colocou em circulação no GitHub/X.
- O texto público deve assumir uma voz de trabalho: método, dúvida, evidência e próximo experimento — sem transformar hipótese em resultado.
Fontes consultadas
- TheTom / llama-cpp-turboquant
- TheTom / turboquant_plus / REFRACT
- llama.cpp discussion #20969
- sztlink / turboquant-cuda-bench
- TurboQuant paper
Próxima leitura
TurboQuant é um radar diário e também um diário público de contribuição: consulta fontes recorrentes do ecossistema — TheTom, REFRACT, llama.cpp, benchmark @sztlink, paper TurboQuant e forks adjacentes — e registra o que o @sztlink colocou em circulação no GitHub/X.