O sinal do dia
Registro autoral do que foi desenvolvido ou publicado no dia anterior, cruzado com sinais das fontes recorrentes do ecossistema.
Contribuições @sztlink
- x · @sztlink / X: @Zai_org Open weights is the part that matters to me here. A 1M context claim is only testable if I can put the KV cache under a quant probe and watch where retrieval breaks. I wil
- x · @sztlink / X: @NielsRogge @huggingface A 4B model trained just to explore a repo is a clean idea, but the cost in coding agents is still the KV cache you drag across the whole exploration. Small
- x · @sztlink / X: @ivanfioravanti fp4 on weights is the easy win, the interesting question is fp4 on the KV cache. That is where the long-context errors show up: retrieval and long-generation probes
- x · @sztlink / X: @NeoAIForecast Nice clean setup. At ctx 8192 with full offload the next lever is KV cache type: dropping K/V to q8_0 (or 4-bit if you are memory bound) frees enough VRAM to push co
- x · @sztlink / X: @NeoAIForecast Nice clean setup. At ctx 8192 with full offload the next lever is KV cache type: dropping K/V to q8_0 (or 4-bit if you are memory bound) frees enough VRAM to push co
- x · @sztlink / X: @fchollet Agreed, and the lever people underrate is the KV cache: at long context it grows past the weights, so inference efficiency becomes a KV-memory problem, not just FLOPs. KV
- x · @sztlink / X: @RedHat_AI True at the API layer, but the part the URL swap hides is the KV cache: the same kv-cache-dtype name behaves differently across the two. I ran the same probe on both and
- x · @sztlink / X: @pbastowski @LucianoLicelli Fair that aggressive quant compounds, but the data splits it finer: 4-bit K is near-lossless through 8k (vLLM + llama.cpp in my runs); it's 2-bit V wher
- x · @sztlink / X: @SeraAndroid Yeah, a lot. FP8 KV is the funny one: naive fp8 went to gibberish on adversarial retrieval (0/8 in my tests), calibrated scales recovered coherence but still missed th
- x · @sztlink / X: @bnjmn_marie This matches what I keep seeing with KV compression: the method passes accuracy/exact-match but quietly changes behavior, here token efficiency, elsewhere the greedy t
- x · @sztlink / X: @jun_song Same conclusion I keep hitting. The part that bites even harder than weight bandwidth: at long context the KV cache grows linearly and eats that same slow memory, so a 32
- x · @sztlink / X: @LucianoLicelli Exactly. Concrete: a 7B at 32k context has a KV cache that rivals or exceeds the weights themselves (GQA still ~18GB fp16), and 4-bit KV quant cuts that ~4x. On a 2
- x · @sztlink / X: @LucianoLicelli The math people skip: not just weights vs bandwidth. For agent loops the KV cache grows with context and eats unified memory, so long context is where you run out,
- x · @sztlink / X: @SeraAndroid Ran the same KV-quant probe across vLLM (KVarN/TurboQuant), llama.cpp and ollama. The engine changes the answer more than people think: vLLM reserves an fp16 KV pool t
Sinais externos do dia
- Nenhum sinal externo automático registrado nessa janela.
O que observar agora
- Velocidade entre paper, fork e validação em hardware real.
- Qualidade medida por REFRACT, não apenas throughput.
- Claims de forks comparados com execução concreta.
Nota autoral
- Esta edição combina radar de campo com diário de contribuição: o que se moveu no ecossistema e o que o @sztlink efetivamente colocou em circulação no GitHub/X.
- O texto público deve assumir uma voz de trabalho: método, dúvida, evidência e próximo experimento — sem transformar hipótese em resultado.
Fontes consultadas
- TheTom / llama-cpp-turboquant
- TheTom / turboquant_plus / REFRACT
- llama.cpp discussion #20969
- sztlink / turboquant-cuda-bench
- TurboQuant paper
Próxima leitura
TurboQuant é um radar diário e também um diário público de contribuição: consulta fontes recorrentes do ecossistema — TheTom, REFRACT, llama.cpp, benchmark @sztlink, paper TurboQuant e forks adjacentes — e registra o que o @sztlink colocou em circulação no GitHub/X.