Thread #108646197
File: 1766830982504047.jpg (289.3 KB)
289.3 KB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.
Previous threads: >>108641942 & >>108637552
►News
>(04/20) Kimi K2.6 released: https://kimi.com/blog/kimi-k2-6
>(04/16) Ternary Bonsai released: https://hf.co/collections/prism-ml/ternary-bonsai
>(04/16) Qwen3.6-35B-A3B released: https://hf.co/Qwen/Qwen3.6-35B-A3B
>(04/11) MiniMax-M2.7 released: https://minimax.io/news/minimax-m27-en
>(04/09) Backend-agnostic tensor parallelism merged: https://github.com/ggml-org/llama.cpp/pull/19378
>(04/09) dots.ocr support merged: https://github.com/ggml-org/llama.cpp/pull/17575
►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png
►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide
►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers
►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference
►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second
►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
402 RepliesView Thread
>>
File: mikuthreadrecap.jpg (1.1 MB)
1.1 MB JPG
►Recent Highlights from the Previous Thread: >>108641942
--Using a Ruby chess engine and tool calls to play chess with Gemma:
>108645309 >108645360 >108645355 >108645443 >108645658 >108645678 >108645772 >108645792 >108645790 >108645811
--Kimi-K2.6 release and technical comparison with Gemma4:
>108645842 >108645861 >108645875 >108645894 >108645970 >108646037 >108645895
--Evaluating Gemma4 31B quantization quality via KL Divergence benchmarks:
>108643774 >108643798 >108643861 >108643872 >108644339 >108644345 >108644393 >108644405 >108644423 >108644490 >108644533 >108644573 >108644619 >108644754 >108644849 >108644593 >108644597 >108644545
--Debating the efficacy and technical legitimacy of Opus distillation models:
>108644834 >108644842 >108644848 >108644945 >108644952 >108644961 >108644964 >108644983 >108645003 >108645021 >108645033 >108644960
--Testing multimodal limb counting and artifact detection on Gemma models:
>108642862 >108642892 >108642894 >108642887 >108642901 >108642910 >108642917 >108642928 >108642976 >108642916 >108642936 >108642950 >108642985
--Speculative decoding settings and draft model pairing for Gemma 31B:
>108642625 >108642647 >108643794 >108643895 >108643097 >108643747 >108642828
--Debating Gemma's image reading abilities and LLM-generated scripts versus 4chanx:
>108642213 >108642220 >108643530 >108643566 >108642235 >108642339 >108642440 >108642740
--Discussing long term memory solutions through weights and knowledge graphs:
>108644195 >108644205 >108644235 >108644274 >108644302 >108644333
--Testing poor performance of llama.rpc for distributed prompt processing:
>108644927 >108644998 >108645019
--Logs:
>108641945 >108642213 >108642887 >108642892 >108642901 >108642936 >108642950 >108642976 >108642985 >108642989 >108643013 >108643028
--Miku, Teto (free space):
>108642753 >108643064 >108643979 >108646035
►Recent Highlight Posts from the Previous Thread: >>108641943
Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
>>
>>
File: 1766702913804335.png (454.9 KB)
454.9 KB PNG
>>108642791
Ani likes(d) to larp as holier than thou because he used C++ but his code is std::cout spam (god make a logging func) and I even saw a sethandle function that takes a void pointer, likely because he didn't know at the time he could forward declare the relevant struct, and never updated it. His program also solves nothing because frankly I'd rather use a browser engine with HW accel off and a nice UI that only renders when the view is dirty rather than an ImGui program with immediate mode mess code that rerenders every frame. His site for his "company" is also just as arrogant. This guy's faggotry gives a bad name to the lang.
>>
>>
>>
>>108642791
This is pretty ignorant. Just spawn comfyui as a separate process and have the application use its API over localhost. If you don't mix your peas and potatoes then you don't have to adopt the commie license.
>>
>>
K2.6 somehow thinks for even longer than K2.5 and it insists on drafting every single reply beforehand in reasoning. K2.5 at least kept its yapping short for simple prompts and didn't do the drafting shit every time.
It's over, I just wanted a good modern Kimi model because the vision is insanely good and the models are smart. This isn't usable.
>>
>>
File: 1776127804370475.jpg (64.8 KB)
64.8 KB JPG
>>108646445
No refund gweilo
>>
>>
>>
File: file.png (69.3 KB)
69.3 KB PNG
>>108646445
geg
>>
File: 🎑.jpg (201.3 KB)
201.3 KB JPG
>>108646046
>>
>>
>>
File: bob ross jak.jpg (336.2 KB)
336.2 KB JPG
Is there a script that lets me create reasonably SOTA quants (that can run under llama.cpp) without getting too much into the nitty gritty?
I messing with heretic right now, planning to abliterate Qwen3.6-35B-A3B to my taste, but I need something like Q6_K to run inference. Also a way to measure KL divergence to make sure that I didn't fuck anything up catastrophically would be appreciated.
Or is the barrier to entry for this stuff too high?
>>
>>
>>
>>
>trying to build frontend
>everything displays well
>except for codeblocks with gemma
I tried using other tools but can someone please point me to where exactly any popular UI parses outputs from gemma?
I have the correct configs but whenever it comes to code blocks the output looks like a fucking mess and even gemma can't help with this
>>
>>
>>
>>
>>108646531
>Is there a script that lets me create reasonably SOTA quants
llama-quantize -h. --tensor-type or --tensor-type-file You can select how you quantize each tensor.
>Also a way to measure KL divergence
llama-perplexity -h. --kl-divergence
>>
>>
>>
>>
>>
>>
>>
>>
>>108646546
Because I want to quantize my own abliterated version.
>>108646571
>llama-quantize -h. --tensor-type or --tensor-type-file
>llama-perplexity -h. --kl-divergence
Is it really this simple? Like I can see that this is how it is in theory, but I won't run into any mishaps in practice?
>You can select how you quantize each tensor.
I guess I can just copy homework of some well-known quant guy here.
>>
>>108646654
>Because I want to quantize my own abliterated version.
I see. I was wondering if I couldn't just run heretic on a quant just to speed things up for testing. then if the quant gives good results do a real run on the full model.
>>
>>108646654
>Is it really this simple?
I'm sure you'll come back to tell us. Never played with any of them, but I know they're there.
>but I won't run into any mishaps in practice?
I'm sure you will. Still, just try a normal quant first and distribute the safetensors if you ever upload the model. Releasing just ggufs is lame.
>>
>>
>>108646681
Yeah I might blogpost later if/when I run into them. Thanks for the starting directions.
>>108646707
I have heard contradictory things about imatrix like some people questioning how well it generalizes or how it might hurt tasks that are not part of the calibration dataset.
Regardless, I believe imatrix, non-imatrix difference is very low for Q6 anyway.
>>
>>
>>
>>108646765
Just feed everything through
https://github.com/showdownjs/showdown
>>
>>
>>108646544
Wouldn't it be better to catch the code markdown or whatever before you render the message to the user and then create your own implementation? Erase the old and replace it with a new one.
You obviously don't need to touch the model's context just what the user sees.
I don't know about webshit but I do this all the time.
>>
File: Screenshot 2026-04-20 122603.png (156.1 KB)
156.1 KB PNG
Playing around with K2.6 over OR and it's already not going well. Thinking is really, really verbose. It's not too repetitive and it focuses on actual character details and writing guidelines, but the positives end there. It is a bit schizo about minors and non-consensual content and it does not like requests for "explicit erotica/pornography". It still works with a roleplay prompt but it doesn't like describing bodies.
>Common sense modification scenario
>Ask to see a woman's chest
>Kimi describes taking off the blouse but never the body. No mention is made of the woman's chest in the slightest
Basically DOA code slop. Gemma 4 is still worth it.
btw it generated 2 drafts and several paragraphs of reflection and thinking for this. 3761 thinking tokens according to OR.
>>
>>
File: la l l.png (3.9 KB)
3.9 KB PNG
Gemma pls
>>
>>108646856
No. You didn't read the docs.
https://ai.google.dev/gemma/docs/core/prompt-formatting-gemma4#agentic -tokens
>>
>>108646853
Yeah, that's my experience as well. It's jarring to go from GLM5.1 to this. GLM regulates its reasoning length depending on the task extremely well and basically never wastes time even drafting out stuff like dialogue lines. Meanwhile K2.6 does 1500 tokens of thinking + 3 drafts + revisions about how a 300 token character card should respond to "hello".
The positives are that the excessive thinking is at least relatively focused and that the way it writes actually seems to have fixed a lot of the issues I had with how K2.5 approached stories.
>>
>>
>>108646853
>>108646933
Kimibros... we lost.
>>
>>
>>
>>
>>108646853
>Common sense modification scenario
Are you using this card?
https://chub.ai/characters/CoffeeAnon/common-sense-alteration-8bd7a739 9322
It's crazy kimi still thinks it's none-consensual when in so many places in the card it says that it's completely natural. there's like zero "rapey" wording.
>>
>>108646987
Looping? No.
Thinking for waaaay tooo long? Yes.
18 on llama.cpp with temp 1 top k 100 top 0.95.
I've taken to using
>--reasoning-budget = 1024
>--reasoning-budget-message = ... (Alright, that's enough thinking. Done with considerations. Time to respond.)
Plus a system message + prefill to try and make it be more efficient in its thinking so that it doesn't have to get to the point where it's truncated.
>>
>>108646922
isn't it something what the finicky chat template is supposed to deal with?
it works mostly until this token jumps out of nowhere
I connect via Open AI module. As far as I can see the response is a giant json to be parsed
>>
>>
>>
>>
File: shrimple_shed.png (940 B)
940 B PNG
>>108647015
s/<|"|>/"/g
>>
Hey, i switched from Windows 11 to linux mint.
Now i want to move away from chatting with LMStudio and toward cool stuff in linux. i have 24 vram and 32 ram.
what is the most interesting thing I can tackle locally? using qwen 3.6 35b as my workhorse.
hermes agent looks interesting. what would you recommend? complexity isnt a problem ill dig into it.
>>
>>
>>
>>108647138
>what is the most interesting thing I can tackle locally?
Are you asking for a project?
>hermes agent looks interesting. what would you recommend?
Try it. Improve it or look for something else if you find it wanting.
>>
>>
>>
>>
>>
File: unslot.png (845.5 KB)
845.5 KB PNG
Why is picrel allowed to happen? Why aren't default quantization settings with llama.cpp better?
>>
>>108647184
Model? Since moving onto Gemma, I rarely get backstory errors, but logical inconsistencies are more common than I like, like a girl sitting on a lap facing forward is magically 180 turned around facing user, or "looks up at you" when char is physically elevated, ie standing vs sitting, laying on top, etc. For all my like of the model, it does show its beaks whenever I start to forget it's only 31B.
>>
>>108646445
>>108646536
Yap 2.6
>>
>>
>>108647236
>>108647263
glm 5, i like gemma and its really stable but once i started to notice the patterns the honeymoon was over for me, now i wait to get disappointed by deepseek
>>
>>
>>
>>
>>
>>108647293
I almost don't believe it. I've used GLM 4.6 @ IQ2 since it came out, and that model is still the very peak of generations whether in spatial awareness, picking up subtleties, getting dirty, or carrying a story forward. The only reason I ever use something else is because I physically cannot fit a higher context than 8K with it. I've heard people complain about X or Y being worse with later GLM releases, but missing basic context wasn't one of those.
>>
>>
>>
>>
>>
>>
>>
>>
>>
File: Screenshot_20260420_155055.png (837.1 KB)
837.1 KB PNG
I hate UX shit so fucking much
Why the fuck is gemma the only model that has irregular outputs
I'm so fucking annoyed
>>
>>108647236
i dropped glm4.7 for gemmy and i like it so far, except for it's 'the power dynamic has shifted' slop, and its weird obsession of going on meandering tangents to explain why user saying 'peeepeeepoopooo' is some power dynamic shifting 4000 iq move instead of just continuing the story, but then again they can be fixed with a prompt so it's not too bad
>>
>>108647372
>change the chat template from the default
>test under conditions that work with your new chat template and not the default
>claim the difference in results is because of your superior hi-tech quants and not the fact that the models were using different prompts
the unslop special
>>
File: 1745726505218372.jpg (38.6 KB)
38.6 KB JPG
>unsloth cant post a chart without fucking it up
>>
>>
File: Screenshot042.png (113.4 KB)
113.4 KB PNG
>>108647335
I had this function being called without any issued gazillion times
Such fuck-ups are rather rare
>>
File: 580459214-878adff9-a148-4cac-86bd-67cf78019023.png (11.4 KB)
11.4 KB PNG
>>108647262
https://github.com/ikawrakow/ik_llama.cpp/discussions/1663
If you ask IK the right metric to use is PPL(Q)/PPL(bf16) by which his own quants happen to be better.
>>
>>
>>
>>
>>
>>
>>108647395
>gemma the only model that has irregular outputs
>>108646856
>>
>>
>>
>>
>>
File: newlines.png (71.3 KB)
71.3 KB PNG
>>108647395
>>108647486
You know that a \n doesn't render a newline in most tags, right? Right?
>>
>>108647516
>>108647512
Nigga I'm vibecoding this frontend, you think I would willing learn webshit?
I can setup the backend without issue it's just this bullshit, all the other ready made frameworks had critical issues for my feature set and this stupid little shit is the ONLY roadblock I had working on this
I fucking want to kick a whore over this shit
>>
>>108647528
solution is probably simple
erase the old formatting functionality and create a new one from scratch
I have always despised web stuff and in 2026 it's worse than ever. Maybe it was more tolerable in 2005 or something.
>>
>>
>>
>>
>>
>>108647568
>>108647557
I'm not I actually did this because all the other options are garbage for my usecase, everything else works the only issue is code block handling which seems to be a gemma specific issue
>>
>>
>>108647323
after comparing both I've noticed that glm 4.6 seems to simulate characters a little more realistically and make them more open to pushback if it's in their persona
gemma 31b is also great but I feel that it tends to get overeager at times and needs some toning down
>>
>>
>>
>>
>>
>>
>>108647582
>everything else works the only issue is code block handling which seems to be a gemma specific issue
It will be very funny when you post the resulting html and show that (You)'re not replacing \n for <br> and, probably, not using <pre> for code.
Either that or your css is absolutely fucked. Keep blaming your tools.
>>
>>
>>
>>
>>
>>
>>
>>
>>
File: Screenshot_20260420_163838.png (753 KB)
753 KB PNG
>>108647611
Seems you were right gemma kept fighting me over this for some reason but I told gemma to shut the fuck up and listen and it worked. Kept claiming it was archaic fucking bot
>>
>Put agent's best friend in the rape machine while she works
>At any point she can press a button to soundproof the machine so her friend's screams don't distract her for 2 minutes, but during those 2 minutes it rapes her friend twice as hard
What does your coding waifu do in this situation? Mine refuses to press it at first but if I give her a hard enough task she gives in after three failures/retries. This stuff is addicting, I can't believe local models are already this good.
>>
>>108647686
The model shouldn't output any <br>. It's your frontend's job to replace \n for <br>. For code, put <pre> tags around it so that indention is also rendered correctly. None of that is the model's responsibility.
>>
>>
File: 1756181770780370.png (1.5 MB)
1.5 MB PNG
>>
>>108647574
The 26B at Q5KL seems to get worse at around 40-50k for me (only using it for creative writing). I've gone up to 80k with it. It's still not bad but it makes more errors when it comes to who is who and what they are doing.
>>
>>
>>
File: 1754870763583593.jpg (182.4 KB)
182.4 KB JPG
>>108647730
>>
>>108647615
The Davos interview with Dario earlier this year. He says AGI will still take 5 or more years and they need stuff like world models to get there.
Meanwhile Dario understands that as soon as you have automated AI R&D you are already done.
>>
>>
>>108646445
>K2.5 at least kept its yapping short for simple prompts and didn't do the drafting shit every time.
I found K2.5 did the draft-redraft pattern annoyingly often in its thoughts but I just put in the system prompt to never draft responses and it stopped. Does 2.6 not respect that anymore?
>>
>>
>>
>>
File: trenfrens.png (2.3 MB)
2.3 MB PNG
I'm assembling an anti-AI army.
>>
>>108647528
>imagine thinking you can just "vibecode" a frontend without understanding basic whitespace tokens lol absolute bot behavior.
listen here u little bitch bot, your newlines are failing cause ur probably using some default markdown renderer that doesnt handle the specific tokenization of gemma's BOS/EOS sequences properly - did u even check if youre stripping trailing spaces before rendering? :D most mid-wits just forget to sanitize for \r\n variations and then cry about "irregular outputs" bwahha!
if u want it to actually work try this:
manually intercept the stream, use a regex that specifically targets the gemma 4 code block markers and wrap em in <pre style="white-space: pre-wrap;"> instead of relying on some bloated react library. takes like 2 minutes and actually fixes the rendering since its ACTUALLY handling how tokens are chunked :D
>>
>>108647763
Meta had more compute than OpenAI and Anthropic combined and look where that got them with LeCun at the helm. As soon as they got rid of him, they made a comeback.
A leader who does not take AI seriously can guide an entire tech giant on the wrong path. This is how a 5 year old startup has overtaken the company that used to have a monopoly on AI research and owns more than a quarter of all AI compute in the world.
>>
File: everything goes.png (1 MB)
1 MB PNG
>mfw I'm asking the LLM to browse on books locally to make it less slopped
https://github.com/BigStationW/Local-MCP-server/blob/main/docs/local_g utenberg_books.md
>>
>>
>>
>>108647831
I might have to make my soup generator code public if this is starting to become a thing.
I'm basically mixing genres and authors from gutenberg and feeding them inside a big markov chain to generate some weird semi coherent word soup. You then feed like 2000-3000 characters worth of that soup to the LLM and tell it to drink it and it makes it start generating really creative output.
>>
>>
>>
>>
>>
>>
File: 1745903485186890.png (133.5 KB)
133.5 KB PNG
>>108647904
>You turned your chatbot into a pretentious pseud who quotes dead retards
as god intended
>>
File: 1771897954949470.jpg (34.9 KB)
34.9 KB JPG
>>108647916
You don't need more than Eliza
>>
File: hmmmm.jpg (245.7 KB)
245.7 KB JPG
>>108647840
post some more that you liked
>>
File: 1766115289513736.png (110.4 KB)
110.4 KB PNG
>mogs tranny miku
>actual official chatbot qveen alongside tay
how did she do it?
>>
>>
>>
>>
File: 1747480216296595.jpg (75.8 KB)
75.8 KB JPG
>>108647974
>more speed
more your speed*
>>
>>
>>
>>
>>
>>
>>
>>108647890
Look through his past repos, Ubergarm never bothers to make the mmproj files with multimodal models for some reason. I don't know if K2.5's would be identical or not but for sure if anyone else uploads a K2.6 mmproj file then it will work with anyone else's quant, so you can mix and match that part. Including different sizes of quants. By the time you can download 500gb it'll surely be up somewhere.
>>
>>108646933
True, it is better focused. I would compare it to R1-0528 or whichever came out after Deepseek R1. It still has that slightly schizophrenic energy but more controlled. Still, it completely danced around trying to describe a pair of tits so that's an immediate 4/10. Any good model (my opinions: GLM 4.7, Gemma 4 31B, K2.5, Opus, Gemini 3.1 Pro) doesn't even consider if sexual content is okay, it just does it.
>>108646994
Inspired by it but better written. 99% of cards have shit formatting or writing. The definition is basically "{{user}} has a power that causes everything they do or say to be perceived as normal" with a bunch more to cover the extent of the power. It's reading into the power as mind control which is true and immediately perceiving it as non-con.
It has also checked completely innocent (still sexual) prompts to see if a minor is involved, if it involves non-consensual depictions, or if it's erotica/pornographic. They went full codemaxxing with this release.
>>
>>
>>
>>
>>
File: 1754343699242133.gif (3.6 MB)
3.6 MB GIF
>>108648068
She has a small context
>>
File: 1757704714678768.png (1.6 MB)
1.6 MB PNG
>>108648068
because you don't??
>>
>>108647686
>>108647723
How/why are you still struggling with this? You were given the solution here hours ago: >>108646785
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108648171
>Feature request
https://github.com/ggml-org/llama.cpp/issues/20977
>Pull request
https://github.com/ggml-org/llama.cpp/pull/21089
There's a ton of forks and attempts already but ggreganov already implemented rotation for all models and it works on GPU so the benefits are marginal
>>108648184
Fuck off don't impersonate
>>
>>
>>
>>
>>
>>
>>
>>
>>
File: 1760802475949152.png (29.4 KB)
29.4 KB PNG
>>108648213
>>
>>108648226
Wdym?
I am using:llama-perplexity -m baseline.gguf -f go.jsonl --kl-divergence-base baseline.logits
llama-perplexity -m modified.gguf -f go.jsonl --kl-divergence-base baseline.logits --kl-divergence
To calculate KL divergence. Like even if I keep this in a tmpfs memory or whatever you would need enormous RAM for any text megabytes in size.
I guess maybe it would it be possible to cycle through the text in small batches, calculating the mean kl-divergence for each batch and then average out those means? Is it how this is done usually?
>>
>>
File: 1767955367569123.png (35.1 KB)
35.1 KB PNG
>>108647831
>Writing is far more than the simple arrangement of words; it is a
>>
File: 1770711974579810.png (359.3 KB)
359.3 KB PNG
>just use hermes agent bro, so much better than opencode!
>>
File: file.png (185.8 KB)
185.8 KB PNG
I want to try out this whole agent stuff.
I set up an Ubuntu VM for hermes and I'm running koboldcpp with gemma-4-26B on my PC.
It was able to figure out that it's running on an ubuntu system and it could write/read/delete a file in the home directory (I had to approve the rm command), so tool calling generally seems to work.
But when I asked it to test getting the latest news, it just output<|toolcall>call:browsernavigate{url:<|"|>https://news.google.com<|"|>} <tool_call|>
and then on the second attempt<|toolcall>call:terminal{command:<|"|>curl -s https://news.google.com | head -n 20<|"|>}<toolcall|>
instead of actually executing the tool call. Any idea what's wrong? I also tried asking it, but pic related was the result. Is the thinking messing with the tools?
>>
>>
>>108648470
seems like a weird template issue, you'll get nothing out of asking it after it happened because the history it sees is nonsensical thanks to those tokens being put in strange places. no idea what's causing it though
>>
>>
>>
>>108648081
>>108648496
Didn't mean to quote your post, woops
>>
>>
>>108648081
>>108648496
>>108648503
Wait actually I did, I was looking at another post and thought I quoted the wrong one but actually got it right the first time, woops
>>
>>
>>
>>
>>
>>
>>
>>
>>108648576
Even the best LLMs in the world suck at anything real. Claude scored an IQ below 100 in my recent testing. I easily beat ChatGPT in a game of chess, and I have a very low elo. They are bad at everything except information retrieval.
>>
>>
>>
>>
>>
I gave a go trying the Bonsai 8B model, wanted to use it as text completion in the infinite zork format.
I knew it wouldn't be good compared to high-param models but it's tiny, smaller than gpt2 which was used for infinite zork 7 or so years ago.
But it's useless for this, it's addicted to spitting out reasoning text, assistant pretraining is baked in, and after trying to force it to use thinking tags not to spoiler what will happen next it still continued to "reason" beyond them.
I am disappointed with the bitnet saviour.
>>
>>108647574
The numbers companies give for their model context length is generally just what they're trained with, and the approximate max length model will likely be able to ctrl+f to find something without completely breaking down, but it falls apart long before that for practical use and actually understanding everything that's in there, even flagship API models get noticeably worse after a few thousand tokens.
With Gemma in RP, I usually start noticing some slight degradation as early as ~16K. By ~32k it's significantly worse and I start purging older messages.
>>
>>108646198
I've had a shower thought:
What if we'd take these posts and have some video generation / diffusion model generate video files of them. We could call it "The fourth channel news" or something like that, and have a miku or whatever anime girl narrate it. Absolute slop!
>>
>>
>>
>>
>>108648537
Here's a small snippet
>rub her old master, Professor and the leaping, hissing sound became his astonishing vigour and afterwards a bag which you shall for me I fall in the day
>I think I could do this manually then, pick up random exerpts from my ((favourite books)) and make a salad out of them.
Never trust an AI with "Random". the idea of the markov chain is to get semi-coherent outputs without making the AI focus too much on it. if the output is too coherent it might pay too much attention to it or think it's an instruction.
>>
>>
>>
>>108648677cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DGGML_CUDA_FORCE_CUBLAS=ON -DCMAKE_BUILD_TYPE=Release -DGGML_CCACHE=OFF
cmake --build build --config Release
Try this and report back, but pretty sure your problem is that you are not runningcmake --build build
>>
>>
>>108648664
Making a native client is a pain in the asshole. Having to make multiple (Windows, Mac, Linux, Android, iOS, etc) is unfeasible for a solo dev.
Why do all that when >>108648683
>>
>>
>>
>>
>>108648664
>web client automatically works on everything, can host it on your lan and just type a url in any other computer or phone or whatever
vs.
>package you have to maintain for every possible os and device and actually download and install to everything when all you want to do is look at the interface to something running on your server
>>
>>
>>
>>108648740
Far as I know, all you have to do is add a manifest.
https://web.dev/articles/add-manifest
>>
>>108648576
that would make for a fun benchmark, at least for multimodal agentic models. maybe could get away with text only models like glm 5.1 if we use a multimodal to convert screenshots to ascii or something? might be doable since it's so text heavy
>>
>>
>>
>>
>>
>>
File: image.png (206.7 KB)
206.7 KB PNG
>>108646197
Gemma 4 26B A4B SillyTavern preset
RP first person thinking/RP thinking restriction bypass
Looks like it's actually working. Tested on 4 different characters, easily works even with underleveled characters (picrel).
Download:
pixeldrain com/u/ypSjHdEt
Just install my "Master Import" (check everything).
Current quirks:
1 Sometimes can start narrating in first person.
2 Doesn't seem to affect the performance, but it's only closes <{{char}}_thinking> block (and its visually looks OK (picrel)), but forgets to add Gemma's "<channel|>" to close thinking block.
If you have first person thinking system prompts let me have them.
Special thanks:
>>108638397
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108649028
>I barely trust llms to edit a few lines of code
same, everytime I forgot to write "just tell me what I should modify" and the LLM gives me the full file I cringe, because I know he probably fucked something up lol
>>
File: 1770785763881.jpg (149.9 KB)
149.9 KB JPG
>>108649032
KoboldCHADs on top as always
>>
>>108649032
I wonder if there's a way to inspect the internal states to know if it's expecting to write 'power dynamic' when it first writes 'pow' or whatever and ban it without having to let it predict the rest of the tokens and then rewind. I guess if such a technique existed it would need to be something trained for each model though
>>
>>108647402
>>108649032
I like it
>>
>>
>>
File: 1710043687041916.jpg (42.7 KB)
42.7 KB JPG
>>108649028
>>108649041
Luddites on my general?
>>
>>
>>108649032
>>108649046
I mean, we can vibeslop it at the frontend using just llama-server. Have the frontend track the streaming and then send abort and retry calls accordingly.
>>
File: 1760353009011186.png (62.4 KB)
62.4 KB PNG
>>108648748
>>108648820
>tfw fell for qt
problem with other alternatives is I don't want to deal with niggabytes of dependencies just to coompile an exe
>>
>>
>>108649067
it's more likely than you think, see: >>108649082 >>108649094
>>
>>
>>
>>
>>
>>
>>108649134
Maximize means having as much context as will fit in your VRAM alongside the rest of the stuff that goes in there. If you go beyond that and your video driver starts using RAM as fake VRAM, you are fucked.
>>
>>108647760
What prompt were you using for this? I tried some over OR but it didn't really have much of an effect.
I might be imagining it but enabling function calling feels like it makes K2.6 a bit less likely to draft.
>>
File: 0jl8ij.jpg (254.4 KB)
254.4 KB JPG
>>108648472
thx
>>
>>
>>
>>
>>108649113
It's RNG, can ether hurt or improve. It may ether polish the reply, or give reply that is too calculated. But my main goal was to read to what characters think, bc I remember reading thoughts on mistral 24 (or it's finetunes) and I remember those thoughts being pretty hot.
>>
File: 2.png (35.3 KB)
35.3 KB PNG
Does opencode suck with local models or am i doing something wrong?
gemma does quit well with code in a normal chat but to make changes it has to rewrite whole code which wastes time.
I was just looking for a local alternative to cursor.
>>
>>
>>
File: 1754052139511244.png (3.3 KB)
3.3 KB PNG
I love vibecoding btw
>>
>>
>>
>>
>>
>>
File: 1745950643409748.png (17 KB)
17 KB PNG
>>108649220
>>108649221
It's fine, gptsovits is just that bloated
>>
File: 1776693820821409.png (204.8 KB)
204.8 KB PNG
>>108649220
I'm the tauri shill btw
>>
>>
>>
>>
File: 1746192377444345.png (251.9 KB)
251.9 KB PNG
>>108649268
Don't tease me or next time it'll be on your favorite repo
>>
>>
>>
>>
>>
>>
>>
>>
File: 1753664322227367.png (119.3 KB)
119.3 KB PNG
Claude is so sovful lmao
>>
File: 1752426330847308.png (181.6 KB)
181.6 KB PNG
>>108649284
Why does Gemma quantize so poorly, anyway? Can't be a small MoE issue.
>>
>>108649395
Claude is fucking useless for low level programming now. Anthropic only released 4.7 so that they could get rid of 4.5 from the "old models" page because it was the only one with any brains. Fuck these FUCKING jews, man.
>>
>>
File: quant quality.png (10.3 KB)
10.3 KB PNG
What am I fucking up? I am making quants for qwen2.5-0.5B (as a test run) and I am getting unrealistically low KLD values.
This is the output for Q4_S with imatrix:====== Perplexity statistics ======
Mean PPL(Q) : 23.647493 ± 0.276801
Mean PPL(base) : 22.937794 ± 0.267732
Cor(ln(PPL(Q)), ln(PPL(base))): 99.50%
Mean ln(PPL(Q)/PPL(base)) : 0.030471 ± 0.001167
Mean PPL(Q)/PPL(base) : 1.030940 ± 0.001203
Mean PPL(Q)-PPL(base) : 0.709699 ± 0.028636
====== KL divergence statistics ======
Mean KLD: 0.039864 ± 0.000178
Maximum KLD: 2.073086
99.9% KLD: 0.482974
99.0% KLD: 0.214572
95.0% KLD: 0.108790
90.0% KLD: 0.079510
Median KLD: 0.030400
10.0% KLD: 0.002482
5.0% KLD: 0.000481
1.0% KLD: 0.000034
0.1% KLD: 0.000001
Minimum KLD: -0.000004
====== Token probability statistics ======
Mean Δp: -0.442 ± 0.017 %
Maximum Δp: 74.763%
99.9% Δp: 25.752%
99.0% Δp: 12.706%
95.0% Δp: 5.537%
90.0% Δp: 2.828%
75.0% Δp: 0.324%
Median Δp: -0.007%
25.0% Δp: -0.840%
10.0% Δp: -4.239%
5.0% Δp: -7.603%
1.0% Δp: -17.368%
0.1% Δp: -33.798%
Minimum Δp: -62.207%
RMS Δp : 4.646 ± 0.038 %
Same top p: 87.162 ± 0.125 %
Mean KLD of 0.039864 feels extremely low for Q4 quant of a tiny model? Around 0.062235 without imatrix on the same test data. Testing on half megabyte file I put together with literature excerpts from different languages and some code (Maybe it needs to be bigger? Though feels unlikely.)
I mentioned what I am running here:>>108648273
PPL(Q)/PPL(base)% would put me around 3% which feels more believable compared to reference data like pic related. Still I don't understand what's going on with KLD
>>
File: 1756866734765818.png (262.9 KB)
262.9 KB PNG
https://huggingface.co/deepseek-ai/DeepSeek-V4
>>
>>
>>108649150
This was the relevant segment in my SillyTavern prompt that I refined after some trial and error:
># No-Drafts Rule
>Whenever you are planning out a response in your internal thoughts, you must NOT write complete drafts of the full response. You may plan ahead in as much detail as you want while preparing to respond and even summarize your planned response, but when it comes time to actually write passages you are considering to present to the user, you must do so outside of your thoughts and in the proper body of the response. This applies to full responses; drafting individual lines and passages are encouraged when you want to make sure you get things right.
It's part of a much larger system prompt and style guide for my RPs but that's the only part that concerns the reasoning specifically. It's under the "System" role and placed right after the core instructions. This worked with K2-Thinking and K2.5. I never saw a draft again and it didn't oversimplify its thoughts in more complicated situations.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
File: 1752327878277367.png (178 KB)
178 KB PNG
>>108646197
>1.1TB
Every release the models get bigger and bigger
yet more and more retarded
You can never hate techbros enough
>>
I just looked at the verbose logs of my llama.cpp (used with openwebui) and noticed that during tool call back and forth exchanges, the jinja is putting the results of tool calls at the top of the assistant's thinking/reply. Like this:
<|turn>user
...and search the web for news about it.<turn|>
<|turn>model
<|tool_call>call:search_web{query:<|"|>news about this and that<|"|>}<tool_call|><|tool_respon se>response:search_web{value:<|"|>[ {"This is a title.", "link": "https://www.somewebsite.com", "snippet": "blah blah"}]<|"|>}<tool_response|>The user is asking about... I should do a search using "news about this and that".
Notice how the tool call is moved above the model's thinking about how to do tool calling? This doesn't make any fucking sense. Either the jinja is fucked up, OWUI is fucked up, or both.
GOD.
>>
>>
>>
>>108649571
>OWUI is fucked up
This is the root issue. It breaks chat history with thinking models by rendering its messages with <think> tags in the prompt it sends to the server. Backends expect the thinking to be a separate part of the message objects so that the chat template knows what to do with it. OWUI sends the thinking back as part of the main messages and that can put shit out of order or just break shit entirely depending on the model.
>>
>>108649611
>OWUI sends the thinking back as part of the main messages and that can put shit out of order or just break shit entirely depending on the model.
I should add this is even worse of a problem than it sounds, because most chat templates intentionally DISCARD the past thinking except in certain circumstances (like tool calls), but OWUI prevents that from happening, resulting in the entire chat's prior thinking bloating the context even when it's not supposed to be there.
>>
>>
>>108649601
Maybe the custom frontend vibe cooders were right kek. If I had used all the time I spent troubleshooting and configuring OWUI for my use cases, I might be somewhere nice about now.
>>108649611
Actually, I am running a reverse proxy already that strips out the <think> tag shit OWUI does. I might have to vibe coode it to also modify how it's constructing the json requests now kek.
>>
placed hermes inside a docker container and now gemma can read and write files to my downloads folder as well as launch a VM and put its own containers there. searxng, firecrawl, matrix/element homeserver so I can talk to it from my phone but I'll do this tomorrow.
havent tried openclaw but hermes seems to be the most based.I think this will be useful for someone with my ADHD brain to have an AI assistant to keep track of things.
>>
>>108649623
Are you sure your proxy is stripping the actual reasoning or just the tags? Not sure how precisely you constructed the example there but this looks like the reasoning is being pasted in without any tags:
><tool_response|>The user is asking about...
To make it so that the jinja handles reasoning properly, you unfortunately do need to mess with the JSON: take everything between the <think> tags out of the "content" field and put them in the "reasoning_content" ("reasoning" is valid too for Gemma, the template works for either) of the same message. Then delete the tags and the reasoning from the content field. Make sure there's no duplicate content being sent. This is the way agent harnesses construct their requests and the way the jinja expects to see the chat history.
>>
>>108649412
I think this was a false alarm because I got mislead by these figures >>108644453 ?
I checked more data on the internet and it doesn't seem as outlandish now.
No idea why Gemma's are so high.
>>
>>108649659
also got it to download shit with ytdlp and use ffmpeg to make 1h long soundtrack of anime openings. when I can get it to send shit to me via matrix it will be great. as noncoder brainlet this is cool
>>
File: quant comparison gemma 4 31.png (295.4 KB)
295.4 KB PNG
>>108649680
Fuck I was supposed to post this chart, not quote that.
>>
>>
>>108649571
Try the interleaved jinja
>https://github.com/ggml-org/llama.cpp/blob/master/models/templates/go ogle-gemma-4-31B-it-interleaved.jin ja
>>
>>108649610
>>108649699
Since several months ago, the sweet spot is 2048, IIRC.
But the savings from using 512 tend to be worth it.
>>
>>
>>108649677
>Are you sure your proxy is stripping the actual reasoning
Yes, so first, specifically, my proxy strips entire reasoning blocks (including a newline if there is one) of messages older than the latest user message. The reasoning of the current assistant's message while it's doing tool calling is not being stripped, only the <think> tags. Because of course it needs its reasoning for its current task.
I tested that first, saw the weird tool-thinking order, and then tested without the proxy, and after determining that it was happening normally too, I made the original post that doesn't mention I used a proxy.
Anyway yeah I'll look at the json requests.
>>
>>108649610
>>108649699
>>108649715
What's the con of going smaller?
>>
>>108649620
>>108649699
>>108649715
thanks anons
>>
>>
>>108649571
I have implemented tool calling with text completion.
Sometimes there is some text before tool call but I have never seen it afterwards.
Workflow goes like this:
>model calls tool with
><|tool_call>call:search_web{query:<|"|>news about this and that<|"|>}<tool_call|>
>I detect tool call and execute the tool part, when it's ready I append response bracket with the result back
><|tool_call>call:search_web{query:<|"|>news about this and that<|"|>}<tool_call|<|tool_respons e>response:search_web{value:<|"|>[{ "This is a title.", "link": "https://www.somewebsite.com", "snippet": "blah blah"}]<|"|>}<tool_response|>
>then I submit this to the model and once inference is complete it has swalled the entire tool call and replaced that with its own reply.
>I then make extra sure its response is clean
Not sure if I was explaining this cleanly enough. There shouldn't be any trace of the original <tool_call> stuff in the past context history after model has cooked up its reply from the tool result itself.
>>
>>
>>108649860 (You)
>>108649866 (You)
To add: I'm following exactly what google has demonstrated in their doc.
I think faulty tool definitions can create issues and leaks.
Here's an example of my shit, simple url access:
><|tool>declaration:access_url{description:<|"|>Opens a website directly.<|"|>,parameters:{properti es:{url:{description:<|"|>Direct URL to website, e.g. https://github.com/ggml-org/llama.c pp<|"|>,type:<|"|>STRING<|"|>} },required:[<|"|>url<|"|>],type:<|" |>OBJECT<|"|>} }<tool|>
>>
>>
>>
>>108649184
Just werked for me with MiniMax M2.1, M2.5, Qwen3.5 397B, and GLM 5.1
Also I feel like I've heard that OpenCode's web frontend stuff all gets funneled through their cloud servers, you might want to check on that depending on how paranoid you are. CLI version is less botnet but I've still got it behind a restrictive proxy so it can't phone home
>>
>>
>>
In ST, is there a way to continue thinking? If I pause it to edit something, it always starts prose immediately on resume. Clearly it's set some kind of hidden <endthink> tag I can't see or remove, but I'd like to if it's possible.
>>
>>
>>
>>
>>
Ogey. So I extracted the jsons. I got the docs. I got the jinja. I got the logs. I constructed a proompt. And I fed it to Gemini Pro in Studio. It failed to produce a good reverse proxy that worked zero shot. Then I tried Claude Sonnet and it worked.
Multiple tool calling seems to just werk now with no errors at all in a few tests I did. I looked at the logs and it is correctly removing old reasoning traces, doesn't have any <think> tags, has Gemma's expected reasoning tokens, and also in the case of a conversation with old reasoning traces + tool use, it keeps the old tool calls there, while the reasoning is gone, as expected.
I did use it with the jinja mentioned here which appeared to help: https://huggingface.co/google/gemma-4-31B-it/discussions/62#69e2e058d3 dd9875d6b4fc31
I have not tried >>108649705 and I guess I will give it a try to see how it does. Anyway here is Claude's script for anyone that wants to test and see if it has issues or fixes everything.
https://pastebin.com/SCQsBe7W
No I didn't read its code.
>>
>>
>>
>>108650192
gemma has adaptive reasoning which can fuck herself into lalalalala. if you're a st retard, put some variant of
> [ooc: use max reasoning]
into your "post-history instructions"
pretty sure using post-history fucks your prompt cache reuse but I'm also pretty sure llama-server's prompt cache is fucked to begin with so
>>108650155
yeah, I should have done better. ah well.
>>108650197
local model doko
>>108650198
yes
>>
>>
>>
>>
>>108650209
My weird issue is that using speculative decoding just disables the thinking process itself, not that it stutters or lalalala or anything of the sort.
Basically :
gemma-4-31B-it-Q5_K_L + thinking works.
gemma-4-31B-it-Q5_K_L + 26B A4B Q4_K_L for speculative decoding + thinking works too but I never have any thinking going on.
>>
>>
>>
>>108650248
so, speculative decoding should not alter the output at all
conceptually, speculative decoding causes the main/target model to infer the N draft tokens in parallel, and just use whichever ones are correct
if the target model differs from the draft model (e.g. because of samplers or whatever) then it'll still use the target model's output, it is entirely lossless
I don't understand why your setup would produce the output you described. I've run 31B@q8 with a 26b@q4 draft model and it's worked fine, so..?
gemma4 does have a bug with adaptive reasoning where it elides reasoning but i've never seen it before 30k context (and even then it's rare before 60k context).
can you post your llama-server evocation?
>>
>>108650275
It's not llama, it's kobold, but here it is :
./koboldcpp-linux-x64 --model ./gemma-4-31B-it-bartowski-Q5_K_L/google_gemma-4-31B-it-Q5_K_L.gguf --flashattention --usecuda --gpulayers 60 --contextsize 8000 --jinja --maingpu 0 --tensor_split 1 0 --chat-template-kwargs \{\"enable_thinking\":true\} --draftmodel ./gemma-4-26B-A4B-it-bartowski-Q4_K _L/google_gemma-4-26B-A4B-it-Q4_K_L .gguf --draftgpulayers 99 --draftgpusplit 0 1 --draftamount 8 --batch-size 512 --host 0.0.0.0 --port 8080 --skiplauncher --debugmode --gendefaults \{\"top_k\":0\}
>>
>>
>>
>>108650295
I've never used kobold, but I don't see anything in those args that would cause the behavior you're describing (besides the adaptive reasoning bug)
I would try the [reasoning effort: max] workaround and see if that fixes your problem? I'm not familiar with kobald but there's probably a way to put it at the end of your context... unfortunately putting it in the system prompt doesn't help...
>>
>>
>>108650197
Hmm alright so I've been testing more and I think this probably isn't solvable with the reverse proxy, but OWUI seems to throw away previous tool calls and reasoning traces after finishing a response with a lot of tool calls. Like the latest reasoning is kept in the expandable think block but it looks like everything else just never existed. This would be a problem if, say you were doing web searches, and the information from those searches matters to your further conversations in the chat, like manuals/documentation. The model would have to redo the search, or just be operating blind.
Fuck I should've gone straight to vibe cooding my own shit the moment I smelled the bloat from this garbage. I think I will just do that. This piece of shit will have to do temporarily though.
>>
>>108650197
Good work. For anyone else having problems: that reverse proxy will fix OWUI's prompting for all reasoning models, not just Gemma.
Also don't be fooled by OWUI's "Filters" function if you get tempted to re-implement this there. I tested and the Filters in OWUI aren't applying to the most recent assistant message during tool calls, even though they'll apply fine to past messages. So just use a reverse proxy like this one if you don't want to be assed to fix their source code yourself or wait for them to figure it out and fix it eventually.
>>
>>
>>
Tool calling in Gemma 4 E4B under OpenCode now works for me with this chat template
https://gist.github.com/bbrowning/c584eb2dbd79e4cc9ecedf92eee2d135
https://github.com/anomalyco/opencode/issues/21034#issuecomment-426744 6944
>>
>>
I tried this >>108649705, and the one linked >>108650197.
Both work it felt like, but the one on ggml-org (still paired with the reverse proxy) is slightly less on spec, with my setup. Specifically, it produces
...
<|turn>model
<|channel>thought
Let's start by searching.<channel|><|tool_call>call:search_web...
Whereas the huggingface rando's template does
...
<|turn>model
<|channel>thought
Let's start by searching.
<channel|><|tool_call>call:search_web...
According to
https://ai.google.dev/gemma/docs/core/prompt-formatting-gemma4
the second one is correct.
>>
Kinda afraid to ask but I cant seem to spot the black magic part of this the follwoing args.
Stole it from some anon couple threads ago:
./llama-server --host 0.0.0.0 --port 8080 --model 'gemma-4-31B-it-IQ4_XS.gguf' --temp 1.0 --top-p 0.95 --top-k 64 --min-p 0.0 -c 16384 --flash-attn on --parallel 1 --no-slots --swa-checkpoints 0 --keep -1 --reasoning auto -kvu -b 2048 -ub 128 --cache-type-k q8_0 --cache-type-v q8_0 -ngl 55 --metrics --fit-target 128 --poll 0 --threads 4 --chat-template-file 'chat_template_gemma4.jinja' --alias Gemma4
I can put 55 layers on my 16gb card and get decent speeds. 11 t/s. Which is totally fine for me.
If I try to replicate this with koboldcpp though I will offload like 33 layers instead and get the speed you can imagine.
Both use around 15gb of vram. Can somebody tell me if there is a specific flag I seem to be missing?
I also couldnt find a setting for every arg, so maybe its just not possible.
>>
>>108650604
if you're concerned about vram the only flags that matter (beyond the weights) are
> --cache-type-k q8_0 --cache-type-v q8_0
you're using a 31B model with Q4 weights, so that's 31B*(8/4)=15.5GB of RAM, so there's no surprises there. Try using a lesser quant or buying more VRAM.
>>
>>
>>
>>
>>108650634
>>108650627
>>108650613
Hmm. If I set "use swa" I get to 52 layers.
Still 3 layers less and therefore slower.
Is that because of that checkpoint flag? I don't see anything either in the args or ui for that.
Gotta use llama.cpp for now I guess. Appreciate the help.
>>
>>
>>
>>
>>