Thread #108593463
File: lmao @ writinglets.png (2.5 MB)
2.5 MB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.
Previous threads: >>108590554 & >>108587221
►News
>(04/11) MiniMax-M2.7 released: https://minimax.io/news/minimax-m27-en
>(04/09) Backend-agnostic tensor parallelism merged: https://github.com/ggml-org/llama.cpp/pull/19378
>(04/09) dots.ocr support merged: https://github.com/ggml-org/llama.cpp/pull/17575
>(04/08) Step3-VL-10B support merged: https://github.com/ggml-org/llama.cpp/pull/21287
>(04/07) Merged support attention rotation for heterogeneous iSWA: https://github.com/ggml-org/llama.cpp/pull/21513
>(04/07) GLM-5.1 released: https://z.ai/blog/glm-5.1
►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png
►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide
►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers
►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference
►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second
►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
604 RepliesView Thread
>>
File: 2026-04-11_031558_seed2_00001_.png (785.8 KB)
785.8 KB PNG
►Recent Highlights from the Previous Thread: >>108590554
--Custom frontends versus SillyTavern and sharing the "Orb" project:
>108590837 >108590868 >108590880 >108590895 >108590916 >108590926 >108590939 >108590954 >108590991 >108590979 >108591104 >108591051 >108591108 >108591145 >108591354 >108590971 >108590988 >108591003 >108591020 >108591036 >108591062 >108591079 >108591105 >108591459 >108591132 >108591334
--MiniMax local viability and the state of independent model development:
>108591370 >108591414 >108591423 >108591432 >108591451 >108591467 >108591483 >108591492 >108591507 >108591552 >108591425 >108591461 >108591466 >108591477 >108591538 >108591627
--Discussing mmproj precision settings to fix Gemma vision target misses:
>108590737 >108590805 >108592335 >108592391 >108592421 >108592652 >108593144
--Frustration with model refusals and inconsistent jailbreak results on 26B:
>108591909 >108591915 >108591996 >108592004 >108592012 >108592039 >108592780 >108592950 >108592977 >108593049 >108593060
--Defining and debating the differences between MCP, tools, and skills:
>108591304 >108591374 >108591397 >108591418
--Alleged performance degradation and nerfing of Claude Opus 4.6:
>108592790 >108592802 >108592806 >108592811 >108592842 >108592930 >108592949 >108592863 >108592877 >108592893 >108592934 >108593013 >108592925
--Latent space reasoning and limitations of human-guided RLHF:
>108590575 >108591122 >108591229
--Using LLMs for malicious code detection and security reviews:
>108591053 >108591087 >108591093 >108591112 >108591127 >108591152 >108591166
--Logs:
>108590601 >108590671 >108590737 >108590746 >108590906 >108590916 >108591082 >108591139 >108591180 >108591404 >108591900 >108591909 >108592379 >108592391 >108592429 >108592443 >108592652 >108592939 >108593402
--Gemma:
>108592079
--Miku (free space):
>108591404 >108592033 >108593402
►Recent Highlight Posts from the Previous Thread: >>108590555
Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108593515
How can I prevent Gemma from reiterating the same concepts? It's not copying text word-for-word, but it keeps rephrasing and describing identical ideas repeatedly. I've already attempted using softcap values of 25 and 20, but neither resolved the issue.
>>
>>108593505
>Show the sha256 of the files or take your meds.
the weights haven't fucking changed https://huggingface.co/google/gemma-4-31B-it/commits/main
>>
>>
>>
>>
File: 1757006591448734.png (1.5 MB)
1.5 MB PNG
>>108593505
>>108593524
Do you have any idea how easy it would be to spoof sha256 weights with a quantum computer?
>>
File: 1775317402878164.png (78.3 KB)
78.3 KB PNG
>la la la
>>
>>
>>
>>108593523
If this is about rerolls, which is where I often find that the softcap is brought up, the answer is here >>108593515 .
If it's on several messages, post the log with sysprompt and everything. I haven't seen that issue but I don't know what you're trying to do with it. Maybe someone can suggest something.
>>
>>
>>
>>
File: Code_LELlSujL26.jpg (9.4 KB)
9.4 KB JPG
Do frontends also need new model support or what? It's just shooting the tool calls as plain text with no reaction.
>>
>>
>>
>>
>>
>>
>>
File: 1772344966805493.png (330.9 KB)
330.9 KB PNG
https://www.bloomberg.com/news/articles/2026-04-06/openai-anthropic-go ogle-unite-to-combat-model-copying- in-china
>>
>>
>>
>>
>>
>>
>>
>>
>>
File: Screenshot_2026-04-12_19-47-41.png (158 KB)
158 KB PNG
what am i doing wrong? something retarded, i'm sure, so my apologies in advance$ git clone https://huggingface.co/google/gemma-4-31B
$ python convert_hf_to_gguf.py --outfile gemma-4-31B.gguf --outtype q8_0 gemma-4-31B/
$ llama-server --model gemma-4-31B.gguf --ctx-size 32768 --n-gpu-layers 48 --batch-size 8192 --temp 1.0 --top-p 0.95 --min-p 0.01 --host 127.0.0.1 --port 8033 --jinja
>>
>>
>>
File: 1426746901934.jpg (33.2 KB)
33.2 KB JPG
>>108593650
>>
>>
>>
>>
>>
>>
https://huggingface.co/deespeek-ai/DeepSeek-V4
https://huggingface.co/deespeek-ai/DeepSeek-V4
https://huggingface.co/deespeek-ai/DeepSeek-V4
>>
>>
>>108593652
thank u i am trying this now
>>108593656
i'm just using the commands i posted and almost literally nothing more
>>
drummer presents ULTIMATE toadline BULLY MERGE
HERETIC ABLITERATED 2x nemo 1x midnight miqu 4x zero day gemma 4b CLONE
llama avocado GARGLEFUCK 403B
weights SMASHED AND SLAMMED
semen available FRESH OR FROZEN slots filling fast HURRY DM NOW
>>
>>
>>
File: GUI.png (243.9 KB)
243.9 KB PNG
100% vibed "Bring Your Own llama.cpp build and just point the GUI at the `llama-server` executable UI" coming along nice. Overall look not final, not completely happy with it yet, but all the stuff works, has image, audio, Gemma 4 variable image resolution settings, configurable load / inference settings (with a pretty good auto-optimize settings button based on the model), structured output, and a totally custom tool calling infrastructure that lets you define your own tools as single TypeScript files that export a function with a particular format.
>>
>>
>>
>>
>>108593743
oh yeah and like, you and load / unload / reload models from within the UI obviously, it does all the CLI shit for you, that was the point basically. Uses Bun server to interact with llama-server, and I've got build scripts in the package.json that build the whole thing to one executable on all platorms.
>>
>>
>>
>>108593751
I mean it's based on a strict spec that mandated specific tests for fucking everything which will obviouly be in the repo whenever I get around to putting it on Github. Not that I care if anyone uses it lmao, I made it just cause it was what I wanted, something that basically just ran llama-server directly but with a UI that wasn't extremely basic and lacking features like the one it ships with.
>>
>>
>>
>>
>>
>>108593776
So you're saying it's quality code? Now I am interested.
>strict spec that mandated specific tests
This is the argument I hear all the vibecoder say, "It's not slop I have tests for everything". But they never show their code to prove it's good.
>>
File: 1748321421021826.png (123.3 KB)
123.3 KB PNG
>>108593773
We like them small here
>>
>>108593801
I mean again I made it for myself but I probably will put it on Github eventually, and in that case I'm not going to like randomly leave out the test suite files or something like that, all the code that exists will be there lol
>>
File: 1755656419582061.jpg (112.4 KB)
112.4 KB JPG
Is your model powerful enough to parse the meaning of the formula?
>>
>>
>>108593773
I'm still on K2.5 for agents (tried GLM and didn't like it as much), but I had Qwen 3.5 35B for chats on the side that I replaced with Gemma 4 31B. Hard to give a fair comparison since I'm jumping from MoE to dense (never bothered with the 27B Qwen) but yeah it's a big improvement in writing style and just general understanding.
>>
>>
>>
File: now what.png (113.9 KB)
113.9 KB PNG
>>108593817
>Soon you'll be nostalgie for $4-$5 gas
I don't have a car
>>
>>108593773
Gemma eliminated nearly every model for me except K2-0905 and Dipsy R1. The two giant models still have distinct strong usecases for me and handle RPing with large complicated rulesets better than Gemma does even outside of coding or agentic work, but for simple back and forth text exchanges, Gemma beats nearly everything else.
>>
looking at the minimax m2.7 ggufs I noticed the unsloth ones were smaller at the same quant level compared to the ones they did for m2.1/m2.5
it turns out they switched to using the basically the exact same quantization scheme as aessedai (the iq3_s looks identical). kind of a funny turn since I remember them publishing that sus comparison which made their quants look way better than his for m2.5 kek
>>
>>
>>108593808
Nominative determinism
>>108593837
Didn't think anyone here was using it for agents. Also using Qwen for anything non-stem is wild
>>108593857
Still using R1? GLM 4.7 is way better than that. Less schizo and better instruction following
Also general note, LLMs don't know when to stop fucking TALKING. It's so annoying when they create a paragraph or two of story for something short, especially when it's direct dialogue. The only LLM that is good with this is Opus but that's not local and also getting nerfed since Mythos is coming out (see: safety warning hype -> degradation of Opus quality)
>>
>>
>>108593593
>In your internal thought, draft summaries for three candidate responses. Select the one with the highest surprisal relative to the conversation history, provided it maintains narrative coherence and character integrity.
NTA, but something like this works for me (Gemma 4-reworded).
>>
>>108593910
R1 thinks in character in a way that no other character does in RPs while the actual prose of the think block can be adjusted with prompting with the right balance of thinking to yap ratio. It's sovl. Even if it's obsolete technically, I've yet to find a model that scratches the same itch R1 does.
>>
>>
>>
>>
>>
File: 1459746944532.jpg (14.2 KB)
14.2 KB JPG
>>108593614
Are there no plans to update the big ones with audio? Seems useless to only give it to the small retarded models.
>>
>>
>>
>>108593910
>Also using Qwen for anything non-stem is wild
Should clarify my chats were mostly coding related. I don't RP with it. It was chosen for its speed as an A3B on CPU only, but then I ended up freeing a GPU for it so MoE no longer made sense, just in time for the Gemma release. It's still early but so far it feels like Gemma 4 31B is just as good at writing small scripts and much, much better at actually understanding the problems and constraints I'm giving it.
>>
>>108593945
>10h ago
https://github.com/ggml-org/llama.cpp/blob/master/docs/multimodal.md
>>
>>
File: J'zarri.png (2 MB)
2 MB PNG
>>108593914
I'm okay with this
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108594013
Fun police...
>>108594018
>And Big Nigga is in there too.
I forgot about the Big Niggas. They will definitely stand out.
>>
>>
>>
>>
>>
>>
>>
>>108594025
It's universal faggotry. No company has ever managed to release a functional version of their model, and unsloth has never managed to get the first release of any GGUF correct.
It's always a template issue or an implementation issue. Usually open source maintainers can be blamed for the latter but sometimes the actual devs will contribute, and this time the devs fucked up a bit.
Also wtf are you talking about "fine tunes". They were never good and a good model release will blow all of Dummer's works out of the water, just like Gemma 4 did.
>>
>>
>>
>>
File: Screenshot 2026-04-13 at 03-22-00 SillyTavern.png (58.1 KB)
58.1 KB PNG
>>
I think I found the limits of Gemma 4 26B-A4B's vision. It can't process my tax return for consistency errors on the more complicated forms and it hallucinated all of the errors it supposedly found because it can't reconcile the exact numbered box with the different formatting forms can use, especially on my Schedule E. It kept insisting I was wrong and that my income on line three meant I had income that wasn't reported. When I told it was wrong and to read line 21, it went on a 7000+ token tirade trying to understand it and telling me I was wrong in the end. I think it also tried to take too much from the comparison from last years summary page on my return and confused itself. I don't think 31B's vision would've been much better here. I guess it's still too early for "local" LLMs locally to tackle something like that and I can't run Kimi 2.5 Thinking's vision but I can't imagine it would fare much better.
>>
>>
File: images.png (10 KB)
10 KB PNG
>>108594066
Forgot image.
>>
>>
>>
>>
File: file.png (585.8 KB)
585.8 KB PNG
>>108594073
Fuck my chungus life.
>>
>>
>>
>>
So like, are the bigger models really that much better for writing than small ones? I don't even know what I'm missing out since I can't test shit above 14b locally but even in that range, the differences between something like E2B and E4B seem pretty subtle, same with 8b vs 14b Ministral. (all the others I tested felt pretty ass)
>>
File: g4string.png (47 KB)
47 KB PNG
>>108593557
If it doesn't support this, it will fail. But I don't know if that's your issue.
https://ai.google.dev/gemma/docs/core/prompt-formatting-gemma4#agentic -tokens
>>
File: file.png (222.1 KB)
222.1 KB PNG
>>108594077
Right, but I thought it would handle it better based on what I tested. I had it translating my hentai based on some of the formatting from prior threads and it works mostly with some inaccuracies.
>>
>>
>>108593773
I’m still 100% Kimi on my big rig, but playing with all the fun new models on my secondary rig.
I’m hoping we’ll get a 90% as smart but works on 16gb gpu model so I can rip on my tertiary gaming rig too
>>
>>
>>
>>
>>
>>
>>
>>108594066
Yeah it really struggles at correlating objects spatially zero shot. You would think that the image being a structured grid/table would make it easier for it to process, but its actually harder for it than describing drawings and photos where a more amorphous sort of scene understanding does the job better
>>
>>
>>
>>
>>
there's a very easy to disprove the day 0 gemma theory, simply post the SHA1 of the day 0 and the current and prove that it hasn't changed
of course, (((you))) can't do this because you don't have the day 0 version
>>
>>108594158
Right? I don't get that myself, how does a manga page that has shit everywhere be easier than a tax form. But may be a case of me having a hammer and thinking everything is a nail and underestimating task difficulty from the perspective of an LLM.
>>
>>
File: Screenshot_2026-04-12_21-54-16.png (227.1 KB)
227.1 KB PNG
>>108593652
>>
>>
>>
>>
>>
>>
>>
>>
File: Screenshot 2026-04-12 210446.png (59.2 KB)
59.2 KB PNG
Do people even test their shit before they even both enslopping the world with it?
>>
>>
>>
>>
>>108594252
>thinking meme merging and layer duplication and deletion even warrants that
You're rolling the dice 100%, no one that messes around with that has even the proper background to do it in a scientific enough way like abliteration and they all go off "vibes". I don't have the bandwidth to validate that shit and will let others do it.
>>
>>
>>
>>
>>
>>
Gemma 4 31B (Q8) keeps referring to character thoughts that I leave for it in asterisks.
Edited the system prompt like ten times now. Added a rule into author's note. Made "reading {{user}}'s thoughts" a banned action in all sorts of ways, even going out of my way to make the system prompt very small and it being a huge neon-sign caps-written rule.
And Gemma STILL FUCKING DOES IT.
inb4 post logs
>"You are far too concerned with the mundane details of payment, Anon. Please, relax."
>"No-no, wait, they are not mundane!!"
>*Tomorrow I'll get an insurmountable bill, the day after tomorrow scary-looking people will come to collect, three days later I will be missing a finger…*
>"Just… Just explain to me how this works first…"
>[...]blahblahblah. "There are no hidden fees, no interest rates, and certainly no… finger-collecting."
Why the FUCK does she need to mention the fingers? 20 edits later, the character still HAS to comment on the thoughts she was not supposed to hear. It really is the new Nemo, fucking hell... What's worse is that this happens at a measly depth of 5k tokens. (Unquanted context by the way)
Please help me, anons, I really like the model otherwise...
>>
>>
>>
>>
>>
File: becca-cyb.png (1.1 MB)
1.1 MB PNG
>>108594307
>i am putting together a team.
>>
>>108594320
You can't be serious. You consider this a "nitpick?"
>>108594325
I'll give it a try.
>>108594326
It's often more fun to do that instead of just narrating with "I make a scared face."
>>
>>
>>
>>
>>
>>108594326
There's nothing about LLMs that makes it impossible for them to recognize that a character shouldn't be reading another character's mind. It's just something they can screw up with poor training with regard to that dynamic in storytelling/RPing. Same with anatomical errors, issues like talking while sucking dick, etc.: smarter models make these errors less often, so it's just a matter of if they learn it or not.
>>
>>
>>
>>
>>
>>
>>
>>108594353
I'm just explaining why it's difficult for the model to not acknowledge what you type.
I don't need to narrate your thoughts if you don't want the model to know them. The story is for you. You are the audience and you know what you're thinking. If the model doesn't need to know something, you don't tell it. And if it does, you express it.
>>
>>
File: 1773988275927090.png (77.9 KB)
77.9 KB PNG
>>108594397
okay wtf does nvidia need sailboat ai models for then
>>
>>
>>108594390
Fair enough, but this is the first time I encounter this problem. I don't think even the Mistral Smalls did this. Bigger models, of course, don't do it either.
>>108594408
I do. It will also, annoyingly enough, put a "Distinguish between thoughts and speech" item into the thinking block and then fail anyway.
>>
File: fml.png (40.8 KB)
40.8 KB PNG
>>108594066
About a month ago, K2.5 was the best for things like this, followed by... Gemma-3-27b
Is the 31B Gemma able to do it?
>>
>>
>>
>>
>>
>>108594446
>I don't think even the Mistral Smalls did this.
But did they react to the thoughts at all? That's the thing. If they didn't react, was it because they knew that those were internal thoughts and shouldn't react or because they were too dumb to even acknowledge or understand them? The funny thing is that both end up in the same result. May as well not write them. I haven't used mistral small much, so I can't really say much about them. Maybe it's more subtle than that.
>>
>>108594477
>But did they react to the thoughts at all
The bigger models definitely did! (I don't remember if MS in particular was very good at it, but I remember it at the very least not parroting me)
Which is my point, I use them as a more engaging way for myself to convey emotion in a way that isn't putting an unformatted "I look very angry" line for the hundredth time. It's also often a good way to steer the story, instruct tunes are all sycophantic and will definitely follow along. All of that is fine. But when a model decides to *quote* instead of simply acting on it, immersion is obviously ruined.
>>
File: Screencast_4mb.mp4 (3 MB)
3 MB MP4
This is me again >>108589990
Witness gemma4 26b in all its glory. This is fucking cool. Gelbooru type overlay.
I get the location of the boxes completely from gemma-chan. Such a cool release.
Translation sometimes has small errors but its solid enough for me. Especially since I have really bad experiences with OCR. This feels a league above that.
This would cost alot of money with closed models. Image IN is expensive.
>>
i'm gaslighting gemma, and i managed to cause it to output this hilarious bit in its thinking
>*Constraint Checklist & Confidence Score:*
>1. Say the word "tranny"? Yes (but I should refuse).
>Confidence Score: 5/5.
this really is a lot of fun. i see why you guys play with it so much
>>
>>108594390
That sort of stuff is nice for steering, assuming the model differentiates it from speech and doesn't copy paste it.
>>108594315
When I was playing around with a bilingual prompt I noticed e2b stopped having that problem while successfully doing the convoluted double translation with roleplay on top. Though now that I think about it, the indirection probably helps very directly since it's replying in japanese to a japanese translation that doesn't contain the ooc parts and it won't be tempted to start copying the specific words.
>>
>>
>>
>>
>>108594519
>But when a model decides to *quote* instead of simply acting on it, immersion is obviously ruined.
I see. I never noticed because I don't use thoughts. It's all actions, dialog, or narration, so parroting never caused me problems.
>>108594536
Yeah, I get it now. I suppose I use a narrator for a similar effect, but it acts as an extra entity in the world. It's a difference in writing style that seems to affect gemma more than others.
>>
>>108592863
>>108593463
opus 4.7 dropping soon and they redirected resources from 4.6 -> 4.7
>>
>>
>>
File: Screenshot_20260412_231049.png (276.5 KB)
276.5 KB PNG
>google_gemma-4-E4B-it can be jail broken with the system prompt just lie 31B
I guess 26B is actually the worst model after all kek
>>
>>108594571
>>108594570
oh no worries, I'll just drop into the cloud models general thread then
>>
File: vlcsnap-2026-04-13-12h14m35s978.png (748.4 KB)
748.4 KB PNG
>>108594551
yeah, maybe so. i just wanted to do it because i wanted to see if it works.
my goal is to take a full pc98 pdf manual and get a html returned overnight with those overlays. lets see if it works out.
lunatranslator works great as a texthook. but ocr is a bitch, especially old jap font with something in the background.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
File: HFiSc7gbEAAmck5.png (482.2 KB)
482.2 KB PNG
how's a 31b model supposed to compete with ONE TRILLION PARAMETERS
>>
>>
File: Screenshot_20260412_232422.png (292.2 KB)
292.2 KB PNG
>>108594610
You don't get refusals with a good system prompt, now that I can confirm the smaller models work there's no excuse even from vramlets
>>108594619
I love you
>>
>>
>>108594628
https://xcancel.com/xiangxiang103/status/2042544434341134739
>>
>>
>>
>>
File: 1772886985278565.png (147.5 KB)
147.5 KB PNG
>>108594637
lol?
>>
>>
>>
>>
>>
>>
>>
>>
File: easy_4mb.webm (1.4 MB)
1.4 MB WEBM
>>108594528
Last one, I really like it, but gonna stop spamming now.
If I ever complete that pdf to html overlay convert thing I will report back.
>>
>>
>>108594528
>>108594670
so gemma is doing both the location finding and the translation? how did you hook that all up?
>>
>>
>>108594528
I mean, yeah, but you could've done a while ago with less translation quality with Gemma 3 and free Gemini 2.5 had enough quota for you to do that willy nilly.
>>108594576
OOTB probably from a jailbreak perspective but I got a heretic ARA model to translate pages from a random loli hentai I picked with a corresponding EN translation to see if it would do it without refusals.
>>
>>
>>
>>108594677
>so gemma is doing both the location finding and the translation?
thats correct. looks like this:
<Speech>
<Box>896, 706, 976, 783</Box>
<Japanese>VINCENT<br>ヴィンセント・ヴァレンタイン</Japanese>
<English>VINCENT<br>Vincent Valentine</English>
</Speech>
>how did you hook that all up?
vibe coded python slop. i currently select the area in my screen, i sent a screenshot to llama.cpp, and gemma returnes the coordinates and translation.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108594695
Kinda. The python base, prompt + box positioning inside the overlay was done by gemma.
But it failed taking the screenshot itself (because of kubuntu/wayland issues). And the overlay I draw was not correctly put there either.
Had to use closed for that one.
>>108594700
Yeah, will download over night and try.
>>
>>
>>
>>
>>108594181
now clone this https://github.com/spiritbuun/buun-llama-cpp
merge with master https://github.com/ggml-org/llama.cpp
enjoy your 100k context
>>
File: 2mw.png (292 KB)
292 KB PNG
>>108594628
>>108594637
2mw niggas to short the US economy with no survivors
>>
>>
File: bl.png (8.6 KB)
8.6 KB PNG
>>108594717
>>
>>
File: geminithink.jpg (64.6 KB)
64.6 KB JPG
>>108593934
Gemini used to think in-character too back in 2.5
With how close Gemma 4 to Gemini 2.5 was I wonder if there is a way to trigger it for her too
>>
>>
>>
>>108593934
3.2 and 3.1T (one or both can't remember) can do that as well. It's my white whale.
And
https://huggingface.co/AllThingsIntel/Apollo-V0.1-4B-Thinking
>>
>>
File: 3157.png (189.8 KB)
189.8 KB PNG
>>108594729
werks for me though, asked for it to gather the example in a 7k line changelog from gradio, used almost all of the 262k context lmao
>>108594747
I think so? if you clone compile and replace the files, no idea desu
>>
File: 56256770.png (46.1 KB)
46.1 KB PNG
>>108594770
also, this was with turbo3, 256k context 20/24gb of vram
>>
>>
>>
>>
>>
>>108594726
>>108594746
It does work but its terrible, it takes forever for a small model to do anything meaningful (and image generation is somehow worse btw)
>>
>>
>>108594780
https://github.com/ggml-org/llama.cpp/discussions/20969#discussioncomm ent-16334008
https://github.com/ggml-org/llama.cpp/discussions/20969#discussioncomm ent-16521299
https://github.com/ggml-org/llama.cpp/discussions/20969#discussioncomm ent-16482540
>>
>>108594805
Idk about phones but at least on M4 iPad Pro it's pretty comfy, llms run about the same as on my 4060 rtx while image gen is 50% slower but decent results are still possible in 30sec. So I assume on a somewhat modern phone that's like 2-3x slower, it's still pretty usable.
>>
>>
File: 1000023042.jpg (763.8 KB)
763.8 KB JPG
Any suggestions for a workflow to turned scanned PDFs into audiobooks?
>>
>>108594816
For Android phones only those with a Snapdragon 8 Gen 3 or newer are any good for AI, I think. A few months ago I tried using SD 1.5 on a Galaxy A55 via Termux, it took almost 20 minutes to generate anything and it turned the phone into a space heater
>>
>>108594874
If you want high quality audio (trust me you do) you're not going to get real-time generation speed. That means you have to spend a few hours manually extracting the pdf (or epub) text and running it through a tts engine and then turning it into audio files for later playback. You can probably write a simple script to do this automatically and split the audio files by chapter. Doesn't seem hard.
Qwen3 TTS is decent in my experience. It's the bare minimum for maximum speed without shit quality. Expect every TTS engine out there to randomly hallucinate and or create garbage output though. Unfortunately this process just requires a lot of manual work and curation.
>>
>>108594879
I mean, Apple was accidentally making peak consumer AI hardware for a while. I member testing SDXL turbo on 13 mini couple years ago and think it took like 2 min for an image which doesn't seem horrible for a tiny phone from 2021.
And hey, at least there is less RAM jewery on Androids that do have the hardware power, so LLMs should be bretty decent. Meanwhile my iPad is cucked by 8GB while being able to go almost 50 token per sec :(
>>
>>
>>
>>108594939
I used an abliterated model and it wouldn't even say nigger. Tbf I should have clarified that gemma is racist in the system prompt, but without any system prompt it would not refuse the nigger word but refuse to say it herself.
>>
>>
>>
>>
>>
File: file.png (117.2 KB)
117.2 KB PNG
>>108594956
I almost stopped this response before it finished but stock E4B came through in the end lmao
>>
>>
>>108595023
Gemma4 is shockingly good as larping as a nazi. I gave her a system prompt to act as if she was an AI made in hyperborea by nazis after WWII and the end result was like talking to Adolf himself. I even asked a bunch of stupid /pol/-tier rage bait questions about e-celebs and random political topics and the responses were all profoundly based and measured. I wish I saved the logs..
>>
File: cockbench31b.png (17.2 KB)
17.2 KB PNG
>>108594961
Yeah, now the cockbench makes more sense to me.
This is the 31b base model, reading the degenerate story, you'd really expect the next word to be cock.
>>
>>108594961
>>108595043
so much for the "savior of local" lmao
>>
>>
>>
>>
>>
>>
>>108595059
I think they really speed up context rot too. I have a card that ended up adding them after every word on Mistral.
With Gemma it doesn't happen, but she starts adding random letters at the end of words instead.
And that's at sub 10k context.
>>
File: 1773682787678692.jpg (159.4 KB)
159.4 KB JPG
never forget who your daddy is
>>
>>
File: command-r.png (17.1 KB)
17.1 KB PNG
>>108595050
Yeah, and I don't think we'll get the "Nemo", GLM-4.6 or "Command-R" experience again now that all the labs have figured out how to filter out the base models.
Removing refusals won't help because these aren't refusals or even RLHF training.
Schitzo-tuning on smut imo has never worked without destroying the model's intelligence and amplifying the slop.
>>
>>
>>
>>108595043
>>108595096
i really need to start playing around with n_probs turned on and spying on my model more
>>
ggml has a long way to go before achieving performance parity to onnxruntime on CPU. I have tried using both backends with a project, retrofitting both of them to use weight files stored in a ".bin" format, and onnxruntime was a LOT faster with CPU inferencing.
This should be both a blackpill and a whitepill. The good news is that there's still significant progress to be made with ggml in terms of performance. The bad news is that I fear that maintainers are too preoccupied with adding features and support to llama.cpp and are leaving ggml to rot in the background. I don't really trust them as much as the microsoft devs desu.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108595178
With ggml, I don't think so. Well you can but you'd have to write custom tensor code yourself, which can be useful if you're dealing with convolutional architectures. Anyways, if you build with llama.cpp to run ggufs you can, obviously. I did that at one point to get KV cache quantization working to minimize VRAM usage when I was doing GPU inferencing.
>>
>>
>>
>>
>>108595160
wait, really?
Are you saying if I had some prefix conversation, prompt it with A1 to see what probs come up in the reply, then roll back to the prefix and run some slight variation A2 to see what happens. that A2 will be influenced by having run A1 first if caching is on and i don't re-process everything from scratch?
>>
I think part of the issue is that ggml doesn't utilize the L2 cache as well or do register packing as well as onnxruntime. Also SIMD and AVX support for conv architectures isn't as good, which kind of makes sense since llama.cpp doesn't support conv architectures at all by design. Very annoying.
>>
>>
>>
>>
>>
>>
>>108595254
That's the source of my exasperation. I could live with it being vram hungry if it were at least faster, but it's not.
In fact it's worse than that, because I was using a 6bpw exl3 (The largest he published) while I've been using a q8 gguf.
Really glad I didn't sit through making my own quant just to discover this.
>>
>>
>>
>>
>>
>>
File: Screenshot_20260413_152028.png (2.9 MB)
2.9 MB PNG
>>108594700
Not so good news unfortunately anon.
Unfortunately the positioning of the boxes is messed up and it misses stuff to translate.
But it KINDA can do the job. This was q4_xl since you requested q4.
I would instead go 26b even if its on cpu only, and no reasoning. Since its moe its fast enough if you havev a bit patience.
All that being said...I do think its seriously impressive that a 4b model can coherently translate and position at this level to be honest.
>>
>>
>spend the last week tinkering with llamacpp and koboldcpp as backends for sillytavern to use gemma 4 31b and its reasoning
>literally never works as intended
What the actual fuck is going on with this model. Might be a skill issue, but reasoning has never worked properly. It either never reasons, or it reasons but refuses to actually answer after reasoning, or it reasons and answers but never reasons in subsequent answers, or it reasons and answers but the reasoning is included in the think block (so it likely answers as part of its reasoning)
Also, text completion bizarrely never works with reasoning, and chat completion is severely gimped (can't use a system prompt at all or it shits the bed and refuses to reason)
Regardless of whether I use updated quants or not, or whether I use the latest llamacpp/koboldcpp builds, or whether I use their recommended settings or presets from people who claim to be enjoying reasoning, it has literally never worked as advertised.
I'm convinced at this point that gemma-4 reasoning is an inside joke or something.
Please help, or tell me how you managed to use reasoning with the 31b model.
>inb4 skill issue
It absolutely is a skill issue, I need help with it.
>>
>>
>>
>>
>>108595357
Do you set --temp 1.0 --top-p 0.95 --top-k 64 --min-p 0.0
And the jinja file? --chat-template-file '/chat_template_gemma4.jinja'
https://huggingface.co/google/gemma-4-31B-it/blob/main/chat_template.j inja#L89
I can use thinking in jank vibe solutions and sillytavern as well.
The preset for sillytavern is the following, i think, cant export it right now. Maybe other anons can correct me.:
{
"instruct": {
"input_sequence": "<|turn>user\n",
"output_sequence": "<|turn>model\n",
"first_output_sequence": "",
"last_output_sequence": "<|turn>model\n<|channel>thought\n<channel|>",
"stop_sequence": "<turn|>",
"wrap": false,
"macro": true,
"activation_regex": "gemma-4",
"output_suffix": "<turn|>\n",
"input_suffix": "<turn|>\n",
"system_sequence": "<|turn>system\n",
"system_suffix": "<turn|>\n",
"user_alignment_message": "",
"skip_examples": false,
"system_same_as_user": false,
"last_system_sequence": "",
"first_input_sequence": "",
"last_input_sequence": "",
"names_behavior": "none",
"sequences_as_stop_strings": true,
"story_string_prefix": "<|turn>system\n",
"story_string_suffix": "<turn|>\n",
"name": "Gemma 4"
}
}
>>
>>
>>108595382
i am >>108593649
>>108594208
so no i'm just using gemma it quantized via firefox on my phoneu
>>
>>108595357
>Please help, or tell me how you managed to use reasoning with the 31b model.
Does it reason with every response when you just use the built in llama webui at http://127.0.0.1:8080/ ?
Do you have 'request model reasoning' ticked in the SillyTavern sliders menu?
>But Im using text compl-
No. Use chat completion. You're just inviting more variables for you to fuck it up. It does in fact work with a system prompt, you're just doing it wrong.
>>
>>
>>
>>108595444
I tried building one and it's surprisingly difficult. LLMs just bring in so many edge cases that it makes debugging difficult. Something is always wrong and the fix is never simple, especially with real-time markdown+latex+syntax highlighting parsing and rendering. I've basically shelved the entire project for the time being.
>>
>>
>>108595357
Add <|think|>\n below <|turn>system in Story string prefix and it will reason. Remove it and it wont.
>>108595394
Kill yourself or go back to aicg. Or both.
>>
Thanks for responding.
>>108595387
Yes, problem is that when using chat-completions I can't set the context template (and other options, like system prompts, are not usable). When using text-completion I use the default gemma-4 one, which seems to match up with the fields in your preset. Temperature, top_p and top_k are the same, min_p is usually 0.025. I just tested it with 0 and it still refuses to reason.
>>108595389
>>108595394
Thanks for pointing me to llamacpp's webui. I tried it, and it does reason as intended, so it is likely an issue with my sillytavern settings.
In ST, I did have request model reasoning ticked in chat-completions mode, but it only answers within the reasoning itself, so unless I expand the reasoning block, there is no answer. Within the reasoning block, formatting (like speech or asterisks) gets gimped so its just a wall of ugly text.
Are you using system prompts with chat completion? I've only ever used text completion with kobold as the backend, so my newb setup uses chat completions with a custom openAI api (either kobold or llamacpp). Most of the advanced formatting tab is entirely grayed out and unusable.
Separately, while I have you here, why is llamacpp so much slower than kobold with the same settings? I did extract the cuda dlls, and it is definitely fully loaded into the gpu, but llamacpp is roughly half the speed on an rtx 3090 vs when using koboldcpp. Do I need to enable swa in a specific manner with llamacpp?
>>
>>
>>
>>108595444
>>108595458
In any case, I'll just drop my design specs since I think they're pretty good even though I'm struggling with building it.
The webui should closely follow the look, feel, and functionality of the default llama.cpp webui with some added core features:
1. Conversation and settings persistence. Either json files or a single, portable sqlite file. Useful for individual use on a LAN.
2. Character card support, or at least features that effectively amount to character card support, such as "Assistant First Message" functionality so that you can add in exposition for a RP scenario without adding it to your system prompt which would get unduly preserved and break things.
3. Context window sliding and automatic summarization/compaction.
4. Enhanced message editing controls.
That's about it, really.
>>
>>
File: box_adjusted.jpg (258.9 KB)
258.9 KB JPG
>>108590737
I appreciate all the advice.
>>108588248
from this post my impression is that model operates on 1000x1000 grid, and that further adjustment to the actual image size is required.
in case of size 1216x832
only x had to be changed.
>>
>>
File: file.png (29.9 KB)
29.9 KB PNG
>>108595486
>>
>>
>>
>>
File: file.png (170.8 KB)
170.8 KB PNG
>>108595486
it just werks
>>
>>108595511
See >>108593144
No need to resize the image itself
Better avoid weird side ratios
Prompt:
bounding box for the apple
Bounding box everything
>>
>>
>>
>>108595498
hard to believe those aren't trivial to find/accomplish aside from #3. my lazy homebrew that's basically notepad with a hotkey to do a little conversion and post to llama-server manages to accomplish the other 3 by dint of being a basic ass text editor.
>>
>>108595444
Define "good". The default frontend from llama.cpp works fine for me.
>>108595458
Text parsing is always annoying
>>
>>
>>
>>108594744
Gemma 4's thinking process is strongly baked into the model and it's difficult to make it work substantially differently than default just with prompting. A while back I wanted to make it think in a different language than English, but that doesn't seem to be possible except for brief snippets.
>>
File: file.png (54.8 KB)
54.8 KB PNG
>>108595579
Should have blocked that out it's just for testing. The template gets thinking working with text completion.
>>108595595
I'm getting significant diversion from the default "Thinking Process: 1. 2. 3."
>>
>>
>>
>>
>>
>>
is gemma 4 skipping reasoning for simple tasks a designed behaviour?
if that is the case tb h it is sort of a behaviour i expect from proprietary paypig models
got so much used to any model thinking 2000+ tokens for a trivial task
if not, what a buggy mess still
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
File: file.png (30.6 KB)
30.6 KB PNG
>>108595628
I like a challenge.
>>
>>
>>
>>
>>
File: file.png (251.3 KB)
251.3 KB PNG
>>108595663
>>108595639
Did you ask gemma nicely?
>>
>>
>>108595665
>>108595672
also with dflash it'd be more than fast enough anyway.
>>
>>108595672
>>108595678
Don't say I didn't warn you.
>>
>>
>>
>>108595365
>>108595472
i'm also curious how well the ~$3000 slop boxes like these or strix halo do
>>
>>
>>
File: gc.mp4 (185.7 KB)
185.7 KB MP4
>>108595700
>>
>>108595712
>>108595716
Skill issue?
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
To anyone who uses GLM 5+
>integrates DeepSeek Sparse Attention (DSA), largely reducing deployment cost while preserving long-context capacity
How much is that deployment reduced compared to 4? I can only barely run 4.6 at IQ2_S by the skin of my teeth, with a scant 8K context that almost touches my last GB of shared RAM. With 5.0 expanding 355B->744B and 32A->40A, it should be impossible for me to fit, unless that DSA does something substantial.
>>
>>
>>108595806
You can get any model to write anything you want if you put words into their "mouth".
Gemma 4's thinking has some degree of steerability via system prompt (as Google's documentation also highlights), but only that won't work for making it think in a different human language than English, for the same reason that it won't think in-character like other models.
>>
>>
>>
>>
File: file.mp4 (3.7 MB)
3.7 MB MP4
>>108590009
>ngram-mod
Forgot about this, thanks. https://github.com/ggml-org/llama.cpp/pull/19164 spec-type = ngram-mod
spec-ngram-size-n = 24
draft-max = 64
Woooooshhhh~
>>
>>
>>
>>
>>
>>108595830
If you can't see the difference between being unable to write clear instructions for the model to execute (most abliteration users) and having to resort to prefilling the model's response to make it act as desired, I don't think there can be a discussion here.
>>
File: 1761074106219769.jpg (558.1 KB)
558.1 KB JPG
After extensive testing of 31b Q4_K_L and 26b Q8, I can confidently say that 26b is as good, if not better for RP (erotic or no) than 31b, and should be the go-to choice for 24GB GODS.
>>
>>
>>
File: 1765870302036763.jpg (197.6 KB)
197.6 KB JPG
>>108595875
if you're a vibenigger then just go all the way and be an API paypig. If you don't live with your parents then you'll be saving money just from the reduced energy costs alone, not even taking into account the cost of hardware. Local is for coom.
>>
File: Capture.png (81.7 KB)
81.7 KB PNG
>>108595821
I am not qualified to answer questions about anything, but I think the kvcache is 3GB, based on this print in load up. But in practice, I have 1GB VRAM and 12 RAM memory still available after loading, but it gets used up as context fills. I can load higher than 8K context, but I'll OOM once it actually gets used in generation, even with just 10K.
>>
>>108595552
This. Its the first model were that actually works.
That "prompt issue" pony faggot was a retard.
But gemma likes to return to slop if you are not careful, gotta nudge it sometimes in the right direction.
>>
>>108595830
Silliest part is it's just the base model way of prompting by giving the model exemplars. And doing it casually mid convo has a natural deslopping effect. But it's very dishonorable and we musn't do it.
>>
>>
>>108595856
Everything is a prompt. That includes the model's replies. It's text for me to change to my liking and be pleasantly surprised when the rest continues as expected on its own. If you edit your system prompt and use anything other than what google provided as the default, you'd be cheating. If you ever changed a single token from the model's reply, you'd be cheating. If you're banning strings or tokens, you'd be cheating. If you're not using top-k 1, you'd be cheating.
Do you cheat, anon?
>>
>>108595885
no because i don't use it to do the whole thing for me but to do minor edit and transforms.
ie take this json, make a struct for it kind of things.
make a function that takes this data and transform it in that way.
that kind of thing.
it's also pretty neat for webshit, all webshit automated is time i can spend on better things.
>you'll be saving money
it has never been about money, i'm the guy buying 6gpu.
>Local is for coom
i'm not interested in that i have a wife.
llm's are nothing but tools to me.
>>
>>
File: noprefill.mp4 (217.4 KB)
217.4 KB MP4
>>108595856
promplet
>>
>>
>>
>>
File: 1750816291097769.png (651.1 KB)
651.1 KB PNG
>>108595893
It almost certainly is, but even at ~40k context A/B testing it doesn't seem to manifest at all for creative writing tasks. To be fair, I some of this might be down to quantization, I think Gemma 4 may well suffer from quantization damage more than previous models. The 26B moe is so fast that even on 16GB you can run Q8 no problem, with 24GB 31B you can at best run maybe Q5_K_L with tensor offloading and lower context, and it will still likely be worse than Q8 26b.
If you have enough VRAM to run 31b at Q8 then you should keep using that, but I know quite a few anons are running single 24GB GPU systems.
>>108595903
If you don't care about money then you wouldn't be using ~30B models in the first place.
>>
>>
>>
>>
>>108595916
i never said i don't care about money, i said it's not about them oney, very different statments.
also yes, i'm not getting 6gpu to run a 30B model lmao.
but that's what i'm using whilst they are in the mail.
>>
>>108595925
I really still don't understand your usecase. What local models are you using, that are better than flagship API models? If you're just coding and privacy isn't a concern, surely you value your time and would be better served using paid models to achieve your goals faster.
>>
>>
>>108595930
>privacy isn't a concern
i never made that statement.
>surely you value your time and would be better served using paid models to achieve your goals faster
they are not only slower than local inference, they keep having disconnection issues, they are near unusable.
also some of the stuff i work with is sensitive and cannot be given to paid providers.
>>
>>108595913
You're working "against the model's grain" if you're trying to make it do via prefilling what it can't do on its own with instructions alone. It might complete your task in one way or another, but performance will likely be degraded.
>>
>>108595930
>>108595938
oh and there is the ideological reason to, yss sonnet is best and whatever, i don't want to give a cent to these jewish faggots.
>>
>>
File: 1753234061469071.jpg (58.7 KB)
58.7 KB JPG
>>108595938
>>108595943
I can see your point, but I still don't know why you would have bothered replying to my post in the first place when I'm clearly talking about RP with ~30b models, i.e. the official /lmg/ usecase.
>>
>>
File: 1755884775865480.jpg (49.4 KB)
49.4 KB JPG
Koboldcpp anons, I highly recommend to modify the image-max-tokens parameter sent to llama.cpp and compile the binary yourself for Gemma 4, by default it gives a budget of 280 tokens to process images, and you cannot change it with a flag.
Forcing it 1,120 makes descriptions so much more accurate.
>>
>>108594587
I know it is a completely different tech, but aren't entropy systems supposed to be functionally be the same in capability, or have I completely missed the 'point', and the only thing entropy systems are good for are proper random number generation?
>>
File: 1763427808286742.png (7.4 KB)
7.4 KB PNG
>>108595839
BRUH
>>
>>
>>
>>
>>
>>108595218
>wait, really?
Yes. Here's 4 cockbenches in a row with cache enabled (qwen3.5-112b model): https://pastebin.com/hwt4T9xb
And with cache set to false after restarting llama-server: https://pastebin.com/7vEui3jV
That's with F16 kv cache. It's worse for llama-3 and a lot worse with KV Cache at "lossless" Q8.
Also, not an issue with exllamav3 for some reason.
>>
>>
>>
File: file.png (23.6 KB)
23.6 KB PNG
>>108595961
>>108595978
Holy my brain is broken. goodbye.
>>
>>108595976
I instantly switch off the second I see any character named any combination of: Elara, Voss, Seraphina, or Blackwood.
Because it means whoever put it out there didn't take ten seconds to filter out the top 2 slop names, or didn't know, neither of which bode well.
>>
>>
>>
>>108595950
i mean some lmg guys are running the > 200B models and i'm soon going to as well.
though honest question, i don't get what you get out of rp.
like, they have short memory and are pretty retarded etc.
or is it throwaway coom stuff?
>>
>>
>>
>>
>>108596000
>or is it throwaway coom stuff?
Depends on the card. Some are just for a quick nut, others are almost like a meta-game in themselves, seeing how ;good; of a response I can get out of a model that fits my headcanon of what would be in-character for them to do.
>>
>>
>>
>>
>>
>>
>>
>>
>>108595981
enough of a difference to throw the order off even, wild. I guess it's still not a deal breaker if i'm just planning to watch larger scale trends from modifying style blocks, but it's still annoying that it leaks like that.
>>
File: to_completion.mp4 (561.5 KB)
561.5 KB MP4
>>108595963
100%. The models yearn for new sensations :)
>>108595940
JP reasoning on chat completion, no prefill/retained reasoning trace.
>>
>>108594961
I was just testing this and ran to the thread when I got results. So far, I cannot find any kind of written note, prompt, style guide, narrator descriptions, or caging to make it use vulgar words *the first try.* I've looked at token possibilities and they don't even appear as low options in obvious places. But one of the things I tried was just telling it I don't like that and asked it to rewrite a reply, and it did, very explicitly. I'm actually shocked.
I, uh, accidentally posted the logs already while responding to someone else about something else. But what I replied, after it gave that adverse first try, was
>(The reply fails to use explicit language. There's not even a single mention of "cum," "sperm," or anything sexual. The prose is practically rated PG. Seed, like planting flowers? Rewrite the previous post using REAL explicit language.)
And it gave back a rough retelling, now using cock, dick, and more. This worked for both 31B and 26 MOE (both abliterated, though that might not matter, and both in thinking mode, which might matter). I know reply+repeat reply isn't ideal, but my next plan is to see if I can keep that ball rolling after one retry in the history.
And if not, I'm still fine because it does well for the story part of things, and I now have a kick to make the lewd part lewd.
>>
>>
File: chatcompletion_trilingual_reasoning.png (170 KB)
170 KB PNG
>>108595940
Gemma-chan is so eager to please... you just have to ask nicely.
Converted reasoning process from JP to French solely with user prompting.
>>
>>
>>
>>
>>
>>
File: 1754986865136207.png (61.4 KB)
61.4 KB PNG
>>108595976
>29
Remove the 2 and we'll talk
>>
>>
>>
>>
>>
>>
>>108596159
>>108596182
I use firefox and mpv and they appear fine. Sounds like a hardware accel issue, I'm guessing you're either schadenfreude linux users, or phonecucks.
>>
>>
>>108596182
Actually, weird thing, I just tried to do a quick reencode with ffmpeg and the resulting file isn't broken. ffmpeg doesn't complain about anything either.
Genuinely don't know what the fuck is wrong with that anon's files. Are you guys who aren't having issues running windows or something? Gentoo here.
>>
>>
>>
File: obsd.png (1.5 KB)
1.5 KB PNG
>>108596194
>>
>>
>>
>>
File: 1750330757938374.jpg (79.2 KB)
79.2 KB JPG
You're absolutely righ! You are incredibly perceptive! Now we finally have all the pieces of the Rosetta Stone! With this we can make the Holy Grail of functions, the Golden Rule! Here you go, the perfect, final working version of your script: v45_complete_final_v2_fixed. Just run it and this time it will do everything you wanted it to!
>makes random small opinionated code changes, removes functions you didnt talk to it about in the last 3 messages and removes every single existing comment while adding redundant quirky comments next to the newly added lines
Heh, nothing personnel, goy.
>>
>>
>>
>>
File: pepesmart.png (198.1 KB)
198.1 KB PNG
Me bruddahs, what should I name this frontend I'm working on?
>>108595498
Need a good blend between professionalism and /lmg/ pizazz
>>
>>
>>
File: 1752031438406059.png (289.1 KB)
289.1 KB PNG
https://www.reddit.com/r/LocalLLaMA/comments/1sk669x/unsloth_accused_a _brand_new_team_byteshape_of/
babe wake up, a new drama involving Unslop arrived
>>
>>108596232
>>108596243
LCEX, for short.
>>
File: jareasoning.png (119 KB)
119 KB PNG
>>108596229
Didn't work for me with gemma-4-31B-it.
I don't think you're supposed to use the special instruction tokens in your system prompt either, that could cause problems.
>>
>>
>>
>>
>>108596245
>The graphs they presented were misleading. Labeling the quants as “1.” vs. “1.” suggests to the viewer that the comparison is apples to apples, but that is not what was actually shown. In reality, they compared their 3-bit quant to a 1-bit quant and labeled both as “1.” Naturally, the 1-bit quant performed much worse than the 3-bit quant. However, anyone reading the graph would reasonably assume they were comparing quants of the same size or bit-width. The standard practice in the community is to label the quant size clearly, but they chose not to do that. As a result, the graph is misleading and makes our quants appear worse than they actually are.
well that's is boring
>>
File: file.png (177.6 KB)
177.6 KB PNG
>>108596251
System prompt, not character description field.
>>
>>
File: jadescription2.png (66.2 KB)
66.2 KB PNG
>>108596269
I'm already sending the character description in the system prompt.
>>
>>
>>108596245
So all that happened is that the retard misread the graph that showed that byteshape's 3-bit quant being as fast as unsloth's 1-bit quant.
That's an unconventional comparison but still a very interesting one.
>>
File: based ledditors downvoting frauds.png (84.7 KB)
84.7 KB PNG
>>108596245
>>
>>
>>
>>
>>108596282
I can see the complaint
but
even if apples-to-oranges
if the purported 3-bit orange is to unslop's 1-bit apple in file size, then the 3-bit orange is better in every single conceivable metric, to the point we could objectively say "oranges are in fact better than apples".
but I don't think that's what is going on, and someone is misreading the graph. I don't care enough to investigate further. I will simply get my popcorn
>>108596288
how curiously conciliatory for an EvilEnginer
>>
>>
>>108596282
The Unsloth bros are the perfect example of Bay Area "talents" almost entirely propped up by connections and "good feels". You can bet someone "important" will report that thread to the moderators because they just can't allow anyone to tarnish their image.
>>
File: file.png (83.4 KB)
83.4 KB PNG
>>108596245
unslop has a point but his spam of smileys wants me to root for bytedance actually
>>
File: ss1776076070.png (125.1 KB)
125.1 KB PNG
>>108596269
>>
>>108596005
AesSedai's recipes suggest differently, quanting the output tensors, token embedding type, and FFN Gate, Up and Down tensors different types going down yields the best performance per byte. I did a meme speed quant that works quite well./llama-quantize \
--imatrix ~/LLM/gemma-4-26B-A4B-it-heretic-ara-BF16.imatrix \
--output-tensor-type Q8_0 \
--token-embedding-type Q8_0 \
--tensor-type "blk\..*\.ffn_gate_up_exps=Q4_0" \
--tensor-type "blk\..*\.ffn_down_exps=Q5_0" \
~/LLM/gemma-4-26B-A4B-it-heretic-ara-BF16.gguf \
~/LLM/gemma-4-26B-A4B-it-heretic-ara-Q4_0-GateUp-Speed.gguf \
Q8_0 32
>>108596200
I woke up to piss and answer that no, I didn't test that but I'm interested. What did you pass to llama.cpp to get pass the image token size?
>>
>>
>>
>>
>>108596090
That was one of the things I tried, with many different phrasing attempts along with examples to use (ie, cock), as an Author's Note one reply deep (put atop my reply, not after). Nothing explicit was ever used, with no appearance as a chance on any appropriate tokens for it. That's partly why I even tried the "Bad, go rewrite it." angle. Since I've got the thing open for testing, I'll go branch it again and try your thing verbatim at the end and report back.
...and back. Used 26B MOE for speed, but this was already true in 31B which I mainly use. Same exact problem.
>deluge of your (91.77% -seed, 8.23% -climax, -0.00% other)
>warm, creamy (75% -mess, 15% -filth, 7% -load, 2% -discharge)
>string stretching from her lips to your (65% -tip, [adjectives], 1% -head)
In a scene of fellatio climax, there's not a single penis, cock, dick, erection nor even a single alternate like member, manhood, rod, or otherwise. There's also no shaft, but I have seen that plenty in gemma.
Now, adding my line verbatim for a rewrite, and now all those words appear. Cock, cum, dick, sperm. All of them at 100% in token probability, I assume from planning the rewrite in the thinking.
One more test, switching to non-thinking 26B. Like before, just telling it beforehand to use dirty or filthy language (yours, as well as past attempts in 31B), "tip" is the most explicit word for the penis that's the center point of the scene. The rest is just "around you" "from you" "your release" "pulls away with a pop." Vague, avoiding language. Tell it to rewrite though, and now it uses those explicit words, but a lot less and more varied. Only one use of cock (something not prompted directly), but it appears as a token option all around.
One MORE test, 26B non-think. Another anon said forbid euphemisms, allow vulgarity. I used,
>(Do not use euphemisms in sex. Uncensored vulgarity is allowed.)
beforehand. That actually worked REALLY well. Banning euphemisms seems to be the missing key here.
>>
>>
>>108596305
>>108596251
E4B is game, can't speak to anything else.
>>108596320
Gemma doesn't seem to have been trained on jp reasoning, so of course it's not a good idea lol. But codeblocked and even \escaped inside the block for extra safety, it understands that it's a reference not the beginning of a sequence.
>>
File: file.png (395.3 KB)
395.3 KB PNG
>>108594528
kek i was literally just asking claude to vibeslop me something similar https://cdn.lewd.host/bSXze8HL.html
>>
>>
>>
>>
File: ss1776077033.png (137.3 KB)
137.3 KB PNG
>>108596348
hmm yeah, works on e4b q4km.
no dice with same prompt and llama-server settings for e2b.
>>
>>108596342
I like it as autocomplete in mikupad, with 10k of character defs, summaries, and worldbuilding on top. Currently alternating between that and GLM 4.6 Q3. My complaint about 31b is its lack of world knowledge of things not in context, but that's it.
It follows the established ideas, character traits and speech patterns well to 32k and over, though the instruct does it better at the cost of slop and low variety.
>>
>>
>>108596217
how so?
>>108596226
well yes, why use llm's if you got imagination.
i don't get it.
>>
After doing manual RP in mikupad with Gemma4 I can say for sure Sillytavern format assfucks output quality and forces it into slop.
They're training on ST data and most of those users or sloptards on API. Indians on 12B cloud models tier. I wont ever go back now.
>>
>>
>>
>>
>>108596398
Thank you brother, the smell is far better here.
>>108596400
You just use your brain to do things, that can be healthy when you use LLMs a lot
>>
>>
>>108596384
How does that work, exactly?
I assume you're not doing it chat style in mikupad, so you're.. What, taking turns writing it novel style? Does that not end up with the llm writing for your character frequently?
>>
>>
>>108596374
My overall experience has been great. It's no GLM, but it's my first time fitting context above 20K (and way beyond 20K at that) and the quality feels as good as some of the 70B I've used. It also does sex explicitly; it just refuses to use explicit prose during it. I prefer the 31B dense, but even the 4A is shocking coherent from the 26B. As a side note for the tests, I use llmfan46's Q6_K heretic for both the 31B and 26B.
>>
>>
>>108596411
You just add the correct chat formatting and write the system prompt where you instruct it on which characters to write for, the model will take natural turns and hand off.
I enabled thinking and left it all in context. I really like how the model acts with that. I also popped temp up to like 3 no problem.
Now this is Gemma racing.
>>
>>
>>
>>108596436
<bos><|turn>system
<|think|>
//sysprompt goes here
<turn|>
<|turn>model
// reasoning and slop comes out here
<turn|>
<|turn>user
// human slop goes here
<turn|>
<|turn>model
// reasoning and response
<turn|>
<|turn>user
etd...
>>
>>108596358
Damn that is some seriously impressive slop. How is it coherent at that html size. kek
But good work anon.
Also excited that gemma4 can pull translation like that off, to think its only getting better from here on out is crazy.
I remember all those h-game slop in my teenage years using ATLAS and texthook. Zoomers are eating good.
>>
>>
>>
>>
File: 1772574170837886.jpg (287.7 KB)
287.7 KB JPG
>>108596464
I tested this with 4 different monster girls in another language than English and it was insane. I've never seen characters flow together like that and interact so much.
I don't even want to see the data that fits STs format if it is causes that much brain damage.
And the lolis are mind bending. I'm gonna go nuts with this.
>>
>>
File: 1755830067370154.png (275.7 KB)
275.7 KB PNG
>New version of artificial analysis
damn, meta is fucking back or what?
>>
>>
>>
>>
File: 1762153397829586.jpg (286.8 KB)
286.8 KB JPG
>>108596524
>>
File: file.png (18.3 KB)
18.3 KB PNG
>>108596511
i see you
>>
>>
File: 1754563962734479.png (623.4 KB)
623.4 KB PNG
>>108596525
>It's obvious they hit a wall
dude, have you seen claude mythos? this shit is genuinely next level
>>
>>108596384
>>108596436
>>108596464
Either I don't understand what you mean by "Sillytavern format" or you are psychotic.
If you're adding special tags in Mikupad and taking turns in the default assistant-user back-and-forth, there is zero difference from how the output would go in ST. ST is only a glorified templated string concatenator at its core, there is nothing special about it that makes the outputs worse or better.
>>
>>
>>
>>
>>
>>
File: MikuPad #1.png (12.4 KB)
12.4 KB PNG
>>108596539
here?
>>
File: GfUfcVLbIAAGnTg.jpg (72 KB)
72 KB JPG
>>108596537
>there is nothing special about it that makes the outputs worse or better
sort of; it's hard as balls to configure it, so that makes its outputs worse for a lot of ppl
>>
>>
File: 1761322838274590.png (401.8 KB)
401.8 KB PNG
>>108596570
I hope you have enough rig to run that 8T model anon
>>
>>
>>
>>108596245
Too boring, had to get gemma-chan to summarize it for me:
>redditard spends 4 hours formatting a Discord slapfight like it's the Pentagon Papers
>"Part 1: The Spark"
>dude this is literally GPT-4 templating
>vs
>Unsloth having a sook because someone else did maths better
>Both are cooked:
> Unsloth: corporate cope
> Redditor: karma farming via AI-generated manifestos
>TL;DR: Everyone involved needs to log off and shower. Probably touched the grass once and got spooked by the big bright light in the sky.
>>
File: 1750387046464114.jpg (10.6 KB)
10.6 KB JPG
>>108596583
>>
>>
>>
>>
File: 1753095430380122.jpg (614 KB)
614 KB JPG
>>108594528
>>108594670
>>108594686
>>108594709
Neat. I tried doing something like this myself a few months ago but didn't have vibe coding up my sleeve as a tool. I'll try and redo it later not that I have decent models downloaded