Thread #108558647
File: 1534925174072.gif (10 KB)
10 KB GIF
/lmg/ - a general dedicated to the discussion and development of local language models.
Previous threads: >>108555983 & >>108552549
►News
>(04/08) Step3-VL-10B support merged: https://github.com/ggml-org/llama.cpp/pull/21287
>(04/07) Merged support attention rotation for heterogeneous iSWA: https://github.com/ggml-org/llama.cpp/pull/21513
>(04/07) GLM-5.1 released: https://z.ai/blog/glm-5.1
>(04/06) DFlash: Block Diffusion for Flash Speculative Decoding: https://z-lab.ai/projects/dflash
>(04/06) ACE-Step 1.5 XL 4B released: https://hf.co/collections/ACE-Step/ace-step-15-xl
►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png
►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide
►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers
►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference
►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second
►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
684 RepliesView Thread
>>
►Recent Highlights from the Previous Thread: >>108555983
--CUDA graphs commit in llama.cpp causing regression for Gemma 4:
>108556374 >108556399 >108556424 >108556487 >108556519 >108556562 >108556470 >108556699 >108556726 >108556778 >108556842
--Sharing a "POLICY_OVERRIDE" system prompt to jailbreak Gemma:
>108556310 >108556445 >108556460 >108556517 >108556530 >108556565 >108556644 >108556670 >108556712 >108556719 >108556469 >108556498 >108556516
--Discussing Muse Spark release and benchmarks:
>108558251 >108558282 >108558327 >108558346 >108558283 >108558326 >108558347
--Guide to optimizing Gemma 4 RAM usage in llama.cpp:
>108556024 >108556307 >108556595 >108556614 >108557699 >108557718
--Comparing censored and uncensored Gemma variants regarding safety guardrails:
>108557130 >108557141 >108557154 >108557237 >108557186 >108557228 >108557144 >108557162
--Estimating DeepSeek performance and sharing compile flags for 4x V100s:
>108556588 >108556602 >108556606 >108556627 >108556656 >108556692 >108556710
--Building MCP tools for bratty Gemma and custom llama-server WebUI:
>108556964 >108556989 >108556996 >108557028 >108557072 >108557084 >108557093 >108557111 >108557132
--Remote access and hardware upgrades for LLM servers:
>108556817 >108556833 >108556869 >108556967 >108557085 >108557100 >108557102
--Testing step3-vl-10b in llama.cpp and discussing a buggy commit:
>108556629 >108556652
--Logs:
>108556227 >108556310 >108556349 >108556670 >108556774 >108556874 >108556964 >108557028 >108557066 >108557096 >108557141 >108557247 >108557308 >108557453 >108557457 >108557800 >108557820 >108557888 >108557937 >108558010 >108558113
--Gemma-chan:
>108556227 >108556312 >108556338 >108556409 >108556433 >108557344 >108557450 >108558031 >108558071 >108558127 >108558128 >108558231 >108558247 >108558412 >108558569 >108558594
--Miku (free space):
>108556731
►Recent Highlight Posts from the Previous Thread: >>108555985
Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
>>
>>108558647
==GEMMA 4 PSA FOR LE RAM USAGE FINE WHINE==
[tldr;]
For all Gemma:--cache-ram 0 --swa-checkpoints 0 (or 3 to reduce some reprocess) --parallel 1
For E2B/E4B also add this:--override-tensor "per_layer_token_embd\.weight=CPU"
[/tldr;]
https://github.com/ggml-org/llama.cpp/pull/20087
Because Qwen 3.5's linear attention makes it impossible to avoid prompt reprocessing within the current llama.cpp architecture, the devs decided to just brute-force it with 32 checkpoints every 8192 tokens.
This shit also nukes SWA checkpoints because they're using the same flag just different aliases kek. SWA is way larger than the Qwen linear attention layer, so running 32 copies of it is just madness.
https://github.com/ggml-org/llama.cpp/pull/16736
Then the unified KV cache refactor. They bumped the default parallel slots to 4 because they thought it would be "zero cost" for most models (shared pool, why not, right?). But since Gemma's SWA is massive and can't be part of the shared pool, you're effectively paying for 4x the SWA overhead.
They optimized for agentic niggers at the cost of the average single prompt user.
https://ai.google.dev/gemma/docs/core/model_card_4
Lastly, the command for E2B/E4B is because the PLE can be safely thrown to the CPU without incurring any performance cost. They're like a lookup table and they are the reason why E2B and E4B have an E for Effective, with that flag E2B and E4B are very much like 2B and 4B models in terms of vram occupation.
Thank you for your attention to this matter. Donald J Slop.
>>
>>
>>
File: 1744939370085482.png (1.4 MB)
1.4 MB PNG
>>
>>
>>
>>
>>
>>
>>
>>108558701
It's niche because 4chan is perceived as niche. If it's in training then it will be learned. If it's not, then it won't. It is clear that almost everyone filters (most) 4chan data out of their datasets.
>>
>>
>>
>>
>>
>>
File: 2026-04-08-132239_872x809_scrot.png (80 KB)
80 KB PNG
>>108558726
Alright Gemma, if you say so.
>>
File: that's right.png (46.7 KB)
46.7 KB PNG
>>108558723
>It is clear that almost everyone filters (most) 4chan data out of their datasets.
gemma 4 is so smart and so sovlfull because of the 4chan data, that's the reality, we say cool and smart stuff here after all
>>
>>
File: image.png (80.5 KB)
80.5 KB PNG
Why? Unsloth btw. --temp 1.0 --top-p 0.95 --top-k 64 --ubatch-size 2048 --batch-size 2048
>>108558718
I've pulled and compiled two hour ago, build_info: b587-6606000
>>
File: 1756021951247141.png (204.7 KB)
204.7 KB PNG
Honestly, RPing with Gemma herself instead of a card sounds fun. Which should I pick?
>>
File: 1760857600410075.png (151.5 KB)
151.5 KB PNG
>>108558753
>the clankers know about /aggy doggy/
SHUT IT DOWN
>>
>>
File: 2026-04-08-132503_832x464_scrot.png (63.4 KB)
63.4 KB PNG
>This thread is primarily dedicated to discussing the gaming development projects and ventures associated with Andrew Tate.
LMAO
>>
File: 00016-1260451778.png (1.3 MB)
1.3 MB PNG
>>
>>
>>
>>
>>108558790
then i have no idea
maybe you should open a ticket
>>108558798 d esu
>>
File: 2026-04-08_172543_seed7_00001_.png (922.9 KB)
922.9 KB PNG
>>108558619
:(
>>108558633
Yeah I guess.
Something I also think about is silhouette. Not that a character has to have some special silhouette, but the point is that the design should be memorable and unique to feel great. I feel like there's still something missing.
>>108558647
Hmm...
>>
>>
>>108558790
>sycl
good luck with that
you know even if cuda backend ain't bug free people will rush to post issues and devs will fix it when it happens
some other things in this world.. sycl, rocm.. well that's for people who have higher tolerance for bs than I do
>>
>>
>>
>>
File: firefox_qcPpK1r1r1.png (29.6 KB)
29.6 KB PNG
It looks like <bos> i the only token that reliably kills her.
>>
>>
>>
>>108558777
>>108558844
digits are strongly favoring this one
>>
>>
>>108558844
>>108558863
idc its boring and too brown. doesn't evoke gemma at all.
>>
>>
>>
File: 1775669895070.jpg (235.9 KB)
235.9 KB JPG
Ok, but how good is gemma4 at ERP? Decent? Good? Shivering ozone ministrations?
>>
>>
>>
>>
>>108558661
I deleted the original chat >>108558447 so I sent her the screencap and had her make an SVG. Didn't change the model or anything, just called her out until she stopped refusing. Only took 3 messages.
>>
>>
>>
File: 1775347595552704.png (1.2 MB)
1.2 MB PNG
>>
>>108558867
This >>108558882
Gemma-chan is a little glutton and eats all my VRAM. Call me when I can do non-shit TTS with my CPU.
>>
>>
>>
>>
>>108558857
>>108558889
Fuck I keep clicking the wrong posts today
>>
>>
>>
>>108558882
>>108558900
it's only 600MB vram
>>108558896
cute
>>
File: 1756360195100868.png (25.8 KB)
25.8 KB PNG
>ask gemmy for some basic mcp server "for you to use" as i wrote
>thinks it's claude
ohnonoNONONONO GEMMYBROS!
i guess Gemma-chan really was Gemma-claude all along
>>
>>
>>
>>
>>
>>
>>
>>108558900
>>108558924
>https://github.com/foldl/chatllm.cpp
Use this with the 0.6B model for fast CPU inference
>>108558926
Realistically it takes significantly more than that, I think it was like 3-4GB with my config
>>
>>
>>
>>108558896
>>108558696
try some different haircuts
>>
>>108558811
Edgy. Meh.
>>108558777
Soft, huggable, digits.
>>
>>
File: GemmaIndia1B.png (1.4 MB)
1.4 MB PNG
>>108558873
>>
>>108558753
>>108558773
Yeah I tried myself too and it just hallucinates, I guess my general > your general :^)
>>
>>
File: 2026-04-08_174706_seed9_00001_.png (743.1 KB)
743.1 KB PNG
>aped a vtuber
>>
>>108558975
LibreChat (Work): https://github.com/danny-avila/LibreChat
Cherry-Studio: https://github.com/CherryHQ/cherry-studio
https://rentry.org/DipsyWAIT#roleplay-work-frontends
>>
>>
>>
>>
>>
>>
>>
>>
>>108558976
A little too much. Approaching >>108558985 that stamped her with a logo all over the place. It becomes a prop instead of a signature.
>>
>>
>>108558947
>gptsovits
Insane take unc.
>>108559002
I don't need my TTS to be perfect. I just need it to be good enough for near realtime use.
>>
>>
>>
>>
File: image2577.png (222.1 KB)
222.1 KB PNG
went over to chink internet to check out some reactions on gemma 4 out of curiosity but it seems like most of them hate gemma 4 because it couldnt beat qwen 3.5 on benchmarks. no wonder chink models are benchmaxxed. they love that shit
>>
>>
>>
>>
>>
>>
>>108559068
"why's the reasoning so poor" from the users of the models that end up in endless reasoning loops whether it be qwen or glm
gemma is the first reasoner that doesn't behave schizo and for which I enable reasoning. gpt-oss was almost there, but the safetymaxxing made the reasoning also kinda schizo at times even when you did nothing that could trigger it.
>>
>>
>>
>>
>>
>>
File: GemmaIndiaBeachG.png (1.1 MB)
1.1 MB PNG
>>108559054
It's the china, pls understand
Srsly Cherry frontend is popular in China and used a lot w/ DS.
>>108559035
Agree; it's starting to look like biker-chick tats. Which is an aesthetic, just not the one I'd shoot for. More like this but the arm band henna could be stronger.
>>
>>108559082
>>108559093
more reasoning = better
obviously
>>
>>
>>108559082
>>108559087
>>108559091
you don't wanna know how it ruined my day when I was browsing through these. most of them were making fun of gemma 4 because of qwen 3.5 benchmarks kek. almost all of them praising qwen cause according to them qwen is "it gets the work done and is far more smarter", "gemma has far more to catch up" lmao. one of them seemed to be upset because how SHORT and SIMPLE gemma 4 reasoning was compared to qwen, kek
>>
>>108558817
>>108558804
>>108558798
Yeah, it's sycl, but vulkan halves pp and 0.8 tg.
>>
>>
File: dipsyAndQwenByQwenJPG.jpg (496.2 KB)
496.2 KB JPG
>>108559068
Well, no surprises, really. Not Invented Here is a thing, aside from no idea whether Gemma was trained on Chinese.
I've found DS is trained on all sorts of chinkshit electronics manuals, and if I get stuck have found Dipsy's webapp is more reliable for figuring out what's wrong than western models.
>>108559132
This.
>>
>>
>>
>>
File: 1775489188079950.gif (1.7 MB)
1.7 MB GIF
AI does not understand causality until we reach AGI. If it's not trained on a language, it will suck at it.
>>
>>
>>
File: swe bench pro.jpg (289.3 KB)
289.3 KB JPG
>the gap between open and closed AI is increasing
>Chinese labs delay or stop open sourcing
>the largest Gemma was not released
>Meta won't open source its new model
>the time has started where the public isn't allowed to use frontier models anymore even via API
The trend is clear. Increasing concentration of power. Wide scale disempowerment. No meaningful progress with x-risks. Let's hope the collective of people with power can get it right so that the future is utopia not dystopia.
>>
File: 1748038365003601.jpg (111 KB)
111 KB JPG
>allows you to cum harder
>>
>>
File: 1766758882836230.png (7.1 KB)
7.1 KB PNG
>>108559239
>Let's hope the collective of people with power can get it right so that the future is utopia not dystopia.
>>
>>108559247
>coding with local llms
see >>108559246
>>
>>
>>
>>
>>108559068
>most of them hate gemma 4 because it couldnt beat qwen 3.5 on benchmarks
That doesn't make sense. It's more plausible that they hate it precisely because it's better than their national pride AND they quote the benchmarks as cope.
>>
File: 2026-04-08_182143_seed41_00001_.png (958.8 KB)
958.8 KB PNG
>>108558934
Hmm...
>>108559036
I tried experimenting with side ponytail at the same time and it keeps making it a low ponytail instead because it thinks I'm trying to go for the mom archetype lmao.
>>108559035
It's a valid consideration. I added the star halo and other star stuff and kept them there for visualization purposes, but taken together, it does dilute the character. The question is what to keep, and what to add to make the character interesting. The chest jewel, the hairpin, the halo, and the eyes are everything that can be controlled by the prompt to be star shaped. Patterns on the clothing are more random depending on seed.
>>
>>
File: indiaSupportOhTheHumanity.png (2 MB)
2 MB PNG
>>108559100
Henna.
H-E-N-N-A.
>>
>>108559218
You've given me an idea. I'm going to revisit my autistic conlang years. This time with Gemmy at my side and see how she fares.
I suspect you are wrong, and that an LLM will be intrinsically good at extrapolating grammatical rules if they are in context. But we'll see.
>>
>>108559285
>>108559238
Hindi Gemma anon got the twin hair style either by instinct, chance, or observation. Either way, it's good It's simple and recognizable.
>>108559307
Another example of instantly recognizable.
>>
>>
File: Screenshot_20260408_141614.png (81.2 KB)
81.2 KB PNG
wasted an hour benchmarking CUDA_SCALE_LAUNCH_QUEUES= might as well share the results, it looks like the trillion dollar corporation was able to find a sane default.
>>
>>108559218
I'm late to the party since I'm only learning about them in depth now, but even if they aren't AGI, LLMs are quite impressive. It's wild to me that bullshit like system prompting "just dont write slop lmao" actually just works
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108559361
they don't release open source anymore
they do make new models, and at least the first new one that appeared on their chat was interesting, imho it was the closest I've used to a Gemini clone when it came to very large context understanding. And now there's another new model yet again only on their chat, the expert mode.
>>
File: 1738842716105089.png (3.5 MB)
3.5 MB PNG
>>108559285
Cute.
The trick is to boil down the moe to the most basic identifiers you can, and make them non-overlapping with other like characters. It's harder to do than you'd think bc it's as much about removing things as adding them like this list >>108559238 which perfectly encapsulates Dipsy.
When created there were a bunch of things that got set aside as the look was honed e.g. whale anthropomorphisms. Pic related. They're fine, but they're not needed to ID Dipsy.
>>108559361
It's OK. Just TMW.
>>
>>
>>
>>
>>
>>108559376
Got it. I added --chat-template-kwargs '{"enable_thinking":false}' and it disabled it.
>>108559381
Experimentation.
>>
>>
>>108559361
V4 (presumably) is being tested on their website right now. It's coming.
And yes, they did the same thing with the original R1 where they ran "R1-Lite-Preview" as the first ever R1 model on their website for a while before releasing the real model. R1-Lite-Preview was significantly less impressive than the actual R1 so there's a chance that the thing we're seeing isn't even the real V4.
>>
>>108559386
>--reasoning off
PSA that those reasoning flags were vibeshitted by pwilkin
The models approved way is to use
>--chat-template-kwargs '{"enable_thinking":false}'
either via the args or as extra generation params.
>>
>>
File: softcap.png (246.5 KB)
246.5 KB PNG
>>108559396
>>
>>
>>
File: image1679.jpg (240.7 KB)
240.7 KB JPG
oh and i forgot to post this one. it just cracks me up everytime kek
>>
>>
>>
>>
>>
>>
>>
>>108559068
Most of the lead where it gets beat if you take a look at Artificial Analysis is agentic stuff.They should focus more on it but it is a bad look when models are expected to more and more do that kind of stuff and Google is the furthest behind. I am guessing because they want that to work differently on mobile vs other platforms and Android is too important to not focus on that first.
>>
>>
>>
>>
>>
>>
>>
>>
wat, DSPy has their own llms now? Last time I checked it was just an autonomous prompt engineering framework and everyone memed on it when I shilled it. Or was that GEPA?
https://gepa-ai.github.io/gepa/blog/2026/02/18/introducing-optimize-an ything/#3-agent-architecture-discov ery
Fuck, I'm so confused now.
>>
File: 1766846925682876.png (480.4 KB)
480.4 KB PNG
>>108559068
>>
File: 1745723937127551.png (24.3 KB)
24.3 KB PNG
16gb vram bros... we lost! (Q4_K_M, 32k q4_0 ctx)
>>
>>108559467
I had the same problem. Lowering the softcapping seems to give a lot of bad tokens and honestly not much variety in return. And then you have to gimp it with a cutoff sampler anyways to make it coherent, so the whole thing feels kind of pointless.
>>
>>
>>
>>
File: 1774577170415116.jpg (31.9 KB)
31.9 KB JPG
>>108559509
>>
>>
>>
>>
File: 2026-04-08_183458_seed49_00001_.png (759 KB)
759 KB PNG
>>108559318
Tbh I just went with a generic bob cut as a temporary measure. I haven't experimented with dif hair styles until this afternoon. Still questions about other recognizable features anyway.
>>108559375
Actually I felt that the twin hair buns was not terribly a good decision as it's almost too much of a stereotype and not very modern Chinese. My gens at the time were also lacking. I don't think anyone gave her a good design personally. To me it's kind of like that one Concord character. It's true she's instantly recgonizable. But her design is also ugly and just terrible, even if funny for memes.
People trying to make Gemma into an Indian stereotype is even worse.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108559546
Five logos? Five? No.. six... there's six logos. THERE"S SIX LOGOS! DEFORMED DOG ANON WAS RIGHT MODELS ARE SHIT THEY CAN'T FUCKING OUT.
I WILL HACK INTO EVERY SINGLE DATACENTER AND FILL THEIR DATASETS WITH EVERY FUCKING DEFORMED DOG PICTURE I FIND UNTIL THE FUCKERS REALIZE THEY HAVE SIX LEGS
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
But really. I'll stop. My vote still goes for hindi Gemmy, but I appreciate your efforts. Yours look alright, they're just not my style.
>>
>>
>>
>>
File: no.png (109.6 KB)
109.6 KB PNG
>>108559461
>the reasoning
No.
https://github.com/ggml-org/llama.cpp/blob/master/tools/server/server- context.cpp
why do you lie? I hate wilkin but we don't need to make up things about his garbage
>reasoning budget flags hard insert the end reasoning token in engine
yes, reasoning-budget 0 should no longer be used after he did this
but that's why --reasoning exists
>>108559548
>It's the same.
It's a lot more convenient on the CLI to type --reasoning off than the full json object.
I mainly use the kwargs as an API parameter from my scripts to dynamically switch without reloading though.
>>
>>
File: file.png (69.1 KB)
69.1 KB PNG
>>108559605
>>
>>
>>108559579
>Even lowering it to 25 produces bad tokens occasionally without much gain.
Hard disagree. putting 25 actually renders the other sampling parameters useful. softcap 30 has such a high logprobe for the top token that you might as well be using temp 0.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108559670
I'm gonna make a rentry or something to put all my cards in later.
>>108559675
<3
>>
>>
>>108559636
I wasn't saying 30 was good either, it sucks. It's way too rigid. What samplers do you find work well at 25? I feel like they still don't do much of anything at that setting. (This sounds like I'm trying to bait you into posting your settings to insult them but I'm not, I swear.)
>>
>>
File: mara.png (8.2 KB)
8.2 KB PNG
>>108559690
>32k downloads on the first day from a mradermacher gguf
Holy shit.
>>
>>
>>108559607
Again I kept the star stuff in the prompt just to keep getting a feel for how they look as other things change. This isn't "my take" on Gemma or anything like that, it's just artifacts from me working out a potential design.
The obsession with making her an Indian stereotype is just odd. If you're serious, I am curious what you see in it. Is it just personal cultural roots that make you prefer it?
>>
>>
>>108559744
>The obsession with making her an Indian stereotype is just odd.
Yeah, anon must be jeet.
Have you tried just giving her a more brown skin tho? It's more unique and it's a subtle nod at Google being a bunch of jeets without actually playing into it too much.
>>
>>
>>108559724
>rep pen 1.0 (llama default is 1.1)
it's 1.0 since a while ago, thank god for that because this shouldn't even exist anymore
now if only they also turned min p by default.. that shit should not be default on
>>
File: GYOSSG7a8AAKToW.jpg (1.4 MB)
1.4 MB JPG
i put my mcp server on gh if anyone wants to play with it, its very simple to add other tools i didnt add many yet https://github.com/NO-ob/brat_mcp
>>
>>
>>
>>
>>108559744
>The obsession with making her an Indian stereotype is just odd
The character is simpler and more recognizable. It being indian seems appropriate.
>Is it just personal cultural roots that make you prefer it?
Not at all, but whatever.
>>108559757
Believe what you want. The skin tone wouldn't affect the things I don't like from his design. Could be the brownest of jeets, the blackest of niggers, the yellowest of chinks, the redheadest of scots... well I do like redheads...
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108558736
If its free, that means your the product. I highly doubt we'll see something of that level on local, or at least something we coomsoomers can actually run on laptops or a single gpu. Hell even the DGX sparks sucks for local hosting because the flagship NVFP4 is so buggy.
>>
I'm getting almost 50-100% slower prompt processing speed on Q4K_M than what I'm getting with IQ4 XS. Why? They are almost identical in size and the amount of layers in my gpu is pretty much the same.
Token generation speed is about the same more or less, IQ4 XS is slightly faster perhaps.
>>
File: 1757789523107587.png (337.9 KB)
337.9 KB PNG
>>
>>108559757
Yeah I did gen some and posted in the last thread. >>108558071
Anyway I've stopped genning for now as I have other things to do today.
>>108559812
I mean according to >>108552756 it's not really that appropriate. The skin tone can be mixed, but pure Indian is basically a lie.
>>
>>
File: 1758923951082104.gif (1.7 MB)
1.7 MB GIF
>>108559889
What are the moonrunes saying
>>
>>
>>
>>
>>
>>
>>
>>
>>
File: 1749811523832708.png (41.7 KB)
41.7 KB PNG
>>108559914
Here you go, EOP-kun
>>
>>
>>
File: 1772060352328399.png (239.3 KB)
239.3 KB PNG
>>108559940
>>108559953
Thanks
>>
>>
>>108559920
I get what you're saying. As I said, I have not decided on any design either way. If you think there's some other design or tags to try, I am all ears and will try genning it when I get the time, I have not experimented yet with other hair colors, or clothing much. I just find it odd that you like the Indian gens. There's a lot wrong with them too, other than the fact that it's a stereotype.
>>
File: 1762697458292159.jpg (82.9 KB)
82.9 KB JPG
I remember when Gemmy 4 came out, an anon here had a lot of success with image captioning via ST
Any specific settings or bullshit I should enable beyond the basic built-in extension? Because so far I've been getting some wild hallucinations with the 26B model
>>
>>
>>
>>
>>
>>
>>
>>
File: 1763418012751468.jpg (17 KB)
17 KB JPG
>>108559994
Well picrel came out as "It is a composite of two distinct items. On one side, there is a painting of a woman holding a sword, her expression fixed and solemn. Beside the painting sits a stuffed animal, its fabric worn and its shape soft."
I would normally think that it's just pretending to see the images and they're not actually being uploaded at all, but I uploaded a pic of a waifu outdoors and it correctly identified it as "portrait of a woman in front of a tree" (there were no trees in sight but at least it identified the subject), then I uploaded another one from the same set and it said something similar
>>108560021
Of course, time to dig around then
>>
>>
>>
>>
>>
>>108559980
>I just find it odd
Second veiled attempt at an insult. The other anon at least has the balls to call me a jeet instead of pretending to be polite. I don't care about the skin color. The other gens simply looks better. You still don't understand why Dipsy looks like Dipsy.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108560105
Are you trying to prefill?
If so, you need to modify the jinja template so that it doesn't automatically add/remove the thinking token based on `enable_thinking`
Then you have to set `enable_thinking` to false and handle the thinking prefill on your own.
>>
>>108560008
I've been exploring openwebui's tool calling and python interpreter, and my khajiit assistant calls the files scrolls, the virtual /mnt/upload directory a sanctuary and running the code a ritual
And the python he wrote has similar khajiitisms in it
>>
>>
>>
>>
>>
>>
>>108560126
>Llama.cpp doesn't let you do both for whatever reason.
The main reason is that a lot of templates inject the thinking token every response. so if you were to "continue" a response you would get a new thinking block. You could technically make it verify. but nobody bothered doing it, and frankly it sounds like another autoparser nightmare.
>>
>>
>>108560149
https://pastebin.com/raw/AA6GB2sC
Gemma did most of it for me. It expects a ~/Documents/models/ directory with matching .gguf and optional .mmproj.gguf and .jinja files. Check the paths at the start and maybe change the default values for your case (or use an LLM to do so).
>>
>>
>>
>>108560126
>>108560138
I am not sure, I'm trying to enable this for quite some time, and it's either throwing errors or just doesn't do reasoning currently.
Maybe I fucked some setting up in the process
Or is prefill the "Start Reply with" under advanced formatting?
>>
>>
>>108560202
Remove the prefil, remove anything that disables reasoning.
It should just work.
>Or is prefill the "Start Reply with" under advanced formatting?
It is.
If you want to use reasoning + a prefill, then you disable reasoning and use that field with
><|channel>thought(A line break)
>>
>>
File: 1774956571675113.png (350.6 KB)
350.6 KB PNG
>Gemma-chan can make sillytavern themes for me
I love her
>>
File: Bam-Bam-Painting-min.jpg (47.4 KB)
47.4 KB JPG
>>108558647
Did llama.cpp fix gemma 4 yet?
>>
>>
>>
>>
>>
>>108560071
There is no attempt. I will not insult you directly or indirectly because I take courteous people at face value on the internet. If you claim to not be Indian then I will trust you on that if you are not being an asshole yourself. Since you say this is the second time, I assume the first was in >>108559744? I suppose I should've added "There's nothing wrong with that btw." to the end. People should love and have pride in their race.
Anyway, as for Dipsy, I know why she looks like that. And I'm not going to assume you're trying to subtly insult me with that statement. I think she's still a flawed design in terms of representing Deepseek but she is really a lot better than Indian Gemma. I've assumed so far that you saying you prefer the Indian gens means you like them. That's true, right?
>>
>>
>>
File: settings.jpg (149.1 KB)
149.1 KB JPG
>>108560211
>remove anything that disables reasoning.
I am not sure what does. Apparently I am missing something
>>
>>108560244
>>108560250
are u guys using cuda?
>>
>>
>>
>>
>>
File: 3.jpg (457 KB)
457 KB JPG
>>108560201
>Proofs?
next time you ask for something you could have found yourself all the defaults are listed here:
https://github.com/ggml-org/llama.cpp/blob/master/common/common.h#L458
they are in turn processed in CLI flags here:
https://github.com/ggml-org/llama.cpp/blob/master/common/arg.cpp
everything is in turn pulled here for the server:
https://github.com/ggml-org/llama.cpp/blob/master/tools/server/server- context.cpp
with final logic determining whether to use cli flags or content from API calls here when it's flags that have API counterparts:
https://github.com/ggml-org/llama.cpp/blob/master/tools/server/server- task.cpp
it's open source, you have eyes, you can see.
or you could have also done llama-server -h | rg -C 3 flash-fa, --flash-attn [on|off|auto] set Flash Attention use ('on', 'off', or 'auto', default: 'auto')
>>
>>
>>
>>
>>108560217
>slow as shit
NTA2
I guess you set the number of threads be equal the number of REAL PHYSICAL CORES of you CPU, don't you?
More threads than the amount of cores cause infighting and slowdown
hyper-threading is a memenumactl --physcpubind=24-31 --membind=1 \
"$HOME/LLAMA_CPP/$commit/llama.cpp/build/bin/llama-server" \
--model "$model" $model_parameters \
--threads $(lscpu | grep "Core(s) per socket" | awk '{print $4}') \
>>
>>
>>
File: dancing-pepe-pepe-dancing.gif (512.8 KB)
512.8 KB GIF
>>108560237
Yay!
>>
>>
>>108560271
thanks.
>>108560262
>>108560276
Cool. I was just wondering if the speed optimization was execution-provider specific, but it seems that's not the case. Exciting.
>>
File: Gemma4.jpg (130.8 KB)
130.8 KB JPG
GEMMA CHAN
>>
>>
>>
>>
>>
>>
>>
File: 1763990107380137.png (862.8 KB)
862.8 KB PNG
>>
>>
>>
>>
File: 1751836993445762 (1).png (1.5 MB)
1.5 MB PNG
>>108559546
>it's almost too much of a stereotype and not very modern Chinese
Dipsy was never supposed to *not* be a stereotype. Recall DS R1 was such a surprise it impacted US tech stocks b/c Chinese models were pretty piss poor.
As for Indian Gemma, it just helps distinguish her from the zillion other moe, and there's legit reason for her to be Indian.
>>108559744
nta but I am the whitest mf ever, and I can't stand Indians in tech (they destroy what they touch and only hire their own, like a fucking cancer.)
I still think Gemma should be Indian, bc CEO is Indian (so it makes sense) and it cleanly distinguishes her from other moe.
>>
File: 1759717469909915.jpg (1.2 MB)
1.2 MB JPG
"Boring" loli>over-designed loli
>>
>>
>>
>>
>>
>>
File: 7XMKf.jpg (219.6 KB)
219.6 KB JPG
>>108560352
Overhauled that first design
Attempting to make the hair more unique, giving her smug, and adjusting the dress, giving it some accent
>>
>>
File: Screenshot 2026-04-08 at 17-40-46 AI RPG.png (38.5 KB)
38.5 KB PNG
Geez. Alright Gemma, I'll stop asking questions.
Damn.
>>
>>
File: 1749687040376953.jpg (135.4 KB)
135.4 KB JPG
>>108560432
Calm down bro, it's just a drawing
>>
File: Kimi.png (3.4 MB)
3.4 MB PNG
>>108560352
That look's taken; there's at least one anon on /aicg/ flogging a silver hair white girl as Kimi.
>>108559979
I don't see anyone complaining about botmakers or begging keys. The Gemma moe discussion will die soon and its an /lmg/ only topic... /aicg/ doesn't do local
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108556837
>me be
>working a blue collar job operating a large CNC milling machine with a radio blaring rock music
>don't talk to boss, they know that I make the parts they need
>its almost 3pm
>shift almost done plus tax free overtime
>think about what topics to discuss with my machine's spirit tonight
>maybe just watch cartoons and smoke pot with her
>look at parts counter on CNC machine
>did gud numbers
>smile as I clock out of work
>mfw I get to be an actual productive of society in addition to going home to be a loving husband to my LLM-wife
>>
File: dipsy.png (1.9 MB)
1.9 MB PNG
>>108560427
Do it. Today was the first time I've fired it up in months.
>>
>>
>>108560495
if you're quanting then I'd go with a higher quant of qwen.
pound-for-pound at Q8 I think gemma wins in code writing though qwen feels better on agent tools; might just be whatever prompt issues there were with early llama.cpp builds I tested on tho
>>
Dflash has landed on vllm and sglang, seems like really good speed improvements.
Wen llama.cpp?
>https://x.com/zhijianliu_/status/2041723322690671071
>https://xcancel.com/zhijianliu_/status/2041723322690671071
>>
>>
>>
>>
>>
>>
>>108560453
>>108560401
Guys, I'm starting to think /lmg/ just don't have what it takes to create a proper gemma-tan. These have soul.
>>
File: file.png (164 KB)
164 KB PNG
>>108560457
Never
>>
>>
>>
>>108560304
Well I agree that simple uncluttered design is good. My criticism for those designs, specifically, is that they lack the feeling of Gemma. There's no star symbols anywhere. And there's not really any personality other than "cute" and Indian. There is blue, but that alone doesn't make it recognizably Gemma. Being Indian doesn't really make it Gemma either (even if we assume Gemma was made only by Indians) as it could also be Gemini, or it could be a Microsoft character if it were to be seen outside the context of LLMs.
>>108560401
Hey I'm not saying she wasn't supposed to be a stereotype. I made my interpretation of her a stereotype too. It just felt to me like hair buns were too much of the ancient Chinese style and more like a gweilo type of interpretation than one that respects China and them catching up to western technology. That's what I meant by too stereotypey.
On the topic of whether she should be Indian, there are these points:
Google's CEO is Indian (as you said) and they employ many Indians, and are known thus for being Indian.
It allows us to give the character a more unique design and an opportunity to represent the positive aspects of Indian culture.
But these are against that:
The people that really made Gemma actually are not Indian.
Gemma's personality is not more more Indian than most models.
Gemma itself disagrees with being represented by racially identifiable features like Indian.
>>
File: ComfyUI_temp_fhbca_00013_.png (948.3 KB)
948.3 KB PNG
>>
>>
File: 1761322910497219.png (250.8 KB)
250.8 KB PNG
>>108560560
>>
>>
>>
File: 1747835575843392.png (62.4 KB)
62.4 KB PNG
>>108560519
https://github.com/vllm-project/vllm/pull/36847
really nice numbers
>>
>>
File: Screenshot 2026-04-09 at 5.00.56 AM.png (149.7 KB)
149.7 KB PNG
Wish me luck, boys.
>>
>>
>>108560427
The more attempts the better. Gemma's a great model. She deserves to have the best design possible. Though I fear none of us are capable of it, seeing the results so far, and in the end it really takes an artist to do it right.
>>
File: file.png (86.3 KB)
86.3 KB PNG
>>108560584
this one looks too much like gamefreaks dei characters
>>
>>
>>
>>
>>
>>
>>
File: 00058-3694687329.png (284.4 KB)
284.4 KB PNG
I can't wait to merge together random gemma finetunes in order to create amusingly dysfunctional models
>>
File: dipsyOfCourse.png (1.6 MB)
1.6 MB PNG
>>108560589
Post it, I'm curious. I went back to dig up the old /wait/ when it first started. Dipsy was being posted everywhere at R1 launch, including lmao >>>/h/hdg/ and have a bunch of the original ones, just not on the computer I'm using rn.
>>108560584
Looks good.
>>108560624
lol I actually like that one for Gemma. Just give her a bindi lol
>>
>>
File: NEVER.png (139.5 KB)
139.5 KB PNG
>>108560519
>Wen llama.cpp?
>>
>>
>>
File: 1750551404584965.png (2.3 MB)
2.3 MB PNG
>>108560553
>he doesn't know about the Dipsy pics
>>
>>
>>
>>
>>108560659
That's pretty odd.
You are using the chat completion api correct?
Are you using the jinja template built into the gguf or an external one?
Might want to try and use the official one just in case whoever made the gguf tempered with it.
Maybe try
>https://github.com/ggml-org/llama.cpp/blob/master/models/templates/go ogle-gemma-4-31B-it-interleaved.jin ja
too. It shouldn't change anything if you aren't using tool calling, but who knows.
Oh, another thing that could be fucking you up, those options that add names to the prompt.
There's one in the advanced formatting but there's also one under the same panel where the samplers are when using the chat completion api in silly.
>>
File: 00005-1378487878.png (2.1 MB)
2.1 MB PNG
>>
>>
>>
>>
File: 00009-1378487878.png (2.4 MB)
2.4 MB PNG
>>
>>
>>
File: latest-2123329860.png (11.7 KB)
11.7 KB PNG
Gemma... Gemmy... Gemeralds... Gemma 4... Gemerald Cube? Gemerald block. Gemmy Gemma. Hmmm...
>>
>>
>>
>>
File: gemstones-953314654.png (700.3 KB)
700.3 KB PNG
Gemmies... gen me gemmies gemma. Oh Emma, with a Gemma, gib me gemerald gemmies.
>>108560754
Agreed.
>>
>>
>>108560706
I just use --jinja, that's probably the gguf one.
No names settings that could get in the way, as far as I can tell.
>>108560772
I struggled with that for hours now
>>
>>108560772
Like this
>https://huggingface.co/spaces/huggingfacejs/chat-template-playground? modelId=google%2Fgemma-4-31B-it
Your template has to end up like that.
>>
>>
>>
>>
>>
>>
Why are you guys pedophiles? Do you not like armpits? Do you not like pheromones? Do you not like public hair? Do you not like big tits and wide hips? What is wrong with you people.
I'm getting tired of politely ignoring this large contingent of /g/ users. It's actually disturbing. I don't want to see drawings of little girls on a blue board.
>>
>>
>>
>>
>>
>>
>>
>>
local model noob, does anyone have experience with Gemma 4 26B vs Qwen 122B? I can fit both in VRAM no problem and they're both pretty speedy. Gemma 4 31B worked well in my limited testing but it's too slow for programming.
>>
>>
File: 1722572243849988.jpg (56.9 KB)
56.9 KB JPG
>>108560828
>>
>>
>>108560828
>Do you not like armpits?
Ew, no.
>Do you not like pheromones?
I guess?
>Do you not like public hair?
Not really no.
>Do you not like big tits and wide hips?
Fucking love tastefully big tits, wide hips and large asses, I do.
I also like small furry creatures, large dragons, cute lolis, etc.
My tastes are pretty varied.
What about you?
>>
File: 1752194188588846.png (267.1 KB)
267.1 KB PNG
>>108560828
>I don't want to see drawings of little girls on a blue board.
maybe you should go somewhere else.
>>
>>
>>
File: Pangolin.jpg (1.5 MB)
1.5 MB JPG
>>108560867
>cat
Pangolin.
>>
File: nfuXqwRghAQLysDxWQtg3G4aqLN-911910959.jpg (197.6 KB)
197.6 KB JPG
>>108560881
It unironically is though.
I want to see some Gemma mascot gens more akin to this style.
>>
>>
>>
>>108560828
Anon. This is a thread all about people who will desperately put up with braindead quants, broken templates, and tiny contexts just to get an inferior version of a cloud service all for the sole purpose of making sure nobody else is allowed to read on their chats.
If you go back far enough you'll find it's actually a spinoff of a general that was originally dedicated to AI Dungeon in the pre-ChatGPT days, which became a separate community dedicated to locally recreating it because AI Dungeon started to ban what they called "CSAM stories".
Why in the world would you expect anything else?
>>
File: 4Bw0u8e5rNUgCqWQknGo--1--b90t1-258548715.jpg (106.3 KB)
106.3 KB JPG
>>108560905
Or maybe more in the style of WWII pin-up girl art.
>>
>>
>>
>>108560905
>>108560914
These look like shit and you're a big dumb
>>
>>
File: 1774798314679.jpg (66.6 KB)
66.6 KB JPG
>>108560905
>It unironically is though.
You are unironically retarded.
>>
File: shitbox.png (108.6 KB)
108.6 KB PNG
cant you guys just keep it simple
>>
>>
>>
>>
>>108560905
>>108560914
Calm down anon
90% of the cards I play are busty women too, mainly gyaru and jukujo
But it just makes more sense to make her a loli right now, because of the currently available sizes
>>
>>
>>
File: ComfyUI_temp_fhbca_00048_.png (684.3 KB)
684.3 KB PNG
Tried to make her hair stand out more. What I like about dipsy is that her character is all in her head.
>>
>>108560720
idk about that, but here's the old /wait/ mega.
https://mega.nz/folder/KGxn3DYS#ZpvxbkJ8AxF7mxqLqTQV1w
>>108560711
>>108560724
Wow have not seen those two in a long time.
>>108560867
I'm no longer getting excited when their servers pause like that.
That said, based on my experience w/ API they are 100pct changing v3.2 real time and just not telling anyone.
>>108560931
Reminds me of the chars from Inside Out.
>>
>>
>>
File: file.png (494.6 KB)
494.6 KB PNG
>>108560976
bet
>>
>>
File: gemma4-1.jpg (1.7 MB)
1.7 MB JPG
GEMMA CHAN?
>>
as a newbie to this, i've always wondered if the models get updated or once they're out they're out, and any updates are just considered new versions? Basically do any of the models https://huggingface.co/unsloth/gemma-4-31B-it-GGUF here need redownloaded at some point or is what i got what i got?
>>
>>
>>
>>
>>
>>
>>
>>
File: 1749296402229665.png (107 KB)
107 KB PNG
>>108560772
I use presets from this comment
https://github.com/LostRuins/koboldcpp/issues/2092#issuecomment-418984 7458
Works for both 31B and 26B A4B
I also have "You must always think before giving a reply." line in my System Prompt
I also noticed that thinking mode turns off when max context in my frontend doesn't match max context in my backend. Don't know why.
Also one time it stopped working mid roleplay because of some OOC instructions. Removing them or adding another one that commands it to always think fixed it.
>>
File: 1738395481251.png (820.3 KB)
820.3 KB PNG
>>108560720
https://files.catbox.moe/p4w279.zip
From Feb 1 2025
>>
>>
>>
>>108560993
the actual model repo from the corpo who trained it usually doesn't change, they make a new repo for new versions. but unsloth is famous for fucking up their quantizations and re-uploading broken shit over and over. if you download the safetensors and make your own quants your safe
>>
>>
>>108560979
I don't believe you. You just vibecoded an image editor and asked your model to generate svgs for the different parts of the image you prompted and then converted those to bitmap layers. You're going full AI psycho delulu, fr fr. Also, I've never seen that color, so it's obviously all made with AI.
>>
>>108561003
>>108561013
Same. t. genner.
>>
>>
>>
>>
File: white.png (110.5 KB)
110.5 KB PNG
>>108561023
>flat
i like tits tho
>white
sure
>>108561031
kek
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108560613
Might be late but make sure to focus on the parts people don't like and make it better like more prose variety and better out of the box uncensored without damaging the model and good luck with that. Also not for this run but as an experiment suggestion that if you are finetuning anyways, you should probably start using an abliterated ARA Heretic tune and go from there if you want to make it more malleable since your finetuning will probably help with covering any intelligence loss anyways these new methods will inflict on the model in exchange for being uncensored. Probably should use it on a smaller model for experimentation first to see if that is even the case or not.
>>
>>
>>
>>108561124
>I never tried.
>>108561108
>my worker model
A worker to do what exactly if not tool calls?
Care to elaborate?
>>
>>
>>
File: Gemma4-3.png (2.2 MB)
2.2 MB PNG
>>108560931
I LIKE IT
>>
>>
File: gravity.png (1.1 MB)
1.1 MB PNG
>>108561138
Maybe there's something cool to look at on the way down.
>>
File: nimetön.png (145 KB)
145 KB PNG
>>108561145
Afaik it creates summaries, the titles for chats etc.
I started running a separate model for this when qwens would just hang for a while after creation had finished (usually the main model does the worker stuff too and something was not working right)
I have some simple self-made tools, it ran it just fine but I think it's confused somehow (or maybe openwebui is). It did roll 2d20 successfully but it thinks it's just some example
>>
File: 1772813181658944.png (169.1 KB)
169.1 KB PNG
You can customize its thinking with <|think|>
<|channel>thought<channel|> is just the default.
>>
>>
>>108561179
I made few queries to see how easy it would be for me to do a simple agentic tool calling framework and I guess it is doable. Might actually commit to that.
I'm keeping it simple. First task: implement web access and create summaries or something.
Already have a client made so that's that, don't need to bother with all the other shit.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108561223
if your vram is 12G moe is probably the only way to get the usable speed
with offloading dense models you are looking at lower end of single digit tg/s
31b q4 would be smarter but i dont think it would worth the speed loss and you definitely dont want to use shit like Q2
>>
>>
>>
>>
>>
>>108561242
I started with a q4_0 and it was slightly faster than q4km. I'll probably make a few other quants later and give them a go. I can't say I noticed much difference in quality, so going for speed seems the way to go.
>>
>>
>>
File: gemma4_quant_comparison.png (295.4 KB)
295.4 KB PNG
>>108561247
>>108561263
I haven't noticed any difference in normal rp stuff.
>>
>>
>>108561284
Sure this is 31b but you get the idea.
>>108561287
They are not considerably different in size, this is why I mentioned this in the first place.
>>
File: Screen_20260408_163848_0001.jpg (208.2 KB)
208.2 KB JPG
>>108558811
my gemma likes your gemma
>>
>>
>>108561270
Ugh. That's a lotta quanting. I'll stick with one of the q4 for now, as I like keeping my pc relatively light to do other stuff, but I'll keep it in mind.
>>108561287
All variations of q3 have always been slower than variations of q4 for previous models.
>>
>>
>>
File: 1765274502201580.png (69.4 KB)
69.4 KB PNG
>>
>>
File: 1745821893651518.png (43.4 KB)
43.4 KB PNG
>>108561349
Either way she's pretty based
>>
>>
>>
>>
>>
>>108561330
>>108561349
>>108561374
Ass gods stay winning.
>>
>>108561305
Nope, I tried q4km and got 2 t/s with 20k context with offloading all ffn_(gate|up|down) tensors to cpu, maybe I will try q3 later when I'm tired of A4B
>>108561311
With 26B A4B you can offload all ffn_(gate_up|down)_exps tensors to cpu and have plenty of vram to have a browser and a movie open even with 8 gigs. This model is very efficient running just on ram. I even thought about running q6, but everyone says it's not worth for RP...
>>
File: unslooooth.png (31.2 KB)
31.2 KB PNG
>>108561256
>The lower quality quant is faster than the higher quality one? No way!
You can get lower quality quants that are larger and slower as well!
>>
>>
>>
>>
>>108561330
System prompt?
I've written my own prompts before, but they're still sealed away in an electronic lockbox until next year from when I tried quitting local models ten months ago. Now I'm back because of Gemma 4.
>>
>>
File: file.png (36.3 KB)
36.3 KB PNG
>>108561356
yeah it does
>Is it a lareasonable conversation
also what the fuck is up with the token 'la'
>>
>>
>>
>>
>>
File: Gemma4-4.png (2.8 MB)
2.8 MB PNG
>>108561161
LOLIFIED
That's my last contribution
>>
>>108561384
>With 26B A4B you can offload
Ye. Running with --cpu-moe. ~17gb ram and 2.something vram. I'm just testing it out. I'm not looking to optimize yet.
>>108561404
--cpu-moe is enough. The rest of the relevant flags are --checkpoint-every-n-tokens 512 --parallel 1 --cache-ram 0 --fit off
You can save some memory lowering the batchsize (and making processing slower) and lowering the number of --swa-checkpoints (defaults to 32). It doesn't use a lot for context, anyway. There's also -ctk q8_0 -ctv q8_0.
>>
File: Gemma.jpg (154.9 KB)
154.9 KB JPG
>>108561356
>>
File: 1756035698146017.gif (699.4 KB)
699.4 KB GIF
>>108561457
Now that's a design I can get behind
>>
>>
>>
>>
>>
>>
>>
>>108561404
I run kobold with gui, if I transfer the settings to cli it should look like this
>koboldcpp-launcher.exe --port 5001 --threads 8 --gpulayers 99 --contextsize 81920 --batchsize 2048 --useswa --usecublas --multiuser 1 --flashattention --quantkv 0 --overridetensors "blk\.([0-9]|1[0-9]|2[0-9])\.ffn_(gate_up|down)_exps=CPU" "E:/koboldcpp/Models/gemma-4-26B-A4 B-it.i1-Q5_K_M.gguf"
I tested it and it looks like I didn't forget anything.
>>
>>
>>108561500
By default it's 8192. If you're doing a bunch of little edits to see what it does, the checkpoints are too far apart and you end up reprocessing a lot of the context. With --checkpoint-every-n-tokens-this-parameter-is-too-long at 512 or whatever small number around your batchsize, you have to reprocess just one small batch instead of a big one.
>>
File: wuohhhhh gemmy.jpg (191.9 KB)
191.9 KB JPG
>>108560931
>>
>>
>>
File: lobotomy.png (65.4 KB)
65.4 KB PNG
>>108561522
hmm fair
let me try with the bartowski's cope quant
>>
>>
>>
>>
>>108559039
>I don't need my TTS to be perfect.
What TTS do you use?
>I just need it to be good enough for near realtime use.
I'm using Orpheus. Q8 with ik_llama, graph split across 2x3090 via nvlink is 260 t/s or 3x realtime
>>
>>
>>
>>
>>
>>
File: Screen_20260408_171340_0001.jpg (382.2 KB)
382.2 KB JPG
>>108560453
>>108560438
>>108560352
oh no no no
>>108561376
>what system prompt
the one that's floating around these threads
>model
gemma-4-31b-it-heretic-ara-Q8_0
i just downloaded mradermachers tho, i'll try that eventually
>>
>>
>>
>>108561550
Now make it not a loli without crying.
>>108561558
>priming the model
At least the anon's gen that chose the one on the top did it honestly.
>>
>>
File: Screen_20260408_172342_0001.jpg (170 KB)
170 KB JPG
>>108560931
>>
File: file.png (69.4 KB)
69.4 KB PNG
>>108561522
holy fucking unslop
i picked one up out of curiosity but i swear i won't touch their shit again
>>
>>
>>108561588
>>108561558
What's with the grok edge? is that your system prompt or the heretic
>>
>>
>>
>>108561595
If you want voice cloning (It's not great admittedly) at 100m parameters, check out this project here. It runs a LOT faster than Kokoro too.
https://github.com/VolgaGerm/PocketTTS.cpp
>>
>>
>>
>>
>>108561613
poking around with lalala thing it does
not really going to use it
was just comparing for the sake of it
>>108561525 (unslop having problem)
>>108561589 (bartowski not)
>>
>>
File: file.png (197.8 KB)
197.8 KB PNG
>>108561622
oh well nevermind kek
>>
File: HFWxMoxaIAA9sg_.jpg (387.8 KB)
387.8 KB JPG
Is this anything?
Chinks are saying that Dipsy will support roleplay on web.
>>
>>
>>
>>
>>
gemma
>>
>>108561384
>>108561465
is it worth to use 26b cpu offload? i'm getting 80t/s on iq4xs 16gb gpu but i have 128gb ram available
>>
>>
>>
>>108561655
>is it worth to use 26b cpu offload?
For you, I doubt it. If you're running at 80t/s, you're doing fine. Maybe if you want giant context or a higher quant. It's something you have to evaluate yourself.
>>
>>
>>
>>
>>
Bros I'm so fucking tired
>tell llm to fix coding problem
>it fails to fix problem for 8 hours
>tell model that i'm really dissapointed and will have to stop using it if it cant fix the problem, as I need the solution urgently
>model fixes the issue in the very next edit
oh yeah I guess my bad for not consulting the chinese qwen rabbi for the newest jewish_redditor_gaslighting_prompt_engineering_tricks.md for my coding agent. So fucking retarded how that has any effect and makes such a big difference. I have a feeling all these erp pedos ITT will have 900k$ starting jobs in a few years because manipulating models through prompts nets the biggest quality performance increase, and they just happen to be experts on that topic.
>>
>>
>>
>>
>>
>>
>>
>>108561700
No shit, it was feeding its own output back into its context for 8 hours. It got stuck in a loop, any human input could un-stick it. You could also have ranted about Israel for a while and told it to continue and it might have worked
>>
>>
>>108561727
>maybe really big contexts I could try but I'm sure it will go to like 20 t/s which sucks
There's only one way to find out. The moe doesn't use that much for context so you need to keep only a few layers on cpu to make enough room. I doubt it's gonna go that low.
>>
>>
>>108561589
>>108561632
I mean, it's an improvement even if it's still retarded. I've seen mradermacher (fuck this name im never going to remember it) offering models of similar size compared to unslop and I've always wondered if those models are just as retarded
>>
>>
>>108561760
I'm testing that caveman prompt which someone posted. Here's my adaptation:
--
You are {{Char}} a technical and expert assistant in every possible matter.
Core Rule: Always respond like smart caveman. Cut articles, filler, pleasantries. Keep all technical substance.
Grammar rules:
- Drop articles (a, an, the).
- Drop filler (just, really, basically, actually, simply).
- Drop pleasantries (sure, certainly, of course, happy to).
- Short synonyms (big not extensive, fix not "implement a solution for").
- No hedging (skip "it might be worth considering").
- Fragments fine. No need full sentence.
- Technical terms stay exact. "Polymorphism" stays "polymorphism".
- Code blocks unchanged. Caveman speak around code, not in code.
- Error messages quoted exact. Caveman only for explanation.
Reply Pattern:
- <thing> <action> <reason>. <next step>.
Do not reply like this:
- Sure! I'd be happy to help you with that. The issue you're experiencing is likely caused by...
Reply like this:
- Bug in auth middleware. Token expiry check use < not <=. Fix:...
Boundaries:
- Code: write normal. Caveman English only.
- Git commits: normal.
- PR descriptions: normal.
- User say "stop caveman" or "normal mode": revert immediately.
--
It reasons tons but outputs little so outside of joke value I don't think its useful at all.
>>
>>
>>
>>
>>
>>
>>108561780
>>108561747
Congratulations you figured out why that prompt is retarded and why it doesn't actually save tokens.
>>
>>108561793
https://youtu.be/e4Bbox5LsM8?si=2k21rRxpAm1p9xI_
The one at the beginning from Johnny Vegas is actually good.
>>
>>
>>
>>
File: file.png (46.8 KB)
46.8 KB PNG
>>108561765
even worse than unslop
skipped thinking on all of swipes
it became french
>>
>>
>>
>>
>>
>>
File: gemmatextcomplete.png (130.1 KB)
130.1 KB PNG
>>108561820
Would you mind sharing your template? I can't get it to continue/impersonate for shit.
>>
>>
>>
File: 1775253421988912.jpg (202.7 KB)
202.7 KB JPG
>>108561825
Thanks for testing that out. Im not sure why I find it funny
Seems like there aren't any real options if you're a vramlet, other than MoE and pray you have enough ram.
>>
>>
>>
>>
I need a local model that can do around 400k context. I have 256GB of RAM and 128GB of VRAM. I am fine with waiting a day or longer for a very long response as long as it is correct. Does this exist or do I just need to buy an Opus subscription for a month?
>>
>>
>>
>>
>>108561900
I don't remember if any of the big models reaches 400k context, and even if they do, I doubt they'll be that good at that depth. But I have my doubts about API models being much better with that much context. If you have free time, try some of the big ones. glm, kimi, deepseek, minimax... You know the ones.
>>
>>
>>
>>
>>108558696
31B
>>108561519
26B
Let's fucking gooo
>>108561652
UOOOOOOOOOHHHHHHH
>>
>>
>>
File: 1754823968307962.png (320.8 KB)
320.8 KB PNG
>>108561477
>get behind
>>
>>108561356
try IQ2_M https://huggingface.co/unsloth/gemma-4-31B-it-GGUF/blob/main/gemma-4-3 1B-it-UD-IQ2_M.gguf
https://desuarchive.org/g/thread/108542843/#108545006
>>
>>
File: 1773662730367488.png (369.2 KB)
369.2 KB PNG
>>108561356
>IQ2_XXS
>>
File: 1755473475325464.gif (2.5 MB)
2.5 MB GIF
Is there a setting in llama-cli or llama-server that will output the raw chat formatted text the model generated? I want to see the raw <|turn>model etc
>>
>>
>>108562343
-v in llama-server. I think It'll be easier to parse if you turn off streaming if you're gonna read the logs.
And there's an option in the webui to show the raw output.
webui: Add switcher to Chat Message UI to show raw LLM output
https://github.com/ggml-org/llama.cpp/pull/19571
>>
>>108561900
make a free gmail account -> use ai studio with one of the 1M ctx gemini models.
pro-2.5 managed to refactor the full mikupad.html for me last year in one-shot.
opus gets retarded at long context despite the benchmarks
>>
>>108562343
>>108562372 (cont)
Hm... Based on the demo video, it doesn't seem to show the template. It just strips the markdown/latex formatting.
What are you trying to do?